DescriptionThis article deals with the situation where too many alert notifications are being received, and provides some suggestions as to how to only be notified when a network resource is "up" or "down."
SolutionThere is not one simple and universal answer for this issue. Generally it is necessary to review the alert history to see if there are any patterns (For example the time of day or the day of week.) or any other symptoms that would explain the Health Check flapping.
An example of a flapping health check is shown below.
In these scenarios, it is recommended to adjust the health check and use some of the configuration dampening tools under the advanced health check options. These can be used to diminish the sensitivity of the checks that are flapping.
Items to consider adjusting to reduce the sensitivity in the health checks are:
Node %: Percentage of requesting healthcheck nodes that must be reporting successful in order for the Network Resource to be considered available (Default 70%).
Retry Down: Number of failures (in a row) by a single healthcheck node before the Network Resource is declared unavailable by that node (Default 1). Checks run every minute, so this would be equivalent to the number of minutes to wait.
Retry Up: After failure, the number of successful checks (in a row) by a single healthcheck node before the Network Resource is re-declared available by that node (Default 1). Checks run every minute, so this would be equivalent to the number of minutes to wait.
Try experimenting with the number of retries each site must have in order to report a down event, or even the overall node threshold level.