How can we control an alert that goes into an alarm state and sends 1000s of emails? For example, a group of servers encounters an issue due to a common event and until the issue is resolved constantly flaps between alert and recovery, sending 1000s of emails.
There are 3 ways to handle this:
- If these all depend on some other issue, you could set the dependencies up so that if the common event happens, the dependent items don't alert until that sev 1 is cleared.
- Aggregation windows on the contact group will queue up emails and limit the number that are sent out.
- If the alert is going back and forth between alerting and clearing with each polling cycle, you can set your alerting rule to only trigger if the threshold has been breached for multiple polling cycles, such as 3-5 minutes for the standard 1-minute check.
It is also possible that you could solve this with ruleset groups and expressions. If you'd like more details about how to accomplish this, contact firstname.lastname@example.org