Alerting is key to any monitoring system. Collecting data allows you to perform historic analysis and identify the root cause of a problem after the fact, but to ensure the best possible customer experience and meet "always on" expectations, IT professionals need to be able to respond to problems in their environment as soon as possible. They need a monitoring system with the flexible alerting tools that best enable them to act on incoming data. That's why Circonus is expanding on its existing alerting capabilities.
Users will already be familiar with threshold-based alerts in Circonus, but now Circonus has a new Analytics Alerting feature that enables you to create alerts based on CAQL transformed metrics. This opens up a wide range of advanced application scenarios for alerting. In this a tutorial, we'll discuss just a few of these use case scenarios; SOA Monitoring, Velocity Alerting, and Anomaly Alerting.
SOA Monitoring with Histograms and Percentiles
Percentile Overlays are commonly used in Service Oriented Architecture (SOA) Monitoring in order to validate Service Level Agreement (SLA) requirements. Historical analysis of this data allows IT to identify the causes of performance spikes, and allows your company to plan ahead and create practical SLAs based on actual historical performance instead of arbitrary values. However, the ability to alert on this data in real-time is tremendously important for SOA Monitoring, because it allows IT to take rapid action to avoid or correct SLA violations as they occur.
Let's look at a practical example of how you could use Analytics Alerting to monitor API performance:
First, we create or open a graph containing an API latency histogram metric.
- Next we select an appropriate percentile that reflects our performance expectations for the API. We can use percentile overlays to inspect the behavior of different percentiles over historic time windows. For this example, we'll assume this metric must stay below the 80th percentile.
Figure 1: API Latency Histogram
- Now we need to navigate to the rules page for that particular metric. The quickest way to do is to hover the information symbol (i) for the metric in the graph legend and click the "view" link for the metric. This will open a metric overview page displaying a list with a single item: this metric.
On the metric overview page, click the menu button for the metric, and then the "Set Rules" link. This will display the Rulesets page.
Figure 3: Metric menu. Note the Set Rules link.
- On the Rulesets page for the metric, create the rule using the "Add Analytics Rule" button.
Figure 4: New Rulesets page. Note the new "Add Analytics Rule +" button.
- In this example, we will create a rule to alert when our API performance metrics rise above the 80th percentile. Select "Percentile Alerting" from the drop down and set the field to 80.
Figure 5: "Add an Analytics Rule" dialog
By saving the rule, we activate alerting and will receive alerts if and whenever the rule is violated in the future.
Read more about the applications of histograms in API monitoring in the ACM Queue article, Statistics for Engineers, by our Chief Data Scientist, Heinrich Hartmann. Werner Vogels, Amazon CTO, describes the importance of this kind of monitoring in his paper, Amazon's Dynamo, paragraph 2.1
Velocity Alerting and Forecasting
Forecasting future data based on current trends is the holy grail of statistical analysis. The existing Capacity Planning overlays make predictions based on current data to enable your company to plan ahead for the growing needs of your IT environment. How soon will you need more disk space? When will you run out of memory? These are Capacity Planning questions.
Velocity Alerting is creating alerts based on forecasted data, so you can be notified about what you need before you run out. Without alerting, you would need to check your forecasts regularly to plan ahead, but now you can set up a ruleset and tell Circonus, just for example:
- "Alert me when I will be running out of disk space in 2 weeks."
- "Alert me when I will be out of memory in 2 hours."
- "Tell me if my disk usage will reach 1 Tb in less than 5 days."
- "Tell me when my CPU load, across my Metric Cluster, will get to 80%."
Let's look at another practical example of how to set up Smart Alerting to create a Velocity Alert. For this example, we assume that we are monitoring resource usage (for example, disk space) with a numeric metric:
- We can examine forecasts of future values with a Capacity Planning overlay.
Figure 6: Resource Usage with Capacity Planning overlay
We navigate to the "Ruleset" page for this particular metric, as described in the previous section.
- We add an Analytics rule. Click the "Add Analytics Rule +" button.
Figure 7: "Forecasted Value" options
Select "Velocity Alerting" from the drop down and select the following:
- The amount of time we want to forecast into the future
- The threshold value on which we wants to alert
- Optionally, suggest the precise forecasting method that should be used, or leave this unspecified and let Circonus select the model that best fits the data.
- We save the rule in order to start receiving alerts.
For more information about the forecasting functions, check out our article on Forecasting Values with CAQL.
Anomaly Detection Overlays highlight areas where collected data doesn't conform to expectations. It can be a huge time saver when conducting historical analysis to identify when things started to go wrong. Now you can receiver alerts in real-time when there are anomalies in your data. This allows IT rapidly respond to unexpected changes in your data stream, which could be signs of trouble, without necessarily specifying what those changes would be ahead of time. Instead of alerting on a specific threshold, you can choose to alert just because the data deviates too much from what you expect.
Warning: Anomaly Detection has a high probability of generating false positives, especially in real-time data. You can mitigate this by lowering the sensitivity of your Anomaly Detection Overlays, but there is the potential to be flooded with unwarranted alerts.
Here's an example of how you could set up Anomaly Alerting. Let's assume that we're monitoring ping/dns latencies or request rates using a numeric metric.
- We examine anomalies using a graph overlay.
We tune the sensitivity parameter until only events that we cares about deeply are highlighted by the Anomaly Detection Overlay in the graph. These should be the sort of events we'd want to be alerted about in the future.
Figure 8: Anomaly Detection Overlay
We navigate to the "Ruleset" page for this particular metric, as described above.
- Click the "Add Analytics Rule +" button.
We select "Anomaly Alerting" from the drop down menu.
Figure 9: Anomaly Detection rule options
Now we can review the rule and change the model or adjust the sensitivity parameter. Be careful, because a very high sensitivity can result in false positives that will spam your alerts.
Figure 10: Adjusting the sensitivity parameter
We save the rule so we can begin receiving alerts.
Be aware that anomalies are not the same faults, so you can expect anomaly alerts that do not point to failure cases. You can tune the sensitivity to help minimize those false positives.
For more information about Anomaly Detection in Circonus, refer to our user documentation.
Analytics Alerting with CAQL is a very flexible tool. In addition to the above examples, you can also alert on CAQL statements, which exposes additional parameters and options. The "Convert to CAQL" button allows you to convert any other type of Analytics Alert into a CAQL alert, or you can write your own CAQL statements from scratch. For more information about CAQL and it's many functions, refer to the CAQL Reference Manual, or check out some of the articles about CAQL here on the Support Portal.
Check the step-by-step instructions in our User Manual to learn more about Alerting with Analytics.