Monitoring solutions give enterprise IT professionals a head start on outages and performance issues. One would expect alerts from a monitoring solution to be logical and important, but this isn’t always the case. To learn more about optimizing alerts for monitoring, we spoke with James Sparenberg, DevOps Engineer at Ericsson.
James has years of experience as a Linux Admin. He has worked in Docker, Puppet, monitoring, troubleshooting, and much more.
What are Symptomatic Alerts in Monitoring? It’s monitoring and alerting designed to detect the symptoms of ill performance, not the exact cause of ill performance. By doing this it ensures that any alert you receive is actionable and catches any problems you have, not just the ones you are expecting.
Alerting becomes the first step where we are analyzing the data we gathered in our monitoring. Here we want the following:
- Alerts should be actionable – There should be an action or a series of actions that attach to all alerts. Informational alerts contribute to pager fatigue.
- Alerts should give you the 4 Ws – Who (the object name or other physical identifier), What (The event that triggered), When (The time it triggered), Where (location of documentation related to the monitor)
- Alerts should indicate that analysis must happen. If you do the analysis programmatically, then you should also write a program to automatically fix it. Cut the human out of the loop and put it in a report, not an alert
Actionable alerts are the first item. Let’s take a database, you monitor the number of entries per minute. Your monitor reports the number of entries recorded in the last minute. This gives you 3 conditions of note. Low – High, and Average number of entries. Should you create an alert if the # of entries hits a high rate? Probably not. That is what you built a system for, to handle spikes, at 3am do you care if suddenly people are ordering the new iWidget for their Holiday shopping? Yes, if you oversee logistics for delivery. But at 3am, you aren’t going to stop people from shopping (or at 3pm for that matter) so why alert and wake someone up? Simply log it, chart it, and use it for capacity planning and growth later.
Low (or no) entry alerts
Now comes the second alert possibility. Low (or no) entries. This could be just coincidence, as no one is up at 3am this morning. It could be your threshold is too high (Action = adjust threshold). It could be the frontend not sending data too your DB, because the overseas team put a typo into the code they just pushed (it’s their afternoon). The low number of entries requires you to take action and analyze why the number has gone south. This is an actionable alert, and the kind we want to be woken up for. (NOTE: This is likely an error that should have been caught by CI-CD but that’s another topic)
Alerts requiring action, (tests, log checks, etc) don’t cause people to just acknowledge and go back to sleep without fully comprehending the alert. They instead cause them to find solutions that will prevent the next alert of this kind, or otherwise contribute to an increase in performance of the system.
Monitor everything, alert selectively
This follows the credo of monitor everything, alert only on those things that need action to be taken. We can create a graph of data entry rate over time, see that we are getting peaks at 90% of capacity for 4 hours a data, and find a way to automate that capacity increase to ensure we never go above 70% for longer than 10 minutes, and we should. But we don’t want to make the method of increasing capacity, having our admin build a new DB server and put it online at 3am.
Conversely, at 3am 0 data input is something that immediately affects the companies bottom line. This requires action to be taken now to resolve the problem. Customers are affected. Customers drive your reason to work. Action is required.
Let’s take a look at the two ways in which we could setup our alert.
- Event based – we monitor to see if the URL the frontend has, is correct for the DB. Then act
- Symptom based – we monitor for the symptom that indicates a problem (Low or no data entries in X timeframe) and action is taken
Both cases are actionable events, so that requirement is met. The problem comes from the first item. It violates the rule of no analysis. Additionally, it would require you add rules for each the other causes of frontend to backend loss of entries being made (with more time taken to write and test each and every possibility). More rules cause more load on your alert system (each rule having to be run against the data set for each event until you either run out of rules or a match is made). This loads up your monitoring system and adds delay to your ability to get the problem solved increasing income loss.
In symptom-based monitoring you would instead be monitoring for 3 events. For example, 500 errors on the app servers, low data entry on the DB and backend errors in logs. Then alerting on 2 of them (500 errors and low data rate), possibly backend errors as well (dependent on the setup). This is 3 total rules, which would cover every possible cause, instead of 1 rule for each possible cause.
The only thing worse than having an alarm go off at 3am, is not having one go off when it should have. If you are monitoring for a specified problem, you will catch that problem. However, you also will miss when the condition that is a problem exists, from a different cause.
At this point, one hopes that you have a well-formed alert. It tells people when and where the problem is, and correctly points them at any docs related to the alert. The docs would then become the source of exact causes(s) of the problem and instructions on how to handle the problem.
The TL:DR of all this is. Monitor for conditions and symptoms of a non-performant environment, over trying to write individual rules for every possible cause of a non-performant system. The end result will be a smaller rule set that covers all causes. Does so in less time, resulting in a smaller footprint on your network of applications.