Alerts: You need a budget!

Alert fatigue is a spectre haunting DevOps teams. We all know that the spectre is there. In our post-mortems, we all acknowledge its impact on detection times and burnout. Yet, as an industry, we have no structural solution, and we rely on our burnt-out colleagues to tune, improve, ignore, and delete alerts when we feel the effects of alert fatigue set in. If you have ever felt that your alerts can be divided into categories of "Alerts I can ignore" (warning) and "Alerts I cannot ignore" (critical), then you are well aware that your sanity is finite. You need a budget!

Let me take a step back and focus on why alerts are a necessary evil. This need might be obvious to most readers. So perhaps the best way to illustrate this necessity is to imagine the extreme case of a world with no alerting. Issues happen as normal. Either you notice them or you don't. Some issues might resolve on their own, and others will remain broken until someone notices and intervenes.

We want our services to work for our users. If we take action quickly when issues appear, then our services will have higher availability. We can then reason that the value of an individual alert is the potential service degradation avoided or reduced by taking action on that alert. For some alerts, this value can be large enough that it would be unthinkable not to implement them.

The challenge that faces us is that we can easily imagine the value of alerts in monetary terms. But we don't often think of what price we are paying for this value until we are deep in debt. What should happen is that we compare the estimated value to the estimated cost. If the value-to-cost ratio is too low, then the alert is disabled.

In my experience, suggesting that spammy but otherwise functioning alerts should be disabled strikes fear into the hearts of DevOps practitioners. However, there is an unavoidable limit to how many alerts you can reasonably investigate. Beyond that limit, alerts will be ignored. But you don't know which.

When will you start ignoring alerts?

Let's consider that you are a highly focused and untiring DevOps practitioner and calculate at what point you must start ignoring some alerts. We will use this to establish an upper bound for your alert budget. Let's assume that you can solve all issues instantaneously and alerts arrive at regular intervals. Let's also assume that you do not lose focus when switching from investigating one alert to another. We can calculate an upper limit for alerts you can tackle during work hours by simply dividing the time it takes to investigate an alert by the time committed to effective on-call work. If you have multiple people on-call, then we can multiply this limit by the number of people on-call. E.g. If you are the only person fully dedicated to on-call and can work effectively for 6 hours in a day and an alert takes 10 minutes to investigate an alert, then: \[ 1 \, \mathrm{person} \cdot \frac{ 6 \,\frac{\mathrm{hours}}{\mathrm{person}} \cdot 60 \,\frac{\mathrm{min}}{\mathrm{hour}} }{ 10 \,\frac{\mathrm{min}}{\mathrm{alert}} } = 36 \;\mathrm{alerts} \] You can at most handle 36 alerts during daytime before some alerts will be ignored. See what the limit would be for your case below.

Your "Work Hours" Alert Upper Limit

Number of people on call (put 0.5 if you are still expected to do some feature work)?

How long, on average, does it take you to determine if an alert is actionable (in minutes)?

How long can you work at reasonable efficiency during work hours (in hours)?

Limit: 0 Alerts

For me, I get a number somewhere in the mid double digits. But we made some pretty generous assumptions. Recall that we assumed that all issues could be solved instantly and alerts come at regular intervals. In reality, this is never the case. Which means that if we were at the alert limit, then we would:

Have no capacity to solve issues since all our time is going into investigating alerts as they come in.
No longer be working at "reasonable" efficiency, e.g. mental overload or burnout.
Fail to investigate some alerts that come in clusters.

The problems mentioned above still mostly hold even if we are just shy of the limit. So then, if we claim that the value of alerts comes from the increase in availability that the alerts cause, then adding more alerts, when we are close to the alert limit, actually reduces the total value of all alerts. This is because the time spent solving issues tends to zero as we approach the alert limit, and solving issues is what increases the availability of our services. Hence I claim, we maximize the value of our current alerts by consistently being far from the alert limit.

However, this only takes into consideration alerts that arrive during work hours. We can't use the same limit as for daytime work hours. Let's again assume that all issues can be resolved instantly, and they arrive in regular intervals. But now we also need to think about your need to eat and sleep. The general idea of this calculation is the same. We divide the total time you can consistently commit to solving alerts outside of work hours, by the time it takes to investigate an alert. However, most people need to sleep in connected stretches. Sleeping for 8 hours with 12 5 minute interruptions is not the same as sleeping for 7 hours. So the number of sleep interruptions you can handle also places a limit on the number of alerts that can come at night.

Let's consider an example. Let's say you are working alone, and it now takes you 15 minutes to investigate an alert because you time to stop what you are doing and get back into a work context. You need 8 hours of sleep. You can consistently dedicate 2 hours of time to active on-call outside of work hours. But you can be woken up at most 2 times per night and stay sane. Then the alert limit is as follows: \[ 1\,\mathrm{person} \cdot\, \min \left( \frac{ 2 \,\frac{\mathrm{hour}}{\mathrm{person}} \cdot 60 \,\frac{\mathrm{min}}{\mathrm{hour}} }{ 15 \,\frac{\mathrm{min}}{\mathrm{alert}} },\, \frac{2 \,\frac{\mathrm{alert}}{\mathrm{person}}}{50\%} \right ) = 4\;\mathrm{alerts} \] In this case, you can only consistently handle 4 alerts outside of work hours. The constraining factor is your need to sleep. The \(50\%\) comes from that fact that if you sleep 8 hours, then sleep takes up about \(50\%\) of your out of work hours, and we divide to extrapolate out to the full out of hours period. See what limit you get for yourself below.

Your "Out of Hours" Alert Upper Limit

Number of people on-call out of hours?

How long, on average, does it take for you to stop what you are doing, find your laptop and determine if an alert is actionable (in minutes)?

How long can you consistently work outside of work hours before going insane (in hours)?

How many times can you be woken up at night without going insane?

How many hours of sleep do you need?

Limit: 0 Alerts

For me, this is a single-digit number. Of course, this is a very simple calculation that you could argue with and tweak. But the same general argument applies that beyond the alert limit (however you want to calculate it), alerts have no value. If you have more alerts, then you will not be able to effectively investigate them, let alone solve the underlying issue. Some alerts will be ignored. If you filled in the numbers in good faith, you should agree that these limits should not be regularly violated. So let's get to budgeting!

How to budget?

Since value is maximized somewhere far from the limit, we should set a budget as some fraction of this limit, perhaps a third. The exact limit depends on your team's psychological resilience. But there are higher risks in overestimating this than underestimating.

Once you've determined a number for the budget, the question remains how to make it effective? Below are a couple of suggestions.

Automatic Escalation

If the budget is exceeded, then additional help is brought in to deal with the alert overload. This is relatively low risk, but requires that a person be designated as reserve on-call.

Budget Violation Triggers Cleanup

Often, we have a general sense of which alerts have the potential to be spammy. We can use the budget to be proactive in improving these alerts. If the budget is violated, then alert cleanup becomes a task for the team.

This means that alert cleanup happens before alert volumes come close to the alert limit. But it doesn't do much to immediately help the person on-call.

Warning Suppression

If the budget is violated, only check critical alerts. This often happens unofficially. If there is a flood of alerts, checking the critical alerts first is perfectly sensible. In many cases, there will not be time to go back to check the warning alerts. Making it the official policy can streamline this workflow and remove distractions. The on-call person knows they are supported in focusing only on what's of highest importance: Resolving the critical alerts.

Of course, the downside is that this does nothing if there is an alert flood of critical alerts. So this should probably be combined with some other method.

CI/CD Budget Warnings

If you manage your alerts as code, then when new alerts are proposed in a PR, you could backtest the alert against previous duty cycles to see how many alerts this would have added. Reviewers can then compare this to the current state of the alert budget and the value that the new alert brings.

For example, if the assumed value of a new alert is high and there is budget remaining, then the alert is added. If the assumed value is high, but there is no budget remaining, then reviewers could suggest tweaking thresholds on lower value alerts to free up more budget for the newer, higher value alert.

Getting to a Better Place

I doubt any of the above will turn a horrible on-call situation into a pleasant one. But I think that setting an explicit budget turns a bad experience into a quantifiable "problem" which can be worked on. If we can say something like "We went over our alert budget, let's see what we can improve", then that will lead to more productive discussions than simply "Oh man, the last on-call cycle was rough".

I also don't think that budgets will never be violated or that alert floods can be fully solved. But we should pay more attention to these problems instead of treating them as a fact of life. Just like a financial budget doesn't magically create more money, an alert budget won't magically create stable applications. What it does is guide you to make good decisions.