Waking People Up for Nothing: The Hidden Cost of Bad Alerting

Waking People Up for Nothing: The Hidden Cost of Bad Alerting
Photo by Judith Chambers / Unsplash

Let’s set the scene. Your company has asked your team to implement alerting. You added monitoring, and now the product owner wants the on-call engineer to be notified whenever something goes wrong. You implemented an alert that triggers whenever a Kubernetes pod fails. However, pod failures can occur for many different - and often non-critical - reasons, so the on-call person receives alerts every day throughout their shift.

After being paged 10 times for minor or self-healing issues, the 11th alert will not trigger the same sense of urgency. Your colleague becomes fatigued by the constant interruptions and, over time, may start ignoring calls, assuming, “It’s probably the same issue again and will resolve itself.” This is a classic example of alert fatigue — and the alerts begin to resemble the story of the boy who cried wolf.

What is the issue?

Alert fatigue sits at the intersection of human psychology and IT architecture. When alerts are poorly configured or too noisy, people quickly become desensitized to them - much like the beeping of reversing vehicles that fades into the background over time. What initially signals danger gradually turns into ignorable noise.

This phenomenon can be dangerous and, in some domains, even life-threatening. Consider a nurse who stops reacting promptly to medical device alarms because they sound every few minutes. When real emergencies are indistinguishable from routine noise, the risk of missing critical events increases significantly.

Who is faulty?

One could easily say that people are at fault here. It’s the first common-sense reaction in this situation. But people are not at fault when it comes to alert fatigue. It is the design of alerting that is troublesome.

How does the issue creep in?

Alert fatigue does not appear overnight. It gradually creeps in when alert quality and signal-to-noise ratio are not actively monitored and managed. Without deliberate ownership and regular tuning, even well-intentioned alerting systems tend to become noisy over time.

Alerts based on symptoms and impact

Alerting an engineer because 60% of pods are unhealthy may sound reasonable at first glance — but is it actually actionable? Kubernetes is designed to maintain service availability by automatically replacing failed pods and distributing traffic across healthy instances. Partial pod churn is therefore expected behavior in many production systems.

Events such as traffic spikes or even some DDoS patterns can temporarily increase pod failures, especially before upstream protections (firewalls, proxies, rate limits) fully engage. In many cases, the user-facing service remains healthy despite significant pod turnover.

From an alerting perspective, the key question is not “Are pods failing?” but “Is user experience degraded?” If the application is still serving traffic within acceptable latency and error-rate thresholds, paging someone in the middle of the night is likely unnecessary and contributes to alert fatigue. Alerts should focus on symptoms that matter to users, not merely on internal turbulence that the platform is designed to absorb.

Lack of context in the alert

What do your alert names look like? If they resemble any of the examples below, you may have a problem:

  • Alert! Pod OOMKilled
  • App is running slow
  • Database connection lost
  • Error in app X

At first glance, these look reasonable — but they lack critical context. They don’t answer the questions an on-call engineer immediately has:

  • Is the user impacted?
  • How severe is the outage?
  • What action is expected right now?

Effective alerts should be explicit, actionable, and user-focused. A stronger alert name might look like:

App X unavailable - global service outage. Check database connectivity.

Good alert titles help the responder quickly understand impact and next steps, reducing cognitive load during incidents and helping prevent alert fatigue. It may look similar to the earlier examples, but let’s break it down:

App unavailable → Clearly states what is happening
Global service outage → Immediately communicates the scope and who is affected
Check database → Points to the most likely area that requires investigation

Together, this alert title answers the three critical questions an on-call engineer has within seconds: what is broken, how big the impact is, and where to start looking.

Keep alert titles short and self-descriptive. Alerts are delivered through many channels - push notifications, SMS, Slack, or phone calls - often in high-stress situations. A well-crafted title ensures that even when someone glances at a smartwatch notification, they can understand in a few words both the nature and the severity of the problem.

Every issue is critical

In my mind, this is something that happens to everyone at the beginning of the alerting journey. You initially configure alerts only for critical errors, but what is holding you back from creating an additional alert with lower criticality that could anticipate a bigger alert?

This seems like a good idea, but it needs refining so it does not become a bigger issue than the one it is trying to solve. Try implementing a warning alert, but limit it to a Slack channel. Programmers will not receive it if they are not at work. If the warning Slack notification predicts a bigger outage in most cases, you can add the SMS channel - but do not make it come through the same channels as the critical alert. Let’s reserve critical notifications and calls for the biggest failures.

Psychology

The most critical dimension of alerting is the human one. A well-designed alert should create an immediate, justified sense that human attention is truly required - even in the middle of the night. The scenarios below are not hypothetical; they emerge naturally in teams experiencing alert fatigue.

Reduced trust

After receiving the same low-value alert ten times in a week, responders begin to mentally downgrade its importance. The next occurrence is likely to be ignored because past experience suggests it is not actionable. Once trust in an alert erodes, restoring it is difficult - which is why alerts must never become routine noise.

Slower response

An SMS arrives from the alerting system. However, earlier that day a similar alert turned out to be a false positive. Instead of reacting immediately, the responder delays investigation. Alert fatigue does not just cause missed alerts - it increases mean time to acknowledge (MTTA) and response.

Burnout and stress

Engineers on call should feel confident, not anxious. If someone spends Friday afternoon anticipating avoidable pod crashes and inevitable pages, the alerting system is already failing the team. Persistent noisy paging directly harms morale and makes on-call rotations harder to staff and sustain.

Real cost of improper alerting

Like presented in the short brief in the post - this phenomenon can cause the alerts to not be taken seriously. This can impact the company in real measurable ways.

  • Delaying the fix so the outage is longer that it really needed to be
  • Looking for issues in places that are OK because the alert just didn't suggest what might happen
  • Owners loose confidence in alerting since it just alerts of every singe error

Rethinking what an alert really is

Alerts are requests for immediate human intervention in genuinely critical situations - this is the core message of this post. If a problem is likely to resolve automatically or does not require timely human action, it should not page a person.

High-quality alerts share a few essential properties: they are self-explanatory, actionable, rare, user-impacting, and reserved for truly critical conditions. If your alerts violate any of these principles, it is a strong signal that your alerting strategy needs refinement.

Design alerts for humans

Because there is always a human responder on the receiving end, design alerts with clarity and empathy. Carefully consider what information must appear first, what language is most unambiguous, and how to ensure shared understanding across the team.

The alert title should communicate impact and severity at a glance. The alert description is the right place for supporting context - for example:

  • direct links to relevant dashboards and logs
  • recent deployment or configuration changes
  • runbook references or first troubleshooting steps

Well-structured alerts reduce cognitive load and help responders move from notification to diagnosis as quickly as possible.

Check periodically alerting

Set aside dedicated time each month to review which alerts fired and whether they were truly necessary. Challenge their continued value and analyze what would have happened if no one had responded. This kind of review helps identify noisy, low-value alerts.

Make this a recurring forum that includes the on-call engineers. Encourage responders to share whether the alert led to meaningful action or could have been handled automatically. These discussions build shared ownership of alert quality.

Alerting is not a one-time implementation - it is an evolving system that must continuously adapt to the needs of the service and the team. While setting up alerts is relatively easy, designing a high-signal, low-noise alerting system is a discipline that takes time and deliberate practice to master.

💡
If your pager went off right now, would you trust that it truly requires your immediate attention?