Services

Resources

Company

Jun 25, 2025 | 3 min read

DARE to question your alerts?

DARE to question your alerts?

Jun 25, 2025 | 3 min read

DARE to question your alerts?

Jun 25, 2025 | 3 min read

DARE to question your alerts?

Jun 25, 2025 | 3 min read

DARE to question your alerts?

Effective alerting isn’t about how many alerts you configure. It’s about how many actually help you detect real issues.Too many alerts, without measurement or refinement, create noise. This noise overwhelms teams, hides actual problems, and leads to burnout.

The Trap: Set-and-Forget Alerting

Many teams inherit or configure alerts based on gut instinct or past incidents and move on. But alerting isn't a one-time task. It needs constant feedback and tuning.

To improve it, treat alerting as a system, not a checklist.

Build a feedback loop using DARE:

Good alerting isn’t static. It evolves. To improve it, your system needs to collect data about the alerts themselves: when they fire, how often, and whether they are actually useful.

Start by collecting alert metadata such as what fired, when, and how often—and route this to your observability stack. This could be your incident management tool, a log analytics system, or a custom dashboard.

Once you have this data, start asking:

  • Which alerts fire most often?

  • Are they acknowledged or ignored?

  • Do they lead to action, or just add noise?

  • Do they correlate with real incidents?

This isn’t extra work. It is an essential aspect of alert hygiene. It’s how you turn alerting into a system that gets better over time.

Now, measure alerts as part of your reliability metrics.

Most teams already track RED metrics: Request rate, Error rate, and Duration. But one key signal is missing: Alerts.

Add A for Alerts per unit time to complete the picture. We call this the DARE framework:

DARE = Duration, Alerts, Request rate, and Error rate

Frequent alerts are often signs of instability, and measuring them helps you get a clearer view of your service’s health.

Real-World Examples

CPU Alert During Backups

Scenario:

A team had set an alert for high CPU usage across their infrastructure. The alert fired 200+ times a day, but it wasn’t tied to any specific workload. After reviewing the data, we found that these alerts almost exclusively fired during scheduled backup jobs.

Problem:

The alert threshold was valid (high CPU usage), but the context was missing. There was no awareness of the predictable load during maintenance windows, like backups.

Impact:

These frequent, high-priority alerts flooded the on-call queue during critical times when backups were running. As a result, engineers were woken up for issues that were expected and harmless. This led to burnout and unnecessary noise.

Solution:

We implemented alert suppression during known backup windows. This stopped the false positives from clogging up the on-call queue, allowing engineers to focus on genuine issues. It reduced alert fatigue and helped the team manage real incidents more effectively.

Disk Space Alerting on Temporary Files

Scenario:

A team had an alert set for disk space usage. Whenever disk space exceeded 90%, the alert would fire. This was a recurring issue, primarily caused by temporary files created during data processing jobs that were later deleted.

Problem:

The alert was triggering unnecessarily because it didn’t differentiate between temporary files that were regularly cleaned up after jobs and permanent storage growth. This led to noise, with the team spending time investigating issues that weren’t critical.

Impact:

The team spent significant time investigating disk space usage, even though the problem was routine and not an immediate concern. The alerting system caused inefficiencies by focusing on transient issues rather than real problems.

Solution:

We adjusted the alert to focus only on permanent storage growth instead of total disk usage. We also excluded known temporary directories that were routinely purged after processing. This cut down on unnecessary alerts and allowed the team to focus on issues that actually required attention.

There were many other scenarios but the aim was to highlight common alerts which, if ignored can build up and increase alerting noise.

Conclusion: Don’t Guess. Measure.

Start by logging every alert. Review the most frequent offenders. Track trends over time. Treat alerting as a dynamic system that improves, not a checklist to check off once.

By implementing the right context around your alerts, you’ll help your team prioritize real incidents, reduce alert fatigue, and improve system resilience.

If you're not sure where to begin, that's where we come in. We help teams go from firefighting to foresight by building observability systems that actually serve the people on call.  Let's make alerting better together.  

Want to dive deeper? We can help you set up automated alert analysis, weekly reporting pipelines, and RCA templates for alert effectiveness. Reach out to us.

Effective alerting isn’t about how many alerts you configure. It’s about how many actually help you detect real issues.Too many alerts, without measurement or refinement, create noise. This noise overwhelms teams, hides actual problems, and leads to burnout.

The Trap: Set-and-Forget Alerting

Many teams inherit or configure alerts based on gut instinct or past incidents and move on. But alerting isn't a one-time task. It needs constant feedback and tuning.

To improve it, treat alerting as a system, not a checklist.

Build a feedback loop using DARE:

Good alerting isn’t static. It evolves. To improve it, your system needs to collect data about the alerts themselves: when they fire, how often, and whether they are actually useful.

Start by collecting alert metadata such as what fired, when, and how often—and route this to your observability stack. This could be your incident management tool, a log analytics system, or a custom dashboard.

Once you have this data, start asking:

  • Which alerts fire most often?

  • Are they acknowledged or ignored?

  • Do they lead to action, or just add noise?

  • Do they correlate with real incidents?

This isn’t extra work. It is an essential aspect of alert hygiene. It’s how you turn alerting into a system that gets better over time.

Now, measure alerts as part of your reliability metrics.

Most teams already track RED metrics: Request rate, Error rate, and Duration. But one key signal is missing: Alerts.

Add A for Alerts per unit time to complete the picture. We call this the DARE framework:

DARE = Duration, Alerts, Request rate, and Error rate

Frequent alerts are often signs of instability, and measuring them helps you get a clearer view of your service’s health.

Real-World Examples

CPU Alert During Backups

Scenario:

A team had set an alert for high CPU usage across their infrastructure. The alert fired 200+ times a day, but it wasn’t tied to any specific workload. After reviewing the data, we found that these alerts almost exclusively fired during scheduled backup jobs.

Problem:

The alert threshold was valid (high CPU usage), but the context was missing. There was no awareness of the predictable load during maintenance windows, like backups.

Impact:

These frequent, high-priority alerts flooded the on-call queue during critical times when backups were running. As a result, engineers were woken up for issues that were expected and harmless. This led to burnout and unnecessary noise.

Solution:

We implemented alert suppression during known backup windows. This stopped the false positives from clogging up the on-call queue, allowing engineers to focus on genuine issues. It reduced alert fatigue and helped the team manage real incidents more effectively.

Disk Space Alerting on Temporary Files

Scenario:

A team had an alert set for disk space usage. Whenever disk space exceeded 90%, the alert would fire. This was a recurring issue, primarily caused by temporary files created during data processing jobs that were later deleted.

Problem:

The alert was triggering unnecessarily because it didn’t differentiate between temporary files that were regularly cleaned up after jobs and permanent storage growth. This led to noise, with the team spending time investigating issues that weren’t critical.

Impact:

The team spent significant time investigating disk space usage, even though the problem was routine and not an immediate concern. The alerting system caused inefficiencies by focusing on transient issues rather than real problems.

Solution:

We adjusted the alert to focus only on permanent storage growth instead of total disk usage. We also excluded known temporary directories that were routinely purged after processing. This cut down on unnecessary alerts and allowed the team to focus on issues that actually required attention.

There were many other scenarios but the aim was to highlight common alerts which, if ignored can build up and increase alerting noise.

Conclusion: Don’t Guess. Measure.

Start by logging every alert. Review the most frequent offenders. Track trends over time. Treat alerting as a dynamic system that improves, not a checklist to check off once.

By implementing the right context around your alerts, you’ll help your team prioritize real incidents, reduce alert fatigue, and improve system resilience.

If you're not sure where to begin, that's where we come in. We help teams go from firefighting to foresight by building observability systems that actually serve the people on call.  Let's make alerting better together.  

Want to dive deeper? We can help you set up automated alert analysis, weekly reporting pipelines, and RCA templates for alert effectiveness. Reach out to us.

Effective alerting isn’t about how many alerts you configure. It’s about how many actually help you detect real issues.Too many alerts, without measurement or refinement, create noise. This noise overwhelms teams, hides actual problems, and leads to burnout.

The Trap: Set-and-Forget Alerting

Many teams inherit or configure alerts based on gut instinct or past incidents and move on. But alerting isn't a one-time task. It needs constant feedback and tuning.

To improve it, treat alerting as a system, not a checklist.

Build a feedback loop using DARE:

Good alerting isn’t static. It evolves. To improve it, your system needs to collect data about the alerts themselves: when they fire, how often, and whether they are actually useful.

Start by collecting alert metadata such as what fired, when, and how often—and route this to your observability stack. This could be your incident management tool, a log analytics system, or a custom dashboard.

Once you have this data, start asking:

  • Which alerts fire most often?

  • Are they acknowledged or ignored?

  • Do they lead to action, or just add noise?

  • Do they correlate with real incidents?

This isn’t extra work. It is an essential aspect of alert hygiene. It’s how you turn alerting into a system that gets better over time.

Now, measure alerts as part of your reliability metrics.

Most teams already track RED metrics: Request rate, Error rate, and Duration. But one key signal is missing: Alerts.

Add A for Alerts per unit time to complete the picture. We call this the DARE framework:

DARE = Duration, Alerts, Request rate, and Error rate

Frequent alerts are often signs of instability, and measuring them helps you get a clearer view of your service’s health.

Real-World Examples

CPU Alert During Backups

Scenario:

A team had set an alert for high CPU usage across their infrastructure. The alert fired 200+ times a day, but it wasn’t tied to any specific workload. After reviewing the data, we found that these alerts almost exclusively fired during scheduled backup jobs.

Problem:

The alert threshold was valid (high CPU usage), but the context was missing. There was no awareness of the predictable load during maintenance windows, like backups.

Impact:

These frequent, high-priority alerts flooded the on-call queue during critical times when backups were running. As a result, engineers were woken up for issues that were expected and harmless. This led to burnout and unnecessary noise.

Solution:

We implemented alert suppression during known backup windows. This stopped the false positives from clogging up the on-call queue, allowing engineers to focus on genuine issues. It reduced alert fatigue and helped the team manage real incidents more effectively.

Disk Space Alerting on Temporary Files

Scenario:

A team had an alert set for disk space usage. Whenever disk space exceeded 90%, the alert would fire. This was a recurring issue, primarily caused by temporary files created during data processing jobs that were later deleted.

Problem:

The alert was triggering unnecessarily because it didn’t differentiate between temporary files that were regularly cleaned up after jobs and permanent storage growth. This led to noise, with the team spending time investigating issues that weren’t critical.

Impact:

The team spent significant time investigating disk space usage, even though the problem was routine and not an immediate concern. The alerting system caused inefficiencies by focusing on transient issues rather than real problems.

Solution:

We adjusted the alert to focus only on permanent storage growth instead of total disk usage. We also excluded known temporary directories that were routinely purged after processing. This cut down on unnecessary alerts and allowed the team to focus on issues that actually required attention.

There were many other scenarios but the aim was to highlight common alerts which, if ignored can build up and increase alerting noise.

Conclusion: Don’t Guess. Measure.

Start by logging every alert. Review the most frequent offenders. Track trends over time. Treat alerting as a dynamic system that improves, not a checklist to check off once.

By implementing the right context around your alerts, you’ll help your team prioritize real incidents, reduce alert fatigue, and improve system resilience.

If you're not sure where to begin, that's where we come in. We help teams go from firefighting to foresight by building observability systems that actually serve the people on call.  Let's make alerting better together.  

Want to dive deeper? We can help you set up automated alert analysis, weekly reporting pipelines, and RCA templates for alert effectiveness. Reach out to us.

Effective alerting isn’t about how many alerts you configure. It’s about how many actually help you detect real issues.Too many alerts, without measurement or refinement, create noise. This noise overwhelms teams, hides actual problems, and leads to burnout.

The Trap: Set-and-Forget Alerting

Many teams inherit or configure alerts based on gut instinct or past incidents and move on. But alerting isn't a one-time task. It needs constant feedback and tuning.

To improve it, treat alerting as a system, not a checklist.

Build a feedback loop using DARE:

Good alerting isn’t static. It evolves. To improve it, your system needs to collect data about the alerts themselves: when they fire, how often, and whether they are actually useful.

Start by collecting alert metadata such as what fired, when, and how often—and route this to your observability stack. This could be your incident management tool, a log analytics system, or a custom dashboard.

Once you have this data, start asking:

  • Which alerts fire most often?

  • Are they acknowledged or ignored?

  • Do they lead to action, or just add noise?

  • Do they correlate with real incidents?

This isn’t extra work. It is an essential aspect of alert hygiene. It’s how you turn alerting into a system that gets better over time.

Now, measure alerts as part of your reliability metrics.

Most teams already track RED metrics: Request rate, Error rate, and Duration. But one key signal is missing: Alerts.

Add A for Alerts per unit time to complete the picture. We call this the DARE framework:

DARE = Duration, Alerts, Request rate, and Error rate

Frequent alerts are often signs of instability, and measuring them helps you get a clearer view of your service’s health.

Real-World Examples

CPU Alert During Backups

Scenario:

A team had set an alert for high CPU usage across their infrastructure. The alert fired 200+ times a day, but it wasn’t tied to any specific workload. After reviewing the data, we found that these alerts almost exclusively fired during scheduled backup jobs.

Problem:

The alert threshold was valid (high CPU usage), but the context was missing. There was no awareness of the predictable load during maintenance windows, like backups.

Impact:

These frequent, high-priority alerts flooded the on-call queue during critical times when backups were running. As a result, engineers were woken up for issues that were expected and harmless. This led to burnout and unnecessary noise.

Solution:

We implemented alert suppression during known backup windows. This stopped the false positives from clogging up the on-call queue, allowing engineers to focus on genuine issues. It reduced alert fatigue and helped the team manage real incidents more effectively.

Disk Space Alerting on Temporary Files

Scenario:

A team had an alert set for disk space usage. Whenever disk space exceeded 90%, the alert would fire. This was a recurring issue, primarily caused by temporary files created during data processing jobs that were later deleted.

Problem:

The alert was triggering unnecessarily because it didn’t differentiate between temporary files that were regularly cleaned up after jobs and permanent storage growth. This led to noise, with the team spending time investigating issues that weren’t critical.

Impact:

The team spent significant time investigating disk space usage, even though the problem was routine and not an immediate concern. The alerting system caused inefficiencies by focusing on transient issues rather than real problems.

Solution:

We adjusted the alert to focus only on permanent storage growth instead of total disk usage. We also excluded known temporary directories that were routinely purged after processing. This cut down on unnecessary alerts and allowed the team to focus on issues that actually required attention.

There were many other scenarios but the aim was to highlight common alerts which, if ignored can build up and increase alerting noise.

Conclusion: Don’t Guess. Measure.

Start by logging every alert. Review the most frequent offenders. Track trends over time. Treat alerting as a dynamic system that improves, not a checklist to check off once.

By implementing the right context around your alerts, you’ll help your team prioritize real incidents, reduce alert fatigue, and improve system resilience.

If you're not sure where to begin, that's where we come in. We help teams go from firefighting to foresight by building observability systems that actually serve the people on call.  Let's make alerting better together.  

Want to dive deeper? We can help you set up automated alert analysis, weekly reporting pipelines, and RCA templates for alert effectiveness. Reach out to us.

Effective alerting isn’t about how many alerts you configure. It’s about how many actually help you detect real issues.Too many alerts, without measurement or refinement, create noise. This noise overwhelms teams, hides actual problems, and leads to burnout.

The Trap: Set-and-Forget Alerting

Many teams inherit or configure alerts based on gut instinct or past incidents and move on. But alerting isn't a one-time task. It needs constant feedback and tuning.

To improve it, treat alerting as a system, not a checklist.

Build a feedback loop using DARE:

Good alerting isn’t static. It evolves. To improve it, your system needs to collect data about the alerts themselves: when they fire, how often, and whether they are actually useful.

Start by collecting alert metadata such as what fired, when, and how often—and route this to your observability stack. This could be your incident management tool, a log analytics system, or a custom dashboard.

Once you have this data, start asking:

  • Which alerts fire most often?

  • Are they acknowledged or ignored?

  • Do they lead to action, or just add noise?

  • Do they correlate with real incidents?

This isn’t extra work. It is an essential aspect of alert hygiene. It’s how you turn alerting into a system that gets better over time.

Now, measure alerts as part of your reliability metrics.

Most teams already track RED metrics: Request rate, Error rate, and Duration. But one key signal is missing: Alerts.

Add A for Alerts per unit time to complete the picture. We call this the DARE framework:

DARE = Duration, Alerts, Request rate, and Error rate

Frequent alerts are often signs of instability, and measuring them helps you get a clearer view of your service’s health.

Real-World Examples

CPU Alert During Backups

Scenario:

A team had set an alert for high CPU usage across their infrastructure. The alert fired 200+ times a day, but it wasn’t tied to any specific workload. After reviewing the data, we found that these alerts almost exclusively fired during scheduled backup jobs.

Problem:

The alert threshold was valid (high CPU usage), but the context was missing. There was no awareness of the predictable load during maintenance windows, like backups.

Impact:

These frequent, high-priority alerts flooded the on-call queue during critical times when backups were running. As a result, engineers were woken up for issues that were expected and harmless. This led to burnout and unnecessary noise.

Solution:

We implemented alert suppression during known backup windows. This stopped the false positives from clogging up the on-call queue, allowing engineers to focus on genuine issues. It reduced alert fatigue and helped the team manage real incidents more effectively.

Disk Space Alerting on Temporary Files

Scenario:

A team had an alert set for disk space usage. Whenever disk space exceeded 90%, the alert would fire. This was a recurring issue, primarily caused by temporary files created during data processing jobs that were later deleted.

Problem:

The alert was triggering unnecessarily because it didn’t differentiate between temporary files that were regularly cleaned up after jobs and permanent storage growth. This led to noise, with the team spending time investigating issues that weren’t critical.

Impact:

The team spent significant time investigating disk space usage, even though the problem was routine and not an immediate concern. The alerting system caused inefficiencies by focusing on transient issues rather than real problems.

Solution:

We adjusted the alert to focus only on permanent storage growth instead of total disk usage. We also excluded known temporary directories that were routinely purged after processing. This cut down on unnecessary alerts and allowed the team to focus on issues that actually required attention.

There were many other scenarios but the aim was to highlight common alerts which, if ignored can build up and increase alerting noise.

Conclusion: Don’t Guess. Measure.

Start by logging every alert. Review the most frequent offenders. Track trends over time. Treat alerting as a dynamic system that improves, not a checklist to check off once.

By implementing the right context around your alerts, you’ll help your team prioritize real incidents, reduce alert fatigue, and improve system resilience.

If you're not sure where to begin, that's where we come in. We help teams go from firefighting to foresight by building observability systems that actually serve the people on call.  Let's make alerting better together.  

Want to dive deeper? We can help you set up automated alert analysis, weekly reporting pipelines, and RCA templates for alert effectiveness. Reach out to us.

Share

Jump to section

Related posts

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.