Services

Resources

Company

Our Work

Blog

Book a Call

Back to Blog

#SRE

#Best Practices

Jun 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Back to Blog

#SRE

#Best Practices

Jun 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Back to Blog

#SRE

#Best Practices

Jun 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Back to Blog

#SRE

#Best Practices

Jun 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Effective alerting isn’t about how many alerts you configure. It’s about how many actually help you detect real issues.Too many alerts, without measurement or refinement, create noise. This noise overwhelms teams, hides actual problems, and leads to burnout.

The Trap: Set-and-Forget Alerting

Many teams inherit or configure alerts based on gut instinct or past incidents and move on. But alerting isn't a one-time task. It needs constant feedback and tuning.

To improve it, treat alerting as a system, not a checklist.

Build a feedback loop using DARE:

Good alerting isn’t static. It evolves. To improve it, your system needs to collect data about the alerts themselves: when they fire, how often, and whether they are actually useful.

Start by collecting alert metadata such as what fired, when, and how often—and route this to your observability stack. This could be your incident management tool, a log analytics system, or a custom dashboard.

Once you have this data, start asking:

Which alerts fire most often?
Are they acknowledged or ignored?
Do they lead to action, or just add noise?
Do they correlate with real incidents?

This isn’t extra work. It is an essential aspect of alert hygiene. It’s how you turn alerting into a system that gets better over time.

Now, measure alerts as part of your reliability metrics.

Most teams already track RED metrics: Request rate, Error rate, and Duration. But one key signal is missing: Alerts.

Add A for Alerts per unit time to complete the picture. We call this the DARE framework:

DARE = Duration, Alerts, Request rate, and Error rate

Frequent alerts are often signs of instability, and measuring them helps you get a clearer view of your service’s health.

Real-World Examples

CPU Alert During Backups

Scenario:

A team had set an alert for high CPU usage across their infrastructure. The alert fired 200+ times a day, but it wasn’t tied to any specific workload. After reviewing the data, we found that these alerts almost exclusively fired during scheduled backup jobs.

Problem:

The alert threshold was valid (high CPU usage), but the context was missing. There was no awareness of the predictable load during maintenance windows, like backups.

Impact:

These frequent, high-priority alerts flooded the on-call queue during critical times when backups were running. As a result, engineers were woken up for issues that were expected and harmless. This led to burnout and unnecessary noise.

Solution:

We implemented alert suppression during known backup windows. This stopped the false positives from clogging up the on-call queue, allowing engineers to focus on genuine issues. It reduced alert fatigue and helped the team manage real incidents more effectively.

Disk Space Alerting on Temporary Files

Scenario:

A team had an alert set for disk space usage. Whenever disk space exceeded 90%, the alert would fire. This was a recurring issue, primarily caused by temporary files created during data processing jobs that were later deleted.

Problem:

The alert was triggering unnecessarily because it didn’t differentiate between temporary files that were regularly cleaned up after jobs and permanent storage growth. This led to noise, with the team spending time investigating issues that weren’t critical.

Impact:

The team spent significant time investigating disk space usage, even though the problem was routine and not an immediate concern. The alerting system caused inefficiencies by focusing on transient issues rather than real problems.

Solution:

We adjusted the alert to focus only on permanent storage growth instead of total disk usage. We also excluded known temporary directories that were routinely purged after processing. This cut down on unnecessary alerts and allowed the team to focus on issues that actually required attention.

There were many other scenarios but the aim was to highlight common alerts which, if ignored can build up and increase alerting noise.

Conclusion: Don’t Guess. Measure.

Start by logging every alert. Review the most frequent offenders. Track trends over time. Treat alerting as a dynamic system that improves, not a checklist to check off once.

By implementing the right context around your alerts, you’ll help your team prioritize real incidents, reduce alert fatigue, and improve system resilience.

If you're not sure where to begin, that's where we come in. We help teams go from firefighting to foresight by building observability systems that actually serve the people on call. Let's make alerting better together.

Want to dive deeper? We can help you set up automated alert analysis, weekly reporting pipelines, and RCA templates for alert effectiveness. Reach out to us.

The Trap: Set-and-Forget Alerting

Many teams inherit or configure alerts based on gut instinct or past incidents and move on. But alerting isn't a one-time task. It needs constant feedback and tuning.

To improve it, treat alerting as a system, not a checklist.

Build a feedback loop using DARE:

Good alerting isn’t static. It evolves. To improve it, your system needs to collect data about the alerts themselves: when they fire, how often, and whether they are actually useful.

Once you have this data, start asking:

Which alerts fire most often?
Are they acknowledged or ignored?
Do they lead to action, or just add noise?
Do they correlate with real incidents?

This isn’t extra work. It is an essential aspect of alert hygiene. It’s how you turn alerting into a system that gets better over time.

Now, measure alerts as part of your reliability metrics.

Most teams already track RED metrics: Request rate, Error rate, and Duration. But one key signal is missing: Alerts.

Add A for Alerts per unit time to complete the picture. We call this the DARE framework:

DARE = Duration, Alerts, Request rate, and Error rate

Frequent alerts are often signs of instability, and measuring them helps you get a clearer view of your service’s health.

Real-World Examples

CPU Alert During Backups

Scenario:

Problem:

The alert threshold was valid (high CPU usage), but the context was missing. There was no awareness of the predictable load during maintenance windows, like backups.

Impact:

Solution:

Disk Space Alerting on Temporary Files

Scenario:

Problem:

Impact:

Solution:

There were many other scenarios but the aim was to highlight common alerts which, if ignored can build up and increase alerting noise.

Conclusion: Don’t Guess. Measure.

Start by logging every alert. Review the most frequent offenders. Track trends over time. Treat alerting as a dynamic system that improves, not a checklist to check off once.

Want to dive deeper? We can help you set up automated alert analysis, weekly reporting pipelines, and RCA templates for alert effectiveness. Reach out to us.

The Trap: Set-and-Forget Alerting

Many teams inherit or configure alerts based on gut instinct or past incidents and move on. But alerting isn't a one-time task. It needs constant feedback and tuning.

To improve it, treat alerting as a system, not a checklist.

Build a feedback loop using DARE:

Good alerting isn’t static. It evolves. To improve it, your system needs to collect data about the alerts themselves: when they fire, how often, and whether they are actually useful.

Once you have this data, start asking:

Which alerts fire most often?
Are they acknowledged or ignored?
Do they lead to action, or just add noise?
Do they correlate with real incidents?

This isn’t extra work. It is an essential aspect of alert hygiene. It’s how you turn alerting into a system that gets better over time.

Now, measure alerts as part of your reliability metrics.

Most teams already track RED metrics: Request rate, Error rate, and Duration. But one key signal is missing: Alerts.

Add A for Alerts per unit time to complete the picture. We call this the DARE framework:

DARE = Duration, Alerts, Request rate, and Error rate

Frequent alerts are often signs of instability, and measuring them helps you get a clearer view of your service’s health.

Real-World Examples

CPU Alert During Backups

Scenario:

Problem:

The alert threshold was valid (high CPU usage), but the context was missing. There was no awareness of the predictable load during maintenance windows, like backups.

Impact:

Solution:

Disk Space Alerting on Temporary Files

Scenario:

Problem:

Impact:

Solution:

There were many other scenarios but the aim was to highlight common alerts which, if ignored can build up and increase alerting noise.

Conclusion: Don’t Guess. Measure.

Start by logging every alert. Review the most frequent offenders. Track trends over time. Treat alerting as a dynamic system that improves, not a checklist to check off once.

Want to dive deeper? We can help you set up automated alert analysis, weekly reporting pipelines, and RCA templates for alert effectiveness. Reach out to us.

The Trap: Set-and-Forget Alerting

Many teams inherit or configure alerts based on gut instinct or past incidents and move on. But alerting isn't a one-time task. It needs constant feedback and tuning.

To improve it, treat alerting as a system, not a checklist.

Build a feedback loop using DARE:

Good alerting isn’t static. It evolves. To improve it, your system needs to collect data about the alerts themselves: when they fire, how often, and whether they are actually useful.

Once you have this data, start asking:

Which alerts fire most often?
Are they acknowledged or ignored?
Do they lead to action, or just add noise?
Do they correlate with real incidents?

This isn’t extra work. It is an essential aspect of alert hygiene. It’s how you turn alerting into a system that gets better over time.

Now, measure alerts as part of your reliability metrics.

Most teams already track RED metrics: Request rate, Error rate, and Duration. But one key signal is missing: Alerts.

Add A for Alerts per unit time to complete the picture. We call this the DARE framework:

DARE = Duration, Alerts, Request rate, and Error rate

Frequent alerts are often signs of instability, and measuring them helps you get a clearer view of your service’s health.

Real-World Examples

CPU Alert During Backups

Scenario:

Problem:

The alert threshold was valid (high CPU usage), but the context was missing. There was no awareness of the predictable load during maintenance windows, like backups.

Impact:

Solution:

Disk Space Alerting on Temporary Files

Scenario:

Problem:

Impact:

Solution:

There were many other scenarios but the aim was to highlight common alerts which, if ignored can build up and increase alerting noise.

Conclusion: Don’t Guess. Measure.

Start by logging every alert. Review the most frequent offenders. Track trends over time. Treat alerting as a dynamic system that improves, not a checklist to check off once.

Want to dive deeper? We can help you set up automated alert analysis, weekly reporting pipelines, and RCA templates for alert effectiveness. Reach out to us.

The Trap: Set-and-Forget Alerting

Many teams inherit or configure alerts based on gut instinct or past incidents and move on. But alerting isn't a one-time task. It needs constant feedback and tuning.

To improve it, treat alerting as a system, not a checklist.

Build a feedback loop using DARE:

Good alerting isn’t static. It evolves. To improve it, your system needs to collect data about the alerts themselves: when they fire, how often, and whether they are actually useful.

Once you have this data, start asking:

Which alerts fire most often?
Are they acknowledged or ignored?
Do they lead to action, or just add noise?
Do they correlate with real incidents?

This isn’t extra work. It is an essential aspect of alert hygiene. It’s how you turn alerting into a system that gets better over time.

Now, measure alerts as part of your reliability metrics.

Most teams already track RED metrics: Request rate, Error rate, and Duration. But one key signal is missing: Alerts.

Add A for Alerts per unit time to complete the picture. We call this the DARE framework:

DARE = Duration, Alerts, Request rate, and Error rate

November 12, 2025 | 3 min read

Blog

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Services

Resources

Company

DARE to question your alerts?

DARE to question your alerts?

DARE to question your alerts?

DARE to question your alerts?

DARE to question your alerts?

Share

Jump to section

Continue reading.

How Queueing Theory Makes Systems Reliable

Ever wondered why some systems crumble at high load while others handle spikes with ease? This guide breaks down queueing theory in plain English, showing reliability engineers how to spot danger zones, manage capacity, and avoid late-night incidents.

How Queueing Theory Makes Systems Reliable

Ever wondered why some systems crumble at high load while others handle spikes with ease? This guide breaks down queueing theory in plain English, showing reliability engineers how to spot danger zones, manage capacity, and avoid late-night incidents.

Error Budget Calculation: Downtime Minutes for every SLO

Error Budget Calculation: Downtime Minutes for every SLO

How to read SRE graphs without lying to yourself

Are your SRE charts messing with your head? We’ll show you step by step how to actually make sense of those dashboards: percentiles, averages, heatmaps, and all, so you spot real issues fast. No jargon, just practical advice from daily SRE work.

How to read SRE graphs without lying to yourself

Are your SRE charts messing with your head? We’ll show you step by step how to actually make sense of those dashboards: percentiles, averages, heatmaps, and all, so you spot real issues fast. No jargon, just practical advice from daily SRE work.

Percentiles in SRE: Why averages lie about latency

Ever wondered why your apps seem slow even when the average latency looks fine? SREs use percentiles to uncover the real story behind performance making sure users get the speed they expect. Learn how One2N’s experts measure what truly matters, spot hidden issues, and keep systems reliable.

Percentiles in SRE: Why averages lie about latency

Ever wondered why your apps seem slow even when the average latency looks fine? SREs use percentiles to uncover the real story behind performance making sure users get the speed they expect. Learn how One2N’s experts measure what truly matters, spot hidden issues, and keep systems reliable.

Deploying a scalable NATS cluster part 2: hands-on demo

Deploying a scalable NATS cluster part 2: hands-on demo

Comparing Latency vs Throughput: why high utilisation hurts reliability

Ever wondered why your systems slow down or fail during peak times? This guide explains in plain English how latency and throughput affect reliability, and why running too close to max capacity leads to problems

Comparing Latency vs Throughput: why high utilisation hurts reliability

Ever wondered why your systems slow down or fail during peak times? This guide explains in plain English how latency and throughput affect reliability, and why running too close to max capacity leads to problems

How Queueing Theory Makes Systems Reliable

Ever wondered why some systems crumble at high load while others handle spikes with ease? This guide breaks down queueing theory in plain English, showing reliability engineers how to spot danger zones, manage capacity, and avoid late-night incidents.

Error Budget Calculation: Downtime Minutes for every SLO

How to read SRE graphs without lying to yourself

Are your SRE charts messing with your head? We’ll show you step by step how to actually make sense of those dashboards: percentiles, averages, heatmaps, and all, so you spot real issues fast. No jargon, just practical advice from daily SRE work.

Percentiles in SRE: Why averages lie about latency

Ever wondered why your apps seem slow even when the average latency looks fine? SREs use percentiles to uncover the real story behind performance making sure users get the speed they expect. Learn how One2N’s experts measure what truly matters, spot hidden issues, and keep systems reliable.

How Queueing Theory Makes Systems Reliable

Ever wondered why some systems crumble at high load while others handle spikes with ease? This guide breaks down queueing theory in plain English, showing reliability engineers how to spot danger zones, manage capacity, and avoid late-night incidents.

Error Budget Calculation: Downtime Minutes for every SLO

How to read SRE graphs without lying to yourself

Are your SRE charts messing with your head? We’ll show you step by step how to actually make sense of those dashboards: percentiles, averages, heatmaps, and all, so you spot real issues fast. No jargon, just practical advice from daily SRE work.

Percentiles in SRE: Why averages lie about latency

Ever wondered why your apps seem slow even when the average latency looks fine? SREs use percentiles to uncover the real story behind performance making sure users get the speed they expect. Learn how One2N’s experts measure what truly matters, spot hidden issues, and keep systems reliable.

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content