Services

Resources

Company

Our Work

Blog

Schedule a Meet

Back to Blog

Nov 26, 2025 | 3 min read

Percentiles in SRE: Why averages lie about latency

One2N

Team

Back to Blog

Nov 26, 2025 | 3 min read

Percentiles in SRE: Why averages lie about latency

One2N

Team

Back to Blog

Nov 26, 2025 | 3 min read

Percentiles in SRE: Why averages lie about latency

One2N

Team

Introduction: why percentiles matter in site reliability engineering

In Site Reliability Engineering, metrics are the foundation of decision making. But not all metrics are equal. Many monitoring tools default to averages, and averages are dangerously misleading. A flat average graph may suggest stability, but users may still be waiting seconds for responses. If your team only watches averages, you are blind to the part of the system that causes the most pain.

At One2N, we often find this gap in client systems during reliability reviews. Engineers confidently point to “stable” averages, while percentile charts tell a very different story. This article explains why percentiles such as p50, p95, p99 are essential to SRE practice. We will look at how to read them, what they reveal about user experience, and how to decide which percentiles to monitor in production.

The problem with averages

Consider this example. Your system handles 1,000 requests in 100 ms. At the same time, 50 requests take between 2 and 5 seconds. The math produces an average of about 200 ms. The dashboard shows a stable, flat number.

But those 50 requests still matter. Each one represents a frustrated user who waited seconds instead of milliseconds. They are the ones who open support tickets, retry transactions, and abandon checkouts. The average hides their pain.

This is why SREs treat averages with suspicion. They compress a distribution into a single value and erase the very extremes that define user experience.

Percentiles: speaking the language of tails

Percentiles describe distributions in a way that matches reality.

p50 (median): Half of requests are faster, half slower. A good measure of “typical” behaviour.
p95: 5 percent of requests are slower than this. A strong early warning for degraded performance.
p99: The slowest one percent of requests. In systems serving thousands of requests per second, this is still thousands of slow experiences every minute.

This framing is powerful. Instead of saying “average latency is 200 ms,” you can say “95 percent of users get responses under 500 ms, but 1 percent wait up to 5 seconds.” That difference changes how incidents are diagnosed and how service levels are negotiated with product teams.

Code and graph: average vs percentiles

Latency distribution: average vs percentiles

The histogram makes the point clear. The white dotted line shows the average, but it sits far away from the real experience. The dashed lines (p50, p95, p99) reveal the true shape: a fast bulk, a long slow tail, and a small group of users suffering seconds of delay.

How to choose the right percentiles to monitor

Not every system needs every percentile. Picking which ones to track is a decision that depends on your workload and your users. Here is a decision table to guide that choice.

System type	Typical user expectation	Suggested percentiles	Why it matters
Web checkout or payments	Fast and predictable	p50, p95, p99	Even 1% of slow checkouts means lost revenue.
Internal admin tools	Tolerant to occasional slowness	p50, p95	Smooth experience for most users is good enough.
APIs serving mobile apps	Sensitive to tail	p50, p90, p99	Network + app latency compounds, tail hurts adoption.
Batch jobs	Throughput focused	p95	Bulk completion time is key, tails less critical.

This decision table shows why there is no universal rule. The percentiles you choose should match the stakes of your system.

Tying percentiles back to SRE practice

Percentiles are not just about better graphs. They tie directly into SRE practices:

Error budgets: A spike in p99 latency often coincides with budget burn. Tracking tails helps teams spot and prevent breaches.
Capacity planning: Rising p95 or p99 latency as throughput grows indicates queues forming. This links directly to the next cluster, Latency vs Throughput.
On-call clarity: During incidents, percentile graphs shorten diagnosis time. You see at a glance whether pain is concentrated in the tail or spread across all users.

At One2N, we encourage teams to start every reliability conversation with percentiles. They are the most direct way to connect system metrics with user experience.

Putting it all together

When you are on-call, do not stop at averages. Look at the full distribution. Percentiles expose the reality that averages flatten away.

In practice, this means:

Use p50 to describe the “typical” case.
Use p95 to check if a minority of users are degraded.
Use p99 to protect against tail latency that destroys trust.
Always connect percentiles back to business outcomes. A 1% tail in payments is far more costly than in an internal tool.

This perspective makes you more effective as an engineer. You stop debating “is the system fine?” and start answering “which users are suffering, and why?” That is the essence of Site Reliability Engineering.

Introduction: why percentiles matter in site reliability engineering

The problem with averages

This is why SREs treat averages with suspicion. They compress a distribution into a single value and erase the very extremes that define user experience.

Percentiles: speaking the language of tails

Percentiles describe distributions in a way that matches reality.

p50 (median): Half of requests are faster, half slower. A good measure of “typical” behaviour.
p95: 5 percent of requests are slower than this. A strong early warning for degraded performance.
p99: The slowest one percent of requests. In systems serving thousands of requests per second, this is still thousands of slow experiences every minute.

Code and graph: average vs percentiles

Latency distribution: average vs percentiles

How to choose the right percentiles to monitor

Not every system needs every percentile. Picking which ones to track is a decision that depends on your workload and your users. Here is a decision table to guide that choice.

System type	Typical user expectation	Suggested percentiles	Why it matters
Web checkout or payments	Fast and predictable	p50, p95, p99	Even 1% of slow checkouts means lost revenue.
Internal admin tools	Tolerant to occasional slowness	p50, p95	Smooth experience for most users is good enough.
APIs serving mobile apps	Sensitive to tail	p50, p90, p99	Network + app latency compounds, tail hurts adoption.
Batch jobs	Throughput focused	p95	Bulk completion time is key, tails less critical.

This decision table shows why there is no universal rule. The percentiles you choose should match the stakes of your system.

Tying percentiles back to SRE practice

Percentiles are not just about better graphs. They tie directly into SRE practices:

Error budgets: A spike in p99 latency often coincides with budget burn. Tracking tails helps teams spot and prevent breaches.
Capacity planning: Rising p95 or p99 latency as throughput grows indicates queues forming. This links directly to the next cluster, Latency vs Throughput.
On-call clarity: During incidents, percentile graphs shorten diagnosis time. You see at a glance whether pain is concentrated in the tail or spread across all users.

At One2N, we encourage teams to start every reliability conversation with percentiles. They are the most direct way to connect system metrics with user experience.

Putting it all together

When you are on-call, do not stop at averages. Look at the full distribution. Percentiles expose the reality that averages flatten away.

In practice, this means:

Use p50 to describe the “typical” case.
Use p95 to check if a minority of users are degraded.
Use p99 to protect against tail latency that destroys trust.
Always connect percentiles back to business outcomes. A 1% tail in payments is far more costly than in an internal tool.

In this post

Section

In this post

section

Keywords

site reliability engineering, SRE, percentiles, latency, averages, monitoring, p50, p95, p99, reliability, performance metrics, tail latency, error budgets, cloud native, incident diagnosis, latency monitoring, user experience, system metrics, capacity planning, production monitoring, One2N

Continue reading.

Read Blog

Gitops for Kafka in the real world: How we governed 78 clusters without breaking production

Harshwardhan Mehrotra

SRE @One2N

This post is about how we, as a lean platform engg team used gitops and jikkou to bring 78 kafka clusters under control, cut ticket based changes from days to minutes, and kept legacy kafka 0.8.2 clusters safe in production.

February 16, 2026 | 4 min read

Read Blog

Prayogshala - The Engineering Laboratory at One2N

Chinmay Naik

CEO @One2N

We created Prayogshala, One2N’s internal engineering lab, to capture our learnings, experiments and knowledge base. See how it helps engineers get better by asking "why" before "how".

February 2, 2026 | 4 min read

Read Blog

The Gotchas of OTEL collector processors for effective observability in K8s

Sanket Rajgiri

SRE @One2N

Spandan Ghosh

Content @One2N

Struggling to make sense of OpenTelemetry Collector processors for real-world projects? This blog breaks down what each OTEL processor actually does, where it matters, and shares real lessons from messy SRE problems like taming noisy data, surviving crashes, and staying under cost limits in Kubernetes.

January 26, 2026 | 6 min read

Read Blog

How Queueing Theory Makes Systems Reliable

One2N

Team

Learn how SREs use queueing theory to explain why 70 percent utilisation feels calm, 90 percent feels cursed, and how the same math helps you choose headroom, tame retries, and protect your error budget before incidents hit.

December 17, 2025 | 2 min read

Read Blog

How to read SRE graphs without lying to yourself

One2N

Team

Are your SRE charts messing with your head? We’ll show you step by step how to actually make sense of those dashboards: percentiles, averages, heatmaps, and all, so you spot real issues fast. No jargon, just practical advice from daily SRE work.

December 10, 2025 | 3 min read

Read Blog

Error Budget Calculation: Downtime Minutes for every SLO

One2N

Team

Turn your SLO into something you can argue about in a meeting: this guide shows how to convert 99.9% into 43 real minutes of downtime, read burn rate, push back on “five nines,” and decide when to ship or hit pause.

December 3, 2025 | 3 min read

Read Blog

Gitops for Kafka in the real world: How we governed 78 clusters without breaking production

Harshwardhan Mehrotra

SRE @One2N

This post is about how we, as a lean platform engg team used gitops and jikkou to bring 78 kafka clusters under control, cut ticket based changes from days to minutes, and kept legacy kafka 0.8.2 clusters safe in production.

February 16, 2026 | 4 min read

Read Blog

Prayogshala - The Engineering Laboratory at One2N

Chinmay Naik

CEO @One2N

We created Prayogshala, One2N’s internal engineering lab, to capture our learnings, experiments and knowledge base. See how it helps engineers get better by asking "why" before "how".

February 2, 2026 | 4 min read

Read Blog

The Gotchas of OTEL collector processors for effective observability in K8s

Sanket Rajgiri

SRE @One2N

Spandan Ghosh

Content @One2N

Struggling to make sense of OpenTelemetry Collector processors for real-world projects? This blog breaks down what each OTEL processor actually does, where it matters, and shares real lessons from messy SRE problems like taming noisy data, surviving crashes, and staying under cost limits in Kubernetes.

January 26, 2026 | 6 min read

Read Blog

How Queueing Theory Makes Systems Reliable

One2N

Team

Learn how SREs use queueing theory to explain why 70 percent utilisation feels calm, 90 percent feels cursed, and how the same math helps you choose headroom, tame retries, and protect your error budget before incidents hit.

December 17, 2025 | 2 min read

Blogs

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Services

Resources

Company

Percentiles in SRE: Why averages lie about latency

Percentiles in SRE: Why averages lie about latency

Percentiles in SRE: Why averages lie about latency

Percentiles in SRE: Why averages lie about latency

In this post

In this post

Section

Share

Share

Tags

In this post

Share

Tags

Keywords

Continue reading.

Gitops for Kafka in the real world: How we governed 78 clusters without breaking production

This post is about how we, as a lean platform engg team used gitops and jikkou to bring 78 kafka clusters under control, cut ticket based changes from days to minutes, and kept legacy kafka 0.8.2 clusters safe in production.

Prayogshala - The Engineering Laboratory at One2N

We created Prayogshala, One2N’s internal engineering lab, to capture our learnings, experiments and knowledge base. See how it helps engineers get better by asking "why" before "how".

The Gotchas of OTEL collector processors for effective observability in K8s

How Queueing Theory Makes Systems Reliable

Learn how SREs use queueing theory to explain why 70 percent utilisation feels calm, 90 percent feels cursed, and how the same math helps you choose headroom, tame retries, and protect your error budget before incidents hit.

How to read SRE graphs without lying to yourself

Are your SRE charts messing with your head? We’ll show you step by step how to actually make sense of those dashboards: percentiles, averages, heatmaps, and all, so you spot real issues fast. No jargon, just practical advice from daily SRE work.

Error Budget Calculation: Downtime Minutes for every SLO

Turn your SLO into something you can argue about in a meeting: this guide shows how to convert 99.9% into 43 real minutes of downtime, read burn rate, push back on “five nines,” and decide when to ship or hit pause.

Gitops for Kafka in the real world: How we governed 78 clusters without breaking production

This post is about how we, as a lean platform engg team used gitops and jikkou to bring 78 kafka clusters under control, cut ticket based changes from days to minutes, and kept legacy kafka 0.8.2 clusters safe in production.

Prayogshala - The Engineering Laboratory at One2N

We created Prayogshala, One2N’s internal engineering lab, to capture our learnings, experiments and knowledge base. See how it helps engineers get better by asking "why" before "how".

The Gotchas of OTEL collector processors for effective observability in K8s

How Queueing Theory Makes Systems Reliable

Learn how SREs use queueing theory to explain why 70 percent utilisation feels calm, 90 percent feels cursed, and how the same math helps you choose headroom, tame retries, and protect your error budget before incidents hit.

Subscribe for more such content

Hold to verify for 2 seconds

Subscribe for more such content

Hold to verify for 2 seconds

Subscribe for more such content

Hold to verify for 2 seconds