Services

Resources

Company

Nov 14, 2025 | 3 min read

How to read SRE graphs without lying to yourself

How to read SRE graphs without lying to yourself

Nov 14, 2025 | 3 min read

How to read SRE graphs without lying to yourself

Nov 14, 2025 | 3 min read

How to read SRE graphs without lying to yourself

Nov 14, 2025 | 3 min read

How to read SRE graphs without lying to yourself

Introduction: why graph literacy is a core SRE skill

Dashboards are the nervous system of modern reliability engineering. They are the first thing you open when the pager goes off and the first thing management wants to see when the incident is over. But dashboards are not reality. They are summaries of reality, stitched together from sampled data, aggregated into windows, and plotted on scales that may or may not match what your users feel.

Most engineers assume that if the dashboard is polished, it is telling the truth. But truth in dashboards is slippery. Averages can look calm while users rage. A five-minute window can erase ten-second meltdowns. A log scale can hide orders of magnitude.

Incidents often run long not because the fix is hard, but because the team misreads the graphs. At One2N we teach engineers that graph literacy is as important as knowing how to restart a pod or roll back a release. If you cannot read graphs honestly, you cannot run systems reliably.

This article is a deep guide to reading SRE graphs without fooling yourself. We will start with averages and percentiles, move through sampling windows and axis scales, dive into heatmaps, and end with a practical checklist and worked incident walkthrough.

Averages vs percentiles: why the mean can be misleading

Most dashboards default to averages. It is convenient to compute and produces smooth lines. But averages erase the very outliers that users complain about.

Imagine a checkout service. In one minute it serves 1,000 requests around 100 ms and 50 requests that take 3 seconds. The average is ~250 ms. The line is flat, calm, and looks acceptable. But 50 customers just waited three seconds to pay. Some retried, some gave up. Those are the customers whose voices fill Slack and Jira tickets.

Percentiles tell the real story. A p50 of 100 ms shows the bulk of requests. A p95 of ~160 ms tells you 95 percent of requests are still fine. A p99 near 3,000 ms reveals the painful tail. In large systems, one percent can mean thousands of people per minute.

The yellow dotted line shows the average (~240 ms). But the purple p99 line at ~3,000 ms shows the real pain. This is why averages cannot be trusted in production systems. Percentiles expose the tails where reliability is won or lost.

Sampling windows: how resolution changes the story

Metrics are aggregated into time windows. The size of that window can hide or reveal incidents.

Suppose an API spikes for 10 seconds every 5 minutes:

  • At 1-second resolution, the spikes stand out.

  • At 1-minute averages, the spikes are softened but visible.

  • At 5-minute averages, the spikes almost vanish.

  • At 1-hour averages, the service looks perfect.

Effect of sampling window on spikes

Axis scales: linear vs log

Dashboards sometimes use log scales. They can be useful for data spanning several orders of magnitude, but in incidents they often hide meaningful jumps.

Suppose your error rate climbs from 0.1% to 1%. That is a 10x increase. On a linear scale, the change is alarming. On a log scale, it looks like a tiny step.

Linear scale vs log scale

On the left, the linear scale shows a steady climb. On the right, the log scale flattens the same climb. Both are correct mathematically, but only one matches what users feel.

In practice:

  • Use linear during incidents so you see what customers see.

  • Use log only when comparing long-tailed distributions or capacity planning.

Heatmaps and density

Heatmaps show distributions over time. They are powerful, but density can trick you. A faint streak may represent thousands of failing requests.

Let’s simulate a service where most requests complete at ~100 ms, but bursts at 5,000 ms happen every 100 seconds.

Latency heatmap with hidden bursts

Most requests cluster in the dark band at 100 ms. But the faint vertical streaks every 100 seconds show bursts at 5,000 ms. Without care, an engineer could ignore them. With percentiles or absolute counts, you see that these are real users suffering.

Overlaying percentiles on heatmaps is the safest practice.

Checklist: habits for honest graph reading

Rather than a short list, here are the habits in detail.

1. Metric type

Always check if you are looking at averages or percentiles. Averages erase pain. Percentiles reveal tails. For example, in one client system we saw “stable” averages at 200 ms while p99s spiked to 3,500 ms. The real user pain was invisible until we switched.

2. Window size

Short windows expose spikes. Long windows smooth them away. For outages, always zoom to 10 to 30 second windows. For planning, 1 hour or 1 day windows are fine. Misreading windows is one of the most common failure modes during triage.

3. Axis scale

Know if you are looking at linear or log. Linear matches user experience. Log is for capacity planning. We once saw a 10x error spike hidden by log scale, costing two hours of triage.

4. Units

Confirm units: ms vs s, requests per second vs per minute. During an auth outage, engineers confused 10 ms with 10 s, thinking DB queries were crawling when they were actually fine. That cost the team an hour.

5. Density in heatmaps

Do not dismiss faint streaks. They can represent many requests. Cross-check with absolute numbers.

6. Correlation traps

Graphs side by side are not always correlated. A CPU rise and latency rise may be coincidence. Validate with deeper metrics before drawing conclusions.

7. Context

Graphs must be checked against logs and user reports. If users complain but graphs look calm, the graphs are wrong, misconfigured, or hiding pain in averages or long windows.

Worked incident walkthrough

At 2 a.m. the on-call SRE is paged. Alert says “checkout latency stable at 200 ms.” Yet users are complaining.

  • Step 1: Check percentiles. Average is calm, but p99 is spiking to 3,500 ms. The tail is burning.

  • Step 2: Check window. Default dashboard uses 5-minute averages. Switching to 30-second view shows repeated spikes.

  • Step 3: Check axis. Error rate chart uses log scale. On log, the increase looks flat. On linear, error rate is up 10x.

  • Step 4: Check heatmap. Faint streaks at 5,000 ms appear every 10 minutes. Investigating shows DB lockups.

  • Step 5: Correlate with logs. Lock contention during promotional campaign confirmed.

Time

Graph misread

Correction

Insight

02:00

Average latency flat

Check p99

Tail on fire

02:05

5m window hides spikes

Switch to 30s

Spikes visible

02:10

Log axis hides rise

Switch to linear

Error up 10x

02:15

Heatmap streak ignored

Investigate streaks

DB lock found

This timeline shows how every trap played a role. The fix was straightforward once the graphs were read correctly.

Putting it all together

Reading graphs honestly is not a luxury. It is survival for on-call engineers.

  • Percentiles show user pain that averages hide.

  • Window size determines whether spikes are visible or lost.

  • Linear axes reveal impact, log axes flatten it.

  • Heatmaps must be read with attention to faint streaks.

  • Context from logs and reports validates what the dashboard shows.

At One2N, we treat graph literacy as part of the SRE math toolkit, alongside error budgets and queueing models. Teams must practise this, not assume dashboards are self-explanatory. The next time your pager goes off, remember: the graph is not the system. It is only a story. And it is your job to read that story honestly.

Introduction: why graph literacy is a core SRE skill

Dashboards are the nervous system of modern reliability engineering. They are the first thing you open when the pager goes off and the first thing management wants to see when the incident is over. But dashboards are not reality. They are summaries of reality, stitched together from sampled data, aggregated into windows, and plotted on scales that may or may not match what your users feel.

Most engineers assume that if the dashboard is polished, it is telling the truth. But truth in dashboards is slippery. Averages can look calm while users rage. A five-minute window can erase ten-second meltdowns. A log scale can hide orders of magnitude.

Incidents often run long not because the fix is hard, but because the team misreads the graphs. At One2N we teach engineers that graph literacy is as important as knowing how to restart a pod or roll back a release. If you cannot read graphs honestly, you cannot run systems reliably.

This article is a deep guide to reading SRE graphs without fooling yourself. We will start with averages and percentiles, move through sampling windows and axis scales, dive into heatmaps, and end with a practical checklist and worked incident walkthrough.

Averages vs percentiles: why the mean can be misleading

Most dashboards default to averages. It is convenient to compute and produces smooth lines. But averages erase the very outliers that users complain about.

Imagine a checkout service. In one minute it serves 1,000 requests around 100 ms and 50 requests that take 3 seconds. The average is ~250 ms. The line is flat, calm, and looks acceptable. But 50 customers just waited three seconds to pay. Some retried, some gave up. Those are the customers whose voices fill Slack and Jira tickets.

Percentiles tell the real story. A p50 of 100 ms shows the bulk of requests. A p95 of ~160 ms tells you 95 percent of requests are still fine. A p99 near 3,000 ms reveals the painful tail. In large systems, one percent can mean thousands of people per minute.

The yellow dotted line shows the average (~240 ms). But the purple p99 line at ~3,000 ms shows the real pain. This is why averages cannot be trusted in production systems. Percentiles expose the tails where reliability is won or lost.

Sampling windows: how resolution changes the story

Metrics are aggregated into time windows. The size of that window can hide or reveal incidents.

Suppose an API spikes for 10 seconds every 5 minutes:

  • At 1-second resolution, the spikes stand out.

  • At 1-minute averages, the spikes are softened but visible.

  • At 5-minute averages, the spikes almost vanish.

  • At 1-hour averages, the service looks perfect.

Effect of sampling window on spikes

Axis scales: linear vs log

Dashboards sometimes use log scales. They can be useful for data spanning several orders of magnitude, but in incidents they often hide meaningful jumps.

Suppose your error rate climbs from 0.1% to 1%. That is a 10x increase. On a linear scale, the change is alarming. On a log scale, it looks like a tiny step.

Linear scale vs log scale

On the left, the linear scale shows a steady climb. On the right, the log scale flattens the same climb. Both are correct mathematically, but only one matches what users feel.

In practice:

  • Use linear during incidents so you see what customers see.

  • Use log only when comparing long-tailed distributions or capacity planning.

Heatmaps and density

Heatmaps show distributions over time. They are powerful, but density can trick you. A faint streak may represent thousands of failing requests.

Let’s simulate a service where most requests complete at ~100 ms, but bursts at 5,000 ms happen every 100 seconds.

Latency heatmap with hidden bursts

Most requests cluster in the dark band at 100 ms. But the faint vertical streaks every 100 seconds show bursts at 5,000 ms. Without care, an engineer could ignore them. With percentiles or absolute counts, you see that these are real users suffering.

Overlaying percentiles on heatmaps is the safest practice.

Checklist: habits for honest graph reading

Rather than a short list, here are the habits in detail.

1. Metric type

Always check if you are looking at averages or percentiles. Averages erase pain. Percentiles reveal tails. For example, in one client system we saw “stable” averages at 200 ms while p99s spiked to 3,500 ms. The real user pain was invisible until we switched.

2. Window size

Short windows expose spikes. Long windows smooth them away. For outages, always zoom to 10 to 30 second windows. For planning, 1 hour or 1 day windows are fine. Misreading windows is one of the most common failure modes during triage.

3. Axis scale

Know if you are looking at linear or log. Linear matches user experience. Log is for capacity planning. We once saw a 10x error spike hidden by log scale, costing two hours of triage.

4. Units

Confirm units: ms vs s, requests per second vs per minute. During an auth outage, engineers confused 10 ms with 10 s, thinking DB queries were crawling when they were actually fine. That cost the team an hour.

5. Density in heatmaps

Do not dismiss faint streaks. They can represent many requests. Cross-check with absolute numbers.

6. Correlation traps

Graphs side by side are not always correlated. A CPU rise and latency rise may be coincidence. Validate with deeper metrics before drawing conclusions.

7. Context

Graphs must be checked against logs and user reports. If users complain but graphs look calm, the graphs are wrong, misconfigured, or hiding pain in averages or long windows.

Worked incident walkthrough

At 2 a.m. the on-call SRE is paged. Alert says “checkout latency stable at 200 ms.” Yet users are complaining.

  • Step 1: Check percentiles. Average is calm, but p99 is spiking to 3,500 ms. The tail is burning.

  • Step 2: Check window. Default dashboard uses 5-minute averages. Switching to 30-second view shows repeated spikes.

  • Step 3: Check axis. Error rate chart uses log scale. On log, the increase looks flat. On linear, error rate is up 10x.

  • Step 4: Check heatmap. Faint streaks at 5,000 ms appear every 10 minutes. Investigating shows DB lockups.

  • Step 5: Correlate with logs. Lock contention during promotional campaign confirmed.

Time

Graph misread

Correction

Insight

02:00

Average latency flat

Check p99

Tail on fire

02:05

5m window hides spikes

Switch to 30s

Spikes visible

02:10

Log axis hides rise

Switch to linear

Error up 10x

02:15

Heatmap streak ignored

Investigate streaks

DB lock found

This timeline shows how every trap played a role. The fix was straightforward once the graphs were read correctly.

Putting it all together

Reading graphs honestly is not a luxury. It is survival for on-call engineers.

  • Percentiles show user pain that averages hide.

  • Window size determines whether spikes are visible or lost.

  • Linear axes reveal impact, log axes flatten it.

  • Heatmaps must be read with attention to faint streaks.

  • Context from logs and reports validates what the dashboard shows.

At One2N, we treat graph literacy as part of the SRE math toolkit, alongside error budgets and queueing models. Teams must practise this, not assume dashboards are self-explanatory. The next time your pager goes off, remember: the graph is not the system. It is only a story. And it is your job to read that story honestly.

Introduction: why graph literacy is a core SRE skill

Dashboards are the nervous system of modern reliability engineering. They are the first thing you open when the pager goes off and the first thing management wants to see when the incident is over. But dashboards are not reality. They are summaries of reality, stitched together from sampled data, aggregated into windows, and plotted on scales that may or may not match what your users feel.

Most engineers assume that if the dashboard is polished, it is telling the truth. But truth in dashboards is slippery. Averages can look calm while users rage. A five-minute window can erase ten-second meltdowns. A log scale can hide orders of magnitude.

Incidents often run long not because the fix is hard, but because the team misreads the graphs. At One2N we teach engineers that graph literacy is as important as knowing how to restart a pod or roll back a release. If you cannot read graphs honestly, you cannot run systems reliably.

This article is a deep guide to reading SRE graphs without fooling yourself. We will start with averages and percentiles, move through sampling windows and axis scales, dive into heatmaps, and end with a practical checklist and worked incident walkthrough.

Averages vs percentiles: why the mean can be misleading

Most dashboards default to averages. It is convenient to compute and produces smooth lines. But averages erase the very outliers that users complain about.

Imagine a checkout service. In one minute it serves 1,000 requests around 100 ms and 50 requests that take 3 seconds. The average is ~250 ms. The line is flat, calm, and looks acceptable. But 50 customers just waited three seconds to pay. Some retried, some gave up. Those are the customers whose voices fill Slack and Jira tickets.

Percentiles tell the real story. A p50 of 100 ms shows the bulk of requests. A p95 of ~160 ms tells you 95 percent of requests are still fine. A p99 near 3,000 ms reveals the painful tail. In large systems, one percent can mean thousands of people per minute.

The yellow dotted line shows the average (~240 ms). But the purple p99 line at ~3,000 ms shows the real pain. This is why averages cannot be trusted in production systems. Percentiles expose the tails where reliability is won or lost.

Sampling windows: how resolution changes the story

Metrics are aggregated into time windows. The size of that window can hide or reveal incidents.

Suppose an API spikes for 10 seconds every 5 minutes:

  • At 1-second resolution, the spikes stand out.

  • At 1-minute averages, the spikes are softened but visible.

  • At 5-minute averages, the spikes almost vanish.

  • At 1-hour averages, the service looks perfect.

Effect of sampling window on spikes

Axis scales: linear vs log

Dashboards sometimes use log scales. They can be useful for data spanning several orders of magnitude, but in incidents they often hide meaningful jumps.

Suppose your error rate climbs from 0.1% to 1%. That is a 10x increase. On a linear scale, the change is alarming. On a log scale, it looks like a tiny step.

Linear scale vs log scale

On the left, the linear scale shows a steady climb. On the right, the log scale flattens the same climb. Both are correct mathematically, but only one matches what users feel.

In practice:

  • Use linear during incidents so you see what customers see.

  • Use log only when comparing long-tailed distributions or capacity planning.

Heatmaps and density

Heatmaps show distributions over time. They are powerful, but density can trick you. A faint streak may represent thousands of failing requests.

Let’s simulate a service where most requests complete at ~100 ms, but bursts at 5,000 ms happen every 100 seconds.

Latency heatmap with hidden bursts

Most requests cluster in the dark band at 100 ms. But the faint vertical streaks every 100 seconds show bursts at 5,000 ms. Without care, an engineer could ignore them. With percentiles or absolute counts, you see that these are real users suffering.

Overlaying percentiles on heatmaps is the safest practice.

Checklist: habits for honest graph reading

Rather than a short list, here are the habits in detail.

1. Metric type

Always check if you are looking at averages or percentiles. Averages erase pain. Percentiles reveal tails. For example, in one client system we saw “stable” averages at 200 ms while p99s spiked to 3,500 ms. The real user pain was invisible until we switched.

2. Window size

Short windows expose spikes. Long windows smooth them away. For outages, always zoom to 10 to 30 second windows. For planning, 1 hour or 1 day windows are fine. Misreading windows is one of the most common failure modes during triage.

3. Axis scale

Know if you are looking at linear or log. Linear matches user experience. Log is for capacity planning. We once saw a 10x error spike hidden by log scale, costing two hours of triage.

4. Units

Confirm units: ms vs s, requests per second vs per minute. During an auth outage, engineers confused 10 ms with 10 s, thinking DB queries were crawling when they were actually fine. That cost the team an hour.

5. Density in heatmaps

Do not dismiss faint streaks. They can represent many requests. Cross-check with absolute numbers.

6. Correlation traps

Graphs side by side are not always correlated. A CPU rise and latency rise may be coincidence. Validate with deeper metrics before drawing conclusions.

7. Context

Graphs must be checked against logs and user reports. If users complain but graphs look calm, the graphs are wrong, misconfigured, or hiding pain in averages or long windows.

Worked incident walkthrough

At 2 a.m. the on-call SRE is paged. Alert says “checkout latency stable at 200 ms.” Yet users are complaining.

  • Step 1: Check percentiles. Average is calm, but p99 is spiking to 3,500 ms. The tail is burning.

  • Step 2: Check window. Default dashboard uses 5-minute averages. Switching to 30-second view shows repeated spikes.

  • Step 3: Check axis. Error rate chart uses log scale. On log, the increase looks flat. On linear, error rate is up 10x.

  • Step 4: Check heatmap. Faint streaks at 5,000 ms appear every 10 minutes. Investigating shows DB lockups.

  • Step 5: Correlate with logs. Lock contention during promotional campaign confirmed.

Time

Graph misread

Correction

Insight

02:00

Average latency flat

Check p99

Tail on fire

02:05

5m window hides spikes

Switch to 30s

Spikes visible

02:10

Log axis hides rise

Switch to linear

Error up 10x

02:15

Heatmap streak ignored

Investigate streaks

DB lock found

This timeline shows how every trap played a role. The fix was straightforward once the graphs were read correctly.

Putting it all together

Reading graphs honestly is not a luxury. It is survival for on-call engineers.

  • Percentiles show user pain that averages hide.

  • Window size determines whether spikes are visible or lost.

  • Linear axes reveal impact, log axes flatten it.

  • Heatmaps must be read with attention to faint streaks.

  • Context from logs and reports validates what the dashboard shows.

At One2N, we treat graph literacy as part of the SRE math toolkit, alongside error budgets and queueing models. Teams must practise this, not assume dashboards are self-explanatory. The next time your pager goes off, remember: the graph is not the system. It is only a story. And it is your job to read that story honestly.

Introduction: why graph literacy is a core SRE skill

Dashboards are the nervous system of modern reliability engineering. They are the first thing you open when the pager goes off and the first thing management wants to see when the incident is over. But dashboards are not reality. They are summaries of reality, stitched together from sampled data, aggregated into windows, and plotted on scales that may or may not match what your users feel.

Most engineers assume that if the dashboard is polished, it is telling the truth. But truth in dashboards is slippery. Averages can look calm while users rage. A five-minute window can erase ten-second meltdowns. A log scale can hide orders of magnitude.

Incidents often run long not because the fix is hard, but because the team misreads the graphs. At One2N we teach engineers that graph literacy is as important as knowing how to restart a pod or roll back a release. If you cannot read graphs honestly, you cannot run systems reliably.

This article is a deep guide to reading SRE graphs without fooling yourself. We will start with averages and percentiles, move through sampling windows and axis scales, dive into heatmaps, and end with a practical checklist and worked incident walkthrough.

Averages vs percentiles: why the mean can be misleading

Most dashboards default to averages. It is convenient to compute and produces smooth lines. But averages erase the very outliers that users complain about.

Imagine a checkout service. In one minute it serves 1,000 requests around 100 ms and 50 requests that take 3 seconds. The average is ~250 ms. The line is flat, calm, and looks acceptable. But 50 customers just waited three seconds to pay. Some retried, some gave up. Those are the customers whose voices fill Slack and Jira tickets.

Percentiles tell the real story. A p50 of 100 ms shows the bulk of requests. A p95 of ~160 ms tells you 95 percent of requests are still fine. A p99 near 3,000 ms reveals the painful tail. In large systems, one percent can mean thousands of people per minute.

The yellow dotted line shows the average (~240 ms). But the purple p99 line at ~3,000 ms shows the real pain. This is why averages cannot be trusted in production systems. Percentiles expose the tails where reliability is won or lost.

Sampling windows: how resolution changes the story

Metrics are aggregated into time windows. The size of that window can hide or reveal incidents.

Suppose an API spikes for 10 seconds every 5 minutes:

  • At 1-second resolution, the spikes stand out.

  • At 1-minute averages, the spikes are softened but visible.

  • At 5-minute averages, the spikes almost vanish.

  • At 1-hour averages, the service looks perfect.

Effect of sampling window on spikes

Axis scales: linear vs log

Dashboards sometimes use log scales. They can be useful for data spanning several orders of magnitude, but in incidents they often hide meaningful jumps.

Suppose your error rate climbs from 0.1% to 1%. That is a 10x increase. On a linear scale, the change is alarming. On a log scale, it looks like a tiny step.

Linear scale vs log scale

On the left, the linear scale shows a steady climb. On the right, the log scale flattens the same climb. Both are correct mathematically, but only one matches what users feel.

In practice:

  • Use linear during incidents so you see what customers see.

  • Use log only when comparing long-tailed distributions or capacity planning.

Heatmaps and density

Heatmaps show distributions over time. They are powerful, but density can trick you. A faint streak may represent thousands of failing requests.

Let’s simulate a service where most requests complete at ~100 ms, but bursts at 5,000 ms happen every 100 seconds.

Latency heatmap with hidden bursts

Most requests cluster in the dark band at 100 ms. But the faint vertical streaks every 100 seconds show bursts at 5,000 ms. Without care, an engineer could ignore them. With percentiles or absolute counts, you see that these are real users suffering.

Overlaying percentiles on heatmaps is the safest practice.

Checklist: habits for honest graph reading

Rather than a short list, here are the habits in detail.

1. Metric type

Always check if you are looking at averages or percentiles. Averages erase pain. Percentiles reveal tails. For example, in one client system we saw “stable” averages at 200 ms while p99s spiked to 3,500 ms. The real user pain was invisible until we switched.

2. Window size

Short windows expose spikes. Long windows smooth them away. For outages, always zoom to 10 to 30 second windows. For planning, 1 hour or 1 day windows are fine. Misreading windows is one of the most common failure modes during triage.

3. Axis scale

Know if you are looking at linear or log. Linear matches user experience. Log is for capacity planning. We once saw a 10x error spike hidden by log scale, costing two hours of triage.

4. Units

Confirm units: ms vs s, requests per second vs per minute. During an auth outage, engineers confused 10 ms with 10 s, thinking DB queries were crawling when they were actually fine. That cost the team an hour.

5. Density in heatmaps

Do not dismiss faint streaks. They can represent many requests. Cross-check with absolute numbers.

6. Correlation traps

Graphs side by side are not always correlated. A CPU rise and latency rise may be coincidence. Validate with deeper metrics before drawing conclusions.

7. Context

Graphs must be checked against logs and user reports. If users complain but graphs look calm, the graphs are wrong, misconfigured, or hiding pain in averages or long windows.

Worked incident walkthrough

At 2 a.m. the on-call SRE is paged. Alert says “checkout latency stable at 200 ms.” Yet users are complaining.

  • Step 1: Check percentiles. Average is calm, but p99 is spiking to 3,500 ms. The tail is burning.

  • Step 2: Check window. Default dashboard uses 5-minute averages. Switching to 30-second view shows repeated spikes.

  • Step 3: Check axis. Error rate chart uses log scale. On log, the increase looks flat. On linear, error rate is up 10x.

  • Step 4: Check heatmap. Faint streaks at 5,000 ms appear every 10 minutes. Investigating shows DB lockups.

  • Step 5: Correlate with logs. Lock contention during promotional campaign confirmed.

Time

Graph misread

Correction

Insight

02:00

Average latency flat

Check p99

Tail on fire

02:05

5m window hides spikes

Switch to 30s

Spikes visible

02:10

Log axis hides rise

Switch to linear

Error up 10x

02:15

Heatmap streak ignored

Investigate streaks

DB lock found

This timeline shows how every trap played a role. The fix was straightforward once the graphs were read correctly.

Putting it all together

Reading graphs honestly is not a luxury. It is survival for on-call engineers.

  • Percentiles show user pain that averages hide.

  • Window size determines whether spikes are visible or lost.

  • Linear axes reveal impact, log axes flatten it.

  • Heatmaps must be read with attention to faint streaks.

  • Context from logs and reports validates what the dashboard shows.

At One2N, we treat graph literacy as part of the SRE math toolkit, alongside error budgets and queueing models. Teams must practise this, not assume dashboards are self-explanatory. The next time your pager goes off, remember: the graph is not the system. It is only a story. And it is your job to read that story honestly.

Introduction: why graph literacy is a core SRE skill

Dashboards are the nervous system of modern reliability engineering. They are the first thing you open when the pager goes off and the first thing management wants to see when the incident is over. But dashboards are not reality. They are summaries of reality, stitched together from sampled data, aggregated into windows, and plotted on scales that may or may not match what your users feel.

Most engineers assume that if the dashboard is polished, it is telling the truth. But truth in dashboards is slippery. Averages can look calm while users rage. A five-minute window can erase ten-second meltdowns. A log scale can hide orders of magnitude.

Incidents often run long not because the fix is hard, but because the team misreads the graphs. At One2N we teach engineers that graph literacy is as important as knowing how to restart a pod or roll back a release. If you cannot read graphs honestly, you cannot run systems reliably.

This article is a deep guide to reading SRE graphs without fooling yourself. We will start with averages and percentiles, move through sampling windows and axis scales, dive into heatmaps, and end with a practical checklist and worked incident walkthrough.

Averages vs percentiles: why the mean can be misleading

Most dashboards default to averages. It is convenient to compute and produces smooth lines. But averages erase the very outliers that users complain about.

Imagine a checkout service. In one minute it serves 1,000 requests around 100 ms and 50 requests that take 3 seconds. The average is ~250 ms. The line is flat, calm, and looks acceptable. But 50 customers just waited three seconds to pay. Some retried, some gave up. Those are the customers whose voices fill Slack and Jira tickets.

Percentiles tell the real story. A p50 of 100 ms shows the bulk of requests. A p95 of ~160 ms tells you 95 percent of requests are still fine. A p99 near 3,000 ms reveals the painful tail. In large systems, one percent can mean thousands of people per minute.

The yellow dotted line shows the average (~240 ms). But the purple p99 line at ~3,000 ms shows the real pain. This is why averages cannot be trusted in production systems. Percentiles expose the tails where reliability is won or lost.

Sampling windows: how resolution changes the story

Metrics are aggregated into time windows. The size of that window can hide or reveal incidents.

Suppose an API spikes for 10 seconds every 5 minutes:

  • At 1-second resolution, the spikes stand out.

  • At 1-minute averages, the spikes are softened but visible.

  • At 5-minute averages, the spikes almost vanish.

  • At 1-hour averages, the service looks perfect.

Effect of sampling window on spikes

Axis scales: linear vs log

Dashboards sometimes use log scales. They can be useful for data spanning several orders of magnitude, but in incidents they often hide meaningful jumps.

Suppose your error rate climbs from 0.1% to 1%. That is a 10x increase. On a linear scale, the change is alarming. On a log scale, it looks like a tiny step.

Linear scale vs log scale

On the left, the linear scale shows a steady climb. On the right, the log scale flattens the same climb. Both are correct mathematically, but only one matches what users feel.

In practice:

  • Use linear during incidents so you see what customers see.

  • Use log only when comparing long-tailed distributions or capacity planning.

Heatmaps and density

Heatmaps show distributions over time. They are powerful, but density can trick you. A faint streak may represent thousands of failing requests.

Let’s simulate a service where most requests complete at ~100 ms, but bursts at 5,000 ms happen every 100 seconds.

Latency heatmap with hidden bursts

Most requests cluster in the dark band at 100 ms. But the faint vertical streaks every 100 seconds show bursts at 5,000 ms. Without care, an engineer could ignore them. With percentiles or absolute counts, you see that these are real users suffering.

Overlaying percentiles on heatmaps is the safest practice.

Checklist: habits for honest graph reading

Rather than a short list, here are the habits in detail.

1. Metric type

Always check if you are looking at averages or percentiles. Averages erase pain. Percentiles reveal tails. For example, in one client system we saw “stable” averages at 200 ms while p99s spiked to 3,500 ms. The real user pain was invisible until we switched.

2. Window size

Short windows expose spikes. Long windows smooth them away. For outages, always zoom to 10 to 30 second windows. For planning, 1 hour or 1 day windows are fine. Misreading windows is one of the most common failure modes during triage.

3. Axis scale

Know if you are looking at linear or log. Linear matches user experience. Log is for capacity planning. We once saw a 10x error spike hidden by log scale, costing two hours of triage.

4. Units

Confirm units: ms vs s, requests per second vs per minute. During an auth outage, engineers confused 10 ms with 10 s, thinking DB queries were crawling when they were actually fine. That cost the team an hour.

5. Density in heatmaps

Do not dismiss faint streaks. They can represent many requests. Cross-check with absolute numbers.

6. Correlation traps

Graphs side by side are not always correlated. A CPU rise and latency rise may be coincidence. Validate with deeper metrics before drawing conclusions.

7. Context

Graphs must be checked against logs and user reports. If users complain but graphs look calm, the graphs are wrong, misconfigured, or hiding pain in averages or long windows.

Worked incident walkthrough

At 2 a.m. the on-call SRE is paged. Alert says “checkout latency stable at 200 ms.” Yet users are complaining.

  • Step 1: Check percentiles. Average is calm, but p99 is spiking to 3,500 ms. The tail is burning.

  • Step 2: Check window. Default dashboard uses 5-minute averages. Switching to 30-second view shows repeated spikes.

  • Step 3: Check axis. Error rate chart uses log scale. On log, the increase looks flat. On linear, error rate is up 10x.

  • Step 4: Check heatmap. Faint streaks at 5,000 ms appear every 10 minutes. Investigating shows DB lockups.

  • Step 5: Correlate with logs. Lock contention during promotional campaign confirmed.

Time

Graph misread

Correction

Insight

02:00

Average latency flat

Check p99

Tail on fire

02:05

5m window hides spikes

Switch to 30s

Spikes visible

02:10

Log axis hides rise

Switch to linear

Error up 10x

02:15

Heatmap streak ignored

Investigate streaks

DB lock found

This timeline shows how every trap played a role. The fix was straightforward once the graphs were read correctly.

Putting it all together

Reading graphs honestly is not a luxury. It is survival for on-call engineers.

  • Percentiles show user pain that averages hide.

  • Window size determines whether spikes are visible or lost.

  • Linear axes reveal impact, log axes flatten it.

  • Heatmaps must be read with attention to faint streaks.

  • Context from logs and reports validates what the dashboard shows.

At One2N, we treat graph literacy as part of the SRE math toolkit, alongside error budgets and queueing models. Teams must practise this, not assume dashboards are self-explanatory. The next time your pager goes off, remember: the graph is not the system. It is only a story. And it is your job to read that story honestly.

Share

Jump to section

Continue reading.

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!