Services

Resources

Company

Our Work

Blog

Book a Call

Back to Blog

#SRE

Nov 21, 2025 | 2 min read

How Queueing Theory Makes Systems Reliable

One2N

Team

Back to Blog

#SRE

Nov 21, 2025 | 2 min read

How Queueing Theory Makes Systems Reliable

One2N

Team

Back to Blog

#SRE

Nov 21, 2025 | 2 min read

How Queueing Theory Makes Systems Reliable

One2N

Team

Back to Blog

#SRE

Nov 21, 2025 | 2 min read

How Queueing Theory Makes Systems Reliable

One2N

Team

Introduction: Why SREs must care about Queues

Almost every system you run as an SRE is a queue. Requests come in, they wait for resources, and then they are served. Sometimes the wait is negligible. Other times, it balloons and users see timeouts.

Queueing theory is the math that explains this behaviour. The reason it matters is simple: the difference between 70% and 90% utilisation is not just “20 percentage points.” It is the difference between a system that feels stable and one that collapses under load.

At One2N, we often meet teams who say: “Our CPUs are at 90% but everything seems fine.” Then suddenly a spike hits, queues explode, latency multiplies, and the incident takes hours to unwind. Queueing math explains exactly why this happens, and why designing with headroom is not waste, but insurance.

The M/M/1 model in plain english

The M/M/1 queue is the simplest way to understand this.

M: arrivals follow a Poisson process (random arrivals at a fixed average rate λ).
M: service times are exponential (each request takes a random amount of time, average rate μ).
1: there is one “server” (this can mean a CPU core, a database worker, or a single-threaded process).

The model gives us this formula:

Where:

W = average response time
μ = service rate (max requests per second the system can handle)
λ = arrival rate (actual requests per second)

As λ approaches μ, response time W explodes.

You do not need the math to see the danger. Imagine a coffee shop with one barista who can make 60 coffees per hour. If 50 customers arrive per hour, the queue stays short. If 58 customers arrive per hour, the barista barely keeps up. If 59 arrive per hour, the line keeps growing and never clears.

Visualising Latency vs Utilisation

Queueing delay grows as utilisation approaches 100%

The graph makes the cliff effect obvious. Latency is flat until about 70%. Past 85%, it climbs fast. At 95%, the system is essentially unusable. This is why SREs insist on leaving headroom and push back when teams try to “maximise efficiency.”

Practical insights for reliability engineers

1. Headroom is not waste

Teams often think of unused capacity as “money left on the table.” But the math shows the opposite. Running consistently at 60 to 70% utilisation means your system can absorb traffic spikes, failover events, and retry storms. That unused margin is what buys you reliability.

2. Retries amplify the problem

When a request fails, clients retry. If the system is already at 90% utilisation, retries add more load than it can handle. The result is a feedback loop where every failure generates new requests, making queues grow faster than operators can react.

3. Tail latency hides in queues

Even if the median request looks fine, queueing makes slow requests much worse. One unlucky user might wait 10× longer than the median. That “tail” is what users feel, and why SREs monitor p95 and p99 percentiles instead of averages.

4. Error budgets shrink quickly

A system running near 95% utilisation does not just have higher latency, it also burns through error budgets. A 30-minute overload can consume a month’s allowance. Queueing theory makes the connection between high utilisation and SLO breaches explicit.

Decision Table: Making Queueing Operational

Utilisation range	Latency behaviour	Risk level	SRE guidance
0 to 70%	Latency flat, queues minimal	Low	Safe to run steadily
70 to 85%	Latency rising, early warnings	Medium	Start planning capacity increase
85 to 95%	Latency unstable, retries dangerous	High	Alert and scale before users notice
95 to 100%	Latency unbounded, collapse imminent	Critical	Incident response required

This is how you turn abstract queueing math into actionable rules for on-call engineers!

Worked Examples

Example 1: API Gateway at 70% vs 90%

At 70% utilisation, average latency stays near 50 ms. At 90%, average latency grows to 500 ms. For an API gateway, this means user apps start timing out. The same hardware looks fine one day and broken the next, purely because of load.

Example 2: Background Jobs

A queue of batch jobs may tolerate longer delays. Running at 85% utilisation is acceptable here because users are not directly waiting. But the same numbers in a checkout path would cause lost revenue. Queueing math helps teams decide where they can afford to run hotter.

Example 3: Retry Storm After Outage

A database outage lasts 5 minutes. When it recovers, clients resend thousands of failed requests. The system jumps from 60% utilisation to 95% instantly. Queues explode, and the system fails again. The root cause is not just the database, but how queues amplify retries.

How Queueing Theory connects to Error Budgets

This cluster links directly to the previous one on error budgets. Operating in the “red zone” above 90% means:

You risk consuming the entire budget in one incident.
Error budget burn rate spikes because queues delay every user.
Recovery becomes slow, so even short spikes have long tails.

Queueing math gives you the evidence to argue for conservative capacity targets. It is not opinion. It is math that matches real user pain.

Putting it all together

Queueing theory turns reliability from guesswork into numbers. For reliability engineers, the key lessons are:

Systems that look efficient at 90% are fragile.
Headroom around 60 to 70% is the safest operating point.
Retries and spikes make queues grow far faster than intuition.
Tail latency, not averages, defines user experience.
Error budgets disappear quickly once queues grow.

At One2N, we use these principles when designing both AI and cloud-native systems. Whether it is a payment API or an observability pipeline, the same math applies. By teaching engineers to read queues, we help them predict and prevent incidents rather than just reacting at 2 A.M

Internal links:

Pillar: SRE Math Every Engineer Should Know
Previous cluster: Error Budget Calculation: Downtime Minutes for Every SLO
Next cluster: Reading SRE Graphs Without Fooling Yourself

Introduction: Why SREs must care about Queues

The M/M/1 model in plain english

The M/M/1 queue is the simplest way to understand this.

M: arrivals follow a Poisson process (random arrivals at a fixed average rate λ).
M: service times are exponential (each request takes a random amount of time, average rate μ).
1: there is one “server” (this can mean a CPU core, a database worker, or a single-threaded process).

The model gives us this formula:

Where:

W = average response time
μ = service rate (max requests per second the system can handle)
λ = arrival rate (actual requests per second)

As λ approaches μ, response time W explodes.

Visualising Latency vs Utilisation

Queueing delay grows as utilisation approaches 100%

Practical insights for reliability engineers

1. Headroom is not waste

2. Retries amplify the problem

3. Tail latency hides in queues

4. Error budgets shrink quickly

Decision Table: Making Queueing Operational

Utilisation range	Latency behaviour	Risk level	SRE guidance
0 to 70%	Latency flat, queues minimal	Low	Safe to run steadily
70 to 85%	Latency rising, early warnings	Medium	Start planning capacity increase
85 to 95%	Latency unstable, retries dangerous	High	Alert and scale before users notice
95 to 100%	Latency unbounded, collapse imminent	Critical	Incident response required

This is how you turn abstract queueing math into actionable rules for on-call engineers!

Worked Examples

Example 1: API Gateway at 70% vs 90%

Example 2: Background Jobs

Example 3: Retry Storm After Outage

How Queueing Theory connects to Error Budgets

This cluster links directly to the previous one on error budgets. Operating in the “red zone” above 90% means:

You risk consuming the entire budget in one incident.
Error budget burn rate spikes because queues delay every user.
Recovery becomes slow, so even short spikes have long tails.

Queueing math gives you the evidence to argue for conservative capacity targets. It is not opinion. It is math that matches real user pain.

Putting it all together

Queueing theory turns reliability from guesswork into numbers. For reliability engineers, the key lessons are:

Systems that look efficient at 90% are fragile.
Headroom around 60 to 70% is the safest operating point.
Retries and spikes make queues grow far faster than intuition.
Tail latency, not averages, defines user experience.
Error budgets disappear quickly once queues grow.

Internal links:

Pillar: SRE Math Every Engineer Should Know
Previous cluster: Error Budget Calculation: Downtime Minutes for Every SLO
Next cluster: Reading SRE Graphs Without Fooling Yourself

Introduction: Why SREs must care about Queues

The M/M/1 model in plain english

The M/M/1 queue is the simplest way to understand this.

M: arrivals follow a Poisson process (random arrivals at a fixed average rate λ).
M: service times are exponential (each request takes a random amount of time, average rate μ).
1: there is one “server” (this can mean a CPU core, a database worker, or a single-threaded process).

The model gives us this formula:

Where:

W = average response time
μ = service rate (max requests per second the system can handle)
λ = arrival rate (actual requests per second)

As λ approaches μ, response time W explodes.

Visualising Latency vs Utilisation

Queueing delay grows as utilisation approaches 100%

Practical insights for reliability engineers

1. Headroom is not waste

2. Retries amplify the problem

3. Tail latency hides in queues

4. Error budgets shrink quickly

Decision Table: Making Queueing Operational

Utilisation range	Latency behaviour	Risk level	SRE guidance
0 to 70%	Latency flat, queues minimal	Low	Safe to run steadily
70 to 85%	Latency rising, early warnings	Medium	Start planning capacity increase
85 to 95%	Latency unstable, retries dangerous	High	Alert and scale before users notice
95 to 100%	Latency unbounded, collapse imminent	Critical	Incident response required

This is how you turn abstract queueing math into actionable rules for on-call engineers!

Worked Examples

Example 1: API Gateway at 70% vs 90%

Example 2: Background Jobs

Example 3: Retry Storm After Outage

How Queueing Theory connects to Error Budgets

This cluster links directly to the previous one on error budgets. Operating in the “red zone” above 90% means:

You risk consuming the entire budget in one incident.
Error budget burn rate spikes because queues delay every user.
Recovery becomes slow, so even short spikes have long tails.

Queueing math gives you the evidence to argue for conservative capacity targets. It is not opinion. It is math that matches real user pain.

Putting it all together

Queueing theory turns reliability from guesswork into numbers. For reliability engineers, the key lessons are:

Systems that look efficient at 90% are fragile.
Headroom around 60 to 70% is the safest operating point.
Retries and spikes make queues grow far faster than intuition.
Tail latency, not averages, defines user experience.
Error budgets disappear quickly once queues grow.

Internal links:

Pillar: SRE Math Every Engineer Should Know
Previous cluster: Error Budget Calculation: Downtime Minutes for Every SLO
Next cluster: Reading SRE Graphs Without Fooling Yourself

Introduction: Why SREs must care about Queues

The M/M/1 model in plain english

The M/M/1 queue is the simplest way to understand this.

M: arrivals follow a Poisson process (random arrivals at a fixed average rate λ).
M: service times are exponential (each request takes a random amount of time, average rate μ).
1: there is one “server” (this can mean a CPU core, a database worker, or a single-threaded process).

The model gives us this formula:

Where:

W = average response time
μ = service rate (max requests per second the system can handle)
λ = arrival rate (actual requests per second)

As λ approaches μ, response time W explodes.

Visualising Latency vs Utilisation

Queueing delay grows as utilisation approaches 100%

Practical insights for reliability engineers

1. Headroom is not waste

2. Retries amplify the problem

3. Tail latency hides in queues

4. Error budgets shrink quickly

Decision Table: Making Queueing Operational

Utilisation range	Latency behaviour	Risk level	SRE guidance
0 to 70%	Latency flat, queues minimal	Low	Safe to run steadily
70 to 85%	Latency rising, early warnings	Medium	Start planning capacity increase
85 to 95%	Latency unstable, retries dangerous	High	Alert and scale before users notice
95 to 100%	Latency unbounded, collapse imminent	Critical	Incident response required

This is how you turn abstract queueing math into actionable rules for on-call engineers!

Worked Examples

Example 1: API Gateway at 70% vs 90%

Example 2: Background Jobs

Example 3: Retry Storm After Outage

How Queueing Theory connects to Error Budgets

This cluster links directly to the previous one on error budgets. Operating in the “red zone” above 90% means:

You risk consuming the entire budget in one incident.
Error budget burn rate spikes because queues delay every user.
Recovery becomes slow, so even short spikes have long tails.

Queueing math gives you the evidence to argue for conservative capacity targets. It is not opinion. It is math that matches real user pain.

Putting it all together

Queueing theory turns reliability from guesswork into numbers. For reliability engineers, the key lessons are:

Systems that look efficient at 90% are fragile.
Headroom around 60 to 70% is the safest operating point.
Retries and spikes make queues grow far faster than intuition.
Tail latency, not averages, defines user experience.
Error budgets disappear quickly once queues grow.

Internal links:

Pillar: SRE Math Every Engineer Should Know
Previous cluster: Error Budget Calculation: Downtime Minutes for Every SLO
Next cluster: Reading SRE Graphs Without Fooling Yourself

Introduction: Why SREs must care about Queues

The M/M/1 model in plain english

The M/M/1 queue is the simplest way to understand this.

M: arrivals follow a Poisson process (random arrivals at a fixed average rate λ).
M: service times are exponential (each request takes a random amount of time, average rate μ).
1: there is one “server” (this can mean a CPU core, a database worker, or a single-threaded process).

The model gives us this formula:

Where:

W = average response time
μ = service rate (max requests per second the system can handle)
λ = arrival rate (actual requests per second)

As λ approaches μ, response time W explodes.

Visualising Latency vs Utilisation

Queueing delay grows as utilisation approaches 100%

Practical insights for reliability engineers

1. Headroom is not waste

2. Retries amplify the problem

3. Tail latency hides in queues

4. Error budgets shrink quickly

Decision Table: Making Queueing Operational

Utilisation range	Latency behaviour	Risk level	SRE guidance
0 to 70%	Latency flat, queues minimal	Low	Safe to run steadily
70 to 85%	Latency rising, early warnings	Medium	Start planning capacity increase
85 to 95%	Latency unstable, retries dangerous	High	Alert and scale before users notice
95 to 100%	Latency unbounded, collapse imminent	Critical	Incident response required

This is how you turn abstract queueing math into actionable rules for on-call engineers!

Worked Examples

Example 1: API Gateway at 70% vs 90%

Example 2: Background Jobs

Example 3: Retry Storm After Outage

How Queueing Theory connects to Error Budgets

This cluster links directly to the previous one on error budgets. Operating in the “red zone” above 90% means:

You risk consuming the entire budget in one incident.
Error budget burn rate spikes because queues delay every user.
Recovery becomes slow, so even short spikes have long tails.

Queueing math gives you the evidence to argue for conservative capacity targets. It is not opinion. It is math that matches real user pain.

Putting it all together

Queueing theory turns reliability from guesswork into numbers. For reliability engineers, the key lessons are:

Systems that look efficient at 90% are fragile.
Headroom around 60 to 70% is the safest operating point.
Retries and spikes make queues grow far faster than intuition.
Tail latency, not averages, defines user experience.
Error budgets disappear quickly once queues grow.

Internal links:

Pillar: SRE Math Every Engineer Should Know
Previous cluster: Error Budget Calculation: Downtime Minutes for Every SLO
Next cluster: Reading SRE Graphs Without Fooling Yourself

November 12, 2025 | 7 min read

Blog

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Services

Resources

Company

How Queueing Theory Makes Systems Reliable

How Queueing Theory Makes Systems Reliable

How Queueing Theory Makes Systems Reliable

How Queueing Theory Makes Systems Reliable

How Queueing Theory Makes Systems Reliable

Share

Jump to section

Continue reading.

Error Budget Calculation: Downtime Minutes for every SLO

Error Budget Calculation: Downtime Minutes for every SLO

How to read SRE graphs without lying to yourself

Are your SRE charts messing with your head? We’ll show you step by step how to actually make sense of those dashboards: percentiles, averages, heatmaps, and all, so you spot real issues fast. No jargon, just practical advice from daily SRE work.

How to read SRE graphs without lying to yourself

Are your SRE charts messing with your head? We’ll show you step by step how to actually make sense of those dashboards: percentiles, averages, heatmaps, and all, so you spot real issues fast. No jargon, just practical advice from daily SRE work.

Percentiles in SRE: Why averages lie about latency

Ever wondered why your apps seem slow even when the average latency looks fine? SREs use percentiles to uncover the real story behind performance making sure users get the speed they expect. Learn how One2N’s experts measure what truly matters, spot hidden issues, and keep systems reliable.

Percentiles in SRE: Why averages lie about latency

Ever wondered why your apps seem slow even when the average latency looks fine? SREs use percentiles to uncover the real story behind performance making sure users get the speed they expect. Learn how One2N’s experts measure what truly matters, spot hidden issues, and keep systems reliable.

Deploying a scalable NATS cluster part 2: hands-on demo

Deploying a scalable NATS cluster part 2: hands-on demo

Comparing Latency vs Throughput: why high utilisation hurts reliability

Ever wondered why your systems slow down or fail during peak times? This guide explains in plain English how latency and throughput affect reliability, and why running too close to max capacity leads to problems

Comparing Latency vs Throughput: why high utilisation hurts reliability

Ever wondered why your systems slow down or fail during peak times? This guide explains in plain English how latency and throughput affect reliability, and why running too close to max capacity leads to problems

SRE math every engineer should know: a practical guide

SRE math every engineer should know: a practical guide

Error Budget Calculation: Downtime Minutes for every SLO

How to read SRE graphs without lying to yourself

Are your SRE charts messing with your head? We’ll show you step by step how to actually make sense of those dashboards: percentiles, averages, heatmaps, and all, so you spot real issues fast. No jargon, just practical advice from daily SRE work.

Percentiles in SRE: Why averages lie about latency

Ever wondered why your apps seem slow even when the average latency looks fine? SREs use percentiles to uncover the real story behind performance making sure users get the speed they expect. Learn how One2N’s experts measure what truly matters, spot hidden issues, and keep systems reliable.

Deploying a scalable NATS cluster part 2: hands-on demo

Error Budget Calculation: Downtime Minutes for every SLO

How to read SRE graphs without lying to yourself

Are your SRE charts messing with your head? We’ll show you step by step how to actually make sense of those dashboards: percentiles, averages, heatmaps, and all, so you spot real issues fast. No jargon, just practical advice from daily SRE work.

Percentiles in SRE: Why averages lie about latency

Ever wondered why your apps seem slow even when the average latency looks fine? SREs use percentiles to uncover the real story behind performance making sure users get the speed they expect. Learn how One2N’s experts measure what truly matters, spot hidden issues, and keep systems reliable.

Deploying a scalable NATS cluster part 2: hands-on demo

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content