🚑

A curious case of Postgres choosing the wrong index

~ Chinmay Naik

It’s a regular weekday. You're catching up on Slack, sipping your morning coffee. Then pings start coming in. A core feature on the user dashboard is down for some users. Not flaky. Down. In peak traffic hours. Support is escalating. You’re on call that week.

You jump in.

What’s Breaking?

The core feature is a main part of the SaaS platform you’re working on. The feature is deeply customizable where each user can toggle their filters, preferences, and sorting. That means the backend query gets complex fast. Lots of JOINs. Multiple CTEs. In addition, there are some dynamic filters, too. But it’s been stable. Until now.

You're seeing requests hang. Logs indicate timeouts. Retries don’t help. After some trial and error, the team is successful in reproducing the issue. What’s very odd though, is that the same query runs fine with LIMIT 100 but is 15x slower with LIMIT 20.

Wait, what?

You have seen enough DB issues in your life, but this is one of its kind.

What Changed?

First instinct is the usual:

Was there a deployment?
Any recent schema changes?
Postgres config tweak?

None of that.

You start comparing query plans. And suddenly, it makes sense. Postgres is picking a completely different plan when you use LIMIT 20 versus LIMIT 100. For the same query!

Even stranger: the slower plan chooses a very old index. You track down the origin. It was created two years ago by another team. It was used for optimizing some internal data pipeline. And over time, it’s become... junk. The table it targets has seen massive churn - tons of reads and writes every day. The index is still around, but it’s become unusable.

And because of a recent auto vacuum, Postgres reconsiders its plan and decides this old index is now a good idea. For LIMIT 20. But not for LIMIT 100.

Why? You’re not sure but you park this rabbit hole to explore some other day. Not today, since production is still down.

Attempts, Failures, and Finally, a Fix

At first, you try the polite route. “Can we hint Postgres to not use that index?”

Nope. Postgres doesn’t do index blacklisting.

Next, you try live experiments on prod:

Creating fresh indexes?
Tweaking query structure?
Playing with statistics?

All dead ends.

Eventually, you take the risky route: delete the old index. You run a VACUUM ANALYZE again.

Postgres re-plans.

Now both LIMIT 20 and LIMIT 100 take the same (fast) path. Queries are snappy. No timeouts. User dashboards start loading again.

Aftermath

It took hours. Not because the fix was hard. But because the behavior was unintuitive. Why would a smaller LIMIT make a query 15x slower? Because maybe, Postgres tried to be too smart. It may be assuming that a smaller LIMIT meant a faster scan. But with a bad index, that assumption broke everything.

Lessons

Postgres is smart until it’s not. Query planners rely on statistics. A single outdated index can throw everything off. Make sure to tune your database configuration.
Don’t keep legacy indexes around “just in case”. If you don’t need it, kill it.
Always compare query plans, even with minor changes. A LIMIT tweak is all it took.
You can't debug what you don't measure. Without query plans, you'd be flying blind.

In the end, the system is back. But you're left with a simple thought:

Some bugs hide in plain sight. Others hide in the query planner.

I write such stories on software engineering.

There's no specific frequency, as I don't make these up.

If you liked this one, you might love - Data engineering mystery - rerouting large data in Kafka

Follow me on LinkedIn and Twitter for more such stuff, straight from the production oven!