Services

Resources

Company

Our Work

Blog

Book a Call

Back to Pragmatic Software Stories

💽

Database Reliability - Zero Downtime Schema Migrations with MySQL

~ Chinmay Naik

(Database reliability story 1)

You join a team as lead SRE, and the CTO asks you to manage and scale a self-managed 6-node MySQL cluster on production.

Here's a story of how you turn this around - from many database failures per month to three nines of uptime for the database.

Context

To get the context, you ask some questions:

Why self-manage the database instead of using a cloud SaaS offering?
Who set this up, and who's managing it currently?
Any observability set up for the DB?
What's the traffic pattern, peak rps, data size, VM specs, etc.

You have many questions, but you start with these.

The CTO candidly replies.

Why self-manage: We started with a small self-hosted MySQL. The business grew so much that we kept scaling vertically.
Who set it up: The lead who set this up is no longer with the company.
Observability: We have a basic Grafana dashboard with infra metrics for the cluster, nothing more.
Traffic pattern, scale, etc: Peak scale of 20k read requests and 5k write requests per second, data size 1.5Tb growing 6GB/day, 150+ tables. 6-node cluster 128GB RAM, 32vCPU.

Your plan

With this, you suggest these improvements:

Zero downtime schema migrations
Separate OLTP and OLAP workloads
Allow read-write traffic routing
Controlling replication lag
Improve security posture
Observability
99.9% uptime

Finally, migrate to SaaS offering e.g., AWS RDS.

[Each of these improvements deserves a separate post (and a conference talk). But for now, I'll write about one specific outcome.]

"Zero Downtime Schema Migration"

Before the solution, let me talk about why schema migration on large tables is a challenge in MySQL.

Challenges in MySQL schema migration

MySQL Schema migration with a simple alter table causes many problems:

For large tables (millions of rows), it results in replication lag
Replicas can't process any other commands during migration
Stale data on replicas due to lag
Non-pausable operation on replicas

You find out that at least 7 product features are blocked due to the inability to perform schema migrations on large tables on production.

The engineering team has created PRs for these features, but these can't be merged as we currently don't have zero downtime schema migrations.

You know that you're not the first to face this problem (MySQL has been around for long, so there must be a way).

Your approach

You do the research and find two approaches and tools

trigger-based (Percona's pt-online-schema-change, i.e. pt-osc) and
binlog-based (GitHub's gh-ost)

You research more and prepare a document comparing the two approaches.

Comparison of pt-osc and gh-ost for MySQL schema migration

After much consideration, you go ahead with GitHub's gh-ost, as it gives you an asynchronous data copy mechanism, controllable switchover, and throttling support.

But what do these terms mean?

Comparison of MySQL schema migration tools

See, zero downtime MySQL schema migration works this way:

Create a table similar to the source table (shadow table)
Migrate the shadow table while empty
Copy data from the source to the shadow table
Rename tables atomically
Optionally delete the older table

Both gh-ost and pt-osc work this way. But, they differ in step 3 (how data is copied from the source table to the shadow table).

Pt-Osc Uses Triggers, Gh-Ost Uses MySQL Binlogs.

Both of these mechanisms have their own pros and cons. For our use case, we went ahead with gh-ost.

Prod setup

You set up a replica of the production environment where you'd try out gh-ost and its various configuration options.

The two most imp config flags are:

Threads_running: The active threads on the primary database server
Max-lag-millis: The current replication lag on the replica

You go through gh-ost documentation and even the source code to find out how it works. You read through most use cases and lurk in forums to understand production challenges. You simulate prod-like scale (400M rows) and run multiple migrations on such tables with various flags.

Fast forwarding the story:

Once you have the confidence, you perform your first migration on a 10M row table. It works well, with zero downtime. Gradually, you move on to migrating 400M row table. It takes 12+ hours, but things are mainly smooth. You document the runbook.

Outcome

Thanks to your efforts,

Product teams can ship features faster
Database reliability is improved
Anyone in the team can schema migrate large tables (based on your runbook)

So far, the team has run 100s of schema migrations without any problems.

Learnings

Scaling stateless systems is easy. Stateful, less so.

Lessons:

Learn to go deep in your stack
Read the documentation, code, participate in forums
Try out tools with production scale, not hello-world scale
Automate the process
Give it time; don't expect results on day 1

I write such stories on software engineering.

There's no specific frequency, as I don't make up these.

If you liked this one, you might love - 🔍 Building Pull Request-based ephemeral Preview environments on Kubernetes

Follow me on LinkedIn and Twitter for more such stuff, straight from the production oven!

Oh, by the way, if you need help with database reliability and scaling, reach out on Twitter or LinkedIn via DMs. We have worked at Terabyte scale when it comes to relational and non-relational databases.

This was one of the reasons I started One2N - to help growing orgs scale sustainably.