Services

Resources

Company

Our Work

Blog

Schedule a Meet

Back to Pragmatic Software Stories

Migrating Terabytes of metrics data with zero downtime

~ Chinmay Naik

You're an SRE responsible for VictoriaMetrics deployment with 30 Million time series/min. The CTO wants you to drastically reduce the costs for this infra without compromising reliability.

You come up with a solution that looks ridiculous at first, but makes total sense.

You're managing the observability setup for a large B2B org. The metrics setup consists of self-hosted VictoriaMetrics that handles 30 Million time series per minute. It's all hosted on AWS - EU Ireland region.

The setup consists of 50+ VMs, and you're running VictoriaMetrics in cluster mode with separate nodes for vm-storage, vm-insert, vm-select. The cost of running just this monitoring infra is upwards of USD 10k/month.

You need 6+ months of metrics retention to run analytics queries. The total metrics data size is 35+ TB (yes, terabytes) and growing every day. It's Q4 of 2023, so you're running low on budget, and the CTO asks the costs be reduced drastically.

All this - without any downtime!

Your approach

"How are you supposed to reduce the costs?"

You're thinking, trying to analyze the workloads. You have gone through the entire documentation of Victoria metrics to the point that you see histograms and percentiles in your dreams.

There must be some way...

You think of a few approaches, but none of them will get you the cost savings that you want. You're desperate and about to write to the CTO that this can't be done. You can't reduce these costs, given the constraints.

You decide to sleep over it and send that update tomorrow.

The next morning, while taking a shower, you have a crazy thought. You hurriedly finish your bath and start comparing EC2 instance prices across various regions while still in your bathrobe.

You can't stop smiling, but you're not sure whether the CTO will approve your crazy idea.

First thing, you set up a call with the CTO, and start with your ridiculous solution to migrate the entire metrics data storage to a different AWS region (IN/Mumbai in your case).

Before the CTO could say anything, you interrupt him and ask him to hear you out.

You go:

See, I did a lot of thinking and tried many hypotheses, but we're running a tight ship. The only way to save up to 40-45 percent of the cost is to migrate all the data from the EU to IN. I did paper napkin math, and I think we can save up to 40% In AWS Infra Cost if we migrate the instances to IN.

You show him the numbers, and it makes sense. You should definitely migrate!

But the next question is:

How do we migrate, and that too, with zero downtime?

Zero downtime metrics data migration

You ask him for a couple of days so that you can prepare the plan. Skipping ahead a few details, in the end, you devise the following migration plan:

Spin up infra in IN region
Dual writes to EU and IN region from time t.
After some time, copy old EU data to IN.
Route read traffic to IN.
Stop EU writes and stop EU nodes.

Obviously, there's so much detail in each of these steps. You create a proper checklist and assign owners, plan for contingencies, and document rollback plans.

Your approach is such that you always have a working solution at any point in time that you can easily fall back to.

The CTO is happy with the details and your approach. The entire migration takes about 2 weeks (1 week of planning and 1 week of execution + testing). A month passes, and you check the actual realized savings.

You achieved 40% cost savings month on month. Not bad!

Your solution to migrate from one region to another was ridiculous sounding at first, but it really worked and got you the savings you wanted.

Since there was no issue with data residency requirements (as you don't store sensitive data in metrics storage), this worked out well.

I write such stories on software engineering.

There's no specific frequency, as I don't make these up.

If you liked this one, you might love - A curious story of debugging Machine Learning model performance.

Follow me on LinkedIn and Twitter for more such stuff, straight from the production oven!

Keywords

You run VictoriaMetrics at 30M time series per minute. The CTO wants costs cut without compromising reliability. A zero-downtime migration story.

‹ Database Reliability - Zero Downtime Schema Migrations with MySQL

Building Pull Request-based ephemeral Preview environments on Kubernetes ›

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Continue reading.

Read PSES

The database that was quietly eating itself

P50, P90, and P99 alerts keep firing. APIs are slow. You investigate. A slow-burning incident story about a database degrading itself over time.

Read PSES

Debugging a website that worked on every device except mine

Debugging a website that worked on every device except mine. Storytime! Colleague: Hey, can anyone help? I can't access go.dev on my work laptop. Tried different browsers, cleared DNS cache, nothing works.

Read PSES

A curious case of Postgres choosing the wrong index

A core feature goes down for some users during peak traffic. You're on call. A story about Postgres picking the wrong index and what it took to fix it.

Read PSES

Data engineering mystery - rerouting large data in Kafka

You're a tech lead pulled into a Kafka issue by a colleague. A data engineering mystery about rerouting large-scale data without breaking the pipeline.

Read PSES

The Mystery of Failing Database Writes

You're on call when a PagerDuty alert fires for the main transactional database. An SRE story about diagnosing failing writes under pressure.

Read PSES

A story about a nightmare scenario for every SRE

A cloud security failure wakes you up via an AWS billing alarm. A story about why good engineering practices matter during the One-to-N journey.

Read PSES

The database that was quietly eating itself

P50, P90, and P99 alerts keep firing. APIs are slow. You investigate. A slow-burning incident story about a database degrading itself over time.

Read PSES

Debugging a website that worked on every device except mine

Debugging a website that worked on every device except mine. Storytime! Colleague: Hey, can anyone help? I can't access go.dev on my work laptop. Tried different browsers, cleared DNS cache, nothing works.

Read PSES

A curious case of Postgres choosing the wrong index

A core feature goes down for some users during peak traffic. You're on call. A story about Postgres picking the wrong index and what it took to fix it.

Read PSES

The database that was quietly eating itself

P50, P90, and P99 alerts keep firing. APIs are slow. You investigate. A slow-burning incident story about a database degrading itself over time.

Read PSES

Debugging a website that worked on every device except mine

Debugging a website that worked on every device except mine. Storytime! Colleague: Hey, can anyone help? I can't access go.dev on my work laptop. Tried different browsers, cleared DNS cache, nothing works.

Read PSES

A curious case of Postgres choosing the wrong index

A core feature goes down for some users during peak traffic. You're on call. A story about Postgres picking the wrong index and what it took to fix it.

Read PSES

Data engineering mystery - rerouting large data in Kafka

You're a tech lead pulled into a Kafka issue by a colleague. A data engineering mystery about rerouting large-scale data without breaking the pipeline.

All PSES

Services

Resources

Company

Migrating Terabytes of metrics data with zero downtime

~ Chinmay Naik

Context

Your approach

Zero downtime metrics data migration

Share

Share

Keywords

Subscribe for more such content

Hold to verify for 2 seconds

Subscribe for more such content

Hold to verify for 2 seconds

Continue reading.

The database that was quietly eating itself

P50, P90, and P99 alerts keep firing. APIs are slow. You investigate. A slow-burning incident story about a database degrading itself over time.

Debugging a website that worked on every device except mine

Debugging a website that worked on every device except mine. Storytime! Colleague: Hey, can anyone help? I can't access go.dev on my work laptop. Tried different browsers, cleared DNS cache, nothing works.

A curious case of Postgres choosing the wrong index

A core feature goes down for some users during peak traffic. You're on call. A story about Postgres picking the wrong index and what it took to fix it.

Data engineering mystery - rerouting large data in Kafka

You're a tech lead pulled into a Kafka issue by a colleague. A data engineering mystery about rerouting large-scale data without breaking the pipeline.

The Mystery of Failing Database Writes

You're on call when a PagerDuty alert fires for the main transactional database. An SRE story about diagnosing failing writes under pressure.

A story about a nightmare scenario for every SRE

A cloud security failure wakes you up via an AWS billing alarm. A story about why good engineering practices matter during the One-to-N journey.

The database that was quietly eating itself

P50, P90, and P99 alerts keep firing. APIs are slow. You investigate. A slow-burning incident story about a database degrading itself over time.

Debugging a website that worked on every device except mine

Debugging a website that worked on every device except mine. Storytime! Colleague: Hey, can anyone help? I can't access go.dev on my work laptop. Tried different browsers, cleared DNS cache, nothing works.

A curious case of Postgres choosing the wrong index

A core feature goes down for some users during peak traffic. You're on call. A story about Postgres picking the wrong index and what it took to fix it.

The database that was quietly eating itself

P50, P90, and P99 alerts keep firing. APIs are slow. You investigate. A slow-burning incident story about a database degrading itself over time.

Debugging a website that worked on every device except mine

Debugging a website that worked on every device except mine. Storytime! Colleague: Hey, can anyone help? I can't access go.dev on my work laptop. Tried different browsers, cleared DNS cache, nothing works.

A curious case of Postgres choosing the wrong index

A core feature goes down for some users during peak traffic. You're on call. A story about Postgres picking the wrong index and what it took to fix it.

Data engineering mystery - rerouting large data in Kafka

You're a tech lead pulled into a Kafka issue by a colleague. A data engineering mystery about rerouting large-scale data without breaking the pipeline.