Services

Resources

Company

Migrating Terabytes of metrics data with zero downtime

Migrating Terabytes of metrics data with zero downtime

Migrating Terabytes of metrics data with zero downtime

Migrating Terabytes of metrics data with zero downtime

You're an SRE responsible for VictoriaMetrics deployment with 30 Million time series/min. The CTO wants you to drastically reduce the costs for this infra without compromising reliability.

You come up with a solution that looks ridiculous at first, but makes total sense.

Context

You're managing the observability setup for a large B2B org. The metrics setup consists of self-hosted VictoriaMetrics that handles 30 Million time series per minute. It's all hosted on AWS - EU Ireland region.

The setup consists of 50+ VMs, and you're running VictoriaMetrics in cluster mode with separate nodes for vm-storage, vm-insert, vm-select. The cost of running just this monitoring infra is upwards of USD 10k/month.

You need 6+ months of metrics retention to run analytics queries. The total metrics data size is 35+ TB (yes, terabytes) and growing every day. It's Q4 of 2023, so you're running low on budget, and the CTO asks the costs be reduced drastically.

All this - without any downtime!

Your approach

"How Are You Supposed To Reduce The Costs?"

You're thinking, trying to analyze the workloads. You have gone through the entire documentation of Victoria metrics to the point that you see histograms and percentiles in your dreams.

There must be some way...

You think of a few approaches, but none of them will get you the cost savings that you want. You're desperate and about to write to the CTO that this can't be done. You can't reduce these costs, given the constraints.

You decide to sleep over it and send that update tomorrow.

The next morning, while taking a shower, you have a crazy thought. You hurriedly finish your bath and start comparing EC2 instance prices across various regions while still in your bathrobe.

You can't stop smiling 😀 but you're not sure whether the CTO will approve your crazy idea.

First thing, you set up a call with the CTO, and start with your ridiculous solution to migrate the entire metrics data storage to a different AWS region (IN/Mumbai in your case).

Before the CTO could say anything, you interrupt him and ask him to hear you out.

You go:

See, I Did A Lot Of Thinking And Tried Many Hypotheses, But We're Running A Tight Ship. The Only Way To Save Up To 40-45 Percent Of The Cost Is To Migrate All The Data From The EU To IN. I Did Paper Napkin Math, And I Think We Can Save Up To 40% In AWS Infra Cost If We Migrate The Instances To IN.

You show him the numbers, and it makes sense. You should definitely migrate!

But the next question is:

How do we migrate, and that too, with zero downtime?

Zero downtime metrics data migration

You ask him for a couple of days so that you can prepare the plan. Skipping ahead a few details, in the end, you devise the following migration plan:

  • Spin up infra in IN region

  • Dual writes to EU and IN region from time t.

  • After some time, copy old EU data to IN.

  • Route read traffic to IN.

  • Stop EU writes and stop EU nodes.

Obviously, there's so much detail in each of these steps. You create a proper checklist and assign owners, plan for contingencies, and document rollback plans.

Your approach is such that you always have a working solution at any point in time that you can easily fall back to.

The CTO is happy with the details and your approach. The entire migration takes about 2 weeks (1 week of planning and 1 week of execution + testing). A month passes, and you check the actual realized savings.

You achieved 40% cost savings month on month. Not bad!

Your solution to migrate from one region to another was ridiculous sounding at first, but it really worked and got you the savings you wanted.

Since there was no issue with data residency requirements (as you don't store sensitive data in metrics storage), this worked out well.

I write such stories on software engineering.

There's no specific frequency, as I don't make up these.

If you liked this one, you might love - 🪀A curious story of debugging Machine Learning model performance.

Follow me on LinkedIn and Twitter for more such stuff, straight from the production oven!

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.