Services

Resources

Company

Our Work

Blog

Book a Call

Migrating 1.3TB of Primary Data from Self-Hosted MySQL to GCP CloudSQL.

Cloud-Migration

Data-Engineering

Data-Migration

Context.

Flip is a fintech company in Indonesia. Their primary MySQL database was self-managed and self-hosted on a cloud provider. They were looking to move away from that cloud provider to GCP and wanted to migrate the self-hosted MySQL setup to GCP’s CloudSQL. MySQL server received the peak traffic of 10,000 database requests per second, and the total database size was 1.3 Terabytes.

Problem Statement.

Migrate from a self-hosted MySQL cluster containing three nodes in a Primary/Replica setup to a GCP CloudSQL setup. As a fintech company data, the downtime has to be minimal without sacrificing data consistency, integrity and durability, and overall adherence to the organization's SLA.

Outcome/Impact.

Seamlessly migrate all the application traffic with less than 10 minutes of downtime for the database switchover.

Data consistency was intact.

Setup up Backup and Restore processes.

We could handle the peak traffic with less hardware utilization, thereby saving on cloud costs.

Solution.

The below diagram shows the initial setup. It consisted of the applications run as Pods. These Pods point to the ProxySQL High Availability Cluster. This cluster routes all the write queries to the MySQL Primary and Read queries to the MySQL read replicas. The read replicas are in sync using binary logs. The SRE team managed each MySQL server node.

Existing self-hosted MySQL setup

We evaluated different options to migrate data from MySQL to CloudSQL. It includes

Point in-time backup and restore
GCP Data Migration Service (DMS) with support for continuous replication
CloudSQL External Replication

After comparing these options, we found DMS to be easy to set up, uses native binary logs for replication, and comes with monitoring of the data lag and data copied.

We found a few downsides with DMS.

DMS itself provisions the CloudSQL instances. So we could not use Terraform. But we could later import it.
CloudSQL Primary instance provisioned is in replica mode and not HA. It blocks read/writes when promoting the Instance as Primary. We had planned for a small window of downtime that involves promoting.
DMS could not reach the source MySQL primary in another Cloud Provider even though we had set up VPC peering. We had to set up some IPtables NAT rules to work around it.
DMS runs large Select * … statements on the entire table, which causes writes to be blocked, affecting the DB performance for huge tables. So to work around that, we used a read-only replica with binary logs enabled as the source for DMS.

We started the DMS a couple of days before D-Day, and it took around three days to copy 1.3TB data, and the DMS replication lag came towards 0. At this point, the CloudSQL replica was in sync with the primary MySQL.

For the cutover

We had put all customer-facing apps in maintenance mode to indicate an easy-to-understand response to users.
We were already using ProxySQL for our read/write workload separation, so we used it to our advantage to simplify the cutover process. We reduced the write weights to ensure that internal apps are not writing to the primary database. We added a backend host entry for CloudSQL and adjusted the weights during cutover to move the traffic from hosted MySQL to CloudSQL.
Once basic read/write was working and tested with internal apps, we provisioned more CloudSQL replicas to match our original cluster size and updated ProxySQL to add the new read replica entries.

On a high level, the cutover process looked like below

Cutover process

After the production traffic was served successfully, we observed frequent tiny spikes of replication lags. We tuned the Innodb_flush_log_at_trx_commit flag. CloudSQL also supports Parallel replication, but the flag worked great enough for us that we decided to enable parallel replication later in the future.

We also observed that the GCP Cloud Monitoring metrics do not represent the true values of the metrics, such as replication lag, so we ended up setting up a native MySQL agent (Datadog in our case) which would connect to CloudSQL over MySQL connection to get most accurate values which we used to set up alerts.

Once the cutover was complete, the final setup looked as below. The application pods connected to ProxySQL HA setup, routing the Write queries to CloudSQL Primary and Read questions to the CloudSQL read replicas.

New GCP CloudSQL setup

The entire migration process took two months, including planning for MySQL and other stateless workloads for a team of 4 engineers. We migrated the stateless services without any downtime. MySQL was the most challenging part of the overall journey. We achieved the MySQL migration with a mere 10 minutes of downtime.

Tech stack used.

See our other works

How we saved more than 42% on AWS infrastructure costs.

The client is a no-code platform to build mobile apps for a Shopify store. The platform is hosted on AWS and serves customers across the globe. They have been in operation for 7+ years and have a 30+ member dev team with not much DevOps expertise in the team.

42%

Cost reduction for AWS Compute service

56%

Cost reduction for AWS RDS

90%

Cost reduction for AWS Data transfer

How we saved more than 42% on AWS infrastructure costs.

42%

Cost reduction for AWS Compute service

56%

Cost reduction for AWS RDS

90%

Cost reduction for AWS Data transfer

How we saved more than 42% on AWS infrastructure costs.

42%

Cost reduction for AWS Compute service

56%

Cost reduction for AWS RDS

90%

Cost reduction for AWS Data transfer

How we saved more than 42% on AWS infrastructure costs.

42%

Cost reduction for AWS Compute service

56%

Cost reduction for AWS RDS

90%

Cost reduction for AWS Data transfer

How we saved more than 42% on AWS infrastructure costs.

42%

Cost reduction for AWS Compute service

56%

Cost reduction for AWS RDS

90%

Cost reduction for AWS Data transfer

Migrating from Heroku PaaS to AWS

ADPList is a mentorship platform connecting mentors and mentees. The platform has been around since 2021 and has a network of 20k+ mentors and 160k+ mentees. The tech stack powering the platform is Django and PostgreSQL on the backend and a React App on the frontend. ADPList is hosted on Heroku PaaS. The ADPList platform hosts more than 30k+ mentoring sessions per month.

300%

Application performance improvement

50%

Cost Reduction

Throughout Improvement

Migrating from Heroku PaaS to AWS

300%

Application performance improvement

50%

Cost Reduction

Throughout Improvement

Migrating from Heroku PaaS to AWS

300%

Application performance improvement

50%

Cost Reduction

Throughout Improvement

Migrating from Heroku PaaS to AWS

300%

Application performance improvement

50%

Cost Reduction

Throughout Improvement

Migrating from Heroku PaaS to AWS

300%

Application performance improvement

50%

Cost Reduction

Throughout Improvement

Zero downtime MySQL schema migrations for 400M row table

DB schema migrations on large tables (400+ million rows or 150+ GB in size for a single table) caused replication lag and impacted latencies. Adding indexes on large tables also resulted in replication lag and degraded query performance. Developers had to wait months to roll out their features which needed schema changes.

400 M

Biggest table migrated

migrations per month