Services

Resources

Company

Our Work

Blog

Book a Call

Journey of moving VM-based workloads to Kubernetes.

Cloud Migration

Kubernetes

Context.

Flip is a fintech company in Indonesia. We migrated their stateless services running on 25+ VMs in the Alibaba cloud to their managed Kubernetes offering (ACK).

These application services served Flip's B2B and B2C traffic which often peaked at 25,000 requests/second.

Problem Statement.

Implement autoscaling of the infrastructure to handle peak traffic.

Move to a Cloud-Native architecture to use modern tooling.

Solve the problems related to configuration management and operational toil.

Outcome/Impact.

30%

Cost Saved

120

Pods at peak load

0sec

Downtime

The entire migration took 3 months. The major challenge was to map the existing complex Nginx configuration to maintainable Ingress resources in Kubernetes.

The overall process was transparent to the user and carried out without downtime or business impact.

We modernized the tech stack and deployment processes to adhere to the 12-factor app and cloud-native principles.

Solution.

The existing setup consisted of web and mobile apps accessing backend APIs running on Alibaba Cloud VMs. The user traffic was routed via Cloudflare. Each VM had Nginx and PHP-FPM processes. The SRE team managed the scaling of infra to serve the traffic accordingly.

Existing VM based setup

Making the application cloud native.

We started with containerizing the application. We separated the tightly coupled Nginx and PHP-FPM processes into separate stateless containers. One backend service used a sticky session, so we worked with the dev team to make it stateless.
We migrated the existing Nginx configuration and routing rules to a ConfigMap in Kubernetes. We used the PHP container as Sidecar to Nginx in the same pod and mounted the ConfigMap.
We also introduced ConfigMap reloader to detect config changes and automatically reload the application without downtime. This saved us operational overhead.
From the security perspective, we adopted cloud-native secrets management and restricted access to the SRE team in production.
Flip being a Fintech, reliability is of utmost priority. We updated the Continuous Deployment process to do Canary and controlled rollout to production. That way, we could detect errors before they impact all users.
The cloud-native architecture allowed us to quickly set up observability tools to provide better visibility into operations.We also used the Kubernetes Horizontal Pod Autoscaler (HPA) to automate the scaling in/out of the Pods based on the traffic reducing the operation toil. So overall system looked as below.

Application setup on kubernetes

Gradual traffic movement from VMs to Kubernetes.

To ensure no users are impacted, we ran both VM and Kubernetes setup in parallel and gradually migrated all the traffic from VMs to Kubernetes. We updated the deployment pipeline to deploy the application to both VM and Kubernetes setup in parallel.We were already using Cloudflare for DNS and DDoS protection. So, we decided to use Cloudflare’s External Load Balancer to migrate the traffic from the VM to Kubernetes.We started routing 10% of live traffic to the Kubernetes setup. We monitored RED metrics to make sure we don’t breach the SLO.Within 3 weeks, we could move 100% of the traffic to Kubernetes. We then decommissioned the VM setup. During this time, there was no business impact and the entire migration happened transparently to the users.During the entire migration period, the product team kept shipping new features and we deployed these features on both VM and Kubernetes environments.

Phased rollout using Cloudflare external load balancer

Managing the reliability at scale.

Initially, during sudden peak loads, the application often faced higher latencies, sometimes resulting in gateway timeout and business impact.The manual scaling of VMs was time-consuming and human-dependent. It was error-prone, and it took substantial time for a new VM to be ready to accept user requests.We used Horizontal Pod Autoscaler (HPA) to scale down by 70% during non-peak hours, saving 30% cost on compute resources.In case of additional load, the HPA would automatically kick in and scale the capacity without human intervention. We observed the number of pods reaching 120 during the peak load.

Tech stack used.

Take a look at our other work.

How we saved more than 42% on AWS infrastructure costs.

The client is a no-code platform to build mobile apps for a Shopify store. The platform is hosted on AWS and serves customers across the globe. They have been in operation for 7+ years and have a 30+ member dev team with not much DevOps expertise in the team.

42%

Cost reduction for AWS Compute service

56%

Cost reduction for AWS RDS

90%

Cost reduction for AWS Data transfer

Learn More

How we saved more than 42% on AWS infrastructure costs.

42%

Cost reduction for AWS Compute service

56%

Cost reduction for AWS RDS

90%

Cost reduction for AWS Data transfer

Learn More

How we saved more than 42% on AWS infrastructure costs.

42%

Cost reduction for AWS Compute service

56%

Cost reduction for AWS RDS

90%

Cost reduction for AWS Data transfer

Learn More

How we saved more than 42% on AWS infrastructure costs.

42%

Cost reduction for AWS Compute service

56%

Cost reduction for AWS RDS

90%

Cost reduction for AWS Data transfer

Learn More

How we saved more than 42% on AWS infrastructure costs.

42%

Cost reduction for AWS Compute service

56%

Cost reduction for AWS RDS

90%

Cost reduction for AWS Data transfer

Learn More

Migrating from Heroku PaaS to AWS

ADPList is a mentorship platform connecting mentors and mentees. The platform has been around since 2021 and has a network of 20k+ mentors and 160k+ mentees. The tech stack powering the platform is Django and PostgreSQL on the backend and a React App on the frontend. ADPList is hosted on Heroku PaaS. The ADPList platform hosts more than 30k+ mentoring sessions per month.

300%

Application performance improvement

50%

Cost Reduction

Throughout Improvement

Learn More

Migrating from Heroku PaaS to AWS

300%

Application performance improvement

50%

Cost Reduction

Throughout Improvement

Learn More

Migrating from Heroku PaaS to AWS

300%

Application performance improvement

50%

Cost Reduction

Throughout Improvement

Learn More

Migrating from Heroku PaaS to AWS

300%

Application performance improvement

50%

Cost Reduction

Throughout Improvement

Learn More

Migrating from Heroku PaaS to AWS

300%

Application performance improvement

50%

Cost Reduction

Throughout Improvement

Learn More

Zero downtime MySQL schema migrations for 400M row table

DB schema migrations on large tables (400+ million rows or 150+ GB in size for a single table) caused replication lag and impacted latencies. Adding indexes on large tables also resulted in replication lag and degraded query performance. Developers had to wait months to roll out their features which needed schema changes.

400 M

Biggest table migrated

migrations per month

Downtime due to schema migrations

Learn More

Zero downtime MySQL schema migrations for 400M row table

400 M

Biggest table migrated

migrations per month

Downtime due to schema migrations

Learn More

Zero downtime MySQL schema migrations for 400M row table

400 M

Biggest table migrated

migrations per month

Downtime due to schema migrations

Learn More

Zero downtime MySQL schema migrations for 400M row table

400 M

Biggest table migrated

migrations per month

Downtime due to schema migrations

Learn More

Zero downtime MySQL schema migrations for 400M row table

400 M

Biggest table migrated

migrations per month

Downtime due to schema migrations

Learn More

Backup and recovery solution for SIEM data at Terabyte scale

The client is a global MSSP (Managed Security Service Provider) company. They host and manage a popular Security Information and Events Management (SIEM) platform for detecting, monitoring, and responding to cybersecurity threats and incidents. Their system handles 100s of tenants and more than 1.5 Terabytes of security events and logs data daily.

1.5TB

Dataset handled daily

100+

tenants using backup restore solution

$200

Running cost per month

Learn More

Backup and recovery solution for SIEM data at Terabyte scale

1.5TB

Dataset handled daily

100+

tenants using backup restore solution

$200

Running cost per month

Learn More

Backup and recovery solution for SIEM data at Terabyte scale

1.5TB

Dataset handled daily

100+

tenants using backup restore solution

$200

Running cost per month

Learn More

Backup and recovery solution for SIEM data at Terabyte scale

1.5TB

Dataset handled daily

100+

tenants using backup restore solution

$200

Running cost per month

Learn More

Backup and recovery solution for SIEM data at Terabyte scale

1.5TB

Dataset handled daily

100+

tenants using backup restore solution

$200

Running cost per month

Learn More

Case Studies

Our blogs.

Debugging KubeVPN's DNS resolution failures

Nilesh D

Software Engineer @One2N

The other day, I was trying to access the Go documentation at go.dev, but the page wouldn't load. Not on Chrome, not in incognito mode, not even on Safari. The error was simple but frustrating: "Could not resolve hostname".

September 30, 2025 | 3 min read

What happens when you run Terraform commands?

Sahil Joseph

SRE @One2N

Saurabh Hirani

Principal SRE @One2N

Have you ever wondered what happens when you run Terraform commands? In this post, we will explore the following topics to gain a better understanding of Terraform internals: Architecture and provider plugin architecture, Credential management, State locking and Debugging

September 24, 2025 | 7 min read

Daily driving Omarchy and Hyprland as a CTO

Jaideep Khandelwal

Founder, CTO @One2N

A hands-on review of Omarchy Linux, an easy-to-install Arch-based Linux distro with Hyprland window manager pre-configured out of the box. Learn what works, what breaks, and who should use it.

September 16, 2025 | 5 min read

DORA Metrics: Useful or Fluff

Jaideep Khandelwal

Founder, CTO @One2N

In this post, we shed light on why following DORA metrics without context might cause more harm than good. We also show how leaders can use a practical, strategic approach to connect DORA metrics to meaningful business outcomes.

September 3, 2025 | 14 min read

A pragmatic guide to get started with OpenTelemetry

Spandan Ghosh

Content @One2N

A guide explaining OTEL, monitoring vs. observability, telemetry’s pillars, OTEL instrumentation, and seamless backend migration. Includes practical migration strategies and trade-offs for SREs adopting OTEL.

August 20, 2025 | 5 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

July 15, 2025 | 7 min read

Blog

Our blogs.

Debugging KubeVPN's DNS resolution failures

Nilesh D

Software Engineer @One2N

The other day, I was trying to access the Go documentation at go.dev, but the page wouldn't load. Not on Chrome, not in incognito mode, not even on Safari. The error was simple but frustrating: "Could not resolve hostname".

September 30, 2025 | 3 min read

What happens when you run Terraform commands?

Sahil Joseph

SRE @One2N

Saurabh Hirani

Principal SRE @One2N

Have you ever wondered what happens when you run Terraform commands? In this post, we will explore the following topics to gain a better understanding of Terraform internals: Architecture and provider plugin architecture, Credential management, State locking and Debugging

September 24, 2025 | 7 min read

Daily driving Omarchy and Hyprland as a CTO

Jaideep Khandelwal

Founder, CTO @One2N

A hands-on review of Omarchy Linux, an easy-to-install Arch-based Linux distro with Hyprland window manager pre-configured out of the box. Learn what works, what breaks, and who should use it.

September 16, 2025 | 5 min read

DORA Metrics: Useful or Fluff

Jaideep Khandelwal

Founder, CTO @One2N

In this post, we shed light on why following DORA metrics without context might cause more harm than good. We also show how leaders can use a practical, strategic approach to connect DORA metrics to meaningful business outcomes.

September 3, 2025 | 14 min read

A pragmatic guide to get started with OpenTelemetry

Spandan Ghosh

Content @One2N

A guide explaining OTEL, monitoring vs. observability, telemetry’s pillars, OTEL instrumentation, and seamless backend migration. Includes practical migration strategies and trade-offs for SREs adopting OTEL.

August 20, 2025 | 5 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

July 15, 2025 | 7 min read

Blog

Our blogs.

Debugging KubeVPN's DNS resolution failures

Nilesh D

Software Engineer @One2N

The other day, I was trying to access the Go documentation at go.dev, but the page wouldn't load. Not on Chrome, not in incognito mode, not even on Safari. The error was simple but frustrating: "Could not resolve hostname".

September 30, 2025 | 3 min read

What happens when you run Terraform commands?

Sahil Joseph

SRE @One2N

Saurabh Hirani

Principal SRE @One2N

Have you ever wondered what happens when you run Terraform commands? In this post, we will explore the following topics to gain a better understanding of Terraform internals: Architecture and provider plugin architecture, Credential management, State locking and Debugging

September 24, 2025 | 7 min read

Daily driving Omarchy and Hyprland as a CTO

Jaideep Khandelwal

Founder, CTO @One2N

A hands-on review of Omarchy Linux, an easy-to-install Arch-based Linux distro with Hyprland window manager pre-configured out of the box. Learn what works, what breaks, and who should use it.

September 16, 2025 | 5 min read

DORA Metrics: Useful or Fluff

Jaideep Khandelwal

Founder, CTO @One2N

In this post, we shed light on why following DORA metrics without context might cause more harm than good. We also show how leaders can use a practical, strategic approach to connect DORA metrics to meaningful business outcomes.

September 3, 2025 | 14 min read

A pragmatic guide to get started with OpenTelemetry

Spandan Ghosh

Content @One2N

A guide explaining OTEL, monitoring vs. observability, telemetry’s pillars, OTEL instrumentation, and seamless backend migration. Includes practical migration strategies and trade-offs for SREs adopting OTEL.

August 20, 2025 | 5 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

July 15, 2025 | 7 min read

Blog

Our blogs.

Debugging KubeVPN's DNS resolution failures

Nilesh D

Software Engineer @One2N

The other day, I was trying to access the Go documentation at go.dev, but the page wouldn't load. Not on Chrome, not in incognito mode, not even on Safari. The error was simple but frustrating: "Could not resolve hostname".

September 30, 2025 | 3 min read

What happens when you run Terraform commands?

Sahil Joseph

SRE @One2N

Saurabh Hirani

Principal SRE @One2N

Have you ever wondered what happens when you run Terraform commands? In this post, we will explore the following topics to gain a better understanding of Terraform internals: Architecture and provider plugin architecture, Credential management, State locking and Debugging

September 24, 2025 | 7 min read

Daily driving Omarchy and Hyprland as a CTO

Jaideep Khandelwal

Founder, CTO @One2N

A hands-on review of Omarchy Linux, an easy-to-install Arch-based Linux distro with Hyprland window manager pre-configured out of the box. Learn what works, what breaks, and who should use it.

September 16, 2025 | 5 min read

Blog

Our blogs.

Debugging KubeVPN's DNS resolution failures

Nilesh D

Software Engineer @One2N

The other day, I was trying to access the Go documentation at go.dev, but the page wouldn't load. Not on Chrome, not in incognito mode, not even on Safari. The error was simple but frustrating: "Could not resolve hostname".

September 30, 2025 | 3 min read

What happens when you run Terraform commands?

Sahil Joseph

SRE @One2N

Saurabh Hirani

Principal SRE @One2N

Have you ever wondered what happens when you run Terraform commands? In this post, we will explore the following topics to gain a better understanding of Terraform internals: Architecture and provider plugin architecture, Credential management, State locking and Debugging

September 24, 2025 | 7 min read

Daily driving Omarchy and Hyprland as a CTO

Jaideep Khandelwal

Founder, CTO @One2N

A hands-on review of Omarchy Linux, an easy-to-install Arch-based Linux distro with Hyprland window manager pre-configured out of the box. Learn what works, what breaks, and who should use it.

September 16, 2025 | 5 min read

Blog