Services

Resources

Company

Our Work

Blog

Book a Call

AutoScaling eKYC Machine Learning workloads to 2 Million req/day.

Context.

The client provides eKYC SaaS APIs (face matching, OCR, etc) accessible via Android SDK to its B2B customers. This was deployed for a major Telecom in India and needed to scale up to 2 Million API requests per day.

The tech stack consisted of a Golang-based API, Python-based Machine Learning models, a Message queue (RabbitMQ), Distributed blob storage (Minio) and PostgreSQL DB deployed via docker on Google Cloud using Hashicorp toolchain (Nomad, Consul).

Problem Statement.

Cost optimization. During the pilot phase, cloud costs were $5k per month. For go live, the expected traffic was 20 to 30x of pilot traffic. Here, the cost of GPU VMs (for ML workloads) was a dominating factor (90% share). This would have pushed the cloud costs to $100k per month without auto-scaling.

Ensuring 99% availability of services during business hours (8 am-10 pm).

Ensuring < 3 sec API latency for the 95th percentile.

Outcome/Impact.

Cloud monthly costs were reduced to a maximum of $10k per month (90% savings over predicted costs without the Autoscaler).

The solution has been live and running successfully on production for 2+ years.

SLA for uptime and response time are met during this period.

Solution.

In the final Auto Scaling solution, we needed a manual override option. At that time, both Nomad and Kubernetes did not provide this capability out of the box. Also, migrating to Kubernetes would have meant much rework for the entire team. Hence, we decided to build a custom auto-scaling solution on top of Nomad and existing toolchains used in the project.

Autoscaler runs every 20 minutes, predicts the traffic for the next cycle and ensures that required ML workers are available. Same logic works for scale up as well as scale down.

Some of the other challenges we encountered were as follows:

Eliminate single points of failure
- Run RabbitMQ in High Availability mode with queue replication across three zones in GCP.
- Use GCloud Storage as a fallback for Minio. If Minio is unavailable, the application will transparently use GCloud Storage as an image store.
- ML workers were split across two availability zones in an odd-even fashion.
- For other components (PostgreSQL, Redis, API), we used SaaS offerings by GCP and ran redundant versions of those components.
GPU utilization bug in Nomad (https://github.com/hashicorp/nomad/issues/6708)
- We used Raw Exec driver from Nomad and launched multiple ML containers on a single VM using docker-compose.
- We also implemented custom health checks and CPU stickiness for individual containers.
Capacity Planning
- We figured out the optimal VM setup considering factors like GPU, RAM, and cost per hour.
- We studied past request patterns and figured out a simple formula for predicting traffic based on a slope of the request growth line.
Automated Rolling Deployments during peak time
- We pre-fetched the ML worker docker images (7GB+) on nodes to have a faster startup time during deployments.
- The golden image would be updated on the first node. After it is successfully deployed, it would be updated on all remaining nodes, one at a time. This allowed us to deploy services even during peak load.
Monitoring, Alerting, and auto-healing
- We made various SLI reports and latency dashboards available to all stakeholders.
- Setup PagerDuty and on-call schedules.
- Implemented scripted actions for common operation issues (example: Handling Live VM migration for GPU VMs)
Test the whole setup and fix issues
- We Performed extensive, long-running Load tests with production-like traffic patterns to ensure the Autoscaler worked as expected.
- Tested redundancies and HA setup by introducing chaos (shutting down of nodes and services like Minio) during load testing.

Tech stack used.

Watch this talk for a detailed understanding of the work, presented at HashiTalks India 2021.

Take a look at our other work.

Zero downtime MySQL schema migrations for 400M row table

DB schema migrations on large tables (400+ million rows or 150+ GB in size for a single table) caused replication lag and impacted latencies. Adding indexes on large tables also resulted in replication lag and degraded query performance. Developers had to wait months to roll out their features which needed schema changes.

400 M

Biggest table migrated

migrations per month

Downtime due to schema migrations

Zero downtime MySQL schema migrations for 400M row table

400 M

Biggest table migrated

migrations per month

Downtime due to schema migrations

Zero downtime MySQL schema migrations for 400M row table

400 M

Biggest table migrated

migrations per month

Downtime due to schema migrations

Zero downtime MySQL schema migrations for 400M row table

400 M

Biggest table migrated

migrations per month

Downtime due to schema migrations

Zero downtime MySQL schema migrations for 400M row table

400 M

Biggest table migrated

migrations per month

Downtime due to schema migrations

Backup and recovery solution for SIEM data at Terabyte scale

The client is a global MSSP (Managed Security Service Provider) company. They host and manage a popular Security Information and Events Management (SIEM) platform for detecting, monitoring, and responding to cybersecurity threats and incidents. Their system handles 100s of tenants and more than 1.5 Terabytes of security events and logs data daily.

1.5TB

Dataset handled daily

100+

tenants using backup restore solution

$200

Running cost per month

Backup and recovery solution for SIEM data at Terabyte scale

1.5TB

Dataset handled daily

100+

tenants using backup restore solution

$200

Running cost per month

Backup and recovery solution for SIEM data at Terabyte scale

1.5TB

Dataset handled daily

100+

tenants using backup restore solution

$200

Running cost per month

Backup and recovery solution for SIEM data at Terabyte scale

1.5TB

Dataset handled daily

100+

tenants using backup restore solution

$200

Running cost per month

Backup and recovery solution for SIEM data at Terabyte scale

1.5TB

Dataset handled daily

100+

tenants using backup restore solution

$200

Running cost per month

Case Studies

Our blogs.

Debugging KubeVPN's DNS resolution failures

Nilesh D

Software Engineer @One2N

The other day, I was trying to access the Go documentation at go.dev, but the page wouldn't load. Not on Chrome, not in incognito mode, not even on Safari. The error was simple but frustrating: "Could not resolve hostname".

September 30, 2025 | 3 min read

What happens when you run Terraform commands?

Sahil Joseph

SRE @One2N

Saurabh Hirani

Principal SRE @One2N

Have you ever wondered what happens when you run Terraform commands? In this post, we will explore the following topics to gain a better understanding of Terraform internals: Architecture and provider plugin architecture, Credential management, State locking and Debugging

September 24, 2025 | 7 min read

Daily driving Omarchy and Hyprland as a CTO

Jaideep Khandelwal

Founder, CTO @One2N

A hands-on review of Omarchy Linux, an easy-to-install Arch-based Linux distro with Hyprland window manager pre-configured out of the box. Learn what works, what breaks, and who should use it.

September 16, 2025 | 5 min read

DORA Metrics: Useful or Fluff

Jaideep Khandelwal

Founder, CTO @One2N

In this post, we shed light on why following DORA metrics without context might cause more harm than good. We also show how leaders can use a practical, strategic approach to connect DORA metrics to meaningful business outcomes.

September 3, 2025 | 14 min read

A pragmatic guide to get started with OpenTelemetry

Spandan Ghosh

Content @One2N

A guide explaining OTEL, monitoring vs. observability, telemetry’s pillars, OTEL instrumentation, and seamless backend migration. Includes practical migration strategies and trade-offs for SREs adopting OTEL.

August 20, 2025 | 5 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

July 15, 2025 | 7 min read

Blog

Our blogs.

Debugging KubeVPN's DNS resolution failures

Nilesh D

Software Engineer @One2N

The other day, I was trying to access the Go documentation at go.dev, but the page wouldn't load. Not on Chrome, not in incognito mode, not even on Safari. The error was simple but frustrating: "Could not resolve hostname".

September 30, 2025 | 3 min read

What happens when you run Terraform commands?

Sahil Joseph

SRE @One2N

Saurabh Hirani

Principal SRE @One2N

Have you ever wondered what happens when you run Terraform commands? In this post, we will explore the following topics to gain a better understanding of Terraform internals: Architecture and provider plugin architecture, Credential management, State locking and Debugging

September 24, 2025 | 7 min read

Daily driving Omarchy and Hyprland as a CTO

Jaideep Khandelwal

Founder, CTO @One2N

A hands-on review of Omarchy Linux, an easy-to-install Arch-based Linux distro with Hyprland window manager pre-configured out of the box. Learn what works, what breaks, and who should use it.

September 16, 2025 | 5 min read

Blog

Our blogs.

Debugging KubeVPN's DNS resolution failures

Nilesh D

Software Engineer @One2N

The other day, I was trying to access the Go documentation at go.dev, but the page wouldn't load. Not on Chrome, not in incognito mode, not even on Safari. The error was simple but frustrating: "Could not resolve hostname".

September 30, 2025 | 3 min read

What happens when you run Terraform commands?

Sahil Joseph

SRE @One2N

Saurabh Hirani

Principal SRE @One2N

Have you ever wondered what happens when you run Terraform commands? In this post, we will explore the following topics to gain a better understanding of Terraform internals: Architecture and provider plugin architecture, Credential management, State locking and Debugging

September 24, 2025 | 7 min read

Daily driving Omarchy and Hyprland as a CTO

Jaideep Khandelwal

Founder, CTO @One2N

A hands-on review of Omarchy Linux, an easy-to-install Arch-based Linux distro with Hyprland window manager pre-configured out of the box. Learn what works, what breaks, and who should use it.

September 16, 2025 | 5 min read

DORA Metrics: Useful or Fluff

Jaideep Khandelwal

Founder, CTO @One2N

In this post, we shed light on why following DORA metrics without context might cause more harm than good. We also show how leaders can use a practical, strategic approach to connect DORA metrics to meaningful business outcomes.

September 3, 2025 | 14 min read

A pragmatic guide to get started with OpenTelemetry

Spandan Ghosh

Content @One2N

A guide explaining OTEL, monitoring vs. observability, telemetry’s pillars, OTEL instrumentation, and seamless backend migration. Includes practical migration strategies and trade-offs for SREs adopting OTEL.

August 20, 2025 | 5 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

July 15, 2025 | 7 min read

Blog

Our blogs.

Debugging KubeVPN's DNS resolution failures

Nilesh D

Software Engineer @One2N

The other day, I was trying to access the Go documentation at go.dev, but the page wouldn't load. Not on Chrome, not in incognito mode, not even on Safari. The error was simple but frustrating: "Could not resolve hostname".

September 30, 2025 | 3 min read

What happens when you run Terraform commands?

Sahil Joseph

SRE @One2N

Saurabh Hirani

Principal SRE @One2N

Have you ever wondered what happens when you run Terraform commands? In this post, we will explore the following topics to gain a better understanding of Terraform internals: Architecture and provider plugin architecture, Credential management, State locking and Debugging

September 24, 2025 | 7 min read

Daily driving Omarchy and Hyprland as a CTO

Jaideep Khandelwal

Founder, CTO @One2N

A hands-on review of Omarchy Linux, an easy-to-install Arch-based Linux distro with Hyprland window manager pre-configured out of the box. Learn what works, what breaks, and who should use it.

September 16, 2025 | 5 min read

DORA Metrics: Useful or Fluff

Jaideep Khandelwal

Founder, CTO @One2N

In this post, we shed light on why following DORA metrics without context might cause more harm than good. We also show how leaders can use a practical, strategic approach to connect DORA metrics to meaningful business outcomes.

September 3, 2025 | 14 min read

A pragmatic guide to get started with OpenTelemetry

Spandan Ghosh

Content @One2N

A guide explaining OTEL, monitoring vs. observability, telemetry’s pillars, OTEL instrumentation, and seamless backend migration. Includes practical migration strategies and trade-offs for SREs adopting OTEL.

August 20, 2025 | 5 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

July 15, 2025 | 7 min read

Blog

Our blogs.

Debugging KubeVPN's DNS resolution failures

Nilesh D

Software Engineer @One2N

The other day, I was trying to access the Go documentation at go.dev, but the page wouldn't load. Not on Chrome, not in incognito mode, not even on Safari. The error was simple but frustrating: "Could not resolve hostname".

September 30, 2025 | 3 min read

What happens when you run Terraform commands?

Sahil Joseph

SRE @One2N

Saurabh Hirani

Principal SRE @One2N

Have you ever wondered what happens when you run Terraform commands? In this post, we will explore the following topics to gain a better understanding of Terraform internals: Architecture and provider plugin architecture, Credential management, State locking and Debugging

September 24, 2025 | 7 min read

Daily driving Omarchy and Hyprland as a CTO

Jaideep Khandelwal

Founder, CTO @One2N

A hands-on review of Omarchy Linux, an easy-to-install Arch-based Linux distro with Hyprland window manager pre-configured out of the box. Learn what works, what breaks, and who should use it.

September 16, 2025 | 5 min read

Blog