Services

Resources

Company

Our Work

Blog

Book a Call

Homogeneous Monitoring across Cloud and On-premise.

Context.

The client provides Credit Insights to Financial Institutions and NBFCs across South East Asia.

They have 15 deployments across AWS and GCP and six on-premise data centers in different geographies.

Every on-premise has 40+ bare-metal machines, and every cloud deployment has 10+ VMs with 30+ microservices.

Problem Statement.

Monitor CPU, Memory, and Disk Utilization of the host systems and application containers. We needed a uniform solution to fetch metrics across AWS, GCP and on-premise data centers.

Monitor CPU, Memory, and Disk Utilization of the host systems and application containers. We needed a uniform solution to fetch metrics across AWS, GCP and on-premise data centers.

Monitor CPU, Memory, and Disk Utilization of the host systems and application containers. We needed a uniform solution to fetch metrics across AWS, GCP and on-premise data centers.

Monitor CPU, Memory, and Disk Utilization of the host systems and application containers. We needed a uniform solution to fetch metrics across AWS, GCP and on-premise data centers.

Monitor CPU, Memory, and Disk Utilization of the host systems and application containers. We needed a uniform solution to fetch metrics across AWS, GCP and on-premise data centers.

All metrics should be stored and served from a single storage backend for uniformity, visualization and alerting.

All metrics should be stored and served from a single storage backend for uniformity, visualization and alerting.

All metrics should be stored and served from a single storage backend for uniformity, visualization and alerting.

All metrics should be stored and served from a single storage backend for uniformity, visualization and alerting.

All metrics should be stored and served from a single storage backend for uniformity, visualization and alerting.

Set alerts on certain utilization thresholds for Host, CPU and Disk. Set alerts on machine restart or container restart.

Set alerts on certain utilization thresholds for Host, CPU and Disk. Set alerts on machine restart or container restart.

Set alerts on certain utilization thresholds for Host, CPU and Disk. Set alerts on machine restart or container restart.

Set alerts on certain utilization thresholds for Host, CPU and Disk. Set alerts on machine restart or container restart.

Set alerts on certain utilization thresholds for Host, CPU and Disk. Set alerts on machine restart or container restart.

Alerts should be configured and raised through a single channel.

Alerts should be configured and raised through a single channel.

Alerts should be configured and raised through a single channel.

Alerts should be configured and raised through a single channel.

Alerts should be configured and raised through a single channel.

The system should handle up to 2,000 metric events per minute.

The system should handle up to 2,000 metric events per minute.

The system should handle up to 2,000 metric events per minute.

The system should handle up to 2,000 metric events per minute.

The system should handle up to 2,000 metric events per minute.

Outcome/Impact.

Homogenous solution for monitoring and alerting for deployments across cloud and on-premise data centers.

Homogenous solution for monitoring and alerting for deployments across cloud and on-premise data centers.

Homogenous solution for monitoring and alerting for deployments across cloud and on-premise data centers.

Homogenous solution for monitoring and alerting for deployments across cloud and on-premise data centers.

Solution.

Every deployment (Cloud or On-Premise) had its own Nomad cluster for scheduling workloads. Prometheus and StatsD exporter were used for metrics collection.

Configured nomad agent and servers to emit metrics to StatsD exporter over UDP. Configure Prometheus to scrape from StatsD exporter HTTP endpoint. This allowed us to set up monitoring for new VMs without having to restart Prometheus.

Configure Prometheus to write metrics to a custom golang HTTP remote service. The service then forwards these metrics to Cloudwatch or Stackdriver. This allowed us to scale Prometheus by using the underlying Cloudwatch or Stackdriver as a remote backend.

Host and service monitoring alert created using terraform scripts integrated with Pagerduty.

Use Grafana for visualizations with Cloudwatch and Stackdriver as backends.

Configure custom alerts via Grafana with Pagerduty Integration.

On-premise / cloud monitoring.

Tech stack used.

Take a look at our other work.

Zero downtime MySQL schema migrations for 400M row table

DB schema migrations on large tables (400+ million rows or 150+ GB in size for a single table) caused replication lag and impacted latencies. Adding indexes on large tables also resulted in replication lag and degraded query performance. Developers had to wait months to roll out their features which needed schema changes.

400 M

Biggest table migrated

7

migrations per month

0

Downtime due to schema migrations

Learn More

Zero downtime MySQL schema migrations for 400M row table

DB schema migrations on large tables (400+ million rows or 150+ GB in size for a single table) caused replication lag and impacted latencies. Adding indexes on large tables also resulted in replication lag and degraded query performance. Developers had to wait months to roll out their features which needed schema changes.

400 M

Biggest table migrated

7

migrations per month

0

Downtime due to schema migrations

Learn More

Zero downtime MySQL schema migrations for 400M row table

DB schema migrations on large tables (400+ million rows or 150+ GB in size for a single table) caused replication lag and impacted latencies. Adding indexes on large tables also resulted in replication lag and degraded query performance. Developers had to wait months to roll out their features which needed schema changes.

400 M

Biggest table migrated

7

migrations per month

0

Downtime due to schema migrations

Learn More

Zero downtime MySQL schema migrations for 400M row table

DB schema migrations on large tables (400+ million rows or 150+ GB in size for a single table) caused replication lag and impacted latencies. Adding indexes on large tables also resulted in replication lag and degraded query performance. Developers had to wait months to roll out their features which needed schema changes.

400 M

Biggest table migrated

7

migrations per month

0

Downtime due to schema migrations

Learn More

Zero downtime MySQL schema migrations for 400M row table

DB schema migrations on large tables (400+ million rows or 150+ GB in size for a single table) caused replication lag and impacted latencies. Adding indexes on large tables also resulted in replication lag and degraded query performance. Developers had to wait months to roll out their features which needed schema changes.

400 M

Biggest table migrated

7

migrations per month

0

Downtime due to schema migrations

Learn More

Case Studies

Case Studies

Case Studies

Case Studies

Case Studies

Checkout our latest posts.

June 18, 2025 | 5 min read

Implementing Secure Error Handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS Cluster Part 1: Core Architecture and Considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: Lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

April 29, 2025 | 9 min read

Transforming Alerting with GitOps - A Journey in Automating Elasticsearch Alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

April 10, 2025 | 5 min read

GitHub Runners Fundamentals and Self-Hosted Runner Setup

Rajesh Jangid

SRE

This post is an introduction to Github runners, different ways in which they can be deployed, and a comparison between managed and self hosted runners.

April 8, 2025 | 4 min read

All software is assembled

Chinmay Naik

Founder, CEO @One2N

Learn why modern software development relies more on strategic assembly of third-party components over building them from scratch.

Blog

Checkout our latest posts.

June 18, 2025 | 5 min read

Implementing Secure Error Handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS Cluster Part 1: Core Architecture and Considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: Lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

April 29, 2025 | 9 min read

Transforming Alerting with GitOps - A Journey in Automating Elasticsearch Alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

April 10, 2025 | 5 min read

GitHub Runners Fundamentals and Self-Hosted Runner Setup

Rajesh Jangid

SRE

This post is an introduction to Github runners, different ways in which they can be deployed, and a comparison between managed and self hosted runners.

April 8, 2025 | 4 min read

All software is assembled

Chinmay Naik

Founder, CEO @One2N

Learn why modern software development relies more on strategic assembly of third-party components over building them from scratch.

Blog

Checkout our latest posts.

June 18, 2025 | 5 min read

Implementing Secure Error Handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS Cluster Part 1: Core Architecture and Considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: Lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

Blog

Checkout our latest posts.

June 18, 2025 | 5 min read

Implementing Secure Error Handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS Cluster Part 1: Core Architecture and Considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: Lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

Blog

Checkout our latest posts.

June 18, 2025 | 5 min read

Implementing Secure Error Handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS Cluster Part 1: Core Architecture and Considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: Lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

April 29, 2025 | 9 min read

Transforming Alerting with GitOps - A Journey in Automating Elasticsearch Alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

April 10, 2025 | 5 min read

GitHub Runners Fundamentals and Self-Hosted Runner Setup

Rajesh Jangid

SRE

This post is an introduction to Github runners, different ways in which they can be deployed, and a comparison between managed and self hosted runners.

April 8, 2025 | 4 min read

All software is assembled

Chinmay Naik

Founder, CEO @One2N

Learn why modern software development relies more on strategic assembly of third-party components over building them from scratch.

Blog