Services

Resources

Company

Our Work

Blog

Book a Call

Back to Blog

#Resiliency

#Data-Migration

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

Back to Blog

#Resiliency

#Data-Migration

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

Back to Blog

#Resiliency

#Data-Migration

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

Back to Blog

#Resiliency

#Data-Migration

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

Overview

Disaster recovery isn’t just a checkbox, it’s a critical component of ensuring data integrity, availability, and fault tolerance in distributed systems. When it comes to databases, especially a distributed system like MongoDB, designing a robust DR strategy can get tricky fast. This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator (v0.9.0).

Our goal was to Achieve an RPO (Recovery Point Objective) of 1 hour, using a combination of full backups, incremental oplog captures, and a touch of automation magic.

What is disaster recovery ?

Disaster recovery is the process of restoring systems and data after unexpected failures such as outages, hardware issues, or data corruption. For databases, it's essential to ensure critical information remains available, consistent, and protected, especially in distributed environments where data loss or downtime can impact entire applications. Every business, regardless of industry or size, must be able to recover quickly from events that halt day-to-day operations. Without a disaster recovery plan, companies risk data loss, reduced productivity, unplanned expenses, and reputational damage that can result in lost customers and revenue.

Why we needed a custom DR strategy

The MongoDB Community Operator is a great way to manage MongoDB clusters on Kubernetes, but it comes with some serious limitations when it comes to DR. Here's what we were up against:

No Built-in Backup or Restore Support The MongoDB Community Operator does not natively support automated backup and restore mechanisms.
Backup Challenges with StatefulSets The MDBC operator creates StatefulSets for the primary and replica nodes. While it's possible to back up the underlying persistent volumes (PVs), restoring them can be problematic due to embedded replica set information stored in local databases. This leads to conflicts when attempting to spin up a new cluster.
Inefficiencies in Volume Snapshots Snapshots taken from persistent volumes are not incremental. This makes it difficult to perform point-in-time restores, leading to inefficiencies in meeting low RPO targets.

The game plan: full dumps + oplogs

To overcome the limitations, we designed a two-tiered backup system:

Daily full database backups using mongodump
Hourly incremental backups from the MongoDB oplog

Together, these give us the flexibility to restore to a specific point in time without capturing every single write operation individually, due to the following reasons:

The full daily backup provides a consistent baseline snapshot of the entire database, simplifying recovery from major failures.
The hourly oplog backups allow us to replay only the changes made after the full backup, enabling granular point-in-time recovery to any hour of the day.

All scripts referenced in this document are available in the accompanying GitHub repository, linked at the end.

Full database backup using mongodump

Step 1: daily full dump setup

Daily full backups using the native mongodump utility provided by MongoDB.

mongodump --uri=<mongodb-connection-uri> --gzip --archive

This approach ensures a complete snapshot of the MongoDB data at a particular point in time.

Step 2: automation using Kubernetes CronJob

oplog-backup script was written to

Execute mongodump
Upload the archive to a Google Cloud Storage (GCS) bucket with timestamp embedded in the filename

This script can scheduled via a kubernetes cronjob that runs on a daily basis.

Incremental oplog backup

What are oplogs?

Oplogs (operation logs) are special capped collections that log all write operations in a replica set. They allow a secondary replica to replicate operations from the primary node, and can also be used for point-in-time recovery.

How oplogs are queried

To perform incremental backups, we query the oplog.rs collection for operations that occurred within a specific time window using BSON timestamp queries.

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": start_time, "i": 1}},
        "$lte": {"$timestamp": {"t": end_time, "i": 0}},
    }
}

t: Represents seconds since the Unix epoch.
i: An incrementing ordinal that serializes operations within a second.

Execution example

To take an oplog dump for March 25, 2025 from 9:00 to 10:00 UTC:

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": 1742873400, "i": 1}},
        "$lte": {"$timestamp": {"t": 1742877000, "i": 0}},
    }
}

mongodump --uri=<mongodb-connection-uri> \\
          --db=local \\
          --collection=oplog.rs \\
          --query=${query_string} \\
          -o -

File organization strategy

Oplog backups are stored in GCS buckets using a structured naming convention to simplify querying and restoration

<ENV>/<YYYY>/<MM>/<DD>/<HH>/<MM>

Sample Screenshot from Bucket

Timing is everything

To ensure consistent oplog backup windows, we must handle potential delays in Kubernetes CronJob execution. Currently, the script calculates the oplog end time using the current timestamp and the start time by subtracting 60 minutes from it. However, if the CronJob is delayed, this dynamic end time can shift forward, creating gaps between consecutive oplog backups.

If a CronJob scheduled at 10:00 AM UTC is delayed and actually starts at 10:05 AM UTC, the script will:

Use 10:05 AM as the end time instead of 10:00 AM
Subtract 60 minutes to get 9:05 AM as the start time
This shifts the oplog window from 9:00 – 10:00 → 9:05 – 10:05, introducing a 5-minute gap at the beginning
These gaps accumulate over time.

Example scenarios

Scenario	Expected Oplog End Time	Actual Oplog End Time Used by Script	Expected Oplog Start Time	Actual Oplog Start Time Used by Script	Fixed Timestamps (Start → End)	Remarks
CronJob scheduled at 10:00 AM UTC (expected to cover 9:00–10:00 AM)	10:00 AM UTC	10:05 AM UTC	9:00 AM UTC	9:05 AM UTC	09:00 → 10:00	5-minute gap from 9:00 - 9:05
Next CronJob scheduled at 11:00 AM UTC (expected to cover 10:00–11:00 AM)	11:00 AM UTC	11:09 AM UTC	10:00 AM UTC	10:09 AM UTC	10:00 → 11:00	4-minute gap from 10:05 - 10:09

To account for potential delays in kubernetes cronjob execution, the end timestamp is always calculated based on the start of the current hour. This ensures that even if the cronjob is delayed, the backup window stays consistent and contiguous (e.g., always 9:00 – 10:00, 10:00 – 11:00), preventing oplog data loss.

Data validation and restore workflow

Restoring data isn’t just about rehydrating bytes, it’s about verifying and validating what’s restored.

Validation challenges

Restoring the data accurately was challenging due to two reasons

Collections had TTL (time-to-live) :
Certain collections had TTL indexes configured, which automatically deleted documents after a specified duration. This meant that even after a successful restore, some documents could expire during validation, leading to inconsistent results and making it hard to verify whether the restore was accurate or if data was naturally purged.
Real-time data ingestion :
The database continued ingesting new data during and after the backup window. This made it difficult to compare restored data against the live system, as new writes could mask or overlap with the expected backup state. It also introduced a moving target problem , by the time validation occurred, the dataset had already evolved.

Metadata snapshot script

To validate the consistency of data, we created db_stats.js script that,

Iterates through user-defined databases
Collects document counts, sizes, and _id ranges for each collection
Stores this metadata in GCS alongside the backup

This script is executed hourly during the oplog backup process to ensure synchronization and avoid delay-related inconsistencies and metadata is used post-restore for verification.

Oplog restoration script

oplog-restore script was developed to apply the oplogs in sequence. It:

Downloads the relevant oplog files from GCS
Applies them using the mongorestore utility with -oplogReplay
Optionally uses -oplogLimit to stop exactly at the desired timestamp, ensuring point-in-time accuracy

Key challenges

Restoration Limits
- Let’s say your daily full backup ran at 06:00, and a bad deployment rolled out at 10:35. Your oplog file for the 10:00 – 11:00 window contains everything,but you only want to recover up to 10:34. Replay the whole file, and you’ll bring the bug right back with you.
- MongoDB’s --oplogLimit flag gives you surgical precision. It tells mongorestore to stop applying changes at a specific timestamp.
Example
mongorestore --uri=${MONGO_URI} \\ --oplogReplay \\ --oplogLimit="1742878380:0" \\ --oplogFile=${oplog_file}
This restores changes only up to 10:34 UTCkeeping your data intact and your bug safely in the past.
Ordering of Restoration
- Oplog files are sorted and restored in chronological order based on timestamp extracted from their filename structure.

Demo

Let us consider a scenario where you have a primary MongoDB database that needs to be restored using monogdump and mongorestore . While this entire process is automated through scripts available in a GitHub repository, we will perform a manual execution for demonstration and better understanding.

Primary MongoDB stats before taking Full Backup

Full Backup Command

mongodump --uri="$PRIMARY_URI" --gzip --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats before restoring Full Backup

Full Backup Restore Command

mongorestore \\
  --uri="$SECONDARY_URI" \\
  --nsInclude="*" \\
  --nsExclude="admin.system.*" \\
  --nsExclude="admin.sessions.*" \\
  --nsExclude="local.*" \\
  --nsExclude="config.system.sessions" \\
  --nsExclude="config.transactions" \\
  --nsExclude="config.image_collection" \\
  --nsExclude="config.system.indexBuilds" \\
  --nsExclude="config.system.preimages" \\
  --nsExclude="*.system.buckets" \\
  --nsExclude="*.system.profile" \\
  --nsExclude="*.system.views" \\
  --nsExclude="*.system.js" \\
  --gzip \\
  --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats after restoring Full Backup

Now Let’s Assume that you have deployment to be rolled out on 16:15 05/05/2025 UTC. Primary MongoDB stats before deployment

This deployment introduced some bug and added some corrupted documents to your database.

Primary MongoDB stats after deployment

With the setup of Oplog Backup Cronjobs you will have oplog backup from 16:00 -> 17:00 , but if you replay this entire file , it will add corrupted documents to your database as well.

Secondary MongoDB stats after replaying entire oplog file

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--dir="<OPLOG_FILE_DIR>"

Secondary MongoDB stats after replaying oplog file till 16:14 using --oploglimit option

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--oplogLimit="1746461700:0" \\
--dir="<OPLOG_FILE_DIR>"

Techniques considered for RPO less than 1 Hour

To minimize data loss and improve the ability to restore to the most recent state possible, several potential techniques were identified and proposed. These suggestions aim to balance reliability, operational simplicity, and cost-effectiveness, though they have not been fully evaluated or implemented.

Method 1: using CDC (Change Data Capture) tools

This approach involves leveraging Change Data Capture (CDC) tools to continuously monitor and replicate changes in the MongoDB database to external systems in real time. The strategy was to use a MongoDB source connector to capture insert, update, and delete operations, then stream those changes into Apache Kafka using a sink connector.

Kafka provides a reliable and highly durable buffer that decouples the producer (MongoDB) from the consumer (downstream processors or backup targets). This buffer allows for flexible consumption patterns and can be replayed to rebuild the database state up to the latest transaction.

Furthermore, these change events from Kafka could be consumed by Kafka Connect or custom consumers to write them to durable cloud storage systems such as Google Cloud Storage (GCS). This would provide an additional layer of durability and a historical archive of changes for long-term recovery or analytics.

To make this even easier, tools like Airbyte come with ready-made connectors for MongoDB, Kafka, helping teams set up end-to-end pipelines with minimal effort.

Method 2 : mongoDB native change streams

MongoDB supports native Change Streams starting from version 3.6, which provides a way to subscribe to real-time changes in the database without polling or manually reading from the oplog. Change Streams use MongoDB’s internal replication mechanism to deliver changes in a reliable and ordered manner.

Applications can consume these streams using MongoDB drivers available in popular languages such as Python, Node.js, and Go. These changes can then be redirected to Kafka, GCS, or in-memory processing pipelines.

This method offers a native and low-latency way to track changes while abstracting away the complexity of dealing directly with the oplog. However, it requires consumers to be resilient and capable of recovering from disconnections and ensuring continuity of the stream using resume tokens.

Wrapping up

While we didn’t have access to the MongoDB Enterprise license, which includes built-in backup and recovery tools like Cloud Manager and Ops Manager, we had a clear objective: to design a disaster recovery solution that could achieve a Recovery Point Objective (RPO) of under one hour using only open-source tools.

By combining daily mongodump backups, hourly oplog captures, we created a resilient and low-cost DR strategy for MongoDB on Kubernetes.

If you’re running MongoDB in Kubernetes and need a recovery strategy that’s both practical and powerful, this just might be it

For those who want to try it out, here is our GitHub repository that contains reference scripts and configuration examples related to the discussed setup. Use the README provided to get started.

Overview

Our goal was to Achieve an RPO (Recovery Point Objective) of 1 hour, using a combination of full backups, incremental oplog captures, and a touch of automation magic.

What is disaster recovery ?

Why we needed a custom DR strategy

The MongoDB Community Operator is a great way to manage MongoDB clusters on Kubernetes, but it comes with some serious limitations when it comes to DR. Here's what we were up against:

No Built-in Backup or Restore Support The MongoDB Community Operator does not natively support automated backup and restore mechanisms.
Backup Challenges with StatefulSets The MDBC operator creates StatefulSets for the primary and replica nodes. While it's possible to back up the underlying persistent volumes (PVs), restoring them can be problematic due to embedded replica set information stored in local databases. This leads to conflicts when attempting to spin up a new cluster.
Inefficiencies in Volume Snapshots Snapshots taken from persistent volumes are not incremental. This makes it difficult to perform point-in-time restores, leading to inefficiencies in meeting low RPO targets.

The game plan: full dumps + oplogs

To overcome the limitations, we designed a two-tiered backup system:

Daily full database backups using mongodump
Hourly incremental backups from the MongoDB oplog

Together, these give us the flexibility to restore to a specific point in time without capturing every single write operation individually, due to the following reasons:

The full daily backup provides a consistent baseline snapshot of the entire database, simplifying recovery from major failures.
The hourly oplog backups allow us to replay only the changes made after the full backup, enabling granular point-in-time recovery to any hour of the day.

All scripts referenced in this document are available in the accompanying GitHub repository, linked at the end.

Full database backup using mongodump

Step 1: daily full dump setup

Daily full backups using the native mongodump utility provided by MongoDB.

mongodump --uri=<mongodb-connection-uri> --gzip --archive

This approach ensures a complete snapshot of the MongoDB data at a particular point in time.

Step 2: automation using Kubernetes CronJob

oplog-backup script was written to

Execute mongodump
Upload the archive to a Google Cloud Storage (GCS) bucket with timestamp embedded in the filename

This script can scheduled via a kubernetes cronjob that runs on a daily basis.

Incremental oplog backup

What are oplogs?

How oplogs are queried

To perform incremental backups, we query the oplog.rs collection for operations that occurred within a specific time window using BSON timestamp queries.

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": start_time, "i": 1}},
        "$lte": {"$timestamp": {"t": end_time, "i": 0}},
    }
}

t: Represents seconds since the Unix epoch.
i: An incrementing ordinal that serializes operations within a second.

Execution example

To take an oplog dump for March 25, 2025 from 9:00 to 10:00 UTC:

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": 1742873400, "i": 1}},
        "$lte": {"$timestamp": {"t": 1742877000, "i": 0}},
    }
}

mongodump --uri=<mongodb-connection-uri> \\
          --db=local \\
          --collection=oplog.rs \\
          --query=${query_string} \\
          -o -

File organization strategy

Oplog backups are stored in GCS buckets using a structured naming convention to simplify querying and restoration

<ENV>/<YYYY>/<MM>/<DD>/<HH>/<MM>

Sample Screenshot from Bucket

Timing is everything

If a CronJob scheduled at 10:00 AM UTC is delayed and actually starts at 10:05 AM UTC, the script will:

Use 10:05 AM as the end time instead of 10:00 AM
Subtract 60 minutes to get 9:05 AM as the start time
This shifts the oplog window from 9:00 – 10:00 → 9:05 – 10:05, introducing a 5-minute gap at the beginning
These gaps accumulate over time.

Example scenarios

Scenario	Expected Oplog End Time	Actual Oplog End Time Used by Script	Expected Oplog Start Time	Actual Oplog Start Time Used by Script	Fixed Timestamps (Start → End)	Remarks
CronJob scheduled at 10:00 AM UTC (expected to cover 9:00–10:00 AM)	10:00 AM UTC	10:05 AM UTC	9:00 AM UTC	9:05 AM UTC	09:00 → 10:00	5-minute gap from 9:00 - 9:05
Next CronJob scheduled at 11:00 AM UTC (expected to cover 10:00–11:00 AM)	11:00 AM UTC	11:09 AM UTC	10:00 AM UTC	10:09 AM UTC	10:00 → 11:00	4-minute gap from 10:05 - 10:09

Data validation and restore workflow

Restoring data isn’t just about rehydrating bytes, it’s about verifying and validating what’s restored.

Validation challenges

Restoring the data accurately was challenging due to two reasons

Collections had TTL (time-to-live) :
Certain collections had TTL indexes configured, which automatically deleted documents after a specified duration. This meant that even after a successful restore, some documents could expire during validation, leading to inconsistent results and making it hard to verify whether the restore was accurate or if data was naturally purged.
Real-time data ingestion :
The database continued ingesting new data during and after the backup window. This made it difficult to compare restored data against the live system, as new writes could mask or overlap with the expected backup state. It also introduced a moving target problem , by the time validation occurred, the dataset had already evolved.

Metadata snapshot script

To validate the consistency of data, we created db_stats.js script that,

Iterates through user-defined databases
Collects document counts, sizes, and _id ranges for each collection
Stores this metadata in GCS alongside the backup

This script is executed hourly during the oplog backup process to ensure synchronization and avoid delay-related inconsistencies and metadata is used post-restore for verification.

Oplog restoration script

oplog-restore script was developed to apply the oplogs in sequence. It:

Downloads the relevant oplog files from GCS
Applies them using the mongorestore utility with -oplogReplay
Optionally uses -oplogLimit to stop exactly at the desired timestamp, ensuring point-in-time accuracy

Key challenges

Restoration Limits
- Let’s say your daily full backup ran at 06:00, and a bad deployment rolled out at 10:35. Your oplog file for the 10:00 – 11:00 window contains everything,but you only want to recover up to 10:34. Replay the whole file, and you’ll bring the bug right back with you.
- MongoDB’s --oplogLimit flag gives you surgical precision. It tells mongorestore to stop applying changes at a specific timestamp.
Example
mongorestore --uri=${MONGO_URI} \\ --oplogReplay \\ --oplogLimit="1742878380:0" \\ --oplogFile=${oplog_file}
This restores changes only up to 10:34 UTCkeeping your data intact and your bug safely in the past.
Ordering of Restoration
- Oplog files are sorted and restored in chronological order based on timestamp extracted from their filename structure.

Demo

Primary MongoDB stats before taking Full Backup

Full Backup Command

mongodump --uri="$PRIMARY_URI" --gzip --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats before restoring Full Backup

Full Backup Restore Command

mongorestore \\
  --uri="$SECONDARY_URI" \\
  --nsInclude="*" \\
  --nsExclude="admin.system.*" \\
  --nsExclude="admin.sessions.*" \\
  --nsExclude="local.*" \\
  --nsExclude="config.system.sessions" \\
  --nsExclude="config.transactions" \\
  --nsExclude="config.image_collection" \\
  --nsExclude="config.system.indexBuilds" \\
  --nsExclude="config.system.preimages" \\
  --nsExclude="*.system.buckets" \\
  --nsExclude="*.system.profile" \\
  --nsExclude="*.system.views" \\
  --nsExclude="*.system.js" \\
  --gzip \\
  --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats after restoring Full Backup

Now Let’s Assume that you have deployment to be rolled out on 16:15 05/05/2025 UTC. Primary MongoDB stats before deployment

This deployment introduced some bug and added some corrupted documents to your database.

Primary MongoDB stats after deployment

With the setup of Oplog Backup Cronjobs you will have oplog backup from 16:00 -> 17:00 , but if you replay this entire file , it will add corrupted documents to your database as well.

Secondary MongoDB stats after replaying entire oplog file

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--dir="<OPLOG_FILE_DIR>"

Secondary MongoDB stats after replaying oplog file till 16:14 using --oploglimit option

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--oplogLimit="1746461700:0" \\
--dir="<OPLOG_FILE_DIR>"

Techniques considered for RPO less than 1 Hour

Method 1: using CDC (Change Data Capture) tools

To make this even easier, tools like Airbyte come with ready-made connectors for MongoDB, Kafka, helping teams set up end-to-end pipelines with minimal effort.

Method 2 : mongoDB native change streams

Wrapping up

By combining daily mongodump backups, hourly oplog captures, we created a resilient and low-cost DR strategy for MongoDB on Kubernetes.

If you’re running MongoDB in Kubernetes and need a recovery strategy that’s both practical and powerful, this just might be it

For those who want to try it out, here is our GitHub repository that contains reference scripts and configuration examples related to the discussed setup. Use the README provided to get started.

Overview

Our goal was to Achieve an RPO (Recovery Point Objective) of 1 hour, using a combination of full backups, incremental oplog captures, and a touch of automation magic.

What is disaster recovery ?

Why we needed a custom DR strategy

The MongoDB Community Operator is a great way to manage MongoDB clusters on Kubernetes, but it comes with some serious limitations when it comes to DR. Here's what we were up against:

No Built-in Backup or Restore Support The MongoDB Community Operator does not natively support automated backup and restore mechanisms.
Backup Challenges with StatefulSets The MDBC operator creates StatefulSets for the primary and replica nodes. While it's possible to back up the underlying persistent volumes (PVs), restoring them can be problematic due to embedded replica set information stored in local databases. This leads to conflicts when attempting to spin up a new cluster.
Inefficiencies in Volume Snapshots Snapshots taken from persistent volumes are not incremental. This makes it difficult to perform point-in-time restores, leading to inefficiencies in meeting low RPO targets.

The game plan: full dumps + oplogs

To overcome the limitations, we designed a two-tiered backup system:

Daily full database backups using mongodump
Hourly incremental backups from the MongoDB oplog

Together, these give us the flexibility to restore to a specific point in time without capturing every single write operation individually, due to the following reasons:

The full daily backup provides a consistent baseline snapshot of the entire database, simplifying recovery from major failures.
The hourly oplog backups allow us to replay only the changes made after the full backup, enabling granular point-in-time recovery to any hour of the day.

All scripts referenced in this document are available in the accompanying GitHub repository, linked at the end.

Full database backup using mongodump

Step 1: daily full dump setup

Daily full backups using the native mongodump utility provided by MongoDB.

mongodump --uri=<mongodb-connection-uri> --gzip --archive

This approach ensures a complete snapshot of the MongoDB data at a particular point in time.

Step 2: automation using Kubernetes CronJob

oplog-backup script was written to

Execute mongodump
Upload the archive to a Google Cloud Storage (GCS) bucket with timestamp embedded in the filename

This script can scheduled via a kubernetes cronjob that runs on a daily basis.

Incremental oplog backup

What are oplogs?

How oplogs are queried

To perform incremental backups, we query the oplog.rs collection for operations that occurred within a specific time window using BSON timestamp queries.

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": start_time, "i": 1}},
        "$lte": {"$timestamp": {"t": end_time, "i": 0}},
    }
}

t: Represents seconds since the Unix epoch.
i: An incrementing ordinal that serializes operations within a second.

Execution example

To take an oplog dump for March 25, 2025 from 9:00 to 10:00 UTC:

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": 1742873400, "i": 1}},
        "$lte": {"$timestamp": {"t": 1742877000, "i": 0}},
    }
}

mongodump --uri=<mongodb-connection-uri> \\
          --db=local \\
          --collection=oplog.rs \\
          --query=${query_string} \\
          -o -

File organization strategy

Oplog backups are stored in GCS buckets using a structured naming convention to simplify querying and restoration

<ENV>/<YYYY>/<MM>/<DD>/<HH>/<MM>

Sample Screenshot from Bucket

Timing is everything

If a CronJob scheduled at 10:00 AM UTC is delayed and actually starts at 10:05 AM UTC, the script will:

Use 10:05 AM as the end time instead of 10:00 AM
Subtract 60 minutes to get 9:05 AM as the start time
This shifts the oplog window from 9:00 – 10:00 → 9:05 – 10:05, introducing a 5-minute gap at the beginning
These gaps accumulate over time.

Example scenarios

Scenario	Expected Oplog End Time	Actual Oplog End Time Used by Script	Expected Oplog Start Time	Actual Oplog Start Time Used by Script	Fixed Timestamps (Start → End)	Remarks
CronJob scheduled at 10:00 AM UTC (expected to cover 9:00–10:00 AM)	10:00 AM UTC	10:05 AM UTC	9:00 AM UTC	9:05 AM UTC	09:00 → 10:00	5-minute gap from 9:00 - 9:05
Next CronJob scheduled at 11:00 AM UTC (expected to cover 10:00–11:00 AM)	11:00 AM UTC	11:09 AM UTC	10:00 AM UTC	10:09 AM UTC	10:00 → 11:00	4-minute gap from 10:05 - 10:09

Data validation and restore workflow

Restoring data isn’t just about rehydrating bytes, it’s about verifying and validating what’s restored.

Validation challenges

Restoring the data accurately was challenging due to two reasons

Collections had TTL (time-to-live) :
Certain collections had TTL indexes configured, which automatically deleted documents after a specified duration. This meant that even after a successful restore, some documents could expire during validation, leading to inconsistent results and making it hard to verify whether the restore was accurate or if data was naturally purged.
Real-time data ingestion :
The database continued ingesting new data during and after the backup window. This made it difficult to compare restored data against the live system, as new writes could mask or overlap with the expected backup state. It also introduced a moving target problem , by the time validation occurred, the dataset had already evolved.

Metadata snapshot script

To validate the consistency of data, we created db_stats.js script that,

Iterates through user-defined databases
Collects document counts, sizes, and _id ranges for each collection
Stores this metadata in GCS alongside the backup

This script is executed hourly during the oplog backup process to ensure synchronization and avoid delay-related inconsistencies and metadata is used post-restore for verification.

Oplog restoration script

oplog-restore script was developed to apply the oplogs in sequence. It:

Downloads the relevant oplog files from GCS
Applies them using the mongorestore utility with -oplogReplay
Optionally uses -oplogLimit to stop exactly at the desired timestamp, ensuring point-in-time accuracy

Key challenges

Restoration Limits
- Let’s say your daily full backup ran at 06:00, and a bad deployment rolled out at 10:35. Your oplog file for the 10:00 – 11:00 window contains everything,but you only want to recover up to 10:34. Replay the whole file, and you’ll bring the bug right back with you.
- MongoDB’s --oplogLimit flag gives you surgical precision. It tells mongorestore to stop applying changes at a specific timestamp.
Example
mongorestore --uri=${MONGO_URI} \\ --oplogReplay \\ --oplogLimit="1742878380:0" \\ --oplogFile=${oplog_file}
This restores changes only up to 10:34 UTCkeeping your data intact and your bug safely in the past.
Ordering of Restoration
- Oplog files are sorted and restored in chronological order based on timestamp extracted from their filename structure.

Demo

Primary MongoDB stats before taking Full Backup

Full Backup Command

mongodump --uri="$PRIMARY_URI" --gzip --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats before restoring Full Backup

Full Backup Restore Command

mongorestore \\
  --uri="$SECONDARY_URI" \\
  --nsInclude="*" \\
  --nsExclude="admin.system.*" \\
  --nsExclude="admin.sessions.*" \\
  --nsExclude="local.*" \\
  --nsExclude="config.system.sessions" \\
  --nsExclude="config.transactions" \\
  --nsExclude="config.image_collection" \\
  --nsExclude="config.system.indexBuilds" \\
  --nsExclude="config.system.preimages" \\
  --nsExclude="*.system.buckets" \\
  --nsExclude="*.system.profile" \\
  --nsExclude="*.system.views" \\
  --nsExclude="*.system.js" \\
  --gzip \\
  --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats after restoring Full Backup

Now Let’s Assume that you have deployment to be rolled out on 16:15 05/05/2025 UTC. Primary MongoDB stats before deployment

This deployment introduced some bug and added some corrupted documents to your database.

Primary MongoDB stats after deployment

With the setup of Oplog Backup Cronjobs you will have oplog backup from 16:00 -> 17:00 , but if you replay this entire file , it will add corrupted documents to your database as well.

Secondary MongoDB stats after replaying entire oplog file

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--dir="<OPLOG_FILE_DIR>"

Secondary MongoDB stats after replaying oplog file till 16:14 using --oploglimit option

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--oplogLimit="1746461700:0" \\
--dir="<OPLOG_FILE_DIR>"

Techniques considered for RPO less than 1 Hour

Method 1: using CDC (Change Data Capture) tools

To make this even easier, tools like Airbyte come with ready-made connectors for MongoDB, Kafka, helping teams set up end-to-end pipelines with minimal effort.

Method 2 : mongoDB native change streams

Wrapping up

By combining daily mongodump backups, hourly oplog captures, we created a resilient and low-cost DR strategy for MongoDB on Kubernetes.

If you’re running MongoDB in Kubernetes and need a recovery strategy that’s both practical and powerful, this just might be it

For those who want to try it out, here is our GitHub repository that contains reference scripts and configuration examples related to the discussed setup. Use the README provided to get started.

Overview

Our goal was to Achieve an RPO (Recovery Point Objective) of 1 hour, using a combination of full backups, incremental oplog captures, and a touch of automation magic.

What is disaster recovery ?

Why we needed a custom DR strategy

The MongoDB Community Operator is a great way to manage MongoDB clusters on Kubernetes, but it comes with some serious limitations when it comes to DR. Here's what we were up against:

No Built-in Backup or Restore Support The MongoDB Community Operator does not natively support automated backup and restore mechanisms.
Backup Challenges with StatefulSets The MDBC operator creates StatefulSets for the primary and replica nodes. While it's possible to back up the underlying persistent volumes (PVs), restoring them can be problematic due to embedded replica set information stored in local databases. This leads to conflicts when attempting to spin up a new cluster.
Inefficiencies in Volume Snapshots Snapshots taken from persistent volumes are not incremental. This makes it difficult to perform point-in-time restores, leading to inefficiencies in meeting low RPO targets.

The game plan: full dumps + oplogs

To overcome the limitations, we designed a two-tiered backup system:

Daily full database backups using mongodump
Hourly incremental backups from the MongoDB oplog

Together, these give us the flexibility to restore to a specific point in time without capturing every single write operation individually, due to the following reasons:

The full daily backup provides a consistent baseline snapshot of the entire database, simplifying recovery from major failures.
The hourly oplog backups allow us to replay only the changes made after the full backup, enabling granular point-in-time recovery to any hour of the day.

All scripts referenced in this document are available in the accompanying GitHub repository, linked at the end.

Full database backup using mongodump

Step 1: daily full dump setup

Daily full backups using the native mongodump utility provided by MongoDB.

mongodump --uri=<mongodb-connection-uri> --gzip --archive

This approach ensures a complete snapshot of the MongoDB data at a particular point in time.

Step 2: automation using Kubernetes CronJob

oplog-backup script was written to

Execute mongodump
Upload the archive to a Google Cloud Storage (GCS) bucket with timestamp embedded in the filename

This script can scheduled via a kubernetes cronjob that runs on a daily basis.

Incremental oplog backup

What are oplogs?

How oplogs are queried

To perform incremental backups, we query the oplog.rs collection for operations that occurred within a specific time window using BSON timestamp queries.

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": start_time, "i": 1}},
        "$lte": {"$timestamp": {"t": end_time, "i": 0}},
    }
}

t: Represents seconds since the Unix epoch.
i: An incrementing ordinal that serializes operations within a second.

Execution example

To take an oplog dump for March 25, 2025 from 9:00 to 10:00 UTC:

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": 1742873400, "i": 1}},
        "$lte": {"$timestamp": {"t": 1742877000, "i": 0}},
    }
}

mongodump --uri=<mongodb-connection-uri> \\
          --db=local \\
          --collection=oplog.rs \\
          --query=${query_string} \\
          -o -

File organization strategy

Oplog backups are stored in GCS buckets using a structured naming convention to simplify querying and restoration

<ENV>/<YYYY>/<MM>/<DD>/<HH>/<MM>

Sample Screenshot from Bucket

Timing is everything

If a CronJob scheduled at 10:00 AM UTC is delayed and actually starts at 10:05 AM UTC, the script will:

Use 10:05 AM as the end time instead of 10:00 AM
Subtract 60 minutes to get 9:05 AM as the start time
This shifts the oplog window from 9:00 – 10:00 → 9:05 – 10:05, introducing a 5-minute gap at the beginning
These gaps accumulate over time.

Example scenarios

Scenario	Expected Oplog End Time	Actual Oplog End Time Used by Script	Expected Oplog Start Time	Actual Oplog Start Time Used by Script	Fixed Timestamps (Start → End)	Remarks
CronJob scheduled at 10:00 AM UTC (expected to cover 9:00–10:00 AM)	10:00 AM UTC	10:05 AM UTC	9:00 AM UTC	9:05 AM UTC	09:00 → 10:00	5-minute gap from 9:00 - 9:05
Next CronJob scheduled at 11:00 AM UTC (expected to cover 10:00–11:00 AM)	11:00 AM UTC	11:09 AM UTC	10:00 AM UTC	10:09 AM UTC	10:00 → 11:00	4-minute gap from 10:05 - 10:09

Data validation and restore workflow

Restoring data isn’t just about rehydrating bytes, it’s about verifying and validating what’s restored.

Validation challenges

Restoring the data accurately was challenging due to two reasons

Collections had TTL (time-to-live) :
Certain collections had TTL indexes configured, which automatically deleted documents after a specified duration. This meant that even after a successful restore, some documents could expire during validation, leading to inconsistent results and making it hard to verify whether the restore was accurate or if data was naturally purged.
Real-time data ingestion :
The database continued ingesting new data during and after the backup window. This made it difficult to compare restored data against the live system, as new writes could mask or overlap with the expected backup state. It also introduced a moving target problem , by the time validation occurred, the dataset had already evolved.

Metadata snapshot script

To validate the consistency of data, we created db_stats.js script that,

Iterates through user-defined databases
Collects document counts, sizes, and _id ranges for each collection
Stores this metadata in GCS alongside the backup

This script is executed hourly during the oplog backup process to ensure synchronization and avoid delay-related inconsistencies and metadata is used post-restore for verification.

Oplog restoration script

oplog-restore script was developed to apply the oplogs in sequence. It:

Downloads the relevant oplog files from GCS
Applies them using the mongorestore utility with -oplogReplay
Optionally uses -oplogLimit to stop exactly at the desired timestamp, ensuring point-in-time accuracy

Key challenges

Restoration Limits
- Let’s say your daily full backup ran at 06:00, and a bad deployment rolled out at 10:35. Your oplog file for the 10:00 – 11:00 window contains everything,but you only want to recover up to 10:34. Replay the whole file, and you’ll bring the bug right back with you.
- MongoDB’s --oplogLimit flag gives you surgical precision. It tells mongorestore to stop applying changes at a specific timestamp.
Example
mongorestore --uri=${MONGO_URI} \\ --oplogReplay \\ --oplogLimit="1742878380:0" \\ --oplogFile=${oplog_file}
This restores changes only up to 10:34 UTCkeeping your data intact and your bug safely in the past.
Ordering of Restoration
- Oplog files are sorted and restored in chronological order based on timestamp extracted from their filename structure.

Demo

Primary MongoDB stats before taking Full Backup

Full Backup Command

mongodump --uri="$PRIMARY_URI" --gzip --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats before restoring Full Backup

Full Backup Restore Command

mongorestore \\
  --uri="$SECONDARY_URI" \\
  --nsInclude="*" \\
  --nsExclude="admin.system.*" \\
  --nsExclude="admin.sessions.*" \\
  --nsExclude="local.*" \\
  --nsExclude="config.system.sessions" \\
  --nsExclude="config.transactions" \\
  --nsExclude="config.image_collection" \\
  --nsExclude="config.system.indexBuilds" \\
  --nsExclude="config.system.preimages" \\
  --nsExclude="*.system.buckets" \\
  --nsExclude="*.system.profile" \\
  --nsExclude="*.system.views" \\
  --nsExclude="*.system.js" \\
  --gzip \\
  --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats after restoring Full Backup

Now Let’s Assume that you have deployment to be rolled out on 16:15 05/05/2025 UTC. Primary MongoDB stats before deployment

This deployment introduced some bug and added some corrupted documents to your database.

Primary MongoDB stats after deployment

With the setup of Oplog Backup Cronjobs you will have oplog backup from 16:00 -> 17:00 , but if you replay this entire file , it will add corrupted documents to your database as well.

Secondary MongoDB stats after replaying entire oplog file

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--dir="<OPLOG_FILE_DIR>"

Secondary MongoDB stats after replaying oplog file till 16:14 using --oploglimit option

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--oplogLimit="1746461700:0" \\
--dir="<OPLOG_FILE_DIR>"

Techniques considered for RPO less than 1 Hour

Method 1: using CDC (Change Data Capture) tools

To make this even easier, tools like Airbyte come with ready-made connectors for MongoDB, Kafka, helping teams set up end-to-end pipelines with minimal effort.

Method 2 : mongoDB native change streams

Wrapping up

By combining daily mongodump backups, hourly oplog captures, we created a resilient and low-cost DR strategy for MongoDB on Kubernetes.

If you’re running MongoDB in Kubernetes and need a recovery strategy that’s both practical and powerful, this just might be it

For those who want to try it out, here is our GitHub repository that contains reference scripts and configuration examples related to the discussed setup. Use the README provided to get started.

Overview

Our goal was to Achieve an RPO (Recovery Point Objective) of 1 hour, using a combination of full backups, incremental oplog captures, and a touch of automation magic.

What is disaster recovery ?

Why we needed a custom DR strategy

The MongoDB Community Operator is a great way to manage MongoDB clusters on Kubernetes, but it comes with some serious limitations when it comes to DR. Here's what we were up against:

No Built-in Backup or Restore Support The MongoDB Community Operator does not natively support automated backup and restore mechanisms.
Backup Challenges with StatefulSets The MDBC operator creates StatefulSets for the primary and replica nodes. While it's possible to back up the underlying persistent volumes (PVs), restoring them can be problematic due to embedded replica set information stored in local databases. This leads to conflicts when attempting to spin up a new cluster.
Inefficiencies in Volume Snapshots Snapshots taken from persistent volumes are not incremental. This makes it difficult to perform point-in-time restores, leading to inefficiencies in meeting low RPO targets.

The game plan: full dumps + oplogs

To overcome the limitations, we designed a two-tiered backup system:

Daily full database backups using mongodump
Hourly incremental backups from the MongoDB oplog

Together, these give us the flexibility to restore to a specific point in time without capturing every single write operation individually, due to the following reasons:

The full daily backup provides a consistent baseline snapshot of the entire database, simplifying recovery from major failures.
The hourly oplog backups allow us to replay only the changes made after the full backup, enabling granular point-in-time recovery to any hour of the day.

All scripts referenced in this document are available in the accompanying GitHub repository, linked at the end.

Full database backup using mongodump

Step 1: daily full dump setup

Daily full backups using the native mongodump utility provided by MongoDB.

mongodump --uri=<mongodb-connection-uri> --gzip --archive

This approach ensures a complete snapshot of the MongoDB data at a particular point in time.

Step 2: automation using Kubernetes CronJob

oplog-backup script was written to

Execute mongodump
Upload the archive to a Google Cloud Storage (GCS) bucket with timestamp embedded in the filename

This script can scheduled via a kubernetes cronjob that runs on a daily basis.

Incremental oplog backup

What are oplogs?

How oplogs are queried

To perform incremental backups, we query the oplog.rs collection for operations that occurred within a specific time window using BSON timestamp queries.

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": start_time, "i": 1}},
        "$lte": {"$timestamp": {"t": end_time, "i": 0}},
    }
}

t: Represents seconds since the Unix epoch.
i: An incrementing ordinal that serializes operations within a second.

Execution example

To take an oplog dump for March 25, 2025 from 9:00 to 10:00 UTC:

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": 1742873400, "i": 1}},
        "$lte": {"$timestamp": {"t": 1742877000, "i": 0}},
    }
}

mongodump --uri=<mongodb-connection-uri> \\
          --db=local \\
          --collection=oplog.rs \\
          --query=${query_string} \\
          -o -

File organization strategy

Oplog backups are stored in GCS buckets using a structured naming convention to simplify querying and restoration

<ENV>/<YYYY>/<MM>/<DD>/<HH>/<MM>

Sample Screenshot from Bucket

Timing is everything

If a CronJob scheduled at 10:00 AM UTC is delayed and actually starts at 10:05 AM UTC, the script will:

Use 10:05 AM as the end time instead of 10:00 AM
Subtract 60 minutes to get 9:05 AM as the start time
This shifts the oplog window from 9:00 – 10:00 → 9:05 – 10:05, introducing a 5-minute gap at the beginning
These gaps accumulate over time.

Example scenarios

Scenario	Expected Oplog End Time	Actual Oplog End Time Used by Script	Expected Oplog Start Time	Actual Oplog Start Time Used by Script	Fixed Timestamps (Start → End)	Remarks
CronJob scheduled at 10:00 AM UTC (expected to cover 9:00–10:00 AM)	10:00 AM UTC	10:05 AM UTC	9:00 AM UTC	9:05 AM UTC	09:00 → 10:00	5-minute gap from 9:00 - 9:05
Next CronJob scheduled at 11:00 AM UTC (expected to cover 10:00–11:00 AM)	11:00 AM UTC	11:09 AM UTC	10:00 AM UTC	10:09 AM UTC	10:00 → 11:00	4-minute gap from 10:05 - 10:09

Data validation and restore workflow

Restoring data isn’t just about rehydrating bytes, it’s about verifying and validating what’s restored.

Validation challenges

Restoring the data accurately was challenging due to two reasons

Collections had TTL (time-to-live) :
Certain collections had TTL indexes configured, which automatically deleted documents after a specified duration. This meant that even after a successful restore, some documents could expire during validation, leading to inconsistent results and making it hard to verify whether the restore was accurate or if data was naturally purged.
Real-time data ingestion :
The database continued ingesting new data during and after the backup window. This made it difficult to compare restored data against the live system, as new writes could mask or overlap with the expected backup state. It also introduced a moving target problem , by the time validation occurred, the dataset had already evolved.

Metadata snapshot script

To validate the consistency of data, we created db_stats.js script that,

Iterates through user-defined databases
Collects document counts, sizes, and _id ranges for each collection
Stores this metadata in GCS alongside the backup

This script is executed hourly during the oplog backup process to ensure synchronization and avoid delay-related inconsistencies and metadata is used post-restore for verification.

Oplog restoration script

oplog-restore script was developed to apply the oplogs in sequence. It:

Downloads the relevant oplog files from GCS
Applies them using the mongorestore utility with -oplogReplay
Optionally uses -oplogLimit to stop exactly at the desired timestamp, ensuring point-in-time accuracy

Key challenges

Restoration Limits
- Let’s say your daily full backup ran at 06:00, and a bad deployment rolled out at 10:35. Your oplog file for the 10:00 – 11:00 window contains everything,but you only want to recover up to 10:34. Replay the whole file, and you’ll bring the bug right back with you.
- MongoDB’s --oplogLimit flag gives you surgical precision. It tells mongorestore to stop applying changes at a specific timestamp.
Example
mongorestore --uri=${MONGO_URI} \\ --oplogReplay \\ --oplogLimit="1742878380:0" \\ --oplogFile=${oplog_file}
This restores changes only up to 10:34 UTCkeeping your data intact and your bug safely in the past.
Ordering of Restoration
- Oplog files are sorted and restored in chronological order based on timestamp extracted from their filename structure.

Demo

Primary MongoDB stats before taking Full Backup

Full Backup Command

mongodump --uri="$PRIMARY_URI" --gzip --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats before restoring Full Backup

Full Backup Restore Command

mongorestore \\
  --uri="$SECONDARY_URI" \\
  --nsInclude="*" \\
  --nsExclude="admin.system.*" \\
  --nsExclude="admin.sessions.*" \\
  --nsExclude="local.*" \\
  --nsExclude="config.system.sessions" \\
  --nsExclude="config.transactions" \\
  --nsExclude="config.image_collection" \\
  --nsExclude="config.system.indexBuilds" \\
  --nsExclude="config.system.preimages" \\
  --nsExclude="*.system.buckets" \\
  --nsExclude="*.system.profile" \\
  --nsExclude="*.system.views" \\
  --nsExclude="*.system.js" \\
  --gzip \\
  --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats after restoring Full Backup

Now Let’s Assume that you have deployment to be rolled out on 16:15 05/05/2025 UTC. Primary MongoDB stats before deployment

This deployment introduced some bug and added some corrupted documents to your database.

Primary MongoDB stats after deployment

With the setup of Oplog Backup Cronjobs you will have oplog backup from 16:00 -> 17:00 , but if you replay this entire file , it will add corrupted documents to your database as well.

Secondary MongoDB stats after replaying entire oplog file

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--dir="<OPLOG_FILE_DIR>"

Secondary MongoDB stats after replaying oplog file till 16:14 using --oploglimit option

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--oplogLimit="1746461700:0" \\
--dir="<OPLOG_FILE_DIR>"

Techniques considered for RPO less than 1 Hour

Method 1: using CDC (Change Data Capture) tools

To make this even easier, tools like Airbyte come with ready-made connectors for MongoDB, Kafka, helping teams set up end-to-end pipelines with minimal effort.

Method 2 : mongoDB native change streams

Wrapping up

By combining daily mongodump backups, hourly oplog captures, we created a resilient and low-cost DR strategy for MongoDB on Kubernetes.

If you’re running MongoDB in Kubernetes and need a recovery strategy that’s both practical and powerful, this just might be it

For those who want to try it out, here is our GitHub repository that contains reference scripts and configuration examples related to the discussed setup. Use the README provided to get started.

Jump to section

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

April 29, 2025 | 9 min read

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

April 29, 2025 | 9 min read

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

April 10, 2025 | 5 min read

GitHub runners fundamentals and self-hosted runner setup

Rajesh Jangid

SRE

This post is an introduction to Github runners, different ways in which they can be deployed, and a comparison between managed and self hosted runners.

April 10, 2025 | 5 min read

GitHub runners fundamentals and self-hosted runner setup

Rajesh Jangid

SRE

This post is an introduction to Github runners, different ways in which they can be deployed, and a comparison between managed and self hosted runners.

April 8, 2025 | 4 min read

All software is assembled

Chinmay Naik

Founder, CEO @One2N

Learn why modern software development relies more on strategic assembly of third-party components over building them from scratch.

April 8, 2025 | 4 min read

All software is assembled

Chinmay Naik

Founder, CEO @One2N

Learn why modern software development relies more on strategic assembly of third-party components over building them from scratch.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

April 29, 2025 | 9 min read

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

April 29, 2025 | 9 min read

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

Blog

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Services

Resources

Company

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Share

Jump to section

Related posts

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

GitHub runners fundamentals and self-hosted runner setup

This post is an introduction to Github runners, different ways in which they can be deployed, and a comparison between managed and self hosted runners.

GitHub runners fundamentals and self-hosted runner setup