Services

Resources

Company

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: Lessons from achieving a 1-Hour RPO

Optimizing MongoDB backup strategy: Lessons from achieving a 1-Hour RPO

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: Lessons from achieving a 1-Hour RPO

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: Lessons from achieving a 1-Hour RPO

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: Lessons from achieving a 1-Hour RPO

Overview

Disaster recovery isn’t just a checkbox, it’s a critical component of ensuring data integrity, availability, and fault tolerance in distributed systems. When it comes to databases, especially a distributed system like MongoDB, designing a robust DR strategy can get tricky fast. This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator (v0.9.0).

Our goal was to Achieve an RPO (Recovery Point Objective) of 1 hour, using a combination of full backups, incremental oplog captures, and a touch of automation magic.

What is Disaster Recovery ?

Disaster recovery is the process of restoring systems and data after unexpected failures such as outages, hardware issues, or data corruption. For databases, it's essential to ensure critical information remains available, consistent, and protected, especially in distributed environments where data loss or downtime can impact entire applications. Every business, regardless of industry or size, must be able to recover quickly from events that halt day-to-day operations. Without a disaster recovery plan, companies risk data loss, reduced productivity, unplanned expenses, and reputational damage that can result in lost customers and revenue.

Why We Needed a Custom DR Strategy

The MongoDB Community Operator is a great way to manage MongoDB clusters on Kubernetes, but it comes with some serious limitations when it comes to DR. Here's what we were up against:

  1. No Built-in Backup or Restore Support The MongoDB Community Operator does not natively support automated backup and restore mechanisms.

  2. Backup Challenges with StatefulSets The MDBC operator creates StatefulSets for the primary and replica nodes. While it's possible to back up the underlying persistent volumes (PVs), restoring them can be problematic due to embedded replica set information stored in local databases. This leads to conflicts when attempting to spin up a new cluster.

  3. Inefficiencies in Volume Snapshots Snapshots taken from persistent volumes are not incremental. This makes it difficult to perform point-in-time restores, leading to inefficiencies in meeting low RPO targets.

The Game Plan: Full Dumps + Oplogs

To overcome the limitations, we designed a two-tiered backup system:

  • Daily full database backups using mongodump

  • Hourly incremental backups from the MongoDB oplog

Together, these give us the flexibility to restore to a specific point in time without capturing every single write operation individually, due to the following reasons:

  • The full daily backup provides a consistent baseline snapshot of the entire database, simplifying recovery from major failures.

  • The hourly oplog backups allow us to replay only the changes made after the full backup, enabling granular point-in-time recovery to any hour of the day.

All scripts referenced in this document are available in the accompanying GitHub repository, linked at the end.

Full Database Backup using Mongodump

Step 1: Daily Full Dump Setup

Daily full backups using the native mongodump utility provided by MongoDB.

mongodump --uri=<mongodb-connection-uri> --gzip --archive

This approach ensures a complete snapshot of the MongoDB data at a particular point in time.

Step 2: Automation using Kubernetes CronJob

oplog-backup script was written to

  • Execute mongodump

  • Upload the archive to a Google Cloud Storage (GCS) bucket with timestamp embedded in the filename

This script can scheduled via a kubernetes cronjob that runs on a daily basis.

Incremental Oplog Backup

What are Oplogs?

Oplogs (operation logs) are special capped collections that log all write operations in a replica set. They allow a secondary replica to replicate operations from the primary node, and can also be used for point-in-time recovery.

How Oplogs are Queried

To perform incremental backups, we query the oplog.rs collection for operations that occurred within a specific time window using BSON timestamp queries.

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": start_time, "i": 1}},
        "$lte": {"$timestamp": {"t": end_time, "i": 0}},
    }
}
  • t: Represents seconds since the Unix epoch.

  • i: An incrementing ordinal that serializes operations within a second.

Execution Example

To take an oplog dump for March 25, 2025 from 9:00 to 10:00 UTC:

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": 1742873400, "i": 1}},
        "$lte": {"$timestamp": {"t": 1742877000, "i": 0}},
    }
}

mongodump --uri=<mongodb-connection-uri> \\
          --db=local \\
          --collection=oplog.rs \\
          --query=${query_string} \\
          -o -

File Organization Strategy

Oplog backups are stored in GCS buckets using a structured naming convention to simplify querying and restoration

<ENV>/<YYYY>/<MM>/<DD>/<HH>/<MM>

Sample Screenshot from Bucket

Timing is Everything

To ensure consistent oplog backup windows, we must handle potential delays in Kubernetes CronJob execution. Currently, the script calculates the oplog end time using the current timestamp and the start time by subtracting 60 minutes from it. However, if the CronJob is delayed, this dynamic end time can shift forward, creating gaps between consecutive oplog backups.

If a CronJob scheduled at 10:00 AM UTC is delayed and actually starts at 10:05 AM UTC, the script will:

  • Use 10:05 AM as the end time instead of 10:00 AM

  • Subtract 60 minutes to get 9:05 AM as the start time

  • This shifts the oplog window from 9:00 – 10:00 9:05 – 10:05, introducing a 5-minute gap at the beginning

  • These gaps accumulate over time.

Example Scenarios

Scenario

Expected Oplog End Time

Actual Oplog End Time Used by Script

Expected Oplog Start Time

Actual Oplog Start Time Used by Script

Fixed Timestamps (Start End)

Remarks

CronJob scheduled at 10:00 AM UTC (expected to cover 9:00–10:00 AM)

10:00 AM UTC

10:05 AM UTC

9:00 AM UTC

9:05 AM UTC

09:00 10:00

5-minute gap from 9:00 - 9:05

Next CronJob scheduled at 11:00 AM UTC (expected to cover 10:00–11:00 AM)

11:00 AM UTC

11:09 AM UTC

10:00 AM UTC

10:09 AM UTC

10:00 11:00

4-minute gap from 10:05 - 10:09

To account for potential delays in kubernetes cronjob execution, the end timestamp is always calculated based on the start of the current hour. This ensures that even if the cronjob is delayed, the backup window stays consistent and contiguous (e.g., always 9:00 – 10:00, 10:00 – 11:00), preventing oplog data loss.

Data Validation and Restore Workflow

Restoring data isn’t just about rehydrating bytes, it’s about verifying and validating what’s restored.

Validation Challenges

Restoring the data accurately was challenging due to two reasons

  • Collections had TTL (time-to-live) :

    Certain collections had TTL indexes configured, which automatically deleted documents after a specified duration. This meant that even after a successful restore, some documents could expire during validation, leading to inconsistent results and making it hard to verify whether the restore was accurate or if data was naturally purged.

  • Real-time data ingestion :

    The database continued ingesting new data during and after the backup window. This made it difficult to compare restored data against the live system, as new writes could mask or overlap with the expected backup state. It also introduced a moving target problem , by the time validation occurred, the dataset had already evolved.

Metadata Snapshot Script

To validate the consistency of data, we created db_stats.js script that,

  • Iterates through user-defined databases

  • Collects document counts, sizes, and _id ranges for each collection

  • Stores this metadata in GCS alongside the backup

This script is executed hourly during the oplog backup process to ensure synchronization and avoid delay-related inconsistencies and metadata is used post-restore for verification.

Oplog Restoration Script

oplog-restore script was developed to apply the oplogs in sequence. It:

  • Downloads the relevant oplog files from GCS

  • Applies them using the mongorestore utility with -oplogReplay

  • Optionally uses -oplogLimit to stop exactly at the desired timestamp, ensuring point-in-time accuracy

Key Challenges

  1. Restoration Limits

    • Let’s say your daily full backup ran at 06:00, and a bad deployment rolled out at 10:35. Your oplog file for the 10:00 – 11:00 window contains everything,but you only want to recover up to 10:34. Replay the whole file, and you’ll bring the bug right back with you.

    • MongoDB’s --oplogLimit flag gives you surgical precision. It tells mongorestore to stop applying changes at a specific timestamp.

    Example

    mongorestore --uri=${MONGO_URI} \\
                 --oplogReplay \\
                 --oplogLimit="1742878380:0" \\
                 --oplogFile=${oplog_file}

    This restores changes only up to 10:34 UTCkeeping your data intact and your bug safely in the past.

  2. Ordering of Restoration

    • Oplog files are sorted and restored in chronological order based on timestamp extracted from their filename structure.

Demo

Let us consider a scenario where you have a primary MongoDB database that needs to be restored using monogdump and mongorestore . While this entire process is automated through scripts available in a GitHub repository, we will perform a manual execution for demonstration and better understanding.

Primary MongoDB stats before taking Full Backup

Full Backup Command

mongodump --uri="$PRIMARY_URI" --gzip --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats before restoring Full Backup

Full Backup Restore Command

mongorestore \\
  --uri="$SECONDARY_URI" \\
  --nsInclude="*" \\
  --nsExclude="admin.system.*" \\
  --nsExclude="admin.sessions.*" \\
  --nsExclude="local.*" \\
  --nsExclude="config.system.sessions" \\
  --nsExclude="config.transactions" \\
  --nsExclude="config.image_collection" \\
  --nsExclude="config.system.indexBuilds" \\
  --nsExclude="config.system.preimages" \\
  --nsExclude="*.system.buckets" \\
  --nsExclude="*.system.profile" \\
  --nsExclude="*.system.views" \\
  --nsExclude="*.system.js" \\
  --gzip \\
  --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats after restoring Full Backup

Now Let’s Assume that you have deployment to be rolled out on 16:15 05/05/2025 UTC. Primary MongoDB stats before deployment

This deployment introduced some bug and added some corrupted documents to your database.

Primary MongoDB stats after deployment

With the setup of Oplog Backup Cronjobs you will have oplog backup from 16:00 -> 17:00 , but if you replay this entire file , it will add corrupted documents to your database as well.

Secondary MongoDB stats after replaying entire oplog file

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--dir="<OPLOG_FILE_DIR>"

Secondary MongoDB stats after replaying oplog file till 16:14 using --oploglimit option

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--oplogLimit="1746461700:0" \\
--dir="<OPLOG_FILE_DIR>"

Techniques Considered for RPO Less Than 1 Hour

To minimize data loss and improve the ability to restore to the most recent state possible, several potential techniques were identified and proposed. These suggestions aim to balance reliability, operational simplicity, and cost-effectiveness, though they have not been fully evaluated or implemented.

Method 1 : Using CDC ( Change Data Capture ) Tools

This approach involves leveraging Change Data Capture (CDC) tools to continuously monitor and replicate changes in the MongoDB database to external systems in real time. The strategy was to use a MongoDB source connector to capture insert, update, and delete operations, then stream those changes into Apache Kafka using a sink connector.

Kafka provides a reliable and highly durable buffer that decouples the producer (MongoDB) from the consumer (downstream processors or backup targets). This buffer allows for flexible consumption patterns and can be replayed to rebuild the database state up to the latest transaction.

Furthermore, these change events from Kafka could be consumed by Kafka Connect or custom consumers to write them to durable cloud storage systems such as Google Cloud Storage (GCS). This would provide an additional layer of durability and a historical archive of changes for long-term recovery or analytics.

To make this even easier, tools like Airbyte come with ready-made connectors for MongoDB, Kafka, helping teams set up end-to-end pipelines with minimal effort.

Method 2 : MongoDB Native Change Streams

MongoDB supports native Change Streams starting from version 3.6, which provides a way to subscribe to real-time changes in the database without polling or manually reading from the oplog. Change Streams use MongoDB’s internal replication mechanism to deliver changes in a reliable and ordered manner.

Applications can consume these streams using MongoDB drivers available in popular languages such as Python, Node.js, and Go. These changes can then be redirected to Kafka, GCS, or in-memory processing pipelines.

This method offers a native and low-latency way to track changes while abstracting away the complexity of dealing directly with the oplog. However, it requires consumers to be resilient and capable of recovering from disconnections and ensuring continuity of the stream using resume tokens.

Wrapping Up

While we didn’t have access to the MongoDB Enterprise license, which includes built-in backup and recovery tools like Cloud Manager and Ops Manager, we had a clear objective: to design a disaster recovery solution that could achieve a Recovery Point Objective (RPO) of under one hour using only open-source tools.

By combining daily mongodump backups, hourly oplog captures, we created a resilient and low-cost DR strategy for MongoDB on Kubernetes.

If you’re running MongoDB in Kubernetes and need a recovery strategy that’s both practical and powerful, this just might be it

For those who want to try it out, here is our GitHub repository that contains reference scripts and configuration examples related to the discussed setup. Use the README provided to get started.

Overview

Disaster recovery isn’t just a checkbox, it’s a critical component of ensuring data integrity, availability, and fault tolerance in distributed systems. When it comes to databases, especially a distributed system like MongoDB, designing a robust DR strategy can get tricky fast. This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator (v0.9.0).

Our goal was to Achieve an RPO (Recovery Point Objective) of 1 hour, using a combination of full backups, incremental oplog captures, and a touch of automation magic.

What is Disaster Recovery ?

Disaster recovery is the process of restoring systems and data after unexpected failures such as outages, hardware issues, or data corruption. For databases, it's essential to ensure critical information remains available, consistent, and protected, especially in distributed environments where data loss or downtime can impact entire applications. Every business, regardless of industry or size, must be able to recover quickly from events that halt day-to-day operations. Without a disaster recovery plan, companies risk data loss, reduced productivity, unplanned expenses, and reputational damage that can result in lost customers and revenue.

Why We Needed a Custom DR Strategy

The MongoDB Community Operator is a great way to manage MongoDB clusters on Kubernetes, but it comes with some serious limitations when it comes to DR. Here's what we were up against:

  1. No Built-in Backup or Restore Support The MongoDB Community Operator does not natively support automated backup and restore mechanisms.

  2. Backup Challenges with StatefulSets The MDBC operator creates StatefulSets for the primary and replica nodes. While it's possible to back up the underlying persistent volumes (PVs), restoring them can be problematic due to embedded replica set information stored in local databases. This leads to conflicts when attempting to spin up a new cluster.

  3. Inefficiencies in Volume Snapshots Snapshots taken from persistent volumes are not incremental. This makes it difficult to perform point-in-time restores, leading to inefficiencies in meeting low RPO targets.

The Game Plan: Full Dumps + Oplogs

To overcome the limitations, we designed a two-tiered backup system:

  • Daily full database backups using mongodump

  • Hourly incremental backups from the MongoDB oplog

Together, these give us the flexibility to restore to a specific point in time without capturing every single write operation individually, due to the following reasons:

  • The full daily backup provides a consistent baseline snapshot of the entire database, simplifying recovery from major failures.

  • The hourly oplog backups allow us to replay only the changes made after the full backup, enabling granular point-in-time recovery to any hour of the day.

All scripts referenced in this document are available in the accompanying GitHub repository, linked at the end.

Full Database Backup using Mongodump

Step 1: Daily Full Dump Setup

Daily full backups using the native mongodump utility provided by MongoDB.

mongodump --uri=<mongodb-connection-uri> --gzip --archive

This approach ensures a complete snapshot of the MongoDB data at a particular point in time.

Step 2: Automation using Kubernetes CronJob

oplog-backup script was written to

  • Execute mongodump

  • Upload the archive to a Google Cloud Storage (GCS) bucket with timestamp embedded in the filename

This script can scheduled via a kubernetes cronjob that runs on a daily basis.

Incremental Oplog Backup

What are Oplogs?

Oplogs (operation logs) are special capped collections that log all write operations in a replica set. They allow a secondary replica to replicate operations from the primary node, and can also be used for point-in-time recovery.

How Oplogs are Queried

To perform incremental backups, we query the oplog.rs collection for operations that occurred within a specific time window using BSON timestamp queries.

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": start_time, "i": 1}},
        "$lte": {"$timestamp": {"t": end_time, "i": 0}},
    }
}
  • t: Represents seconds since the Unix epoch.

  • i: An incrementing ordinal that serializes operations within a second.

Execution Example

To take an oplog dump for March 25, 2025 from 9:00 to 10:00 UTC:

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": 1742873400, "i": 1}},
        "$lte": {"$timestamp": {"t": 1742877000, "i": 0}},
    }
}

mongodump --uri=<mongodb-connection-uri> \\
          --db=local \\
          --collection=oplog.rs \\
          --query=${query_string} \\
          -o -

File Organization Strategy

Oplog backups are stored in GCS buckets using a structured naming convention to simplify querying and restoration

<ENV>/<YYYY>/<MM>/<DD>/<HH>/<MM>

Sample Screenshot from Bucket

Timing is Everything

To ensure consistent oplog backup windows, we must handle potential delays in Kubernetes CronJob execution. Currently, the script calculates the oplog end time using the current timestamp and the start time by subtracting 60 minutes from it. However, if the CronJob is delayed, this dynamic end time can shift forward, creating gaps between consecutive oplog backups.

If a CronJob scheduled at 10:00 AM UTC is delayed and actually starts at 10:05 AM UTC, the script will:

  • Use 10:05 AM as the end time instead of 10:00 AM

  • Subtract 60 minutes to get 9:05 AM as the start time

  • This shifts the oplog window from 9:00 – 10:00 9:05 – 10:05, introducing a 5-minute gap at the beginning

  • These gaps accumulate over time.

Example Scenarios

Scenario

Expected Oplog End Time

Actual Oplog End Time Used by Script

Expected Oplog Start Time

Actual Oplog Start Time Used by Script

Fixed Timestamps (Start End)

Remarks

CronJob scheduled at 10:00 AM UTC (expected to cover 9:00–10:00 AM)

10:00 AM UTC

10:05 AM UTC

9:00 AM UTC

9:05 AM UTC

09:00 10:00

5-minute gap from 9:00 - 9:05

Next CronJob scheduled at 11:00 AM UTC (expected to cover 10:00–11:00 AM)

11:00 AM UTC

11:09 AM UTC

10:00 AM UTC

10:09 AM UTC

10:00 11:00

4-minute gap from 10:05 - 10:09

To account for potential delays in kubernetes cronjob execution, the end timestamp is always calculated based on the start of the current hour. This ensures that even if the cronjob is delayed, the backup window stays consistent and contiguous (e.g., always 9:00 – 10:00, 10:00 – 11:00), preventing oplog data loss.

Data Validation and Restore Workflow

Restoring data isn’t just about rehydrating bytes, it’s about verifying and validating what’s restored.

Validation Challenges

Restoring the data accurately was challenging due to two reasons

  • Collections had TTL (time-to-live) :

    Certain collections had TTL indexes configured, which automatically deleted documents after a specified duration. This meant that even after a successful restore, some documents could expire during validation, leading to inconsistent results and making it hard to verify whether the restore was accurate or if data was naturally purged.

  • Real-time data ingestion :

    The database continued ingesting new data during and after the backup window. This made it difficult to compare restored data against the live system, as new writes could mask or overlap with the expected backup state. It also introduced a moving target problem , by the time validation occurred, the dataset had already evolved.

Metadata Snapshot Script

To validate the consistency of data, we created db_stats.js script that,

  • Iterates through user-defined databases

  • Collects document counts, sizes, and _id ranges for each collection

  • Stores this metadata in GCS alongside the backup

This script is executed hourly during the oplog backup process to ensure synchronization and avoid delay-related inconsistencies and metadata is used post-restore for verification.

Oplog Restoration Script

oplog-restore script was developed to apply the oplogs in sequence. It:

  • Downloads the relevant oplog files from GCS

  • Applies them using the mongorestore utility with -oplogReplay

  • Optionally uses -oplogLimit to stop exactly at the desired timestamp, ensuring point-in-time accuracy

Key Challenges

  1. Restoration Limits

    • Let’s say your daily full backup ran at 06:00, and a bad deployment rolled out at 10:35. Your oplog file for the 10:00 – 11:00 window contains everything,but you only want to recover up to 10:34. Replay the whole file, and you’ll bring the bug right back with you.

    • MongoDB’s --oplogLimit flag gives you surgical precision. It tells mongorestore to stop applying changes at a specific timestamp.

    Example

    mongorestore --uri=${MONGO_URI} \\
                 --oplogReplay \\
                 --oplogLimit="1742878380:0" \\
                 --oplogFile=${oplog_file}

    This restores changes only up to 10:34 UTCkeeping your data intact and your bug safely in the past.

  2. Ordering of Restoration

    • Oplog files are sorted and restored in chronological order based on timestamp extracted from their filename structure.

Demo

Let us consider a scenario where you have a primary MongoDB database that needs to be restored using monogdump and mongorestore . While this entire process is automated through scripts available in a GitHub repository, we will perform a manual execution for demonstration and better understanding.

Primary MongoDB stats before taking Full Backup

Full Backup Command

mongodump --uri="$PRIMARY_URI" --gzip --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats before restoring Full Backup

Full Backup Restore Command

mongorestore \\
  --uri="$SECONDARY_URI" \\
  --nsInclude="*" \\
  --nsExclude="admin.system.*" \\
  --nsExclude="admin.sessions.*" \\
  --nsExclude="local.*" \\
  --nsExclude="config.system.sessions" \\
  --nsExclude="config.transactions" \\
  --nsExclude="config.image_collection" \\
  --nsExclude="config.system.indexBuilds" \\
  --nsExclude="config.system.preimages" \\
  --nsExclude="*.system.buckets" \\
  --nsExclude="*.system.profile" \\
  --nsExclude="*.system.views" \\
  --nsExclude="*.system.js" \\
  --gzip \\
  --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats after restoring Full Backup

Now Let’s Assume that you have deployment to be rolled out on 16:15 05/05/2025 UTC. Primary MongoDB stats before deployment

This deployment introduced some bug and added some corrupted documents to your database.

Primary MongoDB stats after deployment

With the setup of Oplog Backup Cronjobs you will have oplog backup from 16:00 -> 17:00 , but if you replay this entire file , it will add corrupted documents to your database as well.

Secondary MongoDB stats after replaying entire oplog file

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--dir="<OPLOG_FILE_DIR>"

Secondary MongoDB stats after replaying oplog file till 16:14 using --oploglimit option

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--oplogLimit="1746461700:0" \\
--dir="<OPLOG_FILE_DIR>"

Techniques Considered for RPO Less Than 1 Hour

To minimize data loss and improve the ability to restore to the most recent state possible, several potential techniques were identified and proposed. These suggestions aim to balance reliability, operational simplicity, and cost-effectiveness, though they have not been fully evaluated or implemented.

Method 1 : Using CDC ( Change Data Capture ) Tools

This approach involves leveraging Change Data Capture (CDC) tools to continuously monitor and replicate changes in the MongoDB database to external systems in real time. The strategy was to use a MongoDB source connector to capture insert, update, and delete operations, then stream those changes into Apache Kafka using a sink connector.

Kafka provides a reliable and highly durable buffer that decouples the producer (MongoDB) from the consumer (downstream processors or backup targets). This buffer allows for flexible consumption patterns and can be replayed to rebuild the database state up to the latest transaction.

Furthermore, these change events from Kafka could be consumed by Kafka Connect or custom consumers to write them to durable cloud storage systems such as Google Cloud Storage (GCS). This would provide an additional layer of durability and a historical archive of changes for long-term recovery or analytics.

To make this even easier, tools like Airbyte come with ready-made connectors for MongoDB, Kafka, helping teams set up end-to-end pipelines with minimal effort.

Method 2 : MongoDB Native Change Streams

MongoDB supports native Change Streams starting from version 3.6, which provides a way to subscribe to real-time changes in the database without polling or manually reading from the oplog. Change Streams use MongoDB’s internal replication mechanism to deliver changes in a reliable and ordered manner.

Applications can consume these streams using MongoDB drivers available in popular languages such as Python, Node.js, and Go. These changes can then be redirected to Kafka, GCS, or in-memory processing pipelines.

This method offers a native and low-latency way to track changes while abstracting away the complexity of dealing directly with the oplog. However, it requires consumers to be resilient and capable of recovering from disconnections and ensuring continuity of the stream using resume tokens.

Wrapping Up

While we didn’t have access to the MongoDB Enterprise license, which includes built-in backup and recovery tools like Cloud Manager and Ops Manager, we had a clear objective: to design a disaster recovery solution that could achieve a Recovery Point Objective (RPO) of under one hour using only open-source tools.

By combining daily mongodump backups, hourly oplog captures, we created a resilient and low-cost DR strategy for MongoDB on Kubernetes.

If you’re running MongoDB in Kubernetes and need a recovery strategy that’s both practical and powerful, this just might be it

For those who want to try it out, here is our GitHub repository that contains reference scripts and configuration examples related to the discussed setup. Use the README provided to get started.

Overview

Disaster recovery isn’t just a checkbox, it’s a critical component of ensuring data integrity, availability, and fault tolerance in distributed systems. When it comes to databases, especially a distributed system like MongoDB, designing a robust DR strategy can get tricky fast. This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator (v0.9.0).

Our goal was to Achieve an RPO (Recovery Point Objective) of 1 hour, using a combination of full backups, incremental oplog captures, and a touch of automation magic.

What is Disaster Recovery ?

Disaster recovery is the process of restoring systems and data after unexpected failures such as outages, hardware issues, or data corruption. For databases, it's essential to ensure critical information remains available, consistent, and protected, especially in distributed environments where data loss or downtime can impact entire applications. Every business, regardless of industry or size, must be able to recover quickly from events that halt day-to-day operations. Without a disaster recovery plan, companies risk data loss, reduced productivity, unplanned expenses, and reputational damage that can result in lost customers and revenue.

Why We Needed a Custom DR Strategy

The MongoDB Community Operator is a great way to manage MongoDB clusters on Kubernetes, but it comes with some serious limitations when it comes to DR. Here's what we were up against:

  1. No Built-in Backup or Restore Support The MongoDB Community Operator does not natively support automated backup and restore mechanisms.

  2. Backup Challenges with StatefulSets The MDBC operator creates StatefulSets for the primary and replica nodes. While it's possible to back up the underlying persistent volumes (PVs), restoring them can be problematic due to embedded replica set information stored in local databases. This leads to conflicts when attempting to spin up a new cluster.

  3. Inefficiencies in Volume Snapshots Snapshots taken from persistent volumes are not incremental. This makes it difficult to perform point-in-time restores, leading to inefficiencies in meeting low RPO targets.

The Game Plan: Full Dumps + Oplogs

To overcome the limitations, we designed a two-tiered backup system:

  • Daily full database backups using mongodump

  • Hourly incremental backups from the MongoDB oplog

Together, these give us the flexibility to restore to a specific point in time without capturing every single write operation individually, due to the following reasons:

  • The full daily backup provides a consistent baseline snapshot of the entire database, simplifying recovery from major failures.

  • The hourly oplog backups allow us to replay only the changes made after the full backup, enabling granular point-in-time recovery to any hour of the day.

All scripts referenced in this document are available in the accompanying GitHub repository, linked at the end.

Full Database Backup using Mongodump

Step 1: Daily Full Dump Setup

Daily full backups using the native mongodump utility provided by MongoDB.

mongodump --uri=<mongodb-connection-uri> --gzip --archive

This approach ensures a complete snapshot of the MongoDB data at a particular point in time.

Step 2: Automation using Kubernetes CronJob

oplog-backup script was written to

  • Execute mongodump

  • Upload the archive to a Google Cloud Storage (GCS) bucket with timestamp embedded in the filename

This script can scheduled via a kubernetes cronjob that runs on a daily basis.

Incremental Oplog Backup

What are Oplogs?

Oplogs (operation logs) are special capped collections that log all write operations in a replica set. They allow a secondary replica to replicate operations from the primary node, and can also be used for point-in-time recovery.

How Oplogs are Queried

To perform incremental backups, we query the oplog.rs collection for operations that occurred within a specific time window using BSON timestamp queries.

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": start_time, "i": 1}},
        "$lte": {"$timestamp": {"t": end_time, "i": 0}},
    }
}
  • t: Represents seconds since the Unix epoch.

  • i: An incrementing ordinal that serializes operations within a second.

Execution Example

To take an oplog dump for March 25, 2025 from 9:00 to 10:00 UTC:

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": 1742873400, "i": 1}},
        "$lte": {"$timestamp": {"t": 1742877000, "i": 0}},
    }
}

mongodump --uri=<mongodb-connection-uri> \\
          --db=local \\
          --collection=oplog.rs \\
          --query=${query_string} \\
          -o -

File Organization Strategy

Oplog backups are stored in GCS buckets using a structured naming convention to simplify querying and restoration

<ENV>/<YYYY>/<MM>/<DD>/<HH>/<MM>

Sample Screenshot from Bucket

Timing is Everything

To ensure consistent oplog backup windows, we must handle potential delays in Kubernetes CronJob execution. Currently, the script calculates the oplog end time using the current timestamp and the start time by subtracting 60 minutes from it. However, if the CronJob is delayed, this dynamic end time can shift forward, creating gaps between consecutive oplog backups.

If a CronJob scheduled at 10:00 AM UTC is delayed and actually starts at 10:05 AM UTC, the script will:

  • Use 10:05 AM as the end time instead of 10:00 AM

  • Subtract 60 minutes to get 9:05 AM as the start time

  • This shifts the oplog window from 9:00 – 10:00 9:05 – 10:05, introducing a 5-minute gap at the beginning

  • These gaps accumulate over time.

Example Scenarios

Scenario

Expected Oplog End Time

Actual Oplog End Time Used by Script

Expected Oplog Start Time

Actual Oplog Start Time Used by Script

Fixed Timestamps (Start End)

Remarks

CronJob scheduled at 10:00 AM UTC (expected to cover 9:00–10:00 AM)

10:00 AM UTC

10:05 AM UTC

9:00 AM UTC

9:05 AM UTC

09:00 10:00

5-minute gap from 9:00 - 9:05

Next CronJob scheduled at 11:00 AM UTC (expected to cover 10:00–11:00 AM)

11:00 AM UTC

11:09 AM UTC

10:00 AM UTC

10:09 AM UTC

10:00 11:00

4-minute gap from 10:05 - 10:09

To account for potential delays in kubernetes cronjob execution, the end timestamp is always calculated based on the start of the current hour. This ensures that even if the cronjob is delayed, the backup window stays consistent and contiguous (e.g., always 9:00 – 10:00, 10:00 – 11:00), preventing oplog data loss.

Data Validation and Restore Workflow

Restoring data isn’t just about rehydrating bytes, it’s about verifying and validating what’s restored.

Validation Challenges

Restoring the data accurately was challenging due to two reasons

  • Collections had TTL (time-to-live) :

    Certain collections had TTL indexes configured, which automatically deleted documents after a specified duration. This meant that even after a successful restore, some documents could expire during validation, leading to inconsistent results and making it hard to verify whether the restore was accurate or if data was naturally purged.

  • Real-time data ingestion :

    The database continued ingesting new data during and after the backup window. This made it difficult to compare restored data against the live system, as new writes could mask or overlap with the expected backup state. It also introduced a moving target problem , by the time validation occurred, the dataset had already evolved.

Metadata Snapshot Script

To validate the consistency of data, we created db_stats.js script that,

  • Iterates through user-defined databases

  • Collects document counts, sizes, and _id ranges for each collection

  • Stores this metadata in GCS alongside the backup

This script is executed hourly during the oplog backup process to ensure synchronization and avoid delay-related inconsistencies and metadata is used post-restore for verification.

Oplog Restoration Script

oplog-restore script was developed to apply the oplogs in sequence. It:

  • Downloads the relevant oplog files from GCS

  • Applies them using the mongorestore utility with -oplogReplay

  • Optionally uses -oplogLimit to stop exactly at the desired timestamp, ensuring point-in-time accuracy

Key Challenges

  1. Restoration Limits

    • Let’s say your daily full backup ran at 06:00, and a bad deployment rolled out at 10:35. Your oplog file for the 10:00 – 11:00 window contains everything,but you only want to recover up to 10:34. Replay the whole file, and you’ll bring the bug right back with you.

    • MongoDB’s --oplogLimit flag gives you surgical precision. It tells mongorestore to stop applying changes at a specific timestamp.

    Example

    mongorestore --uri=${MONGO_URI} \\
                 --oplogReplay \\
                 --oplogLimit="1742878380:0" \\
                 --oplogFile=${oplog_file}

    This restores changes only up to 10:34 UTCkeeping your data intact and your bug safely in the past.

  2. Ordering of Restoration

    • Oplog files are sorted and restored in chronological order based on timestamp extracted from their filename structure.

Demo

Let us consider a scenario where you have a primary MongoDB database that needs to be restored using monogdump and mongorestore . While this entire process is automated through scripts available in a GitHub repository, we will perform a manual execution for demonstration and better understanding.

Primary MongoDB stats before taking Full Backup

Full Backup Command

mongodump --uri="$PRIMARY_URI" --gzip --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats before restoring Full Backup

Full Backup Restore Command

mongorestore \\
  --uri="$SECONDARY_URI" \\
  --nsInclude="*" \\
  --nsExclude="admin.system.*" \\
  --nsExclude="admin.sessions.*" \\
  --nsExclude="local.*" \\
  --nsExclude="config.system.sessions" \\
  --nsExclude="config.transactions" \\
  --nsExclude="config.image_collection" \\
  --nsExclude="config.system.indexBuilds" \\
  --nsExclude="config.system.preimages" \\
  --nsExclude="*.system.buckets" \\
  --nsExclude="*.system.profile" \\
  --nsExclude="*.system.views" \\
  --nsExclude="*.system.js" \\
  --gzip \\
  --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats after restoring Full Backup

Now Let’s Assume that you have deployment to be rolled out on 16:15 05/05/2025 UTC. Primary MongoDB stats before deployment

This deployment introduced some bug and added some corrupted documents to your database.

Primary MongoDB stats after deployment

With the setup of Oplog Backup Cronjobs you will have oplog backup from 16:00 -> 17:00 , but if you replay this entire file , it will add corrupted documents to your database as well.

Secondary MongoDB stats after replaying entire oplog file

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--dir="<OPLOG_FILE_DIR>"

Secondary MongoDB stats after replaying oplog file till 16:14 using --oploglimit option

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--oplogLimit="1746461700:0" \\
--dir="<OPLOG_FILE_DIR>"

Techniques Considered for RPO Less Than 1 Hour

To minimize data loss and improve the ability to restore to the most recent state possible, several potential techniques were identified and proposed. These suggestions aim to balance reliability, operational simplicity, and cost-effectiveness, though they have not been fully evaluated or implemented.

Method 1 : Using CDC ( Change Data Capture ) Tools

This approach involves leveraging Change Data Capture (CDC) tools to continuously monitor and replicate changes in the MongoDB database to external systems in real time. The strategy was to use a MongoDB source connector to capture insert, update, and delete operations, then stream those changes into Apache Kafka using a sink connector.

Kafka provides a reliable and highly durable buffer that decouples the producer (MongoDB) from the consumer (downstream processors or backup targets). This buffer allows for flexible consumption patterns and can be replayed to rebuild the database state up to the latest transaction.

Furthermore, these change events from Kafka could be consumed by Kafka Connect or custom consumers to write them to durable cloud storage systems such as Google Cloud Storage (GCS). This would provide an additional layer of durability and a historical archive of changes for long-term recovery or analytics.

To make this even easier, tools like Airbyte come with ready-made connectors for MongoDB, Kafka, helping teams set up end-to-end pipelines with minimal effort.

Method 2 : MongoDB Native Change Streams

MongoDB supports native Change Streams starting from version 3.6, which provides a way to subscribe to real-time changes in the database without polling or manually reading from the oplog. Change Streams use MongoDB’s internal replication mechanism to deliver changes in a reliable and ordered manner.

Applications can consume these streams using MongoDB drivers available in popular languages such as Python, Node.js, and Go. These changes can then be redirected to Kafka, GCS, or in-memory processing pipelines.

This method offers a native and low-latency way to track changes while abstracting away the complexity of dealing directly with the oplog. However, it requires consumers to be resilient and capable of recovering from disconnections and ensuring continuity of the stream using resume tokens.

Wrapping Up

While we didn’t have access to the MongoDB Enterprise license, which includes built-in backup and recovery tools like Cloud Manager and Ops Manager, we had a clear objective: to design a disaster recovery solution that could achieve a Recovery Point Objective (RPO) of under one hour using only open-source tools.

By combining daily mongodump backups, hourly oplog captures, we created a resilient and low-cost DR strategy for MongoDB on Kubernetes.

If you’re running MongoDB in Kubernetes and need a recovery strategy that’s both practical and powerful, this just might be it

For those who want to try it out, here is our GitHub repository that contains reference scripts and configuration examples related to the discussed setup. Use the README provided to get started.

Overview

Disaster recovery isn’t just a checkbox, it’s a critical component of ensuring data integrity, availability, and fault tolerance in distributed systems. When it comes to databases, especially a distributed system like MongoDB, designing a robust DR strategy can get tricky fast. This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator (v0.9.0).

Our goal was to Achieve an RPO (Recovery Point Objective) of 1 hour, using a combination of full backups, incremental oplog captures, and a touch of automation magic.

What is Disaster Recovery ?

Disaster recovery is the process of restoring systems and data after unexpected failures such as outages, hardware issues, or data corruption. For databases, it's essential to ensure critical information remains available, consistent, and protected, especially in distributed environments where data loss or downtime can impact entire applications. Every business, regardless of industry or size, must be able to recover quickly from events that halt day-to-day operations. Without a disaster recovery plan, companies risk data loss, reduced productivity, unplanned expenses, and reputational damage that can result in lost customers and revenue.

Why We Needed a Custom DR Strategy

The MongoDB Community Operator is a great way to manage MongoDB clusters on Kubernetes, but it comes with some serious limitations when it comes to DR. Here's what we were up against:

  1. No Built-in Backup or Restore Support The MongoDB Community Operator does not natively support automated backup and restore mechanisms.

  2. Backup Challenges with StatefulSets The MDBC operator creates StatefulSets for the primary and replica nodes. While it's possible to back up the underlying persistent volumes (PVs), restoring them can be problematic due to embedded replica set information stored in local databases. This leads to conflicts when attempting to spin up a new cluster.

  3. Inefficiencies in Volume Snapshots Snapshots taken from persistent volumes are not incremental. This makes it difficult to perform point-in-time restores, leading to inefficiencies in meeting low RPO targets.

The Game Plan: Full Dumps + Oplogs

To overcome the limitations, we designed a two-tiered backup system:

  • Daily full database backups using mongodump

  • Hourly incremental backups from the MongoDB oplog

Together, these give us the flexibility to restore to a specific point in time without capturing every single write operation individually, due to the following reasons:

  • The full daily backup provides a consistent baseline snapshot of the entire database, simplifying recovery from major failures.

  • The hourly oplog backups allow us to replay only the changes made after the full backup, enabling granular point-in-time recovery to any hour of the day.

All scripts referenced in this document are available in the accompanying GitHub repository, linked at the end.

Full Database Backup using Mongodump

Step 1: Daily Full Dump Setup

Daily full backups using the native mongodump utility provided by MongoDB.

mongodump --uri=<mongodb-connection-uri> --gzip --archive

This approach ensures a complete snapshot of the MongoDB data at a particular point in time.

Step 2: Automation using Kubernetes CronJob

oplog-backup script was written to

  • Execute mongodump

  • Upload the archive to a Google Cloud Storage (GCS) bucket with timestamp embedded in the filename

This script can scheduled via a kubernetes cronjob that runs on a daily basis.

Incremental Oplog Backup

What are Oplogs?

Oplogs (operation logs) are special capped collections that log all write operations in a replica set. They allow a secondary replica to replicate operations from the primary node, and can also be used for point-in-time recovery.

How Oplogs are Queried

To perform incremental backups, we query the oplog.rs collection for operations that occurred within a specific time window using BSON timestamp queries.

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": start_time, "i": 1}},
        "$lte": {"$timestamp": {"t": end_time, "i": 0}},
    }
}
  • t: Represents seconds since the Unix epoch.

  • i: An incrementing ordinal that serializes operations within a second.

Execution Example

To take an oplog dump for March 25, 2025 from 9:00 to 10:00 UTC:

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": 1742873400, "i": 1}},
        "$lte": {"$timestamp": {"t": 1742877000, "i": 0}},
    }
}

mongodump --uri=<mongodb-connection-uri> \\
          --db=local \\
          --collection=oplog.rs \\
          --query=${query_string} \\
          -o -

File Organization Strategy

Oplog backups are stored in GCS buckets using a structured naming convention to simplify querying and restoration

<ENV>/<YYYY>/<MM>/<DD>/<HH>/<MM>

Sample Screenshot from Bucket

Timing is Everything

To ensure consistent oplog backup windows, we must handle potential delays in Kubernetes CronJob execution. Currently, the script calculates the oplog end time using the current timestamp and the start time by subtracting 60 minutes from it. However, if the CronJob is delayed, this dynamic end time can shift forward, creating gaps between consecutive oplog backups.

If a CronJob scheduled at 10:00 AM UTC is delayed and actually starts at 10:05 AM UTC, the script will:

  • Use 10:05 AM as the end time instead of 10:00 AM

  • Subtract 60 minutes to get 9:05 AM as the start time

  • This shifts the oplog window from 9:00 – 10:00 9:05 – 10:05, introducing a 5-minute gap at the beginning

  • These gaps accumulate over time.

Example Scenarios

Scenario

Expected Oplog End Time

Actual Oplog End Time Used by Script

Expected Oplog Start Time

Actual Oplog Start Time Used by Script

Fixed Timestamps (Start End)

Remarks

CronJob scheduled at 10:00 AM UTC (expected to cover 9:00–10:00 AM)

10:00 AM UTC

10:05 AM UTC

9:00 AM UTC

9:05 AM UTC

09:00 10:00

5-minute gap from 9:00 - 9:05

Next CronJob scheduled at 11:00 AM UTC (expected to cover 10:00–11:00 AM)

11:00 AM UTC

11:09 AM UTC

10:00 AM UTC

10:09 AM UTC

10:00 11:00

4-minute gap from 10:05 - 10:09

To account for potential delays in kubernetes cronjob execution, the end timestamp is always calculated based on the start of the current hour. This ensures that even if the cronjob is delayed, the backup window stays consistent and contiguous (e.g., always 9:00 – 10:00, 10:00 – 11:00), preventing oplog data loss.

Data Validation and Restore Workflow

Restoring data isn’t just about rehydrating bytes, it’s about verifying and validating what’s restored.

Validation Challenges

Restoring the data accurately was challenging due to two reasons

  • Collections had TTL (time-to-live) :

    Certain collections had TTL indexes configured, which automatically deleted documents after a specified duration. This meant that even after a successful restore, some documents could expire during validation, leading to inconsistent results and making it hard to verify whether the restore was accurate or if data was naturally purged.

  • Real-time data ingestion :

    The database continued ingesting new data during and after the backup window. This made it difficult to compare restored data against the live system, as new writes could mask or overlap with the expected backup state. It also introduced a moving target problem , by the time validation occurred, the dataset had already evolved.

Metadata Snapshot Script

To validate the consistency of data, we created db_stats.js script that,

  • Iterates through user-defined databases

  • Collects document counts, sizes, and _id ranges for each collection

  • Stores this metadata in GCS alongside the backup

This script is executed hourly during the oplog backup process to ensure synchronization and avoid delay-related inconsistencies and metadata is used post-restore for verification.

Oplog Restoration Script

oplog-restore script was developed to apply the oplogs in sequence. It:

  • Downloads the relevant oplog files from GCS

  • Applies them using the mongorestore utility with -oplogReplay

  • Optionally uses -oplogLimit to stop exactly at the desired timestamp, ensuring point-in-time accuracy

Key Challenges

  1. Restoration Limits

    • Let’s say your daily full backup ran at 06:00, and a bad deployment rolled out at 10:35. Your oplog file for the 10:00 – 11:00 window contains everything,but you only want to recover up to 10:34. Replay the whole file, and you’ll bring the bug right back with you.

    • MongoDB’s --oplogLimit flag gives you surgical precision. It tells mongorestore to stop applying changes at a specific timestamp.

    Example

    mongorestore --uri=${MONGO_URI} \\
                 --oplogReplay \\
                 --oplogLimit="1742878380:0" \\
                 --oplogFile=${oplog_file}

    This restores changes only up to 10:34 UTCkeeping your data intact and your bug safely in the past.

  2. Ordering of Restoration

    • Oplog files are sorted and restored in chronological order based on timestamp extracted from their filename structure.

Demo

Let us consider a scenario where you have a primary MongoDB database that needs to be restored using monogdump and mongorestore . While this entire process is automated through scripts available in a GitHub repository, we will perform a manual execution for demonstration and better understanding.

Primary MongoDB stats before taking Full Backup

Full Backup Command

mongodump --uri="$PRIMARY_URI" --gzip --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats before restoring Full Backup

Full Backup Restore Command

mongorestore \\
  --uri="$SECONDARY_URI" \\
  --nsInclude="*" \\
  --nsExclude="admin.system.*" \\
  --nsExclude="admin.sessions.*" \\
  --nsExclude="local.*" \\
  --nsExclude="config.system.sessions" \\
  --nsExclude="config.transactions" \\
  --nsExclude="config.image_collection" \\
  --nsExclude="config.system.indexBuilds" \\
  --nsExclude="config.system.preimages" \\
  --nsExclude="*.system.buckets" \\
  --nsExclude="*.system.profile" \\
  --nsExclude="*.system.views" \\
  --nsExclude="*.system.js" \\
  --gzip \\
  --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats after restoring Full Backup

Now Let’s Assume that you have deployment to be rolled out on 16:15 05/05/2025 UTC. Primary MongoDB stats before deployment

This deployment introduced some bug and added some corrupted documents to your database.

Primary MongoDB stats after deployment

With the setup of Oplog Backup Cronjobs you will have oplog backup from 16:00 -> 17:00 , but if you replay this entire file , it will add corrupted documents to your database as well.

Secondary MongoDB stats after replaying entire oplog file

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--dir="<OPLOG_FILE_DIR>"

Secondary MongoDB stats after replaying oplog file till 16:14 using --oploglimit option

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--oplogLimit="1746461700:0" \\
--dir="<OPLOG_FILE_DIR>"

Techniques Considered for RPO Less Than 1 Hour

To minimize data loss and improve the ability to restore to the most recent state possible, several potential techniques were identified and proposed. These suggestions aim to balance reliability, operational simplicity, and cost-effectiveness, though they have not been fully evaluated or implemented.

Method 1 : Using CDC ( Change Data Capture ) Tools

This approach involves leveraging Change Data Capture (CDC) tools to continuously monitor and replicate changes in the MongoDB database to external systems in real time. The strategy was to use a MongoDB source connector to capture insert, update, and delete operations, then stream those changes into Apache Kafka using a sink connector.

Kafka provides a reliable and highly durable buffer that decouples the producer (MongoDB) from the consumer (downstream processors or backup targets). This buffer allows for flexible consumption patterns and can be replayed to rebuild the database state up to the latest transaction.

Furthermore, these change events from Kafka could be consumed by Kafka Connect or custom consumers to write them to durable cloud storage systems such as Google Cloud Storage (GCS). This would provide an additional layer of durability and a historical archive of changes for long-term recovery or analytics.

To make this even easier, tools like Airbyte come with ready-made connectors for MongoDB, Kafka, helping teams set up end-to-end pipelines with minimal effort.

Method 2 : MongoDB Native Change Streams

MongoDB supports native Change Streams starting from version 3.6, which provides a way to subscribe to real-time changes in the database without polling or manually reading from the oplog. Change Streams use MongoDB’s internal replication mechanism to deliver changes in a reliable and ordered manner.

Applications can consume these streams using MongoDB drivers available in popular languages such as Python, Node.js, and Go. These changes can then be redirected to Kafka, GCS, or in-memory processing pipelines.

This method offers a native and low-latency way to track changes while abstracting away the complexity of dealing directly with the oplog. However, it requires consumers to be resilient and capable of recovering from disconnections and ensuring continuity of the stream using resume tokens.

Wrapping Up

While we didn’t have access to the MongoDB Enterprise license, which includes built-in backup and recovery tools like Cloud Manager and Ops Manager, we had a clear objective: to design a disaster recovery solution that could achieve a Recovery Point Objective (RPO) of under one hour using only open-source tools.

By combining daily mongodump backups, hourly oplog captures, we created a resilient and low-cost DR strategy for MongoDB on Kubernetes.

If you’re running MongoDB in Kubernetes and need a recovery strategy that’s both practical and powerful, this just might be it

For those who want to try it out, here is our GitHub repository that contains reference scripts and configuration examples related to the discussed setup. Use the README provided to get started.

Overview

Disaster recovery isn’t just a checkbox, it’s a critical component of ensuring data integrity, availability, and fault tolerance in distributed systems. When it comes to databases, especially a distributed system like MongoDB, designing a robust DR strategy can get tricky fast. This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator (v0.9.0).

Our goal was to Achieve an RPO (Recovery Point Objective) of 1 hour, using a combination of full backups, incremental oplog captures, and a touch of automation magic.

What is Disaster Recovery ?

Disaster recovery is the process of restoring systems and data after unexpected failures such as outages, hardware issues, or data corruption. For databases, it's essential to ensure critical information remains available, consistent, and protected, especially in distributed environments where data loss or downtime can impact entire applications. Every business, regardless of industry or size, must be able to recover quickly from events that halt day-to-day operations. Without a disaster recovery plan, companies risk data loss, reduced productivity, unplanned expenses, and reputational damage that can result in lost customers and revenue.

Why We Needed a Custom DR Strategy

The MongoDB Community Operator is a great way to manage MongoDB clusters on Kubernetes, but it comes with some serious limitations when it comes to DR. Here's what we were up against:

  1. No Built-in Backup or Restore Support The MongoDB Community Operator does not natively support automated backup and restore mechanisms.

  2. Backup Challenges with StatefulSets The MDBC operator creates StatefulSets for the primary and replica nodes. While it's possible to back up the underlying persistent volumes (PVs), restoring them can be problematic due to embedded replica set information stored in local databases. This leads to conflicts when attempting to spin up a new cluster.

  3. Inefficiencies in Volume Snapshots Snapshots taken from persistent volumes are not incremental. This makes it difficult to perform point-in-time restores, leading to inefficiencies in meeting low RPO targets.

The Game Plan: Full Dumps + Oplogs

To overcome the limitations, we designed a two-tiered backup system:

  • Daily full database backups using mongodump

  • Hourly incremental backups from the MongoDB oplog

Together, these give us the flexibility to restore to a specific point in time without capturing every single write operation individually, due to the following reasons:

  • The full daily backup provides a consistent baseline snapshot of the entire database, simplifying recovery from major failures.

  • The hourly oplog backups allow us to replay only the changes made after the full backup, enabling granular point-in-time recovery to any hour of the day.

All scripts referenced in this document are available in the accompanying GitHub repository, linked at the end.

Full Database Backup using Mongodump

Step 1: Daily Full Dump Setup

Daily full backups using the native mongodump utility provided by MongoDB.

mongodump --uri=<mongodb-connection-uri> --gzip --archive

This approach ensures a complete snapshot of the MongoDB data at a particular point in time.

Step 2: Automation using Kubernetes CronJob

oplog-backup script was written to

  • Execute mongodump

  • Upload the archive to a Google Cloud Storage (GCS) bucket with timestamp embedded in the filename

This script can scheduled via a kubernetes cronjob that runs on a daily basis.

Incremental Oplog Backup

What are Oplogs?

Oplogs (operation logs) are special capped collections that log all write operations in a replica set. They allow a secondary replica to replicate operations from the primary node, and can also be used for point-in-time recovery.

How Oplogs are Queried

To perform incremental backups, we query the oplog.rs collection for operations that occurred within a specific time window using BSON timestamp queries.

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": start_time, "i": 1}},
        "$lte": {"$timestamp": {"t": end_time, "i": 0}},
    }
}
  • t: Represents seconds since the Unix epoch.

  • i: An incrementing ordinal that serializes operations within a second.

Execution Example

To take an oplog dump for March 25, 2025 from 9:00 to 10:00 UTC:

query_string = {
    "ts": {
        "$gte": {"$timestamp": {"t": 1742873400, "i": 1}},
        "$lte": {"$timestamp": {"t": 1742877000, "i": 0}},
    }
}

mongodump --uri=<mongodb-connection-uri> \\
          --db=local \\
          --collection=oplog.rs \\
          --query=${query_string} \\
          -o -

File Organization Strategy

Oplog backups are stored in GCS buckets using a structured naming convention to simplify querying and restoration

<ENV>/<YYYY>/<MM>/<DD>/<HH>/<MM>

Sample Screenshot from Bucket

Timing is Everything

To ensure consistent oplog backup windows, we must handle potential delays in Kubernetes CronJob execution. Currently, the script calculates the oplog end time using the current timestamp and the start time by subtracting 60 minutes from it. However, if the CronJob is delayed, this dynamic end time can shift forward, creating gaps between consecutive oplog backups.

If a CronJob scheduled at 10:00 AM UTC is delayed and actually starts at 10:05 AM UTC, the script will:

  • Use 10:05 AM as the end time instead of 10:00 AM

  • Subtract 60 minutes to get 9:05 AM as the start time

  • This shifts the oplog window from 9:00 – 10:00 9:05 – 10:05, introducing a 5-minute gap at the beginning

  • These gaps accumulate over time.

Example Scenarios

Scenario

Expected Oplog End Time

Actual Oplog End Time Used by Script

Expected Oplog Start Time

Actual Oplog Start Time Used by Script

Fixed Timestamps (Start End)

Remarks

CronJob scheduled at 10:00 AM UTC (expected to cover 9:00–10:00 AM)

10:00 AM UTC

10:05 AM UTC

9:00 AM UTC

9:05 AM UTC

09:00 10:00

5-minute gap from 9:00 - 9:05

Next CronJob scheduled at 11:00 AM UTC (expected to cover 10:00–11:00 AM)

11:00 AM UTC

11:09 AM UTC

10:00 AM UTC

10:09 AM UTC

10:00 11:00

4-minute gap from 10:05 - 10:09

To account for potential delays in kubernetes cronjob execution, the end timestamp is always calculated based on the start of the current hour. This ensures that even if the cronjob is delayed, the backup window stays consistent and contiguous (e.g., always 9:00 – 10:00, 10:00 – 11:00), preventing oplog data loss.

Data Validation and Restore Workflow

Restoring data isn’t just about rehydrating bytes, it’s about verifying and validating what’s restored.

Validation Challenges

Restoring the data accurately was challenging due to two reasons

  • Collections had TTL (time-to-live) :

    Certain collections had TTL indexes configured, which automatically deleted documents after a specified duration. This meant that even after a successful restore, some documents could expire during validation, leading to inconsistent results and making it hard to verify whether the restore was accurate or if data was naturally purged.

  • Real-time data ingestion :

    The database continued ingesting new data during and after the backup window. This made it difficult to compare restored data against the live system, as new writes could mask or overlap with the expected backup state. It also introduced a moving target problem , by the time validation occurred, the dataset had already evolved.

Metadata Snapshot Script

To validate the consistency of data, we created db_stats.js script that,

  • Iterates through user-defined databases

  • Collects document counts, sizes, and _id ranges for each collection

  • Stores this metadata in GCS alongside the backup

This script is executed hourly during the oplog backup process to ensure synchronization and avoid delay-related inconsistencies and metadata is used post-restore for verification.

Oplog Restoration Script

oplog-restore script was developed to apply the oplogs in sequence. It:

  • Downloads the relevant oplog files from GCS

  • Applies them using the mongorestore utility with -oplogReplay

  • Optionally uses -oplogLimit to stop exactly at the desired timestamp, ensuring point-in-time accuracy

Key Challenges

  1. Restoration Limits

    • Let’s say your daily full backup ran at 06:00, and a bad deployment rolled out at 10:35. Your oplog file for the 10:00 – 11:00 window contains everything,but you only want to recover up to 10:34. Replay the whole file, and you’ll bring the bug right back with you.

    • MongoDB’s --oplogLimit flag gives you surgical precision. It tells mongorestore to stop applying changes at a specific timestamp.

    Example

    mongorestore --uri=${MONGO_URI} \\
                 --oplogReplay \\
                 --oplogLimit="1742878380:0" \\
                 --oplogFile=${oplog_file}

    This restores changes only up to 10:34 UTCkeeping your data intact and your bug safely in the past.

  2. Ordering of Restoration

    • Oplog files are sorted and restored in chronological order based on timestamp extracted from their filename structure.

Demo

Let us consider a scenario where you have a primary MongoDB database that needs to be restored using monogdump and mongorestore . While this entire process is automated through scripts available in a GitHub repository, we will perform a manual execution for demonstration and better understanding.

Primary MongoDB stats before taking Full Backup

Full Backup Command

mongodump --uri="$PRIMARY_URI" --gzip --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats before restoring Full Backup

Full Backup Restore Command

mongorestore \\
  --uri="$SECONDARY_URI" \\
  --nsInclude="*" \\
  --nsExclude="admin.system.*" \\
  --nsExclude="admin.sessions.*" \\
  --nsExclude="local.*" \\
  --nsExclude="config.system.sessions" \\
  --nsExclude="config.transactions" \\
  --nsExclude="config.image_collection" \\
  --nsExclude="config.system.indexBuilds" \\
  --nsExclude="config.system.preimages" \\
  --nsExclude="*.system.buckets" \\
  --nsExclude="*.system.profile" \\
  --nsExclude="*.system.views" \\
  --nsExclude="*.system.js" \\
  --gzip \\
  --archive="<BACKUP_FILE_NAME>"

Secondary MongoDB stats after restoring Full Backup

Now Let’s Assume that you have deployment to be rolled out on 16:15 05/05/2025 UTC. Primary MongoDB stats before deployment

This deployment introduced some bug and added some corrupted documents to your database.

Primary MongoDB stats after deployment

With the setup of Oplog Backup Cronjobs you will have oplog backup from 16:00 -> 17:00 , but if you replay this entire file , it will add corrupted documents to your database as well.

Secondary MongoDB stats after replaying entire oplog file

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--dir="<OPLOG_FILE_DIR>"

Secondary MongoDB stats after replaying oplog file till 16:14 using --oploglimit option

mongorestore \\
--uri="$SECONDARY_URI" \\
--oplogReplay \\
--oplogLimit="1746461700:0" \\
--dir="<OPLOG_FILE_DIR>"

Techniques Considered for RPO Less Than 1 Hour

To minimize data loss and improve the ability to restore to the most recent state possible, several potential techniques were identified and proposed. These suggestions aim to balance reliability, operational simplicity, and cost-effectiveness, though they have not been fully evaluated or implemented.

Method 1 : Using CDC ( Change Data Capture ) Tools

This approach involves leveraging Change Data Capture (CDC) tools to continuously monitor and replicate changes in the MongoDB database to external systems in real time. The strategy was to use a MongoDB source connector to capture insert, update, and delete operations, then stream those changes into Apache Kafka using a sink connector.

Kafka provides a reliable and highly durable buffer that decouples the producer (MongoDB) from the consumer (downstream processors or backup targets). This buffer allows for flexible consumption patterns and can be replayed to rebuild the database state up to the latest transaction.

Furthermore, these change events from Kafka could be consumed by Kafka Connect or custom consumers to write them to durable cloud storage systems such as Google Cloud Storage (GCS). This would provide an additional layer of durability and a historical archive of changes for long-term recovery or analytics.

To make this even easier, tools like Airbyte come with ready-made connectors for MongoDB, Kafka, helping teams set up end-to-end pipelines with minimal effort.

Method 2 : MongoDB Native Change Streams

MongoDB supports native Change Streams starting from version 3.6, which provides a way to subscribe to real-time changes in the database without polling or manually reading from the oplog. Change Streams use MongoDB’s internal replication mechanism to deliver changes in a reliable and ordered manner.

Applications can consume these streams using MongoDB drivers available in popular languages such as Python, Node.js, and Go. These changes can then be redirected to Kafka, GCS, or in-memory processing pipelines.

This method offers a native and low-latency way to track changes while abstracting away the complexity of dealing directly with the oplog. However, it requires consumers to be resilient and capable of recovering from disconnections and ensuring continuity of the stream using resume tokens.

Wrapping Up

While we didn’t have access to the MongoDB Enterprise license, which includes built-in backup and recovery tools like Cloud Manager and Ops Manager, we had a clear objective: to design a disaster recovery solution that could achieve a Recovery Point Objective (RPO) of under one hour using only open-source tools.

By combining daily mongodump backups, hourly oplog captures, we created a resilient and low-cost DR strategy for MongoDB on Kubernetes.

If you’re running MongoDB in Kubernetes and need a recovery strategy that’s both practical and powerful, this just might be it

For those who want to try it out, here is our GitHub repository that contains reference scripts and configuration examples related to the discussed setup. Use the README provided to get started.

Share

Jump to Section

Also Checkout

Also Checkout

Also Checkout

Also Checkout

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.