💽
MongoDB Oplog to SQL Parser exercise
Problem statement
Write a program to parse MongoDB operations log (oplog) and generate equivalent SQL statements.
We have a scenario where an organization used MongoDB initially but now needs to move to an RDBMS database. This data transition can be made easy if we can find a way to convert the JSON documents in MongoDB collections to equivalent rows in relational DB tables. That's the purpose of this program.
The MongoDB server generates the Oplog, an ordered collection of all the write operations (insert, update, delete) to the MongoDB. Your job is to parse these oplogs and generate equivalent SQL statements.
There’s already an open-source tool, stampede, that converts MongoDB oplogs to SQL; we are simply attempting to develop an implementation in Go.
A sample MongoDB oplog looks like this:
The main fields in the oplog are:
op
: This indicates the type of operation. It can bei
(insert),u
(update),d
(delete),c
(command),n
(no operation). For this implementation, we’ll only care about insert, update and delete operations.
ns
: This indicates the namespace. Namespace consists of database and collection name separated by a.
In above case, database name istest
and collection name isstudent
.
o
: This indicates the new data for insert or update operation. In above case, a student document is inserted in the collection.
The oplog contains some other fields like version, timestamp, etc. but for our consideration, we can ignore those.
We have divided the problem statement into multiple stories. You’re supposed to implement the stories.
Story 1 (parsing insert oplog)
Parse the insert oplog JSON and convert that into equivalent SQL insert statement.
Here’s the mapping of MongoDB concepts to their equivalent relational database concepts
Database in MongoDB maps to schema in relational database
Collection in MongoDB maps to table in relational database
A single JSON document in MongoDB maps typically to a row in relational database.
Sample Input:
Expected Output:
You can assume that the above oplog is generated by following MongoDB command that inserts some data in student table.
Assumptions:
The
i
inop
key stands for insert operation.
Write a function that accepts oplog JSON as input and returns the SQL statement as output. Write test case for verifying your program works as expected.
The actual MongoDB oplog also contains some other fields like version, timestamp, etc. but for our consideration, we can ignore those. Hence, the above oplog contains only the fields which are relevant for our SQL conversion use case.
The
ns
key indicates a combination of db name and collection name. In above example, db name istest
and collection name isstudent
. Db name and collection name will be separated by a.
For simplicity, for now, assume that there is no nested JSON (arrays or objects) in MongoDB collection.
Expectations:
Your code should be generic enough to extract the db and collection name from
ns
field. It should also extract following types from JSON - string, boolean and number. In above example,name
is a string variable,roll_no
is a number variable andis_graduated
is boolean. For now, you can treatdate_of_birth
as string (and not a date field type)
Since the JSON fields are unordered, the order of columns in your
insert into
statement may differ from that of the expected output. This is okay, as long as you are able to run the SQL statement in PostgreSQL without any errors. For that, you’ll need to create thetest
schema andstudent
table first. You can do that manually as follows:
Later, in Story 4, we’ll ask you to modify the code to also generate the create table
statement. For now, your program should only generate insert into
statement.
Story 2 (parsing update oplog)
Parse the update oplog JSON and convert that into equivalent SQL update statement.
Sample Input (for setting new value to a field):
Here, we are setting is_graduated
to true
for the same student collection.
Expected Output:
Sample Input (for un-setting value to a field):
Expected Output:
Assumptions:
The
u
inop
key stands for update operation.
The
o
field contains the update operation details. In this case, it includes the following subfields:The
$v
field specifies the protocol version used for the update operation. In this case, the value is 2.The
diff
field represents the changes being made to the document. In this case, it contains the following subfield:u
: modifies a field of a document and sets the value of a field in the document. (first example above)d
: removes a field from a document (second example above)
The key
o2
represents the row identifier or the WHERE clause field in SQL
For simplicity, assume that the
_id
would always be the updation criteria
Assume that no new column will be added via this update operation for now.
Assume that the table and the data exist from before (which can be done manually).
Expectations:
You should be able to run the SQL statement generated by your program into PostgreSQL without any errors. For that, you’ll need to create the
test
schema andstudent
table and also insert a row in student table (from first story).
Story 3 (parsing delete oplog)
Parse the delete oplog JSON and convert that into equivalent SQL delete statement.
Sample Input:
Expected Output:
Assumptions:
The
d
inop
key stands for delete.
The
o
key contains the_id
of the field to be deleted
For simplicity, assume that the
_id
would always be the deletion criteria
Assume that the table exists from before (which can be done manually).
Expectations:
You should be able to run the SQL statement generated by your program into PostgreSQL without any errors. For that, you’ll need to create the
test
schema andstudent
table and also insert a row in student table (from first story). When you run the delete statement from the output, the student row should be deleted successfully from PostgreSQL table.
Story 4 (create table with one oplog entry)
This story is the modification of Story 1. In this story, you’ll parse the same insert oplog JSON from Story 1 and convert it to equivalent SQL statements. However, you’ll also generate the create schema
and create table
statements along with insert into
statement.
Features to implement in this story:
Generate
CREATE SCHEMA
SQL statementGenerate
CREATE TABLE
SQL statementGenerate
INSERT INTO
SQL statement
Sample Input:
Expected output:
Assumptions:
In the above output, the
create table
statement is split into multiple lines. This is done only for readability purpose. You should generate the create table statement in a single line.
Feel free to modify the code and tests written as part of Story 1. The input to Story 1 and 4 is same, but in the output, we now expect
create schema
andcreate table
statements as well.
Expectations:
You should be able to run all the SQL statement generated by your program into PostgreSQL without any errors.
Story 5 (create table with multiple oplog entries)
Until now, we were handling only one oplog at a time. However, now we need to handle multiple insert
oplogs for the same collection. For simplicity, let’s assume that there are no field changes across these two oplogs. The only thing that changes is the value of the fields.
As per previous story, create schema
and create table
statements are generated for every insert
oplog. Now, we need to fix the issue where the create schema
and create table
statements are generated only once for each collection.
Sample Input:
In the input, there are two insert
oplogs for the same database and collection. The only difference in the two oplogs is the values of JSON fields. The type and the number of fields are same for both oplogs.
Expected output:
Assumptions:
In the above output, the
create table
statement is split into multiple lines. This is done only for readability purpose. You should generate the create table statement in a single line.
Expectations:
The input to the program has changed from a single oplog JSON to an array of oplogs. Make sure your code is able to handle both.
You will have to modify the code and tests written as part of Story 4.
Note that both oplogs belong to the same database and collection, hence the
create schema
statement in SQL is generated only once.
Story 6 (alter table with multiple oplog entries)
The input for this story is very similar to Story 5 above. Except, in the second oplog, there’s a new field -phone
. Your job is to generate an alter table
statement and then generate an insert into
statement for the second oplog.
Thus, you’ll have to generate SQL statements in the following order:Generate CREATE SCHEMA
SQL statementGenerate CREATE TABLE
SQL statementGenerate INSERT INTO
SQL statementGenerate ALTER TABLE
SQL statementGenerate INSERT INTO
SQL statement
Sample Input:
Expected Output:
Expectation:
You will have to modify the code and tests written as part of Story 5.
Assume that there will never be an alter table case where the existing column’s data type is changed across two oplogs. For example, you will never have a case, where first oplog has roll_no as int and second oplog modifies roll_no field as string. You don’t need to handle this case.
Your code should also handle case of more than two oplogs for the same collection.
Your program should assign null values to columns for which the JSON fields are missing.
Currently, in the sample input, we are only considering an addition of one field (phone). However, your program should handle addition of any number of new fields and generate those many number of
alter table
statements.
Story 7 (handle nested Mongo documents)
So far, we have handled simple JSON documents in MongoDB. In this story, you will have to handle parsing of oplogs of nested JSON objects.
Sample Input:
In the input, we have phone
, which is a single nested JSON object and we have address
, which is an array of nested JSON objects. Let’s see how this is represented in the expected SQL below.
Expected Output:
Since we are dealing with relational data, we will have to create multiple tables with foreign key references. For simplicity, we will not create actual foreign key constraints in any of the tables.
We will create the separate tables for nested objects (i.e. address and phone), then insert records in those tables. Note the use of student__id
column in the address and phone tables. This is the soft-foreign key to the student table’s id column. Once all the records are inserted in address and phone tables, we’ll insert records in the main parent student table.
Since we are not creating foreign key constraints and reference in the database, the order in which the tables are created and the order in which the records are inserted in those tables does not matter.
Assumptions:
For simplicity, assume nesting of JSON documents only at the top level.
Assume that
_id
is a varchar always.
We will not deal with the auto generation of
_id
. It will be a varchar, and will be a randomly generated UUID for foreign tables._id
for main table comes from the_id
field of mongo oplog itself.
_id
of foreign table needs to be created by the program automatically. This value then must be used for any reference to other related tables.
All the foreign tables/associated table will have the reference of primary key from main table. And the primary table will not have any referencing key from associated tables.
There’s no need to create actual foreign key references in SQL. The SQL JOINS will happen just by soft-references (i.e. values in referenced columns), not via actual foreign key constraints in database.
Expectations:
You’ll have to create associated tables with primary key as
_id
which will be a randomly generated UUID
You’ll have to create soft-foreign keys accordingly, so that all the details of employees can be fetched using JOINS.
The SQL should be generated such that it can be run on any relational DB without any modifications. For example, note the order of
create table
above. We generate the phone table first, and then the employees table. The phone table has reference to the employee table usingemployee__id
column.
For simplicity, for a single nested object (e.g. phone), always generate a 1:M relationship. Note that the phone table has a foreign key reference to employee table using employee__id.
Story 8 (reading oplogs from a file)
In the above stories, we have parsed different types of MongoDB oplogs and generated equivalent SQL statements. Now, let’s modify our program to parse multiple oplogs by reading them from a file. Later, we’ll update the program to read directly from MongoDB’s oplog collection. For this story, we have to read oplogs, one at a time, from a JSON file containing MongoDB oplogs and convert them into equivalent SQL statements. You can use the example-input.json file as the input file.
The program should write the generated SQL statements in an output file, say output.sql
. You can compare your output file with example-output.sql to verify your program.
Assumptions:
Your program should accept the filename containing the oplogs as an argument.
The program should also accept the output filename. The program should create this file and write all SQL output to that file.
Expectations:
Make sure while reading the file we process one oplog at a time. This will be helpful to you in the next story where you would be getting data infinitely in a stream.
Write the generated SQL statements in a file. You can compare your output file with our example-output.sql.
Story 9 (reading oplogs from MongoDB)
In story 8, we are reading MongoDB oplogs from a JSON file. Now, we expect you to read the oplogs directly from MongoDB and generate equivalent SQL statements. The program should connect to MongoDB, read and parse the oplogs one at a time, convert them into SQL statements and finally, execute these SQL statements directyl on a relational database (PostgreSQL, in our case).
Assumptions:
The program should read oplogs from MongoDB, convert those into equivalent SQL statements (DDL and DML statements), and execute these statements on PostgreSQL.
Expectations:
The program should process one oplog at a time. It should work in a streaming fashion, where it reads a single oplog entry from MongoDB, converts that oplog entry into multiple SQL statements and executes these statements on PostgreSQL.
The program should terminate only after receiving a SIGTERM signal (Ctrl + c). If there are no oplogs to read, the program should be waiting for more oplogs (it should not terminate). In other words, this program should run as a service.
Write the program such that it can handle input from either an oplog file or directly from MongoDB. Also, the program should be able to send output to an SQL file or execute the SQL statements directly on PostgreSQL.
JSON File (input) → SQL File (output)
JSON File (input) → Relational database (output)
MongoDB (input) → SQL File (output)
MongoDB (input) → Relational database (output)
Story 10 (Bookmarking Support - nice to have)
In Story 9, we successfully implemented the functionality to read MongoDB oplogs directly from a live MongoDB instance and generate equivalent SQL statements. However, one critical aspect of the parser's functionality is missing: bookmarking support. Bookmarking ensures that the parser can keep track of the last processed oplog, allowing it to resume processing from the correct point after a restart. This enhancement will prevent duplicate oplogs from being processed, maintaining data consistency, and improving the overall reliability of the parser.
Assumptions:
Bookmarking support will be added to the parser to keep track of the last processed oplog.
This mechanism will enable the parser to store the position of the last processed oplog, ensuring it can pick up from the correct point after a restart.
By using bookmarking, the parser will avoid reprocessing oplogs that have already been successfully executed, thus preventing duplicate data insertion and updates.
Expectations:
To achieve the objective of adding bookmarking support, we expect the following:
The program should implement a bookmarking mechanism to store the position of the last processed oplog. This information will be crucial for resuming processing from the correct point in the MongoDB oplog after a restart.
With bookmarking support, the parser should maintain data consistency by avoiding duplicate data processing, even in the case of unexpected interruptions or system restarts.
In the event of a program restart, the parser should utilize the bookmarked information to continue processing oplogs from where it left off, ensuring a seamless data migration process.
By adding bookmarking support, the parser will be better equipped to handle real-world scenarios, ensuring data integrity and reliability during data migration from MongoDB to PostgreSQL.
Story 11 (Distributed Execution - nice to have)
It would be beneficial to implement support for running the parser in a distributed manner across multiple machines. As data volume increase, there is a need to distribute the parser's execution across multiple machines to achieve higher performance and scalability. The addition of distributed execution will enable the parser to efficiently handle large datasets and effectively utilize available resources.
Overall Instructions
Create a new Git repo in the language of your choice (Java, Go, etc).
Commit and push code + tests for each of the stories.
Implement graceful shutdown in the program. Handle interrupt signals (e.g. Ctrl+C) to initiate a clean shutdown of the parser and close all external connections to MongoDB and PostgreSQL.
Feel free to make suitable assumptions if something is unclear. Document your assumptions and reasons behind them in the readme.