MongoDB Oplog to SQL Parser exercise
Problem statement
Write a program to parse MongoDB operations log (oplog) and generate equivalent SQL statements.
We have a scenario where an organization used MongoDB initially but now needs to move to an RDBMS database. This data transition can be made easy if we can find a way to convert the JSON documents in MongoDB collections to equivalent rows in relational DB tables. That's the purpose of this program.
The MongoDB server generates the Oplog, an ordered collection of all the write operations (insert, update, delete) to the MongoDB. Your job is to parse these oplogs and generate equivalent SQL statements.
There’s already an open-source tool, stampede, that converts MongoDB oplogs to SQL; we are simply attempting to develop an implementation in Go.
A sample MongoDB oplog looks like this:
The main fields in the oplog are:
op: This indicates the type of operation. It can bei(insert),u(update),d(delete),c(command),n(no operation). For this implementation, we’ll only care about insert, update and delete operations.
ns: This indicates the namespace. Namespace consists of database and collection name separated by a.In above case, database name istestand collection name isstudent.
o: This indicates the new data for insert or update operation. In above case, a student document is inserted in the collection.
The oplog contains some other fields like version, timestamp, etc. but for our consideration, we can ignore those.
We have divided the problem statement into multiple stories. You’re supposed to implement the stories.
Story 1 (parsing insert oplog)
Parse the insert oplog JSON and convert that into equivalent SQL insert statement.
Here’s the mapping of MongoDB concepts to their equivalent relational database concepts
Database in MongoDB maps to schema in relational database
Collection in MongoDB maps to table in relational database
A single JSON document in MongoDB maps typically to a row in relational database.
Sample Input:
Expected Output:
You can assume that the above oplog is generated by following MongoDB command that inserts some data in student table.
Assumptions:
The
iinopkey stands for insert operation.
Write a function that accepts oplog JSON as input and returns the SQL statement as output. Write test case for verifying your program works as expected.
The actual MongoDB oplog also contains some other fields like version, timestamp, etc. but for our consideration, we can ignore those. Hence, the above oplog contains only the fields which are relevant for our SQL conversion use case.
The
nskey indicates a combination of db name and collection name. In above example, db name istestand collection name isstudent. Db name and collection name will be separated by a.
For simplicity, for now, assume that there is no nested JSON (arrays or objects) in MongoDB collection.
Expectations:
Your code should be generic enough to extract the db and collection name from
nsfield. It should also extract following types from JSON - string, boolean and number. In above example,nameis a string variable,roll_nois a number variable andis_graduatedis boolean. For now, you can treatdate_of_birthas string (and not a date field type)
Since the JSON fields are unordered, the order of columns in your
insert intostatement may differ from that of the expected output. This is okay, as long as you are able to run the SQL statement in PostgreSQL without any errors. For that, you’ll need to create thetestschema andstudenttable first. You can do that manually as follows:
Later, in Story 4, we’ll ask you to modify the code to also generate the create table statement. For now, your program should only generate insert into statement.
Story 2 (parsing update oplog)
Parse the update oplog JSON and convert that into equivalent SQL update statement.
Sample Input (for setting new value to a field):
Here, we are setting is_graduated to true for the same student collection.
Expected Output:
Sample Input (for un-setting value to a field):
Expected Output:
Assumptions:
The
uinopkey stands for update operation.
The
ofield contains the update operation details. In this case, it includes the following subfields:The
$vfield specifies the protocol version used for the update operation. In this case, the value is 2.The
difffield represents the changes being made to the document. In this case, it contains the following subfield:u: modifies a field of a document and sets the value of a field in the document. (first example above)d: removes a field from a document (second example above)
The key
o2represents the row identifier or the WHERE clause field in SQL
For simplicity, assume that the
_idwould always be the updation criteria
Assume that no new column will be added via this update operation for now.
Assume that the table and the data exist from before (which can be done manually).
Expectations:
You should be able to run the SQL statement generated by your program into PostgreSQL without any errors. For that, you’ll need to create the
testschema andstudenttable and also insert a row in student table (from first story).
Story 3 (parsing delete oplog)
Parse the delete oplog JSON and convert that into equivalent SQL delete statement.
Sample Input:
Expected Output:
Assumptions:
The
dinopkey stands for delete.
The
okey contains the_idof the field to be deleted
For simplicity, assume that the
_idwould always be the deletion criteria
Assume that the table exists from before (which can be done manually).
Expectations:
You should be able to run the SQL statement generated by your program into PostgreSQL without any errors. For that, you’ll need to create the
testschema andstudenttable and also insert a row in student table (from first story). When you run the delete statement from the output, the student row should be deleted successfully from PostgreSQL table.
Story 4 (create table with one oplog entry)
This story is the modification of Story 1. In this story, you’ll parse the same insert oplog JSON from Story 1 and convert it to equivalent SQL statements. However, you’ll also generate the create schema and create table statements along with insert into statement.
Features to implement in this story:
Generate
CREATE SCHEMASQL statementGenerate
CREATE TABLESQL statementGenerate
INSERT INTOSQL statement
Sample Input:
Expected output:
Assumptions:
In the above output, the
create tablestatement is split into multiple lines. This is done only for readability purpose. You should generate the create table statement in a single line.
Feel free to modify the code and tests written as part of Story 1. The input to Story 1 and 4 is same, but in the output, we now expect
create schemaandcreate tablestatements as well.
Expectations:
You should be able to run all the SQL statement generated by your program into PostgreSQL without any errors.
Story 5 (create table with multiple oplog entries)
Until now, we were handling only one oplog at a time. However, now we need to handle multiple insert oplogs for the same collection. For simplicity, let’s assume that there are no field changes across these two oplogs. The only thing that changes is the value of the fields.
As per previous story, create schema and create table statements are generated for every insert oplog. Now, we need to fix the issue where the create schema and create table statements are generated only once for each collection.
Sample Input:
In the input, there are two insert oplogs for the same database and collection. The only difference in the two oplogs is the values of JSON fields. The type and the number of fields are same for both oplogs.
Expected output:
Assumptions:
In the above output, the
create tablestatement is split into multiple lines. This is done only for readability purpose. You should generate the create table statement in a single line.
Expectations:
The input to the program has changed from a single oplog JSON to an array of oplogs. Make sure your code is able to handle both.
You will have to modify the code and tests written as part of Story 4.
Note that both oplogs belong to the same database and collection, hence the
create schemastatement in SQL is generated only once.
Story 6 (alter table with multiple oplog entries)
The input for this story is very similar to Story 5 above. Except, in the second oplog, there’s a new field -phone. Your job is to generate an alter table statement and then generate an insert into statement for the second oplog.
Thus, you’ll have to generate SQL statements in the following order:Generate CREATE SCHEMA SQL statementGenerate CREATE TABLE SQL statementGenerate INSERT INTO SQL statementGenerate ALTER TABLE SQL statementGenerate INSERT INTO SQL statement
Sample Input:
Expected Output:
Expectation:
You will have to modify the code and tests written as part of Story 5.
Assume that there will never be an alter table case where the existing column’s data type is changed across two oplogs. For example, you will never have a case, where first oplog has roll_no as int and second oplog modifies roll_no field as string. You don’t need to handle this case.
Your code should also handle case of more than two oplogs for the same collection.
Your program should assign null values to columns for which the JSON fields are missing.
Currently, in the sample input, we are only considering an addition of one field (phone). However, your program should handle addition of any number of new fields and generate those many number of
alter tablestatements.
Story 7 (handle nested Mongo documents)
So far, we have handled simple JSON documents in MongoDB. In this story, you will have to handle parsing of oplogs of nested JSON objects.
Sample Input:
In the input, we have phone, which is a single nested JSON object and we have address, which is an array of nested JSON objects. Let’s see how this is represented in the expected SQL below.
Expected Output:
Since we are dealing with relational data, we will have to create multiple tables with foreign key references. For simplicity, we will not create actual foreign key constraints in any of the tables.
We will create the separate tables for nested objects (i.e. address and phone), then insert records in those tables. Note the use of student__id column in the address and phone tables. This is the soft-foreign key to the student table’s id column. Once all the records are inserted in address and phone tables, we’ll insert records in the main parent student table.
Since we are not creating foreign key constraints and reference in the database, the order in which the tables are created and the order in which the records are inserted in those tables does not matter.
Assumptions:
For simplicity, assume nesting of JSON documents only at the top level.
Assume that
_idis a varchar always.
We will not deal with the auto generation of
_id. It will be a varchar, and will be a randomly generated UUID for foreign tables._idfor main table comes from the_idfield of mongo oplog itself.
_idof foreign table needs to be created by the program automatically. This value then must be used for any reference to other related tables.
All the foreign tables/associated table will have the reference of primary key from main table. And the primary table will not have any referencing key from associated tables.
There’s no need to create actual foreign key references in SQL. The SQL JOINS will happen just by soft-references (i.e. values in referenced columns), not via actual foreign key constraints in database.
Expectations:
You’ll have to create associated tables with primary key as
_idwhich will be a randomly generated UUID
You’ll have to create soft-foreign keys accordingly, so that all the details of employees can be fetched using JOINS.
The SQL should be generated such that it can be run on any relational DB without any modifications. For example, note the order of
create tableabove. We generate the phone table first, and then the employees table. The phone table has reference to the employee table usingemployee__idcolumn.
For simplicity, for a single nested object (e.g. phone), always generate a 1:M relationship. Note that the phone table has a foreign key reference to employee table using employee__id.
Story 8 (reading oplogs from a file)
In the above stories, we have parsed different types of MongoDB oplogs and generated equivalent SQL statements. Now, let’s modify our program to parse multiple oplogs by reading them from a file. Later, we’ll update the program to read directly from MongoDB’s oplog collection. For this story, we have to read oplogs, one at a time, from a JSON file containing MongoDB oplogs and convert them into equivalent SQL statements. You can use the example-input.json file as the input file.
The program should write the generated SQL statements in an output file, say output.sql. You can compare your output file with example-output.sql to verify your program.
Assumptions:
Your program should accept the filename containing the oplogs as an argument.
The program should also accept the output filename. The program should create this file and write all SQL output to that file.
Expectations:
Make sure while reading the file we process one oplog at a time. This will be helpful to you in the next story where you would be getting data infinitely in a stream.
Write the generated SQL statements in a file. You can compare your output file with our example-output.sql.
Story 9 (reading oplogs from MongoDB)
In story 8, we are reading MongoDB oplogs from a JSON file. Now, we expect you to read the oplogs directly from MongoDB and generate equivalent SQL statements. The program should connect to MongoDB, read and parse the oplogs one at a time, convert them into SQL statements and finally, execute these SQL statements directyl on a relational database (PostgreSQL, in our case).
Assumptions:
The program should read oplogs from MongoDB, convert those into equivalent SQL statements (DDL and DML statements), and execute these statements on PostgreSQL.
Expectations:
The program should process one oplog at a time. It should work in a streaming fashion, where it reads a single oplog entry from MongoDB, converts that oplog entry into multiple SQL statements and executes these statements on PostgreSQL.
The program should terminate only after receiving a SIGTERM signal (Ctrl + c). If there are no oplogs to read, the program should be waiting for more oplogs (it should not terminate). In other words, this program should run as a service.
Write the program such that it can handle input from either an oplog file or directly from MongoDB. Also, the program should be able to send output to an SQL file or execute the SQL statements directly on PostgreSQL.
JSON File (input) → SQL File (output)
JSON File (input) → Relational database (output)
MongoDB (input) → SQL File (output)
MongoDB (input) → Relational database (output)
Story 10 (Bookmarking Support - nice to have)
In Story 9, we successfully implemented the functionality to read MongoDB oplogs directly from a live MongoDB instance and generate equivalent SQL statements. However, one critical aspect of the parser's functionality is missing: bookmarking support. Bookmarking ensures that the parser can keep track of the last processed oplog, allowing it to resume processing from the correct point after a restart. This enhancement will prevent duplicate oplogs from being processed, maintaining data consistency, and improving the overall reliability of the parser.
Assumptions:
Bookmarking support will be added to the parser to keep track of the last processed oplog.
This mechanism will enable the parser to store the position of the last processed oplog, ensuring it can pick up from the correct point after a restart.
By using bookmarking, the parser will avoid reprocessing oplogs that have already been successfully executed, thus preventing duplicate data insertion and updates.
Expectations:
To achieve the objective of adding bookmarking support, we expect the following:
The program should implement a bookmarking mechanism to store the position of the last processed oplog. This information will be crucial for resuming processing from the correct point in the MongoDB oplog after a restart.
With bookmarking support, the parser should maintain data consistency by avoiding duplicate data processing, even in the case of unexpected interruptions or system restarts.
In the event of a program restart, the parser should utilize the bookmarked information to continue processing oplogs from where it left off, ensuring a seamless data migration process.
By adding bookmarking support, the parser will be better equipped to handle real-world scenarios, ensuring data integrity and reliability during data migration from MongoDB to PostgreSQL.
Story 11 (Distributed Execution - nice to have)
It would be beneficial to implement support for running the parser in a distributed manner across multiple machines. As data volume increase, there is a need to distribute the parser's execution across multiple machines to achieve higher performance and scalability. The addition of distributed execution will enable the parser to efficiently handle large datasets and effectively utilize available resources.
Overall Instructions
Create a new Git repo in the language of your choice (Java, Go, etc).
Commit and push code + tests for each of the stories.
Implement graceful shutdown in the program. Handle interrupt signals (e.g. Ctrl+C) to initiate a clean shutdown of the parser and close all external connections to MongoDB and PostgreSQL.
Feel free to make suitable assumptions if something is unclear. Document your assumptions and reasons behind them in the readme.