📊

Data engineering mystery - rerouting large data in Kafka

~ Chinmay Naik

You're a tech lead handling a large-scale data pipeline. One day, your colleague (C) pulls you into an issue related to Kafka. Here's the conversation with your colleague.

Colleague (C): Hey, we have a problem. The messages from one of the tenants are published to a different Kafka cluster due to bad configuration.

You: umm okay, How many messages? And how did you find out about this issue?

C: Need to check the actual message count, but I am guessing it's probably 100 Million messages or even more. I found out about the issue when configuring my Kafka consumer.

You: Woah! We need to figure out why this happened and how to fix it.

C: Yeah, I haven't worked much with Kafka, need your help.

You: Alright, tell me if this is still ongoing, or have you updated the configuration already?

C: After I noticed it, I updated the configuration to point to the correct Kafka cluster. Thankfully, the producer publishes messages to Kafka only if it has incoming data. As I understand from the business, no new data is expected. But, the problem is that we'll need to rerun the pipelines to publish all the previous messages in the correct Kafka cluster.

You: Alright, let's find out how many messages we need to reroute.

(some time passes ⏳)

C: There are about 400 Million messages that need to be rerouted.

You: Let's think about our options for rerouting the messages.

C: We need to reprocess the entire pipeline from the source. After all messages have been published on the correct Kafka cluster, we need to delete messages from the wrong cluster. I asked another colleague to estimate this task. According to him:

It will take about 1 month for us to process ALL 400 million messages.

You: This won't work, we need to think of some other option.

C: I can't think of any other option. How can we insert messages in the Kafka cluster without running the pipeline?

You: Tell me a bit about the data flow in the current pipeline.

(You understand the data flow via a diagram to visualize the problem better and a couple of hours later...)

You: So there's a way to copy data from one Kafka cluster to another - using kcat (kafkacat). I checked the documentation and tried it out locally, we can use this to copy messages from one Kafka cluster to another. And then, delete the messages from the wrong cluster. I think this should be much much faster. In fact, it should probably take a couple of hours for 400 Million messages, if my calculations are right.

(a day passes, where you create a proper production script to copy data)

You: Let's run the script.

(About 38 minutes later, the script is done. You verify that all messages are correctly copied. Then you also delete the messages from the wrong cluster)

Wow! We went from 1 month to less than an hour! How do I learn to solve issues like you do?

Lessons:

You: Well, there's no rocket science and shortcut to this. Here's the process I followed for this problem:

Understand the problem a bit better by visualizing the data flow. This led to the insight that we can potentially just copy messages across clusters. Search if such a tool exists.
Read the documentation, try out the tool locally, understand its configuration options
I also thought about potential challenges and infra constraints for running this tool. For example, data size of each message, volume of data to be moved, how and where to run kcat, how should I handle failure or broken copy operation, etc.
Wrote a script, handled most of these scenarios, tried it on large enough dataset locally and on staging

I write such stories on software engineering.

There's no specific frequency, as I don't make these up.

If you liked this one, you might love - 📦 The Mystery of Failing Database Writes

Follow me on LinkedIn and Twitter for more such stuff, straight from the production oven!