Services

Resources

Company

Shooting ourselves in the foot by enabling Spring Kafka batching

Shooting ourselves in the foot by enabling Spring Kafka batching

Shooting ourselves in the foot by enabling Spring Kafka batching

Shooting ourselves in the foot by enabling Spring Kafka batching

There are no silver bullets in system architecture, but batching your I/O is advice that comes closer than most. However, when you don’t understand what’s happening under the hood, you might end up causing more harm than good.

This article will go over our journey of enabling batch listeners in Spring Kafka, the neat ways Kafka optimizes for performance out of the box and why you should always RTFM.

Note: The library is officially called Spring for Apache Kafka but that’s a mouthful so I will be referring to it as Spring Kafka throughout this article.

Setting the context

Our use case was as textbook as it gets for adopting Kafka. We have millions of messages coming in per day (~14M) with occasional bursts, but there's no per-message time limit. As long as all messages got processed before the next day, we were good to go.

There could be certain messages that fail transiently and have to be retried. One of the advantages of using Spring Kafka is that it comes with "non-blocking retries" out of the box. The official docs cover how this pattern works in detail but TL;DR - we don't want a failed message to block the rest of the messages in its partition while it gets retried.

What is of interest to us though is that you cannot have non-blocking retries AND batch listeners (Ref. Also, try to think about why this is the case exactly). We were exploring various options to improve the performance of our system and decided to enable batch listeners - at the cost of losing non-blocking retries. We saw a marginal improvement in performance but the system would now be much slower if (or when, if you believe in Murphy's law) the error rate goes up. However, at the time we decided the tradeoff was worth it and moved on.

A lesson on understanding the tools you work with

Months after this was done, I was going over Kafka's design docs. Each decision taken deserves a blog of its own but one thing caught my eye - the section on "End to end batch compression".

The way compression works in Kafka is, at the producer side, a bunch of messages are grouped together, compressed and sent. Batching before compressing the messages leads to better compression ratios as a lot of information tends to be redundant across messages. eg., the keys in JSON messages which can be taken advantage of by compressing a batch together. Each batch of records / messages is called a RecordBatch. (Very original indeed.)

For example, a producer might group a few hundred small JSON messages into a single RecordBatch, compress it once, and send that unit over the wire.

Then, the broker just validates each RecordBatch sent by the producer (eg., ensuring that the number of records match what the header states) before sending it in the same format to the consumers. The RecordBatch is the smallest unit sent over the wire and this is logical as the consumer cannot make use of half a RecordBatch.

Image: Rough sketch of the message format being sent

Note that setting a different compression type at the topic level would mean that each RecordBatch would be recompressed by the broker using the topic level compression algorithm. However, the messages in each batch would still be the same and the number of messages is decided by the producer and not the broker or consumer. This is the point that is significant for us. Also, having a different compression type at the topic level is usually undesirable as it adds unnecessary load on the broker side with no benefit for most use cases. This doc goes over every combination of the compression configs and explains what it does.

Does max.poll.records affect how Kafka batches messages?

These consumer configs sound like they would mess up our ideal batching design so let's go over them.

  • max.poll.records is an application level config and just affects how many records are returned on each call of poll. This does not affect the actual message fetching from the broker to the consumer. The Kafka consumer fetches the batch(es) from the broker, buffers them in memory and gives you as many messages as it gets per poll call.

What does fetch.max.bytes actually control?

  • fetch.max.bytes does affect the actual fetching from broker to the consumer. However, this is not treated as an absolute maximum. If the first batch is larger than this value, it is still returned in order for the consumer to make progress. This is typically used to fetch multiple batches in one go rather than to break batches into fragments. This in combination with fetch.min.bytes can be used to "batch" the batches more or less aggressively from the consumer side.

The key point is that these configs control how consumers receive and buffer batches, but they do not change how producers group messages into RecordBatches in the first place.

Now with confidence that Kafka maintains a batch of messages intact, from producer, through the broker till the consumer is finally reading the messages, we are left with just one unanswered question --

If Kafka does batching out of the box, what do batch listeners in Spring Kafka even solve?

When you configure @KafkaListener with factory.setBatchListener(true), Spring Kafka delivers a List of records instead of a single record per invocation. This is only worth doing when your consumer's processing logic actually benefits from batching.

If your consumer mostly does per-message side effects (for example, calling external APIs per event), batch listeners may add complexity without giving you much in return.

One simple and common use case is reading a bunch of messages, transforming them in some manner and then writing the results to a database. Instead of doing a DB write for each record, you can have one batch insert which is much more efficient.

Our use case -- running regex matches on each message -- was different. Since each message had to be processed individually, there was no batching benefit on the consumer side.

This batch size can be configured by setting the max.poll.records consumer config we discussed.

Wrapping up

For our specific use case, batch listeners were not worth losing non-blocking retries. Our bottleneck was producer throughput the producer was not sending data fast enough. It was not the consumer's processing logic, which was CPU-bound and inherently per-record. Batch listeners help when you can batch the processing itself (e.g., bulk DB writes). They do not help when your business logic needs to run on each record individually.

Instead, we configured our producers to batch more aggressively. We increased batch.size, which specifies the batch size in bytes per partition per request. A higher value increases throughput at the cost of a bigger memory footprint and latency, which was an acceptable tradeoff for us. Since Kafka does not wait forever for batches to fill up, batch.size is an upper bound, not a guarantee. We also configured linger.ms, which controls how long the producer waits before sending a batch. The producer sends whichever comes first: the configured linger duration, or a full batch.size worth of records.

This also meant we could re-enable non-blocking retries, which made error handling simpler and more performant. The real lesson was simpler than the fix: before enabling a feature, understand what problem it actually solves. The docs were there all along.

If you enjoy this kind of “we broke it, then fixed it” story, you might also like our posts on real-world Kafka and reliability incidents. Check out Gitops for Kafka in the real world and How Queueing Theory Makes Systems Reliable.

There are no silver bullets in system architecture, but batching your I/O is advice that comes closer than most. However, when you don’t understand what’s happening under the hood, you might end up causing more harm than good.

This article will go over our journey of enabling batch listeners in Spring Kafka, the neat ways Kafka optimizes for performance out of the box and why you should always RTFM.

Note: The library is officially called Spring for Apache Kafka but that’s a mouthful so I will be referring to it as Spring Kafka throughout this article.

Setting the context

Our use case was as textbook as it gets for adopting Kafka. We have millions of messages coming in per day (~14M) with occasional bursts, but there's no per-message time limit. As long as all messages got processed before the next day, we were good to go.

There could be certain messages that fail transiently and have to be retried. One of the advantages of using Spring Kafka is that it comes with "non-blocking retries" out of the box. The official docs cover how this pattern works in detail but TL;DR - we don't want a failed message to block the rest of the messages in its partition while it gets retried.

What is of interest to us though is that you cannot have non-blocking retries AND batch listeners (Ref. Also, try to think about why this is the case exactly). We were exploring various options to improve the performance of our system and decided to enable batch listeners - at the cost of losing non-blocking retries. We saw a marginal improvement in performance but the system would now be much slower if (or when, if you believe in Murphy's law) the error rate goes up. However, at the time we decided the tradeoff was worth it and moved on.

A lesson on understanding the tools you work with

Months after this was done, I was going over Kafka's design docs. Each decision taken deserves a blog of its own but one thing caught my eye - the section on "End to end batch compression".

The way compression works in Kafka is, at the producer side, a bunch of messages are grouped together, compressed and sent. Batching before compressing the messages leads to better compression ratios as a lot of information tends to be redundant across messages. eg., the keys in JSON messages which can be taken advantage of by compressing a batch together. Each batch of records / messages is called a RecordBatch. (Very original indeed.)

For example, a producer might group a few hundred small JSON messages into a single RecordBatch, compress it once, and send that unit over the wire.

Then, the broker just validates each RecordBatch sent by the producer (eg., ensuring that the number of records match what the header states) before sending it in the same format to the consumers. The RecordBatch is the smallest unit sent over the wire and this is logical as the consumer cannot make use of half a RecordBatch.

Image: Rough sketch of the message format being sent

Note that setting a different compression type at the topic level would mean that each RecordBatch would be recompressed by the broker using the topic level compression algorithm. However, the messages in each batch would still be the same and the number of messages is decided by the producer and not the broker or consumer. This is the point that is significant for us. Also, having a different compression type at the topic level is usually undesirable as it adds unnecessary load on the broker side with no benefit for most use cases. This doc goes over every combination of the compression configs and explains what it does.

Does max.poll.records affect how Kafka batches messages?

These consumer configs sound like they would mess up our ideal batching design so let's go over them.

  • max.poll.records is an application level config and just affects how many records are returned on each call of poll. This does not affect the actual message fetching from the broker to the consumer. The Kafka consumer fetches the batch(es) from the broker, buffers them in memory and gives you as many messages as it gets per poll call.

What does fetch.max.bytes actually control?

  • fetch.max.bytes does affect the actual fetching from broker to the consumer. However, this is not treated as an absolute maximum. If the first batch is larger than this value, it is still returned in order for the consumer to make progress. This is typically used to fetch multiple batches in one go rather than to break batches into fragments. This in combination with fetch.min.bytes can be used to "batch" the batches more or less aggressively from the consumer side.

The key point is that these configs control how consumers receive and buffer batches, but they do not change how producers group messages into RecordBatches in the first place.

Now with confidence that Kafka maintains a batch of messages intact, from producer, through the broker till the consumer is finally reading the messages, we are left with just one unanswered question --

If Kafka does batching out of the box, what do batch listeners in Spring Kafka even solve?

When you configure @KafkaListener with factory.setBatchListener(true), Spring Kafka delivers a List of records instead of a single record per invocation. This is only worth doing when your consumer's processing logic actually benefits from batching.

If your consumer mostly does per-message side effects (for example, calling external APIs per event), batch listeners may add complexity without giving you much in return.

One simple and common use case is reading a bunch of messages, transforming them in some manner and then writing the results to a database. Instead of doing a DB write for each record, you can have one batch insert which is much more efficient.

Our use case -- running regex matches on each message -- was different. Since each message had to be processed individually, there was no batching benefit on the consumer side.

This batch size can be configured by setting the max.poll.records consumer config we discussed.

Wrapping up

For our specific use case, batch listeners were not worth losing non-blocking retries. Our bottleneck was producer throughput the producer was not sending data fast enough. It was not the consumer's processing logic, which was CPU-bound and inherently per-record. Batch listeners help when you can batch the processing itself (e.g., bulk DB writes). They do not help when your business logic needs to run on each record individually.

Instead, we configured our producers to batch more aggressively. We increased batch.size, which specifies the batch size in bytes per partition per request. A higher value increases throughput at the cost of a bigger memory footprint and latency, which was an acceptable tradeoff for us. Since Kafka does not wait forever for batches to fill up, batch.size is an upper bound, not a guarantee. We also configured linger.ms, which controls how long the producer waits before sending a batch. The producer sends whichever comes first: the configured linger duration, or a full batch.size worth of records.

This also meant we could re-enable non-blocking retries, which made error handling simpler and more performant. The real lesson was simpler than the fix: before enabling a feature, understand what problem it actually solves. The docs were there all along.

If you enjoy this kind of “we broke it, then fixed it” story, you might also like our posts on real-world Kafka and reliability incidents. Check out Gitops for Kafka in the real world and How Queueing Theory Makes Systems Reliable.

In this post
In this post
Section
Share
Share
In this post

section

Share
Keywords

Kafka batching, End to end batch compression, spring kafka batch listeners, kafka non blocking retries, spring for apache kafka batching, Message Queues, max.poll.records vs fetch.max.bytes

Continue reading.

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Hold for 2 seconds to verify

Security powered by One2N

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Hold for 2 seconds to verify

Security powered by One2N

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Hold for 2 seconds to verify

Security powered by One2N