Services

Resources

Company

Our Work

Blog

Book a Call

Back to Blog

#SRE

#Scalability

Jun 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

Back to Blog

#SRE

#Scalability

Jun 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

Back to Blog

#SRE

#Scalability

Jun 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

Back to Blog

#SRE

#Scalability

Jun 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

Introduction

In distributed systems, services need to communicate quickly and reliably. NATS.io is a lightweight, high performance messaging system that helps to achieve this.

We have two blogs in this series. The present blog gives you an overview on what NATS architecture is, how it works and what should be your design considerations. In second blog, we will start implementing a NATS cluster, add oberservability and look into disaster recovery strategies.

Understanding core architecture

NATS operates primarily on a publish-subscribe messaging pattern, supporting various messaging models:

Subject-Based Messaging
Publish-Subscribe
Request-Reply
Queue Groups

Additionally, NATS includes JetStream, a powerful built-in persistence engine enabling message storage and replay, providing an "at-least-once" delivery mechanism along with durable subscriptions.

Architecture topologies

Depending on your use case, NATS can be deployed in several topologies:

Single Server Instance: Suitable for development or simple workloads.
Cluster: Ideal for achieving high availability and scalability.
Super Cluster: Multiple clusters interconnected via gateway nodes, optimized for multi-region deployments.
Leaf Nodes: Lightweight edge servers connected to a core cluster, ideal for edge computing scenarios.

Cluster behavior

NATS clusters form a full mesh network, allowing dynamic scaling and self healing without manual reconfiguration. Each server node in the cluster requires:

Client Port (default 4222): used for client connections.
Cluster Port (Typically 6222 ): used for inter-server communication.

Clients connecting to any server in the cluster automatically discover all members, ensuring seamless failover and load balancing

JetStream clustering

JetStream enhances NATS by providing durable message storage and processing. It employs the RAFT consensus algorithm to ensure consistency across the cluster, requiring:

unique server identifiers
persistent storage for logs
A quorum, typically calculated as (Cluster Size / 2) + 1

💡 Recommendation: For optimal reliability and performance, deploy Jetstream cluster with 3 or 5 servers. The maximum replication factor per stream is 5

JetStream meta group

All Jetstream enabled servers participate in a special RAFT group called the Meta Group, which manages Jetstream API operations. The meta group elects a leader responsible for:

Handling API interactions
Making decisions about stream adn consumer placements

💡 Important: Without an active meta group leader, Jetstream clustering and server operations cannot proceed.

Streams and their types

Streams are durable message stores in Jetstream. They persist messages on specified subjects. Each stream forms its own RAFT group, with a leader managing data synchronization and acknowledgements.

💡 Note: Stream placement is centrally managed by the Meta Group leader

Consumers

Consumers enable applications to process messages stored within streams. Each consumer forms its own RAFT group to handle acknowledgements and maintain delivery state information.

💡 Note: This independent RAFT group formation ensures reliable message processing and state management for each consumer.

Subjects

Subjects act as named channels through which publishers and subscribers discover each other and exchange messages efficiently.

Server synchronization

NATS core servers themselves form their own RAFT group to maintain synchronization across the cluster.

💡 Important: Any disruption in RAFT leadership election across streams, consumers, or the Meta Group can adversely impact the overall performance and message processing capabilities of NATS.

Considerations

Sizing your NATS cluster

Determining the right size for your NATS Cluster is crucial for achieving optimal performance, reliability, and scalability. Here are key considerations to guide your sizing decisions:

High availability and fault tolerance

Deploying multiple nodes enhances the cluster’s ability to handle failures without service disruption. For instance, a 3 node cluster can tolerate the failure of one node, maintaining consensus and operational continuity. This setup ensures that the cluster remains functional even during partial outages.

💡 Quorum Requirement: NATS uses a quorum based consensus mechanism. A majority of nodes (calculated as cluster_size/2 +1 must agree to commit operations. In a 3 node cluster, this means two nodes from the required quorum, ensuring consistency and resilience.

Implementing Dead Letter Queues (DLQs)

While NATS doesn’t natively support DLQs, they can be implemented using advisory streams. When a message exceeds the maximum attempts without acknowledgement, an advisory message ($JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES) is published. By subscribing to this advisory, you can capture undelivered messages and route them to a dedicated “dead letter” stream for further analysis or reprocessing.

💡 Practical Insight: In our experience processing high-throughput event streams, setting up a dedicated DLQ allowed us to isolate and inspect problematic messages without affecting the main processing pipeline.

Replication for durability

Replication enhances message durability and fault tolerance. By configuring the replication factor, you determine how many nodes each message is stored on:

Recommended: A replication factor of 3 balances resource usage with fault tolerance, allowing the system to withstand the failure of one node.
Maximum: A replication factor of 5 offers higher resilience, tolerating up to two node failures, suitable for critical workloads.

💡 Best Practice: Deploying clusters with an odd number of nodes (e.g. 3 or 5) is advisable to prevent split brain scenarios and ensure a clear majority for quorum decisions.

Selecting infrastructure for NATS

Choosing the right infrastructure is crucial for optimal NATS performance. Here are key considerations to guide your setup:

CPU and memory recommendations

For production environments, especially when utilizing Jetstream, it’s advisable to allocate:

CPU: At least 2 cores
Memory: At least 8 GiB

These recommendations ensure that the NATS server can handle high throughput scenario effectively.

Storage considerations

NATS benefits from high performance storage solutions:

Type: SSD or high performance block storage
IOPS: Aim for at least 3000 IOPS per node

This setup supports the low latency requirements for NATS operations

Memory management with GOMEMLIMIT

In containerized environments, the Go runtime doesn’t inherently respect running instance memory limits. To manage memory usage effectively:

💡 Configure GOMEMLIMIT to 80-90% of the instance’s memory limit. For instance, in an 8 GiB machine, set GOMEMLIMIT=6GiB . This ensures the Go garbage collector operates within desired memory constraints, preventing potential out-of-memory issues.

Server tags for enhanced management

NATS allows the use of server_tags to label servers with arbitrary key-value pairs, such as region:us-east or az:1 . These tags helps in profiling, monitoring and operational grouping.

They can be leveraged for selective targeting during cluster operations, especially combined with selectors like --name or --cluster.

💡 In our deployments, we have utilized server tags to manage normal and DLQ stream placements within specific instances. This approach has improved our system’s resilience and helped in better management of NATS nodes.

Understanding payload handling

Managing payload sizes in NATS is crucial for maintaining optimal performance and system stability. Challenges faced with large payloads are:

Increased Latency: Larger messages can lead to higher transmission times, affecting real time processing
Memory Consumption: Big payloads consume more memory, potentially leading to resource exhaustion
Throughput Reduction: Handling sizable messages can decrease the overall message throughput.
Potential Errors: Exceeding the maximum payload size can result in errors

Recommended Practices:

Default Limit: NATS sets a default maximum payload size of 1 MB
Optimal Size: For best performance, keep payloads below 1 MB
Maximum Configurable Limit: While it’s possible to increase the limit up to 64 MB, it’s generally advised to avoid doing so unless absolutely necessary.
Alternative Approaches:
- External Storage: Store large data (e.g., files, images) in external storage systems and transmit only references or metadata through NATS.
- Chunking: Break down large messages into smaller chunks that comply with the payload size limit.

💡 Best Practice: Always assess the necessity of transmitting large payloads. If possible, redesign message structures to be more efficient, leveraging external systems for heavy data and keeping NATS messages lightweight.

Consumers and message processing

Selecting the Right Consumer Type

Pull Consumers: Ideal for batch processing and high-throughput scenarios.
Push Consumers: Better for real-time, continuous streams.

Important Consumer Configurations

Optimizations for high-throughput:
- Use batch sizes of 100-1000 messages for better throughput.
- Set max-ack-pending to 1000-10000 based on resources.
- Configure flow control, idle heartbeat, and max waiting appropriately.
- Prefer AckAll acknowledgment strategy for bulk processing.
With these foundational concepts, let's move ahead to practical deployment, best practices, and hands-on demonstrations to ensure your NATS cluster is robust, scalable, and performant.

Wrapping up

Over the course of this blog, we built a solid conceptual foundation for running NATS in production. We have went through:

Right sizing the cluster
tag aware node placement
Designing Dead Letter Queue
Looked into payload size configuration
Configuring GO Garbage Collector parameter

These practices will let you design clusters that self heal, isolate failure domains and keep latency predictable even during high spikes.

What’s in part 2?

In Part 2, we will do a hand-on demo:

Spin up a four node NATS cluster
Full Go client demo
stress test cluster with NATS CLI
will look into troubleshooting
Backup and Disaster recovery

So grab your IDE, install NATS CLI, we will turn whatever concepts we learnt today into production ready NATS cluster.

Useful links & resources

Here are some handy tools and dashboards to supercharge your setup:

For further reading, visit NATS official documentation.

Introduction

In distributed systems, services need to communicate quickly and reliably. NATS.io is a lightweight, high performance messaging system that helps to achieve this.

Understanding core architecture

NATS operates primarily on a publish-subscribe messaging pattern, supporting various messaging models:

Subject-Based Messaging
Publish-Subscribe
Request-Reply
Queue Groups

Architecture topologies

Depending on your use case, NATS can be deployed in several topologies:

Single Server Instance: Suitable for development or simple workloads.
Cluster: Ideal for achieving high availability and scalability.
Super Cluster: Multiple clusters interconnected via gateway nodes, optimized for multi-region deployments.
Leaf Nodes: Lightweight edge servers connected to a core cluster, ideal for edge computing scenarios.

Cluster behavior

NATS clusters form a full mesh network, allowing dynamic scaling and self healing without manual reconfiguration. Each server node in the cluster requires:

Client Port (default 4222): used for client connections.
Cluster Port (Typically 6222 ): used for inter-server communication.

Clients connecting to any server in the cluster automatically discover all members, ensuring seamless failover and load balancing

JetStream clustering

JetStream enhances NATS by providing durable message storage and processing. It employs the RAFT consensus algorithm to ensure consistency across the cluster, requiring:

unique server identifiers
persistent storage for logs
A quorum, typically calculated as (Cluster Size / 2) + 1

💡 Recommendation: For optimal reliability and performance, deploy Jetstream cluster with 3 or 5 servers. The maximum replication factor per stream is 5

JetStream meta group

All Jetstream enabled servers participate in a special RAFT group called the Meta Group, which manages Jetstream API operations. The meta group elects a leader responsible for:

Handling API interactions
Making decisions about stream adn consumer placements

💡 Important: Without an active meta group leader, Jetstream clustering and server operations cannot proceed.

Streams and their types

Streams are durable message stores in Jetstream. They persist messages on specified subjects. Each stream forms its own RAFT group, with a leader managing data synchronization and acknowledgements.

💡 Note: Stream placement is centrally managed by the Meta Group leader

Consumers

Consumers enable applications to process messages stored within streams. Each consumer forms its own RAFT group to handle acknowledgements and maintain delivery state information.

💡 Note: This independent RAFT group formation ensures reliable message processing and state management for each consumer.

Subjects

Subjects act as named channels through which publishers and subscribers discover each other and exchange messages efficiently.

Server synchronization

NATS core servers themselves form their own RAFT group to maintain synchronization across the cluster.

💡 Important: Any disruption in RAFT leadership election across streams, consumers, or the Meta Group can adversely impact the overall performance and message processing capabilities of NATS.

Considerations

Sizing your NATS cluster

Determining the right size for your NATS Cluster is crucial for achieving optimal performance, reliability, and scalability. Here are key considerations to guide your sizing decisions:

High availability and fault tolerance

💡 Quorum Requirement: NATS uses a quorum based consensus mechanism. A majority of nodes (calculated as cluster_size/2 +1 must agree to commit operations. In a 3 node cluster, this means two nodes from the required quorum, ensuring consistency and resilience.

Implementing Dead Letter Queues (DLQs)

💡 Practical Insight: In our experience processing high-throughput event streams, setting up a dedicated DLQ allowed us to isolate and inspect problematic messages without affecting the main processing pipeline.

Replication for durability

Replication enhances message durability and fault tolerance. By configuring the replication factor, you determine how many nodes each message is stored on:

Recommended: A replication factor of 3 balances resource usage with fault tolerance, allowing the system to withstand the failure of one node.
Maximum: A replication factor of 5 offers higher resilience, tolerating up to two node failures, suitable for critical workloads.

💡 Best Practice: Deploying clusters with an odd number of nodes (e.g. 3 or 5) is advisable to prevent split brain scenarios and ensure a clear majority for quorum decisions.

Selecting infrastructure for NATS

Choosing the right infrastructure is crucial for optimal NATS performance. Here are key considerations to guide your setup:

CPU and memory recommendations

For production environments, especially when utilizing Jetstream, it’s advisable to allocate:

CPU: At least 2 cores
Memory: At least 8 GiB

These recommendations ensure that the NATS server can handle high throughput scenario effectively.

Storage considerations

NATS benefits from high performance storage solutions:

Type: SSD or high performance block storage
IOPS: Aim for at least 3000 IOPS per node

This setup supports the low latency requirements for NATS operations

Memory management with GOMEMLIMIT

In containerized environments, the Go runtime doesn’t inherently respect running instance memory limits. To manage memory usage effectively:

💡 Configure GOMEMLIMIT to 80-90% of the instance’s memory limit. For instance, in an 8 GiB machine, set GOMEMLIMIT=6GiB . This ensures the Go garbage collector operates within desired memory constraints, preventing potential out-of-memory issues.

Server tags for enhanced management

NATS allows the use of server_tags to label servers with arbitrary key-value pairs, such as region:us-east or az:1 . These tags helps in profiling, monitoring and operational grouping.

They can be leveraged for selective targeting during cluster operations, especially combined with selectors like --name or --cluster.

💡 In our deployments, we have utilized server tags to manage normal and DLQ stream placements within specific instances. This approach has improved our system’s resilience and helped in better management of NATS nodes.

Understanding payload handling

Managing payload sizes in NATS is crucial for maintaining optimal performance and system stability. Challenges faced with large payloads are:

Increased Latency: Larger messages can lead to higher transmission times, affecting real time processing
Memory Consumption: Big payloads consume more memory, potentially leading to resource exhaustion
Throughput Reduction: Handling sizable messages can decrease the overall message throughput.
Potential Errors: Exceeding the maximum payload size can result in errors

Recommended Practices:

Default Limit: NATS sets a default maximum payload size of 1 MB
Optimal Size: For best performance, keep payloads below 1 MB
Maximum Configurable Limit: While it’s possible to increase the limit up to 64 MB, it’s generally advised to avoid doing so unless absolutely necessary.
Alternative Approaches:
- External Storage: Store large data (e.g., files, images) in external storage systems and transmit only references or metadata through NATS.
- Chunking: Break down large messages into smaller chunks that comply with the payload size limit.

💡 Best Practice: Always assess the necessity of transmitting large payloads. If possible, redesign message structures to be more efficient, leveraging external systems for heavy data and keeping NATS messages lightweight.

Consumers and message processing

Selecting the Right Consumer Type

Pull Consumers: Ideal for batch processing and high-throughput scenarios.
Push Consumers: Better for real-time, continuous streams.

Important Consumer Configurations

Optimizations for high-throughput:
- Use batch sizes of 100-1000 messages for better throughput.
- Set max-ack-pending to 1000-10000 based on resources.
- Configure flow control, idle heartbeat, and max waiting appropriately.
- Prefer AckAll acknowledgment strategy for bulk processing.
With these foundational concepts, let's move ahead to practical deployment, best practices, and hands-on demonstrations to ensure your NATS cluster is robust, scalable, and performant.

Wrapping up

Over the course of this blog, we built a solid conceptual foundation for running NATS in production. We have went through:

Right sizing the cluster
tag aware node placement
Designing Dead Letter Queue
Looked into payload size configuration
Configuring GO Garbage Collector parameter

These practices will let you design clusters that self heal, isolate failure domains and keep latency predictable even during high spikes.

What’s in part 2?

In Part 2, we will do a hand-on demo:

Spin up a four node NATS cluster
Full Go client demo
stress test cluster with NATS CLI
will look into troubleshooting
Backup and Disaster recovery

So grab your IDE, install NATS CLI, we will turn whatever concepts we learnt today into production ready NATS cluster.

Useful links & resources

Here are some handy tools and dashboards to supercharge your setup:

For further reading, visit NATS official documentation.

Introduction

In distributed systems, services need to communicate quickly and reliably. NATS.io is a lightweight, high performance messaging system that helps to achieve this.

Understanding core architecture

NATS operates primarily on a publish-subscribe messaging pattern, supporting various messaging models:

Subject-Based Messaging
Publish-Subscribe
Request-Reply
Queue Groups

Architecture topologies

Depending on your use case, NATS can be deployed in several topologies:

Single Server Instance: Suitable for development or simple workloads.
Cluster: Ideal for achieving high availability and scalability.
Super Cluster: Multiple clusters interconnected via gateway nodes, optimized for multi-region deployments.
Leaf Nodes: Lightweight edge servers connected to a core cluster, ideal for edge computing scenarios.

Cluster behavior

NATS clusters form a full mesh network, allowing dynamic scaling and self healing without manual reconfiguration. Each server node in the cluster requires:

Client Port (default 4222): used for client connections.
Cluster Port (Typically 6222 ): used for inter-server communication.

Clients connecting to any server in the cluster automatically discover all members, ensuring seamless failover and load balancing

JetStream clustering

JetStream enhances NATS by providing durable message storage and processing. It employs the RAFT consensus algorithm to ensure consistency across the cluster, requiring:

unique server identifiers
persistent storage for logs
A quorum, typically calculated as (Cluster Size / 2) + 1

💡 Recommendation: For optimal reliability and performance, deploy Jetstream cluster with 3 or 5 servers. The maximum replication factor per stream is 5

JetStream meta group

All Jetstream enabled servers participate in a special RAFT group called the Meta Group, which manages Jetstream API operations. The meta group elects a leader responsible for:

Handling API interactions
Making decisions about stream adn consumer placements

💡 Important: Without an active meta group leader, Jetstream clustering and server operations cannot proceed.

Streams and their types

Streams are durable message stores in Jetstream. They persist messages on specified subjects. Each stream forms its own RAFT group, with a leader managing data synchronization and acknowledgements.

💡 Note: Stream placement is centrally managed by the Meta Group leader

Consumers

Consumers enable applications to process messages stored within streams. Each consumer forms its own RAFT group to handle acknowledgements and maintain delivery state information.

💡 Note: This independent RAFT group formation ensures reliable message processing and state management for each consumer.

Subjects

Subjects act as named channels through which publishers and subscribers discover each other and exchange messages efficiently.

Server synchronization

NATS core servers themselves form their own RAFT group to maintain synchronization across the cluster.

💡 Important: Any disruption in RAFT leadership election across streams, consumers, or the Meta Group can adversely impact the overall performance and message processing capabilities of NATS.

Considerations

Sizing your NATS cluster

Determining the right size for your NATS Cluster is crucial for achieving optimal performance, reliability, and scalability. Here are key considerations to guide your sizing decisions:

High availability and fault tolerance

💡 Quorum Requirement: NATS uses a quorum based consensus mechanism. A majority of nodes (calculated as cluster_size/2 +1 must agree to commit operations. In a 3 node cluster, this means two nodes from the required quorum, ensuring consistency and resilience.

Implementing Dead Letter Queues (DLQs)

💡 Practical Insight: In our experience processing high-throughput event streams, setting up a dedicated DLQ allowed us to isolate and inspect problematic messages without affecting the main processing pipeline.

Replication for durability

Replication enhances message durability and fault tolerance. By configuring the replication factor, you determine how many nodes each message is stored on:

Recommended: A replication factor of 3 balances resource usage with fault tolerance, allowing the system to withstand the failure of one node.
Maximum: A replication factor of 5 offers higher resilience, tolerating up to two node failures, suitable for critical workloads.

💡 Best Practice: Deploying clusters with an odd number of nodes (e.g. 3 or 5) is advisable to prevent split brain scenarios and ensure a clear majority for quorum decisions.

Selecting infrastructure for NATS

Choosing the right infrastructure is crucial for optimal NATS performance. Here are key considerations to guide your setup:

CPU and memory recommendations

For production environments, especially when utilizing Jetstream, it’s advisable to allocate:

CPU: At least 2 cores
Memory: At least 8 GiB

These recommendations ensure that the NATS server can handle high throughput scenario effectively.

Storage considerations

NATS benefits from high performance storage solutions:

Type: SSD or high performance block storage
IOPS: Aim for at least 3000 IOPS per node

This setup supports the low latency requirements for NATS operations

Memory management with GOMEMLIMIT

In containerized environments, the Go runtime doesn’t inherently respect running instance memory limits. To manage memory usage effectively:

💡 Configure GOMEMLIMIT to 80-90% of the instance’s memory limit. For instance, in an 8 GiB machine, set GOMEMLIMIT=6GiB . This ensures the Go garbage collector operates within desired memory constraints, preventing potential out-of-memory issues.

Server tags for enhanced management

NATS allows the use of server_tags to label servers with arbitrary key-value pairs, such as region:us-east or az:1 . These tags helps in profiling, monitoring and operational grouping.

They can be leveraged for selective targeting during cluster operations, especially combined with selectors like --name or --cluster.

💡 In our deployments, we have utilized server tags to manage normal and DLQ stream placements within specific instances. This approach has improved our system’s resilience and helped in better management of NATS nodes.

Understanding payload handling

Managing payload sizes in NATS is crucial for maintaining optimal performance and system stability. Challenges faced with large payloads are:

Increased Latency: Larger messages can lead to higher transmission times, affecting real time processing
Memory Consumption: Big payloads consume more memory, potentially leading to resource exhaustion
Throughput Reduction: Handling sizable messages can decrease the overall message throughput.
Potential Errors: Exceeding the maximum payload size can result in errors

Recommended Practices:

Default Limit: NATS sets a default maximum payload size of 1 MB
Optimal Size: For best performance, keep payloads below 1 MB
Maximum Configurable Limit: While it’s possible to increase the limit up to 64 MB, it’s generally advised to avoid doing so unless absolutely necessary.
Alternative Approaches:
- External Storage: Store large data (e.g., files, images) in external storage systems and transmit only references or metadata through NATS.
- Chunking: Break down large messages into smaller chunks that comply with the payload size limit.

💡 Best Practice: Always assess the necessity of transmitting large payloads. If possible, redesign message structures to be more efficient, leveraging external systems for heavy data and keeping NATS messages lightweight.

Consumers and message processing

Selecting the Right Consumer Type

Pull Consumers: Ideal for batch processing and high-throughput scenarios.
Push Consumers: Better for real-time, continuous streams.

Important Consumer Configurations

Optimizations for high-throughput:
- Use batch sizes of 100-1000 messages for better throughput.
- Set max-ack-pending to 1000-10000 based on resources.
- Configure flow control, idle heartbeat, and max waiting appropriately.
- Prefer AckAll acknowledgment strategy for bulk processing.
With these foundational concepts, let's move ahead to practical deployment, best practices, and hands-on demonstrations to ensure your NATS cluster is robust, scalable, and performant.

Wrapping up

Over the course of this blog, we built a solid conceptual foundation for running NATS in production. We have went through:

Right sizing the cluster
tag aware node placement
Designing Dead Letter Queue
Looked into payload size configuration
Configuring GO Garbage Collector parameter

These practices will let you design clusters that self heal, isolate failure domains and keep latency predictable even during high spikes.

What’s in part 2?

In Part 2, we will do a hand-on demo:

Spin up a four node NATS cluster
Full Go client demo
stress test cluster with NATS CLI
will look into troubleshooting
Backup and Disaster recovery

So grab your IDE, install NATS CLI, we will turn whatever concepts we learnt today into production ready NATS cluster.

Useful links & resources

Here are some handy tools and dashboards to supercharge your setup:

For further reading, visit NATS official documentation.

Introduction

In distributed systems, services need to communicate quickly and reliably. NATS.io is a lightweight, high performance messaging system that helps to achieve this.

Understanding core architecture

NATS operates primarily on a publish-subscribe messaging pattern, supporting various messaging models:

Subject-Based Messaging
Publish-Subscribe
Request-Reply
Queue Groups

Architecture topologies

Depending on your use case, NATS can be deployed in several topologies:

Single Server Instance: Suitable for development or simple workloads.
Cluster: Ideal for achieving high availability and scalability.
Super Cluster: Multiple clusters interconnected via gateway nodes, optimized for multi-region deployments.
Leaf Nodes: Lightweight edge servers connected to a core cluster, ideal for edge computing scenarios.

Cluster behavior

NATS clusters form a full mesh network, allowing dynamic scaling and self healing without manual reconfiguration. Each server node in the cluster requires:

Client Port (default 4222): used for client connections.
Cluster Port (Typically 6222 ): used for inter-server communication.

Clients connecting to any server in the cluster automatically discover all members, ensuring seamless failover and load balancing

JetStream clustering

JetStream enhances NATS by providing durable message storage and processing. It employs the RAFT consensus algorithm to ensure consistency across the cluster, requiring:

unique server identifiers
persistent storage for logs
A quorum, typically calculated as (Cluster Size / 2) + 1

💡 Recommendation: For optimal reliability and performance, deploy Jetstream cluster with 3 or 5 servers. The maximum replication factor per stream is 5

JetStream meta group

All Jetstream enabled servers participate in a special RAFT group called the Meta Group, which manages Jetstream API operations. The meta group elects a leader responsible for:

Handling API interactions
Making decisions about stream adn consumer placements

💡 Important: Without an active meta group leader, Jetstream clustering and server operations cannot proceed.

Streams and their types

Streams are durable message stores in Jetstream. They persist messages on specified subjects. Each stream forms its own RAFT group, with a leader managing data synchronization and acknowledgements.

💡 Note: Stream placement is centrally managed by the Meta Group leader

Consumers

Consumers enable applications to process messages stored within streams. Each consumer forms its own RAFT group to handle acknowledgements and maintain delivery state information.

💡 Note: This independent RAFT group formation ensures reliable message processing and state management for each consumer.

Subjects

Subjects act as named channels through which publishers and subscribers discover each other and exchange messages efficiently.

Server synchronization

NATS core servers themselves form their own RAFT group to maintain synchronization across the cluster.

💡 Important: Any disruption in RAFT leadership election across streams, consumers, or the Meta Group can adversely impact the overall performance and message processing capabilities of NATS.

Considerations

Sizing your NATS cluster

Determining the right size for your NATS Cluster is crucial for achieving optimal performance, reliability, and scalability. Here are key considerations to guide your sizing decisions:

High availability and fault tolerance

💡 Quorum Requirement: NATS uses a quorum based consensus mechanism. A majority of nodes (calculated as cluster_size/2 +1 must agree to commit operations. In a 3 node cluster, this means two nodes from the required quorum, ensuring consistency and resilience.

Implementing Dead Letter Queues (DLQs)

💡 Practical Insight: In our experience processing high-throughput event streams, setting up a dedicated DLQ allowed us to isolate and inspect problematic messages without affecting the main processing pipeline.

Replication for durability

Replication enhances message durability and fault tolerance. By configuring the replication factor, you determine how many nodes each message is stored on:

Recommended: A replication factor of 3 balances resource usage with fault tolerance, allowing the system to withstand the failure of one node.
Maximum: A replication factor of 5 offers higher resilience, tolerating up to two node failures, suitable for critical workloads.

💡 Best Practice: Deploying clusters with an odd number of nodes (e.g. 3 or 5) is advisable to prevent split brain scenarios and ensure a clear majority for quorum decisions.

Selecting infrastructure for NATS

Choosing the right infrastructure is crucial for optimal NATS performance. Here are key considerations to guide your setup:

CPU and memory recommendations

For production environments, especially when utilizing Jetstream, it’s advisable to allocate:

CPU: At least 2 cores
Memory: At least 8 GiB

These recommendations ensure that the NATS server can handle high throughput scenario effectively.

Storage considerations

NATS benefits from high performance storage solutions:

Type: SSD or high performance block storage
IOPS: Aim for at least 3000 IOPS per node

This setup supports the low latency requirements for NATS operations

Memory management with GOMEMLIMIT

In containerized environments, the Go runtime doesn’t inherently respect running instance memory limits. To manage memory usage effectively:

💡 Configure GOMEMLIMIT to 80-90% of the instance’s memory limit. For instance, in an 8 GiB machine, set GOMEMLIMIT=6GiB . This ensures the Go garbage collector operates within desired memory constraints, preventing potential out-of-memory issues.

Server tags for enhanced management

NATS allows the use of server_tags to label servers with arbitrary key-value pairs, such as region:us-east or az:1 . These tags helps in profiling, monitoring and operational grouping.

They can be leveraged for selective targeting during cluster operations, especially combined with selectors like --name or --cluster.

💡 In our deployments, we have utilized server tags to manage normal and DLQ stream placements within specific instances. This approach has improved our system’s resilience and helped in better management of NATS nodes.

Understanding payload handling

Managing payload sizes in NATS is crucial for maintaining optimal performance and system stability. Challenges faced with large payloads are:

Increased Latency: Larger messages can lead to higher transmission times, affecting real time processing
Memory Consumption: Big payloads consume more memory, potentially leading to resource exhaustion
Throughput Reduction: Handling sizable messages can decrease the overall message throughput.
Potential Errors: Exceeding the maximum payload size can result in errors

Recommended Practices:

Default Limit: NATS sets a default maximum payload size of 1 MB
Optimal Size: For best performance, keep payloads below 1 MB
Maximum Configurable Limit: While it’s possible to increase the limit up to 64 MB, it’s generally advised to avoid doing so unless absolutely necessary.
Alternative Approaches:
- External Storage: Store large data (e.g., files, images) in external storage systems and transmit only references or metadata through NATS.
- Chunking: Break down large messages into smaller chunks that comply with the payload size limit.

💡 Best Practice: Always assess the necessity of transmitting large payloads. If possible, redesign message structures to be more efficient, leveraging external systems for heavy data and keeping NATS messages lightweight.

Consumers and message processing

Selecting the Right Consumer Type

Pull Consumers: Ideal for batch processing and high-throughput scenarios.
Push Consumers: Better for real-time, continuous streams.

Important Consumer Configurations

Optimizations for high-throughput:
- Use batch sizes of 100-1000 messages for better throughput.
- Set max-ack-pending to 1000-10000 based on resources.
- Configure flow control, idle heartbeat, and max waiting appropriately.
- Prefer AckAll acknowledgment strategy for bulk processing.
With these foundational concepts, let's move ahead to practical deployment, best practices, and hands-on demonstrations to ensure your NATS cluster is robust, scalable, and performant.

Wrapping up

Over the course of this blog, we built a solid conceptual foundation for running NATS in production. We have went through:

Right sizing the cluster
tag aware node placement
Designing Dead Letter Queue
Looked into payload size configuration
Configuring GO Garbage Collector parameter

These practices will let you design clusters that self heal, isolate failure domains and keep latency predictable even during high spikes.

What’s in part 2?

In Part 2, we will do a hand-on demo:

Spin up a four node NATS cluster
Full Go client demo
stress test cluster with NATS CLI
will look into troubleshooting
Backup and Disaster recovery

So grab your IDE, install NATS CLI, we will turn whatever concepts we learnt today into production ready NATS cluster.

Useful links & resources

Here are some handy tools and dashboards to supercharge your setup:

For further reading, visit NATS official documentation.

Introduction

In distributed systems, services need to communicate quickly and reliably. NATS.io is a lightweight, high performance messaging system that helps to achieve this.

Understanding core architecture

NATS operates primarily on a publish-subscribe messaging pattern, supporting various messaging models:

Subject-Based Messaging
Publish-Subscribe
Request-Reply
Queue Groups

Architecture topologies

Depending on your use case, NATS can be deployed in several topologies:

Single Server Instance: Suitable for development or simple workloads.
Cluster: Ideal for achieving high availability and scalability.
Super Cluster: Multiple clusters interconnected via gateway nodes, optimized for multi-region deployments.
Leaf Nodes: Lightweight edge servers connected to a core cluster, ideal for edge computing scenarios.

Cluster behavior

NATS clusters form a full mesh network, allowing dynamic scaling and self healing without manual reconfiguration. Each server node in the cluster requires:

Client Port (default 4222): used for client connections.
Cluster Port (Typically 6222 ): used for inter-server communication.

Clients connecting to any server in the cluster automatically discover all members, ensuring seamless failover and load balancing

JetStream clustering

JetStream enhances NATS by providing durable message storage and processing. It employs the RAFT consensus algorithm to ensure consistency across the cluster, requiring:

unique server identifiers
persistent storage for logs
A quorum, typically calculated as (Cluster Size / 2) + 1

💡 Recommendation: For optimal reliability and performance, deploy Jetstream cluster with 3 or 5 servers. The maximum replication factor per stream is 5

JetStream meta group

All Jetstream enabled servers participate in a special RAFT group called the Meta Group, which manages Jetstream API operations. The meta group elects a leader responsible for:

Handling API interactions
Making decisions about stream adn consumer placements

💡 Important: Without an active meta group leader, Jetstream clustering and server operations cannot proceed.

Streams and their types

Streams are durable message stores in Jetstream. They persist messages on specified subjects. Each stream forms its own RAFT group, with a leader managing data synchronization and acknowledgements.

💡 Note: Stream placement is centrally managed by the Meta Group leader

Consumers

Consumers enable applications to process messages stored within streams. Each consumer forms its own RAFT group to handle acknowledgements and maintain delivery state information.

💡 Note: This independent RAFT group formation ensures reliable message processing and state management for each consumer.

Subjects

Subjects act as named channels through which publishers and subscribers discover each other and exchange messages efficiently.

Server synchronization

NATS core servers themselves form their own RAFT group to maintain synchronization across the cluster.

💡 Important: Any disruption in RAFT leadership election across streams, consumers, or the Meta Group can adversely impact the overall performance and message processing capabilities of NATS.

Considerations

Sizing your NATS cluster

Determining the right size for your NATS Cluster is crucial for achieving optimal performance, reliability, and scalability. Here are key considerations to guide your sizing decisions:

High availability and fault tolerance

💡 Quorum Requirement: NATS uses a quorum based consensus mechanism. A majority of nodes (calculated as cluster_size/2 +1 must agree to commit operations. In a 3 node cluster, this means two nodes from the required quorum, ensuring consistency and resilience.

Implementing Dead Letter Queues (DLQs)

💡 Practical Insight: In our experience processing high-throughput event streams, setting up a dedicated DLQ allowed us to isolate and inspect problematic messages without affecting the main processing pipeline.

Replication for durability

Replication enhances message durability and fault tolerance. By configuring the replication factor, you determine how many nodes each message is stored on:

Recommended: A replication factor of 3 balances resource usage with fault tolerance, allowing the system to withstand the failure of one node.
Maximum: A replication factor of 5 offers higher resilience, tolerating up to two node failures, suitable for critical workloads.

💡 Best Practice: Deploying clusters with an odd number of nodes (e.g. 3 or 5) is advisable to prevent split brain scenarios and ensure a clear majority for quorum decisions.

Selecting infrastructure for NATS

Choosing the right infrastructure is crucial for optimal NATS performance. Here are key considerations to guide your setup:

CPU and memory recommendations

For production environments, especially when utilizing Jetstream, it’s advisable to allocate:

CPU: At least 2 cores
Memory: At least 8 GiB

These recommendations ensure that the NATS server can handle high throughput scenario effectively.

Storage considerations

NATS benefits from high performance storage solutions:

Type: SSD or high performance block storage
IOPS: Aim for at least 3000 IOPS per node

This setup supports the low latency requirements for NATS operations

Memory management with GOMEMLIMIT

In containerized environments, the Go runtime doesn’t inherently respect running instance memory limits. To manage memory usage effectively:

💡 Configure GOMEMLIMIT to 80-90% of the instance’s memory limit. For instance, in an 8 GiB machine, set GOMEMLIMIT=6GiB . This ensures the Go garbage collector operates within desired memory constraints, preventing potential out-of-memory issues.

Server tags for enhanced management

NATS allows the use of server_tags to label servers with arbitrary key-value pairs, such as region:us-east or az:1 . These tags helps in profiling, monitoring and operational grouping.

They can be leveraged for selective targeting during cluster operations, especially combined with selectors like --name or --cluster.

💡 In our deployments, we have utilized server tags to manage normal and DLQ stream placements within specific instances. This approach has improved our system’s resilience and helped in better management of NATS nodes.

Understanding payload handling

Managing payload sizes in NATS is crucial for maintaining optimal performance and system stability. Challenges faced with large payloads are:

Increased Latency: Larger messages can lead to higher transmission times, affecting real time processing
Memory Consumption: Big payloads consume more memory, potentially leading to resource exhaustion
Throughput Reduction: Handling sizable messages can decrease the overall message throughput.
Potential Errors: Exceeding the maximum payload size can result in errors

Recommended Practices:

Default Limit: NATS sets a default maximum payload size of 1 MB
Optimal Size: For best performance, keep payloads below 1 MB
Maximum Configurable Limit: While it’s possible to increase the limit up to 64 MB, it’s generally advised to avoid doing so unless absolutely necessary.
Alternative Approaches:
- External Storage: Store large data (e.g., files, images) in external storage systems and transmit only references or metadata through NATS.
- Chunking: Break down large messages into smaller chunks that comply with the payload size limit.

💡 Best Practice: Always assess the necessity of transmitting large payloads. If possible, redesign message structures to be more efficient, leveraging external systems for heavy data and keeping NATS messages lightweight.

Consumers and message processing

Selecting the Right Consumer Type

Pull Consumers: Ideal for batch processing and high-throughput scenarios.
Push Consumers: Better for real-time, continuous streams.

Important Consumer Configurations

Optimizations for high-throughput:
- Use batch sizes of 100-1000 messages for better throughput.
- Set max-ack-pending to 1000-10000 based on resources.
- Configure flow control, idle heartbeat, and max waiting appropriately.
- Prefer AckAll acknowledgment strategy for bulk processing.
With these foundational concepts, let's move ahead to practical deployment, best practices, and hands-on demonstrations to ensure your NATS cluster is robust, scalable, and performant.

Wrapping up

Over the course of this blog, we built a solid conceptual foundation for running NATS in production. We have went through:

Right sizing the cluster
tag aware node placement
Designing Dead Letter Queue
Looked into payload size configuration
Configuring GO Garbage Collector parameter

These practices will let you design clusters that self heal, isolate failure domains and keep latency predictable even during high spikes.

What’s in part 2?

In Part 2, we will do a hand-on demo:

Spin up a four node NATS cluster
Full Go client demo
stress test cluster with NATS CLI
will look into troubleshooting
Backup and Disaster recovery

So grab your IDE, install NATS CLI, we will turn whatever concepts we learnt today into production ready NATS cluster.

Useful links & resources

Here are some handy tools and dashboards to supercharge your setup:

For further reading, visit NATS official documentation.

Jump to section

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

April 29, 2025 | 9 min read

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

April 29, 2025 | 9 min read

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

April 10, 2025 | 5 min read

GitHub runners fundamentals and self-hosted runner setup

Rajesh Jangid

SRE

This post is an introduction to Github runners, different ways in which they can be deployed, and a comparison between managed and self hosted runners.

April 10, 2025 | 5 min read

GitHub runners fundamentals and self-hosted runner setup

Rajesh Jangid

SRE

This post is an introduction to Github runners, different ways in which they can be deployed, and a comparison between managed and self hosted runners.

April 8, 2025 | 4 min read

All software is assembled

Chinmay Naik

Founder, CEO @One2N

Learn why modern software development relies more on strategic assembly of third-party components over building them from scratch.

April 8, 2025 | 4 min read

All software is assembled

Chinmay Naik

Founder, CEO @One2N

Learn why modern software development relies more on strategic assembly of third-party components over building them from scratch.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

April 29, 2025 | 9 min read

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

April 29, 2025 | 9 min read

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

Blog

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Services

Resources

Company

Deploying a scalable NATS cluster part 1: core architecture and considerations

Deploying a scalable NATS cluster part 1: core architecture and considerations

Deploying a scalable NATS cluster part 1: core architecture and considerations

Deploying a scalable NATS cluster part 1: core architecture and considerations

Deploying a scalable NATS cluster part 1: core architecture and considerations

Share

Jump to section

Related posts

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.