Services

Resources

Company

Jun 4, 2025 | 5 min read

Deploying a scalable NATS Cluster Part 1: Core Architecture and Considerations

Deploying a scalable NATS Cluster Part 1: Core Architecture and Considerations

Jun 4, 2025 | 5 min read

Deploying a scalable NATS Cluster Part 1: Core Architecture and Considerations

Jun 4, 2025 | 5 min read

Deploying a scalable NATS Cluster Part 1: Core Architecture and Considerations

Jun 4, 2025 | 5 min read

Deploying a scalable NATS Cluster Part 1: Core Architecture and Considerations

Introduction

In distributed systems, services need to communicate quickly and reliably. NATS.io is a lightweight, high performance messaging system that helps to achieve this.

We have two blogs in this series. The present blog gives you an overview on what NATS architecture is, how it works and what should be your design considerations. In second blog, we will start implementing a NATS cluster, add oberservability and look into disaster recovery strategies.

Understanding Core Architecture

NATS operates primarily on a publish-subscribe messaging pattern, supporting various messaging models:

  1. Subject-Based Messaging

  2. Publish-Subscribe

  3. Request-Reply

  4. Queue Groups

Additionally, NATS includes JetStream, a powerful built-in persistence engine enabling message storage and replay, providing an "at-least-once" delivery mechanism along with durable subscriptions.

Architecture Topologies

Depending on your use case, NATS can be deployed in several topologies:

  1. Single Server Instance: Suitable for development or simple workloads.

  2. Cluster: Ideal for achieving high availability and scalability.

  3. Super Cluster: Multiple clusters interconnected via gateway nodes, optimized for multi-region deployments.

  4. Leaf Nodes: Lightweight edge servers connected to a core cluster, ideal for edge computing scenarios.

Cluster Behavior

NATS clusters form a full mesh network, allowing dynamic scaling and self healing without manual reconfiguration. Each server node in the cluster requires:

  • Client Port (default 4222): used for client connections.

  • Cluster Port (Typically 6222 ): used for inter-server communication.

Clients connecting to any server in the cluster automatically discover all members, ensuring seamless failover and load balancing

JetStream Clustering

JetStream enhances NATS by providing durable message storage and processing. It employs the RAFT consensus algorithm to ensure consistency across the cluster, requiring:

  • unique server identifiers

  • persistent storage for logs

  • A quorum, typically calculated as (Cluster Size / 2) + 1

💡 Recommendation: For optimal reliability and performance, deploy Jetstream cluster with 3 or 5 servers. The maximum replication factor per stream is 5

JetStream Meta Group

All Jetstream enabled servers participate in a special RAFT group called the Meta Group, which manages Jetstream API operations. The meta group elects a leader responsible for:

  • Handling API interactions

  • Making decisions about stream adn consumer placements

💡 Important: Without an active meta group leader, Jetstream clustering and server operations cannot proceed.

Streams and Their Types

Streams are durable message stores in Jetstream. They persist messages on specified subjects. Each stream forms its own RAFT group, with a leader managing data synchronization and acknowledgements.

💡 Note: Stream placement is centrally managed by the Meta Group leader

Consumers

Consumers enable applications to process messages stored within streams. Each consumer forms its own RAFT group to handle acknowledgements and maintain delivery state information.

💡 Note: This independent RAFT group formation ensures reliable message processing and state management for each consumer.

Subjects

Subjects act as named channels through which publishers and subscribers discover each other and exchange messages efficiently.

Server Synchronization

NATS core servers themselves form their own RAFT group to maintain synchronization across the cluster.

💡 Important: Any disruption in RAFT leadership election across streams, consumers, or the Meta Group can adversely impact the overall performance and message processing capabilities of NATS.

Considerations

Sizing Your NATS Cluster

Determining the right size for your NATS Cluster is crucial for achieving optimal performance, reliability, and scalability. Here are key considerations to guide your sizing decisions:

High Availability and Fault Tolerance

Deploying multiple nodes enhances the cluster’s ability to handle failures without service disruption. For instance, a 3 node cluster can tolerate the failure of one node, maintaining consensus and operational continuity. This setup ensures that the cluster remains functional even during partial outages.

💡 Quorum Requirement: NATS uses a quorum based consensus mechanism. A majority of nodes (calculated as cluster_size/2 +1 must agree to commit operations. In a 3 node cluster, this means two nodes from the required quorum, ensuring consistency and resilience.

Implementing Dead Letter Queues (DLQs)

While NATS doesn’t natively support DLQs, they can be implemented using advisory streams. When a message exceeds the maximum attempts without acknowledgement, an advisory message ($JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES) is published. By subscribing to this advisory, you can capture undelivered messages and route them to a dedicated “dead letter” stream for further analysis or reprocessing.

💡 Practical Insight: In our experience processing high-throughput event streams, setting up a dedicated DLQ allowed us to isolate and inspect problematic messages without affecting the main processing pipeline.

Replication for Durability

Replication enhances message durability and fault tolerance. By configuring the replication factor, you determine how many nodes each message is stored on:

  • Recommended: A replication factor of 3 balances resource usage with fault tolerance, allowing the system to withstand the failure of one node.

  • Maximum: A replication factor of 5 offers higher resilience, tolerating up to two node failures, suitable for critical workloads.

💡 Best Practice: Deploying clusters with an odd number of nodes (e.g. 3 or 5) is advisable to prevent split brain scenarios and ensure a clear majority for quorum decisions.

Selecting Infrastructure for NATS

Choosing the right infrastructure is crucial for optimal NATS performance. Here are key considerations to guide your setup:

CPU and Memory Recommendations

For production environments, especially when utilizing Jetstream, it’s advisable to allocate:

  • CPU: At least 2 cores

  • Memory: At least 8 GiB

These recommendations ensure that the NATS server can handle high throughput scenario effectively.

Storage Considerations

NATS benefits from high performance storage solutions:

  • Type: SSD or high performance block storage

  • IOPS: Aim for at least 3000 IOPS per node

This setup supports the low latency requirements for NATS operations

Memory Management with GOMEMLIMIT

In containerized environments, the Go runtime doesn’t inherently respect running instance memory limits. To manage memory usage effectively:

💡 Configure GOMEMLIMIT to 80-90% of the instance’s memory limit. For instance, in an 8 GiB machine, set GOMEMLIMIT=6GiB . This ensures the Go garbage collector operates within desired memory constraints, preventing potential out-of-memory issues.

Server Tags for enhanced management

NATS allows the use of server_tags to label servers with arbitrary key-value pairs, such as region:us-east or az:1 . These tags helps in profiling, monitoring and operational grouping.

They can be leveraged for selective targeting during cluster operations, especially combined with selectors like --name or --cluster.

💡 In our deployments, we have utilized server tags to manage normal and DLQ stream placements within specific instances. This approach has improved our system’s resilience and helped in better management of NATS nodes.

Understanding Payload Handling

Managing payload sizes in NATS is crucial for maintaining optimal performance and system stability. Challenges faced with large payloads are:

  • Increased Latency: Larger messages can lead to higher transmission times, affecting real time processing

  • Memory Consumption: Big payloads consume more memory, potentially leading to resource exhaustion

  • Throughput Reduction: Handling sizable messages can decrease the overall message throughput.

  • Potential Errors: Exceeding the maximum payload size can result in errors

Recommended Practices:

  • Default Limit: NATS sets a default maximum payload size of 1 MB

  • Optimal Size: For best performance, keep payloads below 1 MB

  • Maximum Configurable Limit: While it’s possible to increase the limit up to 64 MB, it’s generally advised to avoid doing so unless absolutely necessary.

  • Alternative Approaches:

    • External Storage: Store large data (e.g., files, images) in external storage systems and transmit only references or metadata through NATS.

    • Chunking: Break down large messages into smaller chunks that comply with the payload size limit.

💡 Best Practice: Always assess the necessity of transmitting large payloads. If possible, redesign message structures to be more efficient, leveraging external systems for heavy data and keeping NATS messages lightweight.

Consumers and Message Processing

Selecting the Right Consumer Type

  • Pull Consumers: Ideal for batch processing and high-throughput scenarios.

  • Push Consumers: Better for real-time, continuous streams.

Important Consumer Configurations

  • Optimizations for high-throughput:

    • Use batch sizes of 100-1000 messages for better throughput.

    • Set max-ack-pending to 1000-10000 based on resources.

    • Configure flow control, idle heartbeat, and max waiting appropriately.

    • Prefer AckAll acknowledgment strategy for bulk processing.

  • With these foundational concepts, let's move ahead to practical deployment, best practices, and hands-on demonstrations to ensure your NATS cluster is robust, scalable, and performant.

Wrapping up

Over the course of this blog, we built a solid conceptual foundation for running NATS in production. We have went through:

  1. Right sizing the cluster

  2. tag aware node placement

  3. Designing Dead Letter Queue

  4. Looked into payload size configuration

  5. Configuring GO Garbage Collector parameter

These practices will let you design clusters that self heal, isolate failure domains and keep latency predictable even during high spikes.

What’s in Part 2?

In Part 2, we will do a hand-on demo:

  1. Spin up a four node NATS cluster

  2. Full Go client demo

  3. stress test cluster with NATS CLI

  4. will look into troubleshooting

  5. Backup and Disaster recovery

So grab your IDE, install NATS CLI, we will turn whatever concepts we learnt today into production ready NATS cluster.

Here are some handy tools and dashboards to supercharge your setup:

For further reading, visit NATS official documentation.

Introduction

In distributed systems, services need to communicate quickly and reliably. NATS.io is a lightweight, high performance messaging system that helps to achieve this.

We have two blogs in this series. The present blog gives you an overview on what NATS architecture is, how it works and what should be your design considerations. In second blog, we will start implementing a NATS cluster, add oberservability and look into disaster recovery strategies.

Understanding Core Architecture

NATS operates primarily on a publish-subscribe messaging pattern, supporting various messaging models:

  1. Subject-Based Messaging

  2. Publish-Subscribe

  3. Request-Reply

  4. Queue Groups

Additionally, NATS includes JetStream, a powerful built-in persistence engine enabling message storage and replay, providing an "at-least-once" delivery mechanism along with durable subscriptions.

Architecture Topologies

Depending on your use case, NATS can be deployed in several topologies:

  1. Single Server Instance: Suitable for development or simple workloads.

  2. Cluster: Ideal for achieving high availability and scalability.

  3. Super Cluster: Multiple clusters interconnected via gateway nodes, optimized for multi-region deployments.

  4. Leaf Nodes: Lightweight edge servers connected to a core cluster, ideal for edge computing scenarios.

Cluster Behavior

NATS clusters form a full mesh network, allowing dynamic scaling and self healing without manual reconfiguration. Each server node in the cluster requires:

  • Client Port (default 4222): used for client connections.

  • Cluster Port (Typically 6222 ): used for inter-server communication.

Clients connecting to any server in the cluster automatically discover all members, ensuring seamless failover and load balancing

JetStream Clustering

JetStream enhances NATS by providing durable message storage and processing. It employs the RAFT consensus algorithm to ensure consistency across the cluster, requiring:

  • unique server identifiers

  • persistent storage for logs

  • A quorum, typically calculated as (Cluster Size / 2) + 1

💡 Recommendation: For optimal reliability and performance, deploy Jetstream cluster with 3 or 5 servers. The maximum replication factor per stream is 5

JetStream Meta Group

All Jetstream enabled servers participate in a special RAFT group called the Meta Group, which manages Jetstream API operations. The meta group elects a leader responsible for:

  • Handling API interactions

  • Making decisions about stream adn consumer placements

💡 Important: Without an active meta group leader, Jetstream clustering and server operations cannot proceed.

Streams and Their Types

Streams are durable message stores in Jetstream. They persist messages on specified subjects. Each stream forms its own RAFT group, with a leader managing data synchronization and acknowledgements.

💡 Note: Stream placement is centrally managed by the Meta Group leader

Consumers

Consumers enable applications to process messages stored within streams. Each consumer forms its own RAFT group to handle acknowledgements and maintain delivery state information.

💡 Note: This independent RAFT group formation ensures reliable message processing and state management for each consumer.

Subjects

Subjects act as named channels through which publishers and subscribers discover each other and exchange messages efficiently.

Server Synchronization

NATS core servers themselves form their own RAFT group to maintain synchronization across the cluster.

💡 Important: Any disruption in RAFT leadership election across streams, consumers, or the Meta Group can adversely impact the overall performance and message processing capabilities of NATS.

Considerations

Sizing Your NATS Cluster

Determining the right size for your NATS Cluster is crucial for achieving optimal performance, reliability, and scalability. Here are key considerations to guide your sizing decisions:

High Availability and Fault Tolerance

Deploying multiple nodes enhances the cluster’s ability to handle failures without service disruption. For instance, a 3 node cluster can tolerate the failure of one node, maintaining consensus and operational continuity. This setup ensures that the cluster remains functional even during partial outages.

💡 Quorum Requirement: NATS uses a quorum based consensus mechanism. A majority of nodes (calculated as cluster_size/2 +1 must agree to commit operations. In a 3 node cluster, this means two nodes from the required quorum, ensuring consistency and resilience.

Implementing Dead Letter Queues (DLQs)

While NATS doesn’t natively support DLQs, they can be implemented using advisory streams. When a message exceeds the maximum attempts without acknowledgement, an advisory message ($JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES) is published. By subscribing to this advisory, you can capture undelivered messages and route them to a dedicated “dead letter” stream for further analysis or reprocessing.

💡 Practical Insight: In our experience processing high-throughput event streams, setting up a dedicated DLQ allowed us to isolate and inspect problematic messages without affecting the main processing pipeline.

Replication for Durability

Replication enhances message durability and fault tolerance. By configuring the replication factor, you determine how many nodes each message is stored on:

  • Recommended: A replication factor of 3 balances resource usage with fault tolerance, allowing the system to withstand the failure of one node.

  • Maximum: A replication factor of 5 offers higher resilience, tolerating up to two node failures, suitable for critical workloads.

💡 Best Practice: Deploying clusters with an odd number of nodes (e.g. 3 or 5) is advisable to prevent split brain scenarios and ensure a clear majority for quorum decisions.

Selecting Infrastructure for NATS

Choosing the right infrastructure is crucial for optimal NATS performance. Here are key considerations to guide your setup:

CPU and Memory Recommendations

For production environments, especially when utilizing Jetstream, it’s advisable to allocate:

  • CPU: At least 2 cores

  • Memory: At least 8 GiB

These recommendations ensure that the NATS server can handle high throughput scenario effectively.

Storage Considerations

NATS benefits from high performance storage solutions:

  • Type: SSD or high performance block storage

  • IOPS: Aim for at least 3000 IOPS per node

This setup supports the low latency requirements for NATS operations

Memory Management with GOMEMLIMIT

In containerized environments, the Go runtime doesn’t inherently respect running instance memory limits. To manage memory usage effectively:

💡 Configure GOMEMLIMIT to 80-90% of the instance’s memory limit. For instance, in an 8 GiB machine, set GOMEMLIMIT=6GiB . This ensures the Go garbage collector operates within desired memory constraints, preventing potential out-of-memory issues.

Server Tags for enhanced management

NATS allows the use of server_tags to label servers with arbitrary key-value pairs, such as region:us-east or az:1 . These tags helps in profiling, monitoring and operational grouping.

They can be leveraged for selective targeting during cluster operations, especially combined with selectors like --name or --cluster.

💡 In our deployments, we have utilized server tags to manage normal and DLQ stream placements within specific instances. This approach has improved our system’s resilience and helped in better management of NATS nodes.

Understanding Payload Handling

Managing payload sizes in NATS is crucial for maintaining optimal performance and system stability. Challenges faced with large payloads are:

  • Increased Latency: Larger messages can lead to higher transmission times, affecting real time processing

  • Memory Consumption: Big payloads consume more memory, potentially leading to resource exhaustion

  • Throughput Reduction: Handling sizable messages can decrease the overall message throughput.

  • Potential Errors: Exceeding the maximum payload size can result in errors

Recommended Practices:

  • Default Limit: NATS sets a default maximum payload size of 1 MB

  • Optimal Size: For best performance, keep payloads below 1 MB

  • Maximum Configurable Limit: While it’s possible to increase the limit up to 64 MB, it’s generally advised to avoid doing so unless absolutely necessary.

  • Alternative Approaches:

    • External Storage: Store large data (e.g., files, images) in external storage systems and transmit only references or metadata through NATS.

    • Chunking: Break down large messages into smaller chunks that comply with the payload size limit.

💡 Best Practice: Always assess the necessity of transmitting large payloads. If possible, redesign message structures to be more efficient, leveraging external systems for heavy data and keeping NATS messages lightweight.

Consumers and Message Processing

Selecting the Right Consumer Type

  • Pull Consumers: Ideal for batch processing and high-throughput scenarios.

  • Push Consumers: Better for real-time, continuous streams.

Important Consumer Configurations

  • Optimizations for high-throughput:

    • Use batch sizes of 100-1000 messages for better throughput.

    • Set max-ack-pending to 1000-10000 based on resources.

    • Configure flow control, idle heartbeat, and max waiting appropriately.

    • Prefer AckAll acknowledgment strategy for bulk processing.

  • With these foundational concepts, let's move ahead to practical deployment, best practices, and hands-on demonstrations to ensure your NATS cluster is robust, scalable, and performant.

Wrapping up

Over the course of this blog, we built a solid conceptual foundation for running NATS in production. We have went through:

  1. Right sizing the cluster

  2. tag aware node placement

  3. Designing Dead Letter Queue

  4. Looked into payload size configuration

  5. Configuring GO Garbage Collector parameter

These practices will let you design clusters that self heal, isolate failure domains and keep latency predictable even during high spikes.

What’s in Part 2?

In Part 2, we will do a hand-on demo:

  1. Spin up a four node NATS cluster

  2. Full Go client demo

  3. stress test cluster with NATS CLI

  4. will look into troubleshooting

  5. Backup and Disaster recovery

So grab your IDE, install NATS CLI, we will turn whatever concepts we learnt today into production ready NATS cluster.

Here are some handy tools and dashboards to supercharge your setup:

For further reading, visit NATS official documentation.

Introduction

In distributed systems, services need to communicate quickly and reliably. NATS.io is a lightweight, high performance messaging system that helps to achieve this.

We have two blogs in this series. The present blog gives you an overview on what NATS architecture is, how it works and what should be your design considerations. In second blog, we will start implementing a NATS cluster, add oberservability and look into disaster recovery strategies.

Understanding Core Architecture

NATS operates primarily on a publish-subscribe messaging pattern, supporting various messaging models:

  1. Subject-Based Messaging

  2. Publish-Subscribe

  3. Request-Reply

  4. Queue Groups

Additionally, NATS includes JetStream, a powerful built-in persistence engine enabling message storage and replay, providing an "at-least-once" delivery mechanism along with durable subscriptions.

Architecture Topologies

Depending on your use case, NATS can be deployed in several topologies:

  1. Single Server Instance: Suitable for development or simple workloads.

  2. Cluster: Ideal for achieving high availability and scalability.

  3. Super Cluster: Multiple clusters interconnected via gateway nodes, optimized for multi-region deployments.

  4. Leaf Nodes: Lightweight edge servers connected to a core cluster, ideal for edge computing scenarios.

Cluster Behavior

NATS clusters form a full mesh network, allowing dynamic scaling and self healing without manual reconfiguration. Each server node in the cluster requires:

  • Client Port (default 4222): used for client connections.

  • Cluster Port (Typically 6222 ): used for inter-server communication.

Clients connecting to any server in the cluster automatically discover all members, ensuring seamless failover and load balancing

JetStream Clustering

JetStream enhances NATS by providing durable message storage and processing. It employs the RAFT consensus algorithm to ensure consistency across the cluster, requiring:

  • unique server identifiers

  • persistent storage for logs

  • A quorum, typically calculated as (Cluster Size / 2) + 1

💡 Recommendation: For optimal reliability and performance, deploy Jetstream cluster with 3 or 5 servers. The maximum replication factor per stream is 5

JetStream Meta Group

All Jetstream enabled servers participate in a special RAFT group called the Meta Group, which manages Jetstream API operations. The meta group elects a leader responsible for:

  • Handling API interactions

  • Making decisions about stream adn consumer placements

💡 Important: Without an active meta group leader, Jetstream clustering and server operations cannot proceed.

Streams and Their Types

Streams are durable message stores in Jetstream. They persist messages on specified subjects. Each stream forms its own RAFT group, with a leader managing data synchronization and acknowledgements.

💡 Note: Stream placement is centrally managed by the Meta Group leader

Consumers

Consumers enable applications to process messages stored within streams. Each consumer forms its own RAFT group to handle acknowledgements and maintain delivery state information.

💡 Note: This independent RAFT group formation ensures reliable message processing and state management for each consumer.

Subjects

Subjects act as named channels through which publishers and subscribers discover each other and exchange messages efficiently.

Server Synchronization

NATS core servers themselves form their own RAFT group to maintain synchronization across the cluster.

💡 Important: Any disruption in RAFT leadership election across streams, consumers, or the Meta Group can adversely impact the overall performance and message processing capabilities of NATS.

Considerations

Sizing Your NATS Cluster

Determining the right size for your NATS Cluster is crucial for achieving optimal performance, reliability, and scalability. Here are key considerations to guide your sizing decisions:

High Availability and Fault Tolerance

Deploying multiple nodes enhances the cluster’s ability to handle failures without service disruption. For instance, a 3 node cluster can tolerate the failure of one node, maintaining consensus and operational continuity. This setup ensures that the cluster remains functional even during partial outages.

💡 Quorum Requirement: NATS uses a quorum based consensus mechanism. A majority of nodes (calculated as cluster_size/2 +1 must agree to commit operations. In a 3 node cluster, this means two nodes from the required quorum, ensuring consistency and resilience.

Implementing Dead Letter Queues (DLQs)

While NATS doesn’t natively support DLQs, they can be implemented using advisory streams. When a message exceeds the maximum attempts without acknowledgement, an advisory message ($JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES) is published. By subscribing to this advisory, you can capture undelivered messages and route them to a dedicated “dead letter” stream for further analysis or reprocessing.

💡 Practical Insight: In our experience processing high-throughput event streams, setting up a dedicated DLQ allowed us to isolate and inspect problematic messages without affecting the main processing pipeline.

Replication for Durability

Replication enhances message durability and fault tolerance. By configuring the replication factor, you determine how many nodes each message is stored on:

  • Recommended: A replication factor of 3 balances resource usage with fault tolerance, allowing the system to withstand the failure of one node.

  • Maximum: A replication factor of 5 offers higher resilience, tolerating up to two node failures, suitable for critical workloads.

💡 Best Practice: Deploying clusters with an odd number of nodes (e.g. 3 or 5) is advisable to prevent split brain scenarios and ensure a clear majority for quorum decisions.

Selecting Infrastructure for NATS

Choosing the right infrastructure is crucial for optimal NATS performance. Here are key considerations to guide your setup:

CPU and Memory Recommendations

For production environments, especially when utilizing Jetstream, it’s advisable to allocate:

  • CPU: At least 2 cores

  • Memory: At least 8 GiB

These recommendations ensure that the NATS server can handle high throughput scenario effectively.

Storage Considerations

NATS benefits from high performance storage solutions:

  • Type: SSD or high performance block storage

  • IOPS: Aim for at least 3000 IOPS per node

This setup supports the low latency requirements for NATS operations

Memory Management with GOMEMLIMIT

In containerized environments, the Go runtime doesn’t inherently respect running instance memory limits. To manage memory usage effectively:

💡 Configure GOMEMLIMIT to 80-90% of the instance’s memory limit. For instance, in an 8 GiB machine, set GOMEMLIMIT=6GiB . This ensures the Go garbage collector operates within desired memory constraints, preventing potential out-of-memory issues.

Server Tags for enhanced management

NATS allows the use of server_tags to label servers with arbitrary key-value pairs, such as region:us-east or az:1 . These tags helps in profiling, monitoring and operational grouping.

They can be leveraged for selective targeting during cluster operations, especially combined with selectors like --name or --cluster.

💡 In our deployments, we have utilized server tags to manage normal and DLQ stream placements within specific instances. This approach has improved our system’s resilience and helped in better management of NATS nodes.

Understanding Payload Handling

Managing payload sizes in NATS is crucial for maintaining optimal performance and system stability. Challenges faced with large payloads are:

  • Increased Latency: Larger messages can lead to higher transmission times, affecting real time processing

  • Memory Consumption: Big payloads consume more memory, potentially leading to resource exhaustion

  • Throughput Reduction: Handling sizable messages can decrease the overall message throughput.

  • Potential Errors: Exceeding the maximum payload size can result in errors

Recommended Practices:

  • Default Limit: NATS sets a default maximum payload size of 1 MB

  • Optimal Size: For best performance, keep payloads below 1 MB

  • Maximum Configurable Limit: While it’s possible to increase the limit up to 64 MB, it’s generally advised to avoid doing so unless absolutely necessary.

  • Alternative Approaches:

    • External Storage: Store large data (e.g., files, images) in external storage systems and transmit only references or metadata through NATS.

    • Chunking: Break down large messages into smaller chunks that comply with the payload size limit.

💡 Best Practice: Always assess the necessity of transmitting large payloads. If possible, redesign message structures to be more efficient, leveraging external systems for heavy data and keeping NATS messages lightweight.

Consumers and Message Processing

Selecting the Right Consumer Type

  • Pull Consumers: Ideal for batch processing and high-throughput scenarios.

  • Push Consumers: Better for real-time, continuous streams.

Important Consumer Configurations

  • Optimizations for high-throughput:

    • Use batch sizes of 100-1000 messages for better throughput.

    • Set max-ack-pending to 1000-10000 based on resources.

    • Configure flow control, idle heartbeat, and max waiting appropriately.

    • Prefer AckAll acknowledgment strategy for bulk processing.

  • With these foundational concepts, let's move ahead to practical deployment, best practices, and hands-on demonstrations to ensure your NATS cluster is robust, scalable, and performant.

Wrapping up

Over the course of this blog, we built a solid conceptual foundation for running NATS in production. We have went through:

  1. Right sizing the cluster

  2. tag aware node placement

  3. Designing Dead Letter Queue

  4. Looked into payload size configuration

  5. Configuring GO Garbage Collector parameter

These practices will let you design clusters that self heal, isolate failure domains and keep latency predictable even during high spikes.

What’s in Part 2?

In Part 2, we will do a hand-on demo:

  1. Spin up a four node NATS cluster

  2. Full Go client demo

  3. stress test cluster with NATS CLI

  4. will look into troubleshooting

  5. Backup and Disaster recovery

So grab your IDE, install NATS CLI, we will turn whatever concepts we learnt today into production ready NATS cluster.

Here are some handy tools and dashboards to supercharge your setup:

For further reading, visit NATS official documentation.

Introduction

In distributed systems, services need to communicate quickly and reliably. NATS.io is a lightweight, high performance messaging system that helps to achieve this.

We have two blogs in this series. The present blog gives you an overview on what NATS architecture is, how it works and what should be your design considerations. In second blog, we will start implementing a NATS cluster, add oberservability and look into disaster recovery strategies.

Understanding Core Architecture

NATS operates primarily on a publish-subscribe messaging pattern, supporting various messaging models:

  1. Subject-Based Messaging

  2. Publish-Subscribe

  3. Request-Reply

  4. Queue Groups

Additionally, NATS includes JetStream, a powerful built-in persistence engine enabling message storage and replay, providing an "at-least-once" delivery mechanism along with durable subscriptions.

Architecture Topologies

Depending on your use case, NATS can be deployed in several topologies:

  1. Single Server Instance: Suitable for development or simple workloads.

  2. Cluster: Ideal for achieving high availability and scalability.

  3. Super Cluster: Multiple clusters interconnected via gateway nodes, optimized for multi-region deployments.

  4. Leaf Nodes: Lightweight edge servers connected to a core cluster, ideal for edge computing scenarios.

Cluster Behavior

NATS clusters form a full mesh network, allowing dynamic scaling and self healing without manual reconfiguration. Each server node in the cluster requires:

  • Client Port (default 4222): used for client connections.

  • Cluster Port (Typically 6222 ): used for inter-server communication.

Clients connecting to any server in the cluster automatically discover all members, ensuring seamless failover and load balancing

JetStream Clustering

JetStream enhances NATS by providing durable message storage and processing. It employs the RAFT consensus algorithm to ensure consistency across the cluster, requiring:

  • unique server identifiers

  • persistent storage for logs

  • A quorum, typically calculated as (Cluster Size / 2) + 1

💡 Recommendation: For optimal reliability and performance, deploy Jetstream cluster with 3 or 5 servers. The maximum replication factor per stream is 5

JetStream Meta Group

All Jetstream enabled servers participate in a special RAFT group called the Meta Group, which manages Jetstream API operations. The meta group elects a leader responsible for:

  • Handling API interactions

  • Making decisions about stream adn consumer placements

💡 Important: Without an active meta group leader, Jetstream clustering and server operations cannot proceed.

Streams and Their Types

Streams are durable message stores in Jetstream. They persist messages on specified subjects. Each stream forms its own RAFT group, with a leader managing data synchronization and acknowledgements.

💡 Note: Stream placement is centrally managed by the Meta Group leader

Consumers

Consumers enable applications to process messages stored within streams. Each consumer forms its own RAFT group to handle acknowledgements and maintain delivery state information.

💡 Note: This independent RAFT group formation ensures reliable message processing and state management for each consumer.

Subjects

Subjects act as named channels through which publishers and subscribers discover each other and exchange messages efficiently.

Server Synchronization

NATS core servers themselves form their own RAFT group to maintain synchronization across the cluster.

💡 Important: Any disruption in RAFT leadership election across streams, consumers, or the Meta Group can adversely impact the overall performance and message processing capabilities of NATS.

Considerations

Sizing Your NATS Cluster

Determining the right size for your NATS Cluster is crucial for achieving optimal performance, reliability, and scalability. Here are key considerations to guide your sizing decisions:

High Availability and Fault Tolerance

Deploying multiple nodes enhances the cluster’s ability to handle failures without service disruption. For instance, a 3 node cluster can tolerate the failure of one node, maintaining consensus and operational continuity. This setup ensures that the cluster remains functional even during partial outages.

💡 Quorum Requirement: NATS uses a quorum based consensus mechanism. A majority of nodes (calculated as cluster_size/2 +1 must agree to commit operations. In a 3 node cluster, this means two nodes from the required quorum, ensuring consistency and resilience.

Implementing Dead Letter Queues (DLQs)

While NATS doesn’t natively support DLQs, they can be implemented using advisory streams. When a message exceeds the maximum attempts without acknowledgement, an advisory message ($JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES) is published. By subscribing to this advisory, you can capture undelivered messages and route them to a dedicated “dead letter” stream for further analysis or reprocessing.

💡 Practical Insight: In our experience processing high-throughput event streams, setting up a dedicated DLQ allowed us to isolate and inspect problematic messages without affecting the main processing pipeline.

Replication for Durability

Replication enhances message durability and fault tolerance. By configuring the replication factor, you determine how many nodes each message is stored on:

  • Recommended: A replication factor of 3 balances resource usage with fault tolerance, allowing the system to withstand the failure of one node.

  • Maximum: A replication factor of 5 offers higher resilience, tolerating up to two node failures, suitable for critical workloads.

💡 Best Practice: Deploying clusters with an odd number of nodes (e.g. 3 or 5) is advisable to prevent split brain scenarios and ensure a clear majority for quorum decisions.

Selecting Infrastructure for NATS

Choosing the right infrastructure is crucial for optimal NATS performance. Here are key considerations to guide your setup:

CPU and Memory Recommendations

For production environments, especially when utilizing Jetstream, it’s advisable to allocate:

  • CPU: At least 2 cores

  • Memory: At least 8 GiB

These recommendations ensure that the NATS server can handle high throughput scenario effectively.

Storage Considerations

NATS benefits from high performance storage solutions:

  • Type: SSD or high performance block storage

  • IOPS: Aim for at least 3000 IOPS per node

This setup supports the low latency requirements for NATS operations

Memory Management with GOMEMLIMIT

In containerized environments, the Go runtime doesn’t inherently respect running instance memory limits. To manage memory usage effectively:

💡 Configure GOMEMLIMIT to 80-90% of the instance’s memory limit. For instance, in an 8 GiB machine, set GOMEMLIMIT=6GiB . This ensures the Go garbage collector operates within desired memory constraints, preventing potential out-of-memory issues.

Server Tags for enhanced management

NATS allows the use of server_tags to label servers with arbitrary key-value pairs, such as region:us-east or az:1 . These tags helps in profiling, monitoring and operational grouping.

They can be leveraged for selective targeting during cluster operations, especially combined with selectors like --name or --cluster.

💡 In our deployments, we have utilized server tags to manage normal and DLQ stream placements within specific instances. This approach has improved our system’s resilience and helped in better management of NATS nodes.

Understanding Payload Handling

Managing payload sizes in NATS is crucial for maintaining optimal performance and system stability. Challenges faced with large payloads are:

  • Increased Latency: Larger messages can lead to higher transmission times, affecting real time processing

  • Memory Consumption: Big payloads consume more memory, potentially leading to resource exhaustion

  • Throughput Reduction: Handling sizable messages can decrease the overall message throughput.

  • Potential Errors: Exceeding the maximum payload size can result in errors

Recommended Practices:

  • Default Limit: NATS sets a default maximum payload size of 1 MB

  • Optimal Size: For best performance, keep payloads below 1 MB

  • Maximum Configurable Limit: While it’s possible to increase the limit up to 64 MB, it’s generally advised to avoid doing so unless absolutely necessary.

  • Alternative Approaches:

    • External Storage: Store large data (e.g., files, images) in external storage systems and transmit only references or metadata through NATS.

    • Chunking: Break down large messages into smaller chunks that comply with the payload size limit.

💡 Best Practice: Always assess the necessity of transmitting large payloads. If possible, redesign message structures to be more efficient, leveraging external systems for heavy data and keeping NATS messages lightweight.

Consumers and Message Processing

Selecting the Right Consumer Type

  • Pull Consumers: Ideal for batch processing and high-throughput scenarios.

  • Push Consumers: Better for real-time, continuous streams.

Important Consumer Configurations

  • Optimizations for high-throughput:

    • Use batch sizes of 100-1000 messages for better throughput.

    • Set max-ack-pending to 1000-10000 based on resources.

    • Configure flow control, idle heartbeat, and max waiting appropriately.

    • Prefer AckAll acknowledgment strategy for bulk processing.

  • With these foundational concepts, let's move ahead to practical deployment, best practices, and hands-on demonstrations to ensure your NATS cluster is robust, scalable, and performant.

Wrapping up

Over the course of this blog, we built a solid conceptual foundation for running NATS in production. We have went through:

  1. Right sizing the cluster

  2. tag aware node placement

  3. Designing Dead Letter Queue

  4. Looked into payload size configuration

  5. Configuring GO Garbage Collector parameter

These practices will let you design clusters that self heal, isolate failure domains and keep latency predictable even during high spikes.

What’s in Part 2?

In Part 2, we will do a hand-on demo:

  1. Spin up a four node NATS cluster

  2. Full Go client demo

  3. stress test cluster with NATS CLI

  4. will look into troubleshooting

  5. Backup and Disaster recovery

So grab your IDE, install NATS CLI, we will turn whatever concepts we learnt today into production ready NATS cluster.

Here are some handy tools and dashboards to supercharge your setup:

For further reading, visit NATS official documentation.

Introduction

In distributed systems, services need to communicate quickly and reliably. NATS.io is a lightweight, high performance messaging system that helps to achieve this.

We have two blogs in this series. The present blog gives you an overview on what NATS architecture is, how it works and what should be your design considerations. In second blog, we will start implementing a NATS cluster, add oberservability and look into disaster recovery strategies.

Understanding Core Architecture

NATS operates primarily on a publish-subscribe messaging pattern, supporting various messaging models:

  1. Subject-Based Messaging

  2. Publish-Subscribe

  3. Request-Reply

  4. Queue Groups

Additionally, NATS includes JetStream, a powerful built-in persistence engine enabling message storage and replay, providing an "at-least-once" delivery mechanism along with durable subscriptions.

Architecture Topologies

Depending on your use case, NATS can be deployed in several topologies:

  1. Single Server Instance: Suitable for development or simple workloads.

  2. Cluster: Ideal for achieving high availability and scalability.

  3. Super Cluster: Multiple clusters interconnected via gateway nodes, optimized for multi-region deployments.

  4. Leaf Nodes: Lightweight edge servers connected to a core cluster, ideal for edge computing scenarios.

Cluster Behavior

NATS clusters form a full mesh network, allowing dynamic scaling and self healing without manual reconfiguration. Each server node in the cluster requires:

  • Client Port (default 4222): used for client connections.

  • Cluster Port (Typically 6222 ): used for inter-server communication.

Clients connecting to any server in the cluster automatically discover all members, ensuring seamless failover and load balancing

JetStream Clustering

JetStream enhances NATS by providing durable message storage and processing. It employs the RAFT consensus algorithm to ensure consistency across the cluster, requiring:

  • unique server identifiers

  • persistent storage for logs

  • A quorum, typically calculated as (Cluster Size / 2) + 1

💡 Recommendation: For optimal reliability and performance, deploy Jetstream cluster with 3 or 5 servers. The maximum replication factor per stream is 5

JetStream Meta Group

All Jetstream enabled servers participate in a special RAFT group called the Meta Group, which manages Jetstream API operations. The meta group elects a leader responsible for:

  • Handling API interactions

  • Making decisions about stream adn consumer placements

💡 Important: Without an active meta group leader, Jetstream clustering and server operations cannot proceed.

Streams and Their Types

Streams are durable message stores in Jetstream. They persist messages on specified subjects. Each stream forms its own RAFT group, with a leader managing data synchronization and acknowledgements.

💡 Note: Stream placement is centrally managed by the Meta Group leader

Consumers

Consumers enable applications to process messages stored within streams. Each consumer forms its own RAFT group to handle acknowledgements and maintain delivery state information.

💡 Note: This independent RAFT group formation ensures reliable message processing and state management for each consumer.

Subjects

Subjects act as named channels through which publishers and subscribers discover each other and exchange messages efficiently.

Server Synchronization

NATS core servers themselves form their own RAFT group to maintain synchronization across the cluster.

💡 Important: Any disruption in RAFT leadership election across streams, consumers, or the Meta Group can adversely impact the overall performance and message processing capabilities of NATS.

Considerations

Sizing Your NATS Cluster

Determining the right size for your NATS Cluster is crucial for achieving optimal performance, reliability, and scalability. Here are key considerations to guide your sizing decisions:

High Availability and Fault Tolerance

Deploying multiple nodes enhances the cluster’s ability to handle failures without service disruption. For instance, a 3 node cluster can tolerate the failure of one node, maintaining consensus and operational continuity. This setup ensures that the cluster remains functional even during partial outages.

💡 Quorum Requirement: NATS uses a quorum based consensus mechanism. A majority of nodes (calculated as cluster_size/2 +1 must agree to commit operations. In a 3 node cluster, this means two nodes from the required quorum, ensuring consistency and resilience.

Implementing Dead Letter Queues (DLQs)

While NATS doesn’t natively support DLQs, they can be implemented using advisory streams. When a message exceeds the maximum attempts without acknowledgement, an advisory message ($JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES) is published. By subscribing to this advisory, you can capture undelivered messages and route them to a dedicated “dead letter” stream for further analysis or reprocessing.

💡 Practical Insight: In our experience processing high-throughput event streams, setting up a dedicated DLQ allowed us to isolate and inspect problematic messages without affecting the main processing pipeline.

Replication for Durability

Replication enhances message durability and fault tolerance. By configuring the replication factor, you determine how many nodes each message is stored on:

  • Recommended: A replication factor of 3 balances resource usage with fault tolerance, allowing the system to withstand the failure of one node.

  • Maximum: A replication factor of 5 offers higher resilience, tolerating up to two node failures, suitable for critical workloads.

💡 Best Practice: Deploying clusters with an odd number of nodes (e.g. 3 or 5) is advisable to prevent split brain scenarios and ensure a clear majority for quorum decisions.

Selecting Infrastructure for NATS

Choosing the right infrastructure is crucial for optimal NATS performance. Here are key considerations to guide your setup:

CPU and Memory Recommendations

For production environments, especially when utilizing Jetstream, it’s advisable to allocate:

  • CPU: At least 2 cores

  • Memory: At least 8 GiB

These recommendations ensure that the NATS server can handle high throughput scenario effectively.

Storage Considerations

NATS benefits from high performance storage solutions:

  • Type: SSD or high performance block storage

  • IOPS: Aim for at least 3000 IOPS per node

This setup supports the low latency requirements for NATS operations

Memory Management with GOMEMLIMIT

In containerized environments, the Go runtime doesn’t inherently respect running instance memory limits. To manage memory usage effectively:

💡 Configure GOMEMLIMIT to 80-90% of the instance’s memory limit. For instance, in an 8 GiB machine, set GOMEMLIMIT=6GiB . This ensures the Go garbage collector operates within desired memory constraints, preventing potential out-of-memory issues.

Server Tags for enhanced management

NATS allows the use of server_tags to label servers with arbitrary key-value pairs, such as region:us-east or az:1 . These tags helps in profiling, monitoring and operational grouping.

They can be leveraged for selective targeting during cluster operations, especially combined with selectors like --name or --cluster.

💡 In our deployments, we have utilized server tags to manage normal and DLQ stream placements within specific instances. This approach has improved our system’s resilience and helped in better management of NATS nodes.

Understanding Payload Handling

Managing payload sizes in NATS is crucial for maintaining optimal performance and system stability. Challenges faced with large payloads are:

  • Increased Latency: Larger messages can lead to higher transmission times, affecting real time processing

  • Memory Consumption: Big payloads consume more memory, potentially leading to resource exhaustion

  • Throughput Reduction: Handling sizable messages can decrease the overall message throughput.

  • Potential Errors: Exceeding the maximum payload size can result in errors

Recommended Practices:

  • Default Limit: NATS sets a default maximum payload size of 1 MB

  • Optimal Size: For best performance, keep payloads below 1 MB

  • Maximum Configurable Limit: While it’s possible to increase the limit up to 64 MB, it’s generally advised to avoid doing so unless absolutely necessary.

  • Alternative Approaches:

    • External Storage: Store large data (e.g., files, images) in external storage systems and transmit only references or metadata through NATS.

    • Chunking: Break down large messages into smaller chunks that comply with the payload size limit.

💡 Best Practice: Always assess the necessity of transmitting large payloads. If possible, redesign message structures to be more efficient, leveraging external systems for heavy data and keeping NATS messages lightweight.

Consumers and Message Processing

Selecting the Right Consumer Type

  • Pull Consumers: Ideal for batch processing and high-throughput scenarios.

  • Push Consumers: Better for real-time, continuous streams.

Important Consumer Configurations

  • Optimizations for high-throughput:

    • Use batch sizes of 100-1000 messages for better throughput.

    • Set max-ack-pending to 1000-10000 based on resources.

    • Configure flow control, idle heartbeat, and max waiting appropriately.

    • Prefer AckAll acknowledgment strategy for bulk processing.

  • With these foundational concepts, let's move ahead to practical deployment, best practices, and hands-on demonstrations to ensure your NATS cluster is robust, scalable, and performant.

Wrapping up

Over the course of this blog, we built a solid conceptual foundation for running NATS in production. We have went through:

  1. Right sizing the cluster

  2. tag aware node placement

  3. Designing Dead Letter Queue

  4. Looked into payload size configuration

  5. Configuring GO Garbage Collector parameter

These practices will let you design clusters that self heal, isolate failure domains and keep latency predictable even during high spikes.

What’s in Part 2?

In Part 2, we will do a hand-on demo:

  1. Spin up a four node NATS cluster

  2. Full Go client demo

  3. stress test cluster with NATS CLI

  4. will look into troubleshooting

  5. Backup and Disaster recovery

So grab your IDE, install NATS CLI, we will turn whatever concepts we learnt today into production ready NATS cluster.

Here are some handy tools and dashboards to supercharge your setup:

For further reading, visit NATS official documentation.

Share

Jump to Section

Also Checkout

Also Checkout

Also Checkout

Also Checkout

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.