Introduction
In distributed systems, services need to communicate quickly and reliably. NATS.io is a lightweight, high performance messaging system that helps to achieve this.
We have two blogs in this series. The present blog gives you an overview on what NATS architecture is, how it works and what should be your design considerations. In second blog, we will start implementing a NATS cluster, add oberservability and look into disaster recovery strategies.
Understanding Core Architecture
NATS operates primarily on a publish-subscribe messaging pattern, supporting various messaging models:
Subject-Based Messaging
Publish-Subscribe
Request-Reply
Queue Groups
Additionally, NATS includes JetStream, a powerful built-in persistence engine enabling message storage and replay, providing an "at-least-once" delivery mechanism along with durable subscriptions.
Architecture Topologies
Depending on your use case, NATS can be deployed in several topologies:
Single Server Instance: Suitable for development or simple workloads.
Cluster: Ideal for achieving high availability and scalability.
Super Cluster: Multiple clusters interconnected via gateway nodes, optimized for multi-region deployments.
Leaf Nodes: Lightweight edge servers connected to a core cluster, ideal for edge computing scenarios.
Cluster Behavior
NATS clusters form a full mesh network, allowing dynamic scaling and self healing without manual reconfiguration. Each server node in the cluster requires:
Client Port (default
4222
): used for client connections.Cluster Port (Typically
6222
): used for inter-server communication.
Clients connecting to any server in the cluster automatically discover all members, ensuring seamless failover and load balancing
JetStream Clustering
JetStream enhances NATS by providing durable message storage and processing. It employs the RAFT consensus algorithm to ensure consistency across the cluster, requiring:
unique server identifiers
persistent storage for logs
A quorum, typically calculated as
(Cluster Size / 2) + 1
💡 Recommendation: For optimal reliability and performance, deploy Jetstream cluster with 3 or 5 servers. The maximum replication factor per stream is 5
JetStream Meta Group
All Jetstream enabled servers participate in a special RAFT group called the Meta Group, which manages Jetstream API operations. The meta group elects a leader responsible for:
Handling API interactions
Making decisions about stream adn consumer placements
💡 Important: Without an active meta group leader, Jetstream clustering and server operations cannot proceed.
Streams and Their Types
Streams are durable message stores in Jetstream. They persist messages on specified subjects. Each stream forms its own RAFT group, with a leader managing data synchronization and acknowledgements.
💡 Note: Stream placement is centrally managed by the Meta Group leader
Consumers
Consumers enable applications to process messages stored within streams. Each consumer forms its own RAFT group to handle acknowledgements and maintain delivery state information.
💡 Note: This independent RAFT group formation ensures reliable message processing and state management for each consumer.
Subjects
Subjects act as named channels through which publishers and subscribers discover each other and exchange messages efficiently.
Server Synchronization
NATS core servers themselves form their own RAFT group to maintain synchronization across the cluster.
💡 Important: Any disruption in RAFT leadership election across streams, consumers, or the Meta Group can adversely impact the overall performance and message processing capabilities of NATS.
Considerations
Sizing Your NATS Cluster
Determining the right size for your NATS Cluster is crucial for achieving optimal performance, reliability, and scalability. Here are key considerations to guide your sizing decisions:
High Availability and Fault Tolerance
Deploying multiple nodes enhances the cluster’s ability to handle failures without service disruption. For instance, a 3 node cluster can tolerate the failure of one node, maintaining consensus and operational continuity. This setup ensures that the cluster remains functional even during partial outages.
💡 Quorum Requirement: NATS uses a quorum based consensus mechanism. A majority of nodes (calculated as
cluster_size/2 +1
must agree to commit operations. In a 3 node cluster, this means two nodes from the required quorum, ensuring consistency and resilience.
Implementing Dead Letter Queues (DLQs)
While NATS doesn’t natively support DLQs, they can be implemented using advisory streams. When a message exceeds the maximum attempts without acknowledgement, an advisory message ($JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES
) is published. By subscribing to this advisory, you can capture undelivered messages and route them to a dedicated “dead letter” stream for further analysis or reprocessing.
💡 Practical Insight: In our experience processing high-throughput event streams, setting up a dedicated DLQ allowed us to isolate and inspect problematic messages without affecting the main processing pipeline.
Replication for Durability
Replication enhances message durability and fault tolerance. By configuring the replication factor, you determine how many nodes each message is stored on:
Recommended: A replication factor of 3 balances resource usage with fault tolerance, allowing the system to withstand the failure of one node.
Maximum: A replication factor of 5 offers higher resilience, tolerating up to two node failures, suitable for critical workloads.
💡 Best Practice: Deploying clusters with an odd number of nodes (e.g. 3 or 5) is advisable to prevent split brain scenarios and ensure a clear majority for quorum decisions.
Selecting Infrastructure for NATS
Choosing the right infrastructure is crucial for optimal NATS performance. Here are key considerations to guide your setup:
CPU and Memory Recommendations
For production environments, especially when utilizing Jetstream, it’s advisable to allocate:
CPU: At least 2 cores
Memory: At least 8 GiB
These recommendations ensure that the NATS server can handle high throughput scenario effectively.
Storage Considerations
NATS benefits from high performance storage solutions:
Type: SSD or high performance block storage
IOPS: Aim for at least 3000 IOPS per node
This setup supports the low latency requirements for NATS operations
Memory Management with GOMEMLIMIT
In containerized environments, the Go runtime doesn’t inherently respect running instance memory limits. To manage memory usage effectively:
💡 Configure
GOMEMLIMIT
to 80-90% of the instance’s memory limit. For instance, in an 8 GiB machine, setGOMEMLIMIT=6GiB
. This ensures the Go garbage collector operates within desired memory constraints, preventing potential out-of-memory issues.
Server Tags for enhanced management
NATS allows the use of server_tags
to label servers with arbitrary key-value pairs, such as region:us-east
or az:1
. These tags helps in profiling, monitoring and operational grouping.
They can be leveraged for selective targeting during cluster operations, especially combined with selectors like --name
or --cluster.
💡 In our deployments, we have utilized server tags to manage normal and DLQ stream placements within specific instances. This approach has improved our system’s resilience and helped in better management of NATS nodes.
Understanding Payload Handling
Managing payload sizes in NATS is crucial for maintaining optimal performance and system stability. Challenges faced with large payloads are:
Increased Latency: Larger messages can lead to higher transmission times, affecting real time processing
Memory Consumption: Big payloads consume more memory, potentially leading to resource exhaustion
Throughput Reduction: Handling sizable messages can decrease the overall message throughput.
Potential Errors: Exceeding the maximum payload size can result in errors
Recommended Practices:
Default Limit: NATS sets a default maximum payload size of 1 MB
Optimal Size: For best performance, keep payloads below 1 MB
Maximum Configurable Limit: While it’s possible to increase the limit up to 64 MB, it’s generally advised to avoid doing so unless absolutely necessary.
Alternative Approaches:
External Storage: Store large data (e.g., files, images) in external storage systems and transmit only references or metadata through NATS.
Chunking: Break down large messages into smaller chunks that comply with the payload size limit.
💡 Best Practice: Always assess the necessity of transmitting large payloads. If possible, redesign message structures to be more efficient, leveraging external systems for heavy data and keeping NATS messages lightweight.
Consumers and Message Processing
Selecting the Right Consumer Type
Pull Consumers: Ideal for batch processing and high-throughput scenarios.
Push Consumers: Better for real-time, continuous streams.
Important Consumer Configurations
Optimizations for high-throughput:
Use batch sizes of 100-1000 messages for better throughput.
Set
max-ack-pending
to 1000-10000 based on resources.Configure flow control, idle heartbeat, and max waiting appropriately.
Prefer
AckAll
acknowledgment strategy for bulk processing.
With these foundational concepts, let's move ahead to practical deployment, best practices, and hands-on demonstrations to ensure your NATS cluster is robust, scalable, and performant.
Wrapping up
Over the course of this blog, we built a solid conceptual foundation for running NATS in production. We have went through:
Right sizing the cluster
tag aware node placement
Designing Dead Letter Queue
Looked into payload size configuration
Configuring GO Garbage Collector parameter
These practices will let you design clusters that self heal, isolate failure domains and keep latency predictable even during high spikes.
What’s in Part 2?
In Part 2, we will do a hand-on demo:
Spin up a four node NATS cluster
Full Go client demo
stress test cluster with NATS CLI
will look into troubleshooting
Backup and Disaster recovery
So grab your IDE, install NATS CLI, we will turn whatever concepts we learnt today into production ready NATS cluster.
Useful Links & Resources
Here are some handy tools and dashboards to supercharge your setup:
For further reading, visit NATS official documentation.
Introduction
In distributed systems, services need to communicate quickly and reliably. NATS.io is a lightweight, high performance messaging system that helps to achieve this.
We have two blogs in this series. The present blog gives you an overview on what NATS architecture is, how it works and what should be your design considerations. In second blog, we will start implementing a NATS cluster, add oberservability and look into disaster recovery strategies.
Understanding Core Architecture
NATS operates primarily on a publish-subscribe messaging pattern, supporting various messaging models:
Subject-Based Messaging
Publish-Subscribe
Request-Reply
Queue Groups
Additionally, NATS includes JetStream, a powerful built-in persistence engine enabling message storage and replay, providing an "at-least-once" delivery mechanism along with durable subscriptions.
Architecture Topologies
Depending on your use case, NATS can be deployed in several topologies:
Single Server Instance: Suitable for development or simple workloads.
Cluster: Ideal for achieving high availability and scalability.
Super Cluster: Multiple clusters interconnected via gateway nodes, optimized for multi-region deployments.
Leaf Nodes: Lightweight edge servers connected to a core cluster, ideal for edge computing scenarios.
Cluster Behavior
NATS clusters form a full mesh network, allowing dynamic scaling and self healing without manual reconfiguration. Each server node in the cluster requires:
Client Port (default
4222
): used for client connections.Cluster Port (Typically
6222
): used for inter-server communication.
Clients connecting to any server in the cluster automatically discover all members, ensuring seamless failover and load balancing
JetStream Clustering
JetStream enhances NATS by providing durable message storage and processing. It employs the RAFT consensus algorithm to ensure consistency across the cluster, requiring:
unique server identifiers
persistent storage for logs
A quorum, typically calculated as
(Cluster Size / 2) + 1
💡 Recommendation: For optimal reliability and performance, deploy Jetstream cluster with 3 or 5 servers. The maximum replication factor per stream is 5
JetStream Meta Group
All Jetstream enabled servers participate in a special RAFT group called the Meta Group, which manages Jetstream API operations. The meta group elects a leader responsible for:
Handling API interactions
Making decisions about stream adn consumer placements
💡 Important: Without an active meta group leader, Jetstream clustering and server operations cannot proceed.
Streams and Their Types
Streams are durable message stores in Jetstream. They persist messages on specified subjects. Each stream forms its own RAFT group, with a leader managing data synchronization and acknowledgements.
💡 Note: Stream placement is centrally managed by the Meta Group leader
Consumers
Consumers enable applications to process messages stored within streams. Each consumer forms its own RAFT group to handle acknowledgements and maintain delivery state information.
💡 Note: This independent RAFT group formation ensures reliable message processing and state management for each consumer.
Subjects
Subjects act as named channels through which publishers and subscribers discover each other and exchange messages efficiently.
Server Synchronization
NATS core servers themselves form their own RAFT group to maintain synchronization across the cluster.
💡 Important: Any disruption in RAFT leadership election across streams, consumers, or the Meta Group can adversely impact the overall performance and message processing capabilities of NATS.
Considerations
Sizing Your NATS Cluster
Determining the right size for your NATS Cluster is crucial for achieving optimal performance, reliability, and scalability. Here are key considerations to guide your sizing decisions:
High Availability and Fault Tolerance
Deploying multiple nodes enhances the cluster’s ability to handle failures without service disruption. For instance, a 3 node cluster can tolerate the failure of one node, maintaining consensus and operational continuity. This setup ensures that the cluster remains functional even during partial outages.
💡 Quorum Requirement: NATS uses a quorum based consensus mechanism. A majority of nodes (calculated as
cluster_size/2 +1
must agree to commit operations. In a 3 node cluster, this means two nodes from the required quorum, ensuring consistency and resilience.
Implementing Dead Letter Queues (DLQs)
While NATS doesn’t natively support DLQs, they can be implemented using advisory streams. When a message exceeds the maximum attempts without acknowledgement, an advisory message ($JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES
) is published. By subscribing to this advisory, you can capture undelivered messages and route them to a dedicated “dead letter” stream for further analysis or reprocessing.
💡 Practical Insight: In our experience processing high-throughput event streams, setting up a dedicated DLQ allowed us to isolate and inspect problematic messages without affecting the main processing pipeline.
Replication for Durability
Replication enhances message durability and fault tolerance. By configuring the replication factor, you determine how many nodes each message is stored on:
Recommended: A replication factor of 3 balances resource usage with fault tolerance, allowing the system to withstand the failure of one node.
Maximum: A replication factor of 5 offers higher resilience, tolerating up to two node failures, suitable for critical workloads.
💡 Best Practice: Deploying clusters with an odd number of nodes (e.g. 3 or 5) is advisable to prevent split brain scenarios and ensure a clear majority for quorum decisions.
Selecting Infrastructure for NATS
Choosing the right infrastructure is crucial for optimal NATS performance. Here are key considerations to guide your setup:
CPU and Memory Recommendations
For production environments, especially when utilizing Jetstream, it’s advisable to allocate:
CPU: At least 2 cores
Memory: At least 8 GiB
These recommendations ensure that the NATS server can handle high throughput scenario effectively.
Storage Considerations
NATS benefits from high performance storage solutions:
Type: SSD or high performance block storage
IOPS: Aim for at least 3000 IOPS per node
This setup supports the low latency requirements for NATS operations
Memory Management with GOMEMLIMIT
In containerized environments, the Go runtime doesn’t inherently respect running instance memory limits. To manage memory usage effectively:
💡 Configure
GOMEMLIMIT
to 80-90% of the instance’s memory limit. For instance, in an 8 GiB machine, setGOMEMLIMIT=6GiB
. This ensures the Go garbage collector operates within desired memory constraints, preventing potential out-of-memory issues.
Server Tags for enhanced management
NATS allows the use of server_tags
to label servers with arbitrary key-value pairs, such as region:us-east
or az:1
. These tags helps in profiling, monitoring and operational grouping.
They can be leveraged for selective targeting during cluster operations, especially combined with selectors like --name
or --cluster.
💡 In our deployments, we have utilized server tags to manage normal and DLQ stream placements within specific instances. This approach has improved our system’s resilience and helped in better management of NATS nodes.
Understanding Payload Handling
Managing payload sizes in NATS is crucial for maintaining optimal performance and system stability. Challenges faced with large payloads are:
Increased Latency: Larger messages can lead to higher transmission times, affecting real time processing
Memory Consumption: Big payloads consume more memory, potentially leading to resource exhaustion
Throughput Reduction: Handling sizable messages can decrease the overall message throughput.
Potential Errors: Exceeding the maximum payload size can result in errors
Recommended Practices:
Default Limit: NATS sets a default maximum payload size of 1 MB
Optimal Size: For best performance, keep payloads below 1 MB
Maximum Configurable Limit: While it’s possible to increase the limit up to 64 MB, it’s generally advised to avoid doing so unless absolutely necessary.
Alternative Approaches:
External Storage: Store large data (e.g., files, images) in external storage systems and transmit only references or metadata through NATS.
Chunking: Break down large messages into smaller chunks that comply with the payload size limit.
💡 Best Practice: Always assess the necessity of transmitting large payloads. If possible, redesign message structures to be more efficient, leveraging external systems for heavy data and keeping NATS messages lightweight.
Consumers and Message Processing
Selecting the Right Consumer Type
Pull Consumers: Ideal for batch processing and high-throughput scenarios.
Push Consumers: Better for real-time, continuous streams.
Important Consumer Configurations
Optimizations for high-throughput:
Use batch sizes of 100-1000 messages for better throughput.
Set
max-ack-pending
to 1000-10000 based on resources.Configure flow control, idle heartbeat, and max waiting appropriately.
Prefer
AckAll
acknowledgment strategy for bulk processing.
With these foundational concepts, let's move ahead to practical deployment, best practices, and hands-on demonstrations to ensure your NATS cluster is robust, scalable, and performant.
Wrapping up
Over the course of this blog, we built a solid conceptual foundation for running NATS in production. We have went through:
Right sizing the cluster
tag aware node placement
Designing Dead Letter Queue
Looked into payload size configuration
Configuring GO Garbage Collector parameter
These practices will let you design clusters that self heal, isolate failure domains and keep latency predictable even during high spikes.
What’s in Part 2?
In Part 2, we will do a hand-on demo:
Spin up a four node NATS cluster
Full Go client demo
stress test cluster with NATS CLI
will look into troubleshooting
Backup and Disaster recovery
So grab your IDE, install NATS CLI, we will turn whatever concepts we learnt today into production ready NATS cluster.
Useful Links & Resources
Here are some handy tools and dashboards to supercharge your setup:
For further reading, visit NATS official documentation.
Introduction
In distributed systems, services need to communicate quickly and reliably. NATS.io is a lightweight, high performance messaging system that helps to achieve this.
We have two blogs in this series. The present blog gives you an overview on what NATS architecture is, how it works and what should be your design considerations. In second blog, we will start implementing a NATS cluster, add oberservability and look into disaster recovery strategies.
Understanding Core Architecture
NATS operates primarily on a publish-subscribe messaging pattern, supporting various messaging models:
Subject-Based Messaging
Publish-Subscribe
Request-Reply
Queue Groups
Additionally, NATS includes JetStream, a powerful built-in persistence engine enabling message storage and replay, providing an "at-least-once" delivery mechanism along with durable subscriptions.
Architecture Topologies
Depending on your use case, NATS can be deployed in several topologies:
Single Server Instance: Suitable for development or simple workloads.
Cluster: Ideal for achieving high availability and scalability.
Super Cluster: Multiple clusters interconnected via gateway nodes, optimized for multi-region deployments.
Leaf Nodes: Lightweight edge servers connected to a core cluster, ideal for edge computing scenarios.
Cluster Behavior
NATS clusters form a full mesh network, allowing dynamic scaling and self healing without manual reconfiguration. Each server node in the cluster requires:
Client Port (default
4222
): used for client connections.Cluster Port (Typically
6222
): used for inter-server communication.
Clients connecting to any server in the cluster automatically discover all members, ensuring seamless failover and load balancing
JetStream Clustering
JetStream enhances NATS by providing durable message storage and processing. It employs the RAFT consensus algorithm to ensure consistency across the cluster, requiring:
unique server identifiers
persistent storage for logs
A quorum, typically calculated as
(Cluster Size / 2) + 1
💡 Recommendation: For optimal reliability and performance, deploy Jetstream cluster with 3 or 5 servers. The maximum replication factor per stream is 5
JetStream Meta Group
All Jetstream enabled servers participate in a special RAFT group called the Meta Group, which manages Jetstream API operations. The meta group elects a leader responsible for:
Handling API interactions
Making decisions about stream adn consumer placements
💡 Important: Without an active meta group leader, Jetstream clustering and server operations cannot proceed.
Streams and Their Types
Streams are durable message stores in Jetstream. They persist messages on specified subjects. Each stream forms its own RAFT group, with a leader managing data synchronization and acknowledgements.
💡 Note: Stream placement is centrally managed by the Meta Group leader
Consumers
Consumers enable applications to process messages stored within streams. Each consumer forms its own RAFT group to handle acknowledgements and maintain delivery state information.
💡 Note: This independent RAFT group formation ensures reliable message processing and state management for each consumer.
Subjects
Subjects act as named channels through which publishers and subscribers discover each other and exchange messages efficiently.
Server Synchronization
NATS core servers themselves form their own RAFT group to maintain synchronization across the cluster.
💡 Important: Any disruption in RAFT leadership election across streams, consumers, or the Meta Group can adversely impact the overall performance and message processing capabilities of NATS.
Considerations
Sizing Your NATS Cluster
Determining the right size for your NATS Cluster is crucial for achieving optimal performance, reliability, and scalability. Here are key considerations to guide your sizing decisions:
High Availability and Fault Tolerance
Deploying multiple nodes enhances the cluster’s ability to handle failures without service disruption. For instance, a 3 node cluster can tolerate the failure of one node, maintaining consensus and operational continuity. This setup ensures that the cluster remains functional even during partial outages.
💡 Quorum Requirement: NATS uses a quorum based consensus mechanism. A majority of nodes (calculated as
cluster_size/2 +1
must agree to commit operations. In a 3 node cluster, this means two nodes from the required quorum, ensuring consistency and resilience.
Implementing Dead Letter Queues (DLQs)
While NATS doesn’t natively support DLQs, they can be implemented using advisory streams. When a message exceeds the maximum attempts without acknowledgement, an advisory message ($JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES
) is published. By subscribing to this advisory, you can capture undelivered messages and route them to a dedicated “dead letter” stream for further analysis or reprocessing.
💡 Practical Insight: In our experience processing high-throughput event streams, setting up a dedicated DLQ allowed us to isolate and inspect problematic messages without affecting the main processing pipeline.
Replication for Durability
Replication enhances message durability and fault tolerance. By configuring the replication factor, you determine how many nodes each message is stored on:
Recommended: A replication factor of 3 balances resource usage with fault tolerance, allowing the system to withstand the failure of one node.
Maximum: A replication factor of 5 offers higher resilience, tolerating up to two node failures, suitable for critical workloads.
💡 Best Practice: Deploying clusters with an odd number of nodes (e.g. 3 or 5) is advisable to prevent split brain scenarios and ensure a clear majority for quorum decisions.
Selecting Infrastructure for NATS
Choosing the right infrastructure is crucial for optimal NATS performance. Here are key considerations to guide your setup:
CPU and Memory Recommendations
For production environments, especially when utilizing Jetstream, it’s advisable to allocate:
CPU: At least 2 cores
Memory: At least 8 GiB
These recommendations ensure that the NATS server can handle high throughput scenario effectively.
Storage Considerations
NATS benefits from high performance storage solutions:
Type: SSD or high performance block storage
IOPS: Aim for at least 3000 IOPS per node
This setup supports the low latency requirements for NATS operations
Memory Management with GOMEMLIMIT
In containerized environments, the Go runtime doesn’t inherently respect running instance memory limits. To manage memory usage effectively:
💡 Configure
GOMEMLIMIT
to 80-90% of the instance’s memory limit. For instance, in an 8 GiB machine, setGOMEMLIMIT=6GiB
. This ensures the Go garbage collector operates within desired memory constraints, preventing potential out-of-memory issues.
Server Tags for enhanced management
NATS allows the use of server_tags
to label servers with arbitrary key-value pairs, such as region:us-east
or az:1
. These tags helps in profiling, monitoring and operational grouping.
They can be leveraged for selective targeting during cluster operations, especially combined with selectors like --name
or --cluster.
💡 In our deployments, we have utilized server tags to manage normal and DLQ stream placements within specific instances. This approach has improved our system’s resilience and helped in better management of NATS nodes.
Understanding Payload Handling
Managing payload sizes in NATS is crucial for maintaining optimal performance and system stability. Challenges faced with large payloads are:
Increased Latency: Larger messages can lead to higher transmission times, affecting real time processing
Memory Consumption: Big payloads consume more memory, potentially leading to resource exhaustion
Throughput Reduction: Handling sizable messages can decrease the overall message throughput.
Potential Errors: Exceeding the maximum payload size can result in errors
Recommended Practices:
Default Limit: NATS sets a default maximum payload size of 1 MB
Optimal Size: For best performance, keep payloads below 1 MB
Maximum Configurable Limit: While it’s possible to increase the limit up to 64 MB, it’s generally advised to avoid doing so unless absolutely necessary.
Alternative Approaches:
External Storage: Store large data (e.g., files, images) in external storage systems and transmit only references or metadata through NATS.
Chunking: Break down large messages into smaller chunks that comply with the payload size limit.
💡 Best Practice: Always assess the necessity of transmitting large payloads. If possible, redesign message structures to be more efficient, leveraging external systems for heavy data and keeping NATS messages lightweight.
Consumers and Message Processing
Selecting the Right Consumer Type
Pull Consumers: Ideal for batch processing and high-throughput scenarios.
Push Consumers: Better for real-time, continuous streams.
Important Consumer Configurations
Optimizations for high-throughput:
Use batch sizes of 100-1000 messages for better throughput.
Set
max-ack-pending
to 1000-10000 based on resources.Configure flow control, idle heartbeat, and max waiting appropriately.
Prefer
AckAll
acknowledgment strategy for bulk processing.
With these foundational concepts, let's move ahead to practical deployment, best practices, and hands-on demonstrations to ensure your NATS cluster is robust, scalable, and performant.
Wrapping up
Over the course of this blog, we built a solid conceptual foundation for running NATS in production. We have went through:
Right sizing the cluster
tag aware node placement
Designing Dead Letter Queue
Looked into payload size configuration
Configuring GO Garbage Collector parameter
These practices will let you design clusters that self heal, isolate failure domains and keep latency predictable even during high spikes.
What’s in Part 2?
In Part 2, we will do a hand-on demo:
Spin up a four node NATS cluster
Full Go client demo
stress test cluster with NATS CLI
will look into troubleshooting
Backup and Disaster recovery
So grab your IDE, install NATS CLI, we will turn whatever concepts we learnt today into production ready NATS cluster.
Useful Links & Resources
Here are some handy tools and dashboards to supercharge your setup:
For further reading, visit NATS official documentation.
Introduction
In distributed systems, services need to communicate quickly and reliably. NATS.io is a lightweight, high performance messaging system that helps to achieve this.
We have two blogs in this series. The present blog gives you an overview on what NATS architecture is, how it works and what should be your design considerations. In second blog, we will start implementing a NATS cluster, add oberservability and look into disaster recovery strategies.
Understanding Core Architecture
NATS operates primarily on a publish-subscribe messaging pattern, supporting various messaging models:
Subject-Based Messaging
Publish-Subscribe
Request-Reply
Queue Groups
Additionally, NATS includes JetStream, a powerful built-in persistence engine enabling message storage and replay, providing an "at-least-once" delivery mechanism along with durable subscriptions.
Architecture Topologies
Depending on your use case, NATS can be deployed in several topologies:
Single Server Instance: Suitable for development or simple workloads.
Cluster: Ideal for achieving high availability and scalability.
Super Cluster: Multiple clusters interconnected via gateway nodes, optimized for multi-region deployments.
Leaf Nodes: Lightweight edge servers connected to a core cluster, ideal for edge computing scenarios.
Cluster Behavior
NATS clusters form a full mesh network, allowing dynamic scaling and self healing without manual reconfiguration. Each server node in the cluster requires:
Client Port (default
4222
): used for client connections.Cluster Port (Typically
6222
): used for inter-server communication.
Clients connecting to any server in the cluster automatically discover all members, ensuring seamless failover and load balancing
JetStream Clustering
JetStream enhances NATS by providing durable message storage and processing. It employs the RAFT consensus algorithm to ensure consistency across the cluster, requiring:
unique server identifiers
persistent storage for logs
A quorum, typically calculated as
(Cluster Size / 2) + 1
💡 Recommendation: For optimal reliability and performance, deploy Jetstream cluster with 3 or 5 servers. The maximum replication factor per stream is 5
JetStream Meta Group
All Jetstream enabled servers participate in a special RAFT group called the Meta Group, which manages Jetstream API operations. The meta group elects a leader responsible for:
Handling API interactions
Making decisions about stream adn consumer placements
💡 Important: Without an active meta group leader, Jetstream clustering and server operations cannot proceed.
Streams and Their Types
Streams are durable message stores in Jetstream. They persist messages on specified subjects. Each stream forms its own RAFT group, with a leader managing data synchronization and acknowledgements.
💡 Note: Stream placement is centrally managed by the Meta Group leader
Consumers
Consumers enable applications to process messages stored within streams. Each consumer forms its own RAFT group to handle acknowledgements and maintain delivery state information.
💡 Note: This independent RAFT group formation ensures reliable message processing and state management for each consumer.
Subjects
Subjects act as named channels through which publishers and subscribers discover each other and exchange messages efficiently.
Server Synchronization
NATS core servers themselves form their own RAFT group to maintain synchronization across the cluster.
💡 Important: Any disruption in RAFT leadership election across streams, consumers, or the Meta Group can adversely impact the overall performance and message processing capabilities of NATS.
Considerations
Sizing Your NATS Cluster
Determining the right size for your NATS Cluster is crucial for achieving optimal performance, reliability, and scalability. Here are key considerations to guide your sizing decisions:
High Availability and Fault Tolerance
Deploying multiple nodes enhances the cluster’s ability to handle failures without service disruption. For instance, a 3 node cluster can tolerate the failure of one node, maintaining consensus and operational continuity. This setup ensures that the cluster remains functional even during partial outages.
💡 Quorum Requirement: NATS uses a quorum based consensus mechanism. A majority of nodes (calculated as
cluster_size/2 +1
must agree to commit operations. In a 3 node cluster, this means two nodes from the required quorum, ensuring consistency and resilience.
Implementing Dead Letter Queues (DLQs)
While NATS doesn’t natively support DLQs, they can be implemented using advisory streams. When a message exceeds the maximum attempts without acknowledgement, an advisory message ($JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES
) is published. By subscribing to this advisory, you can capture undelivered messages and route them to a dedicated “dead letter” stream for further analysis or reprocessing.
💡 Practical Insight: In our experience processing high-throughput event streams, setting up a dedicated DLQ allowed us to isolate and inspect problematic messages without affecting the main processing pipeline.
Replication for Durability
Replication enhances message durability and fault tolerance. By configuring the replication factor, you determine how many nodes each message is stored on:
Recommended: A replication factor of 3 balances resource usage with fault tolerance, allowing the system to withstand the failure of one node.
Maximum: A replication factor of 5 offers higher resilience, tolerating up to two node failures, suitable for critical workloads.
💡 Best Practice: Deploying clusters with an odd number of nodes (e.g. 3 or 5) is advisable to prevent split brain scenarios and ensure a clear majority for quorum decisions.
Selecting Infrastructure for NATS
Choosing the right infrastructure is crucial for optimal NATS performance. Here are key considerations to guide your setup:
CPU and Memory Recommendations
For production environments, especially when utilizing Jetstream, it’s advisable to allocate:
CPU: At least 2 cores
Memory: At least 8 GiB
These recommendations ensure that the NATS server can handle high throughput scenario effectively.
Storage Considerations
NATS benefits from high performance storage solutions:
Type: SSD or high performance block storage
IOPS: Aim for at least 3000 IOPS per node
This setup supports the low latency requirements for NATS operations
Memory Management with GOMEMLIMIT
In containerized environments, the Go runtime doesn’t inherently respect running instance memory limits. To manage memory usage effectively:
💡 Configure
GOMEMLIMIT
to 80-90% of the instance’s memory limit. For instance, in an 8 GiB machine, setGOMEMLIMIT=6GiB
. This ensures the Go garbage collector operates within desired memory constraints, preventing potential out-of-memory issues.
Server Tags for enhanced management
NATS allows the use of server_tags
to label servers with arbitrary key-value pairs, such as region:us-east
or az:1
. These tags helps in profiling, monitoring and operational grouping.
They can be leveraged for selective targeting during cluster operations, especially combined with selectors like --name
or --cluster.
💡 In our deployments, we have utilized server tags to manage normal and DLQ stream placements within specific instances. This approach has improved our system’s resilience and helped in better management of NATS nodes.
Understanding Payload Handling
Managing payload sizes in NATS is crucial for maintaining optimal performance and system stability. Challenges faced with large payloads are:
Increased Latency: Larger messages can lead to higher transmission times, affecting real time processing
Memory Consumption: Big payloads consume more memory, potentially leading to resource exhaustion
Throughput Reduction: Handling sizable messages can decrease the overall message throughput.
Potential Errors: Exceeding the maximum payload size can result in errors
Recommended Practices:
Default Limit: NATS sets a default maximum payload size of 1 MB
Optimal Size: For best performance, keep payloads below 1 MB
Maximum Configurable Limit: While it’s possible to increase the limit up to 64 MB, it’s generally advised to avoid doing so unless absolutely necessary.
Alternative Approaches:
External Storage: Store large data (e.g., files, images) in external storage systems and transmit only references or metadata through NATS.
Chunking: Break down large messages into smaller chunks that comply with the payload size limit.
💡 Best Practice: Always assess the necessity of transmitting large payloads. If possible, redesign message structures to be more efficient, leveraging external systems for heavy data and keeping NATS messages lightweight.
Consumers and Message Processing
Selecting the Right Consumer Type
Pull Consumers: Ideal for batch processing and high-throughput scenarios.
Push Consumers: Better for real-time, continuous streams.
Important Consumer Configurations
Optimizations for high-throughput:
Use batch sizes of 100-1000 messages for better throughput.
Set
max-ack-pending
to 1000-10000 based on resources.Configure flow control, idle heartbeat, and max waiting appropriately.
Prefer
AckAll
acknowledgment strategy for bulk processing.
With these foundational concepts, let's move ahead to practical deployment, best practices, and hands-on demonstrations to ensure your NATS cluster is robust, scalable, and performant.
Wrapping up
Over the course of this blog, we built a solid conceptual foundation for running NATS in production. We have went through:
Right sizing the cluster
tag aware node placement
Designing Dead Letter Queue
Looked into payload size configuration
Configuring GO Garbage Collector parameter
These practices will let you design clusters that self heal, isolate failure domains and keep latency predictable even during high spikes.
What’s in Part 2?
In Part 2, we will do a hand-on demo:
Spin up a four node NATS cluster
Full Go client demo
stress test cluster with NATS CLI
will look into troubleshooting
Backup and Disaster recovery
So grab your IDE, install NATS CLI, we will turn whatever concepts we learnt today into production ready NATS cluster.
Useful Links & Resources
Here are some handy tools and dashboards to supercharge your setup:
For further reading, visit NATS official documentation.
Introduction
In distributed systems, services need to communicate quickly and reliably. NATS.io is a lightweight, high performance messaging system that helps to achieve this.
We have two blogs in this series. The present blog gives you an overview on what NATS architecture is, how it works and what should be your design considerations. In second blog, we will start implementing a NATS cluster, add oberservability and look into disaster recovery strategies.
Understanding Core Architecture
NATS operates primarily on a publish-subscribe messaging pattern, supporting various messaging models:
Subject-Based Messaging
Publish-Subscribe
Request-Reply
Queue Groups
Additionally, NATS includes JetStream, a powerful built-in persistence engine enabling message storage and replay, providing an "at-least-once" delivery mechanism along with durable subscriptions.
Architecture Topologies
Depending on your use case, NATS can be deployed in several topologies:
Single Server Instance: Suitable for development or simple workloads.
Cluster: Ideal for achieving high availability and scalability.
Super Cluster: Multiple clusters interconnected via gateway nodes, optimized for multi-region deployments.
Leaf Nodes: Lightweight edge servers connected to a core cluster, ideal for edge computing scenarios.
Cluster Behavior
NATS clusters form a full mesh network, allowing dynamic scaling and self healing without manual reconfiguration. Each server node in the cluster requires:
Client Port (default
4222
): used for client connections.Cluster Port (Typically
6222
): used for inter-server communication.
Clients connecting to any server in the cluster automatically discover all members, ensuring seamless failover and load balancing
JetStream Clustering
JetStream enhances NATS by providing durable message storage and processing. It employs the RAFT consensus algorithm to ensure consistency across the cluster, requiring:
unique server identifiers
persistent storage for logs
A quorum, typically calculated as
(Cluster Size / 2) + 1
💡 Recommendation: For optimal reliability and performance, deploy Jetstream cluster with 3 or 5 servers. The maximum replication factor per stream is 5
JetStream Meta Group
All Jetstream enabled servers participate in a special RAFT group called the Meta Group, which manages Jetstream API operations. The meta group elects a leader responsible for:
Handling API interactions
Making decisions about stream adn consumer placements
💡 Important: Without an active meta group leader, Jetstream clustering and server operations cannot proceed.
Streams and Their Types
Streams are durable message stores in Jetstream. They persist messages on specified subjects. Each stream forms its own RAFT group, with a leader managing data synchronization and acknowledgements.
💡 Note: Stream placement is centrally managed by the Meta Group leader
Consumers
Consumers enable applications to process messages stored within streams. Each consumer forms its own RAFT group to handle acknowledgements and maintain delivery state information.
💡 Note: This independent RAFT group formation ensures reliable message processing and state management for each consumer.
Subjects
Subjects act as named channels through which publishers and subscribers discover each other and exchange messages efficiently.
Server Synchronization
NATS core servers themselves form their own RAFT group to maintain synchronization across the cluster.
💡 Important: Any disruption in RAFT leadership election across streams, consumers, or the Meta Group can adversely impact the overall performance and message processing capabilities of NATS.
Considerations
Sizing Your NATS Cluster
Determining the right size for your NATS Cluster is crucial for achieving optimal performance, reliability, and scalability. Here are key considerations to guide your sizing decisions:
High Availability and Fault Tolerance
Deploying multiple nodes enhances the cluster’s ability to handle failures without service disruption. For instance, a 3 node cluster can tolerate the failure of one node, maintaining consensus and operational continuity. This setup ensures that the cluster remains functional even during partial outages.
💡 Quorum Requirement: NATS uses a quorum based consensus mechanism. A majority of nodes (calculated as
cluster_size/2 +1
must agree to commit operations. In a 3 node cluster, this means two nodes from the required quorum, ensuring consistency and resilience.
Implementing Dead Letter Queues (DLQs)
While NATS doesn’t natively support DLQs, they can be implemented using advisory streams. When a message exceeds the maximum attempts without acknowledgement, an advisory message ($JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES
) is published. By subscribing to this advisory, you can capture undelivered messages and route them to a dedicated “dead letter” stream for further analysis or reprocessing.
💡 Practical Insight: In our experience processing high-throughput event streams, setting up a dedicated DLQ allowed us to isolate and inspect problematic messages without affecting the main processing pipeline.
Replication for Durability
Replication enhances message durability and fault tolerance. By configuring the replication factor, you determine how many nodes each message is stored on:
Recommended: A replication factor of 3 balances resource usage with fault tolerance, allowing the system to withstand the failure of one node.
Maximum: A replication factor of 5 offers higher resilience, tolerating up to two node failures, suitable for critical workloads.
💡 Best Practice: Deploying clusters with an odd number of nodes (e.g. 3 or 5) is advisable to prevent split brain scenarios and ensure a clear majority for quorum decisions.
Selecting Infrastructure for NATS
Choosing the right infrastructure is crucial for optimal NATS performance. Here are key considerations to guide your setup:
CPU and Memory Recommendations
For production environments, especially when utilizing Jetstream, it’s advisable to allocate:
CPU: At least 2 cores
Memory: At least 8 GiB
These recommendations ensure that the NATS server can handle high throughput scenario effectively.
Storage Considerations
NATS benefits from high performance storage solutions:
Type: SSD or high performance block storage
IOPS: Aim for at least 3000 IOPS per node
This setup supports the low latency requirements for NATS operations
Memory Management with GOMEMLIMIT
In containerized environments, the Go runtime doesn’t inherently respect running instance memory limits. To manage memory usage effectively:
💡 Configure
GOMEMLIMIT
to 80-90% of the instance’s memory limit. For instance, in an 8 GiB machine, setGOMEMLIMIT=6GiB
. This ensures the Go garbage collector operates within desired memory constraints, preventing potential out-of-memory issues.
Server Tags for enhanced management
NATS allows the use of server_tags
to label servers with arbitrary key-value pairs, such as region:us-east
or az:1
. These tags helps in profiling, monitoring and operational grouping.
They can be leveraged for selective targeting during cluster operations, especially combined with selectors like --name
or --cluster.
💡 In our deployments, we have utilized server tags to manage normal and DLQ stream placements within specific instances. This approach has improved our system’s resilience and helped in better management of NATS nodes.
Understanding Payload Handling
Managing payload sizes in NATS is crucial for maintaining optimal performance and system stability. Challenges faced with large payloads are:
Increased Latency: Larger messages can lead to higher transmission times, affecting real time processing
Memory Consumption: Big payloads consume more memory, potentially leading to resource exhaustion
Throughput Reduction: Handling sizable messages can decrease the overall message throughput.
Potential Errors: Exceeding the maximum payload size can result in errors
Recommended Practices:
Default Limit: NATS sets a default maximum payload size of 1 MB
Optimal Size: For best performance, keep payloads below 1 MB
Maximum Configurable Limit: While it’s possible to increase the limit up to 64 MB, it’s generally advised to avoid doing so unless absolutely necessary.
Alternative Approaches:
External Storage: Store large data (e.g., files, images) in external storage systems and transmit only references or metadata through NATS.
Chunking: Break down large messages into smaller chunks that comply with the payload size limit.
💡 Best Practice: Always assess the necessity of transmitting large payloads. If possible, redesign message structures to be more efficient, leveraging external systems for heavy data and keeping NATS messages lightweight.
Consumers and Message Processing
Selecting the Right Consumer Type
Pull Consumers: Ideal for batch processing and high-throughput scenarios.
Push Consumers: Better for real-time, continuous streams.
Important Consumer Configurations
Optimizations for high-throughput:
Use batch sizes of 100-1000 messages for better throughput.
Set
max-ack-pending
to 1000-10000 based on resources.Configure flow control, idle heartbeat, and max waiting appropriately.
Prefer
AckAll
acknowledgment strategy for bulk processing.
With these foundational concepts, let's move ahead to practical deployment, best practices, and hands-on demonstrations to ensure your NATS cluster is robust, scalable, and performant.
Wrapping up
Over the course of this blog, we built a solid conceptual foundation for running NATS in production. We have went through:
Right sizing the cluster
tag aware node placement
Designing Dead Letter Queue
Looked into payload size configuration
Configuring GO Garbage Collector parameter
These practices will let you design clusters that self heal, isolate failure domains and keep latency predictable even during high spikes.
What’s in Part 2?
In Part 2, we will do a hand-on demo:
Spin up a four node NATS cluster
Full Go client demo
stress test cluster with NATS CLI
will look into troubleshooting
Backup and Disaster recovery
So grab your IDE, install NATS CLI, we will turn whatever concepts we learnt today into production ready NATS cluster.
Useful Links & Resources
Here are some handy tools and dashboards to supercharge your setup:
For further reading, visit NATS official documentation.