1. What Is NATS?
NATS is a high-performance messaging system designed for cloud-native applications, microservices, and distributed systems. Originally created by Derek Collison at VMware in 2010, NATS has evolved from a simple pub/sub system into a comprehensive messaging platform that powers some of the world's largest distributed architectures.
The name "NATS" doesn't stand for anything—it's a recursive acronym in the tradition of GNU ("GNU's Not Unix"). What started as an internal project at VMware became a Cloud Native Computing Foundation (CNCF) incubating project in 2018, cementing its place in the cloud-native ecosystem.
Core Design Principles
NATS was built with specific principles that differentiate it from traditional message brokers:
Simplicity: NATS has a simple, text-based protocol that's easy to understand and implement. The server is a single binary with no external dependencies. You can be up and running in minutes, not hours.
Performance: Written in Go, NATS is designed for speed. It can handle millions of messages per second with minimal latency. The server uses efficient memory management and lock-free data structures to maximize throughput.
Availability: NATS uses a "always on" philosophy. Messages are delivered to available subscribers; if no subscribers exist, messages are dropped. This prevents resource exhaustion and ensures the system remains responsive.
Scalability: NATS scales horizontally through clustering. You can add nodes to handle more connections and throughput. The protocol is designed to minimize chatter between nodes, keeping cluster overhead low.
Security: Modern security features including TLS, authentication (token, username/password, NKEYS, JWT), and authorization are built-in. NATS supports decentralized security through JWTs and account-based isolation.
Messaging Patterns
NATS supports multiple messaging patterns:
Publish-Subscribe (Pub/Sub): Publishers send messages to subjects; all subscribers receive a copy. This enables broadcast patterns and event notification systems.
Request-Reply: A synchronous pattern where a client sends a request and waits for a response. NATS handles the reply addressing automatically through inbox subjects.
Queue Groups: Multiple subscribers form a queue group; messages are distributed among them. This provides load balancing and horizontal scaling of consumers.
Subject Hierarchies: Subjects use dot notation (e.g., orders.us.east) with wildcards for flexible routing. * matches a single token; > matches one or more tokens.
orders.*.created matches orders.us.created and orders.eu.created but not orders.us.east.created.
orders.> matches orders.us, orders.us.east, and orders.us.east.nyc.
2. NATS Core vs JetStream
NATS comes in two flavors: Core NATS and JetStream. Understanding the difference is crucial for choosing the right solution.
NATS Core
Core NATS is the original, in-memory messaging system. It's designed for speed and simplicity:
- At-most-once delivery: Messages are delivered zero or one time
- No persistence: Messages exist only in memory
- Fire-and-forget: Publishers don't wait for acknowledgments
- Best effort: If no subscriber exists, the message is dropped
Core NATS is perfect for real-time data, telemetry, and scenarios where losing occasional messages is acceptable.
JetStream
JetStream is NATS's persistence layer, added to support enterprise messaging patterns:
- At-least-once delivery: Messages are persisted and acknowledged
- Stream persistence: Messages stored on disk with configurable retention
- Consumer groups: Durable subscriptions with replay capabilities
- Exactly-once semantics: With idempotent consumers
JetStream adds the durability and reliability needed for event sourcing, CQRS, and mission-critical messaging.
| Feature | Core NATS | JetStream |
|---|---|---|
| Persistence | None (in-memory) | Disk-based with replication |
| Delivery guarantee | At-most-once | At-least-once |
| Message replay | No | Yes (by time or sequence) |
| Consumer durability | Ephemeral only | Durable with state |
| Retention policies | N/A | Limits, interest, work queue |
| Use case | Real-time, telemetry | Events, logs, commands |
3. Architecture and Topology
Single Server
The simplest NATS deployment is a single server:
# Start NATS server
nats-server
# With JetStream enabled
nats-server -js
This is suitable for development and small deployments. The server handles all connections, routing, and (with JetStream) persistence.
Cluster Mode
For production, run NATS in cluster mode:
# Server 1
nats-server --cluster_name my_cluster \
--cluster listen:0.0.0.0:6222 \
--routes nats://nats-2:6222,nats://nats-3:6222
# Server 2
nats-server --cluster_name my_cluster \
--cluster listen:0.0.0.0:6222 \
--routes nats://nats-1:6222,nats://nats-3:6222
# Server 3
nats-server --cluster_name my_cluster \
--cluster listen:0.0.0.0:6222 \
--routes nats://nats-1:6222,nats://nats-2:6222
In cluster mode:
- Clients connect to any server
- Messages are routed between servers
- Servers share connection state
- No single point of failure (with 3+ nodes)
Superclusters
For geographically distributed deployments, NATS supports superclusters—federated clusters connected through gateway connections:
┌──────────────┐ Gateway ┌──────────────┐
│ Cluster │ <────────────────> │ Cluster │
│ (US-East) │ (TLS) │ (EU-West) │
└──────────────┘ └──────────────┘
│ │
│ Gateway │ Gateway
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Cluster │ <────────────────> │ Cluster │
│ (US-West) │ │ (APAC) │
└──────────────┘ └──────────────┘
Superclusters enable:
- Geographic distribution of workloads
- Disaster recovery across regions
- Data sovereignty compliance
- Reduced latency for local consumers
Leaf Nodes
Leaf nodes extend NATS to edge locations. They're lightweight NATS servers that connect to a cluster or supercluster:
┌─────────────────────────────────────┐
│ NATS Cluster │
│ ┌─────────┐ ┌─────────┐ │
│ │ Server │<────>│ Server │ │
│ │ 1 │ │ 2 │ │
│ └────┬────┘ └────┬────┘ │
│ │ │ │
│ └────────────────┘ │
│ │ │
└──────────────┼───────────────────────┘
│ Leaf Node Connection
┌───────┴───────┐
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Leaf Node │ │ Leaf Node │
│ (Factory) │ │ (Store) │
└─────────────┘ └─────────────┘
Leaf nodes are ideal for:
- IoT gateways collecting sensor data
- Retail stores syncing with headquarters
- Mobile edge computing
- Air-gapped environments with periodic sync
4. Getting Started with NATS
Installation
# macOS
brew install nats-server nats
# Linux (using official binaries)
curl -sf https://get-nats.io | sh
# Docker
docker run -p 4222:4222 nats:latest -js
Basic Pub/Sub in Go
package main
import (
"fmt"
"log"
"time"
"github.com/nats-io/nats.go"
)
func main() {
// Connect to NATS
nc, err := nats.Connect(nats.DefaultURL)
if err != nil {
log.Fatal(err)
}
defer nc.Close()
// Subscribe to subject
sub, err := nc.Subscribe("updates", func(m *nats.Msg) {
fmt.Printf("Received: %s\n", string(m.Data))
})
if err != nil {
log.Fatal(err)
}
defer sub.Unsubscribe()
// Publish messages
for i := 0; i < 5; i++ {
msg := fmt.Sprintf("Message %d", i)
nc.Publish("updates", []byte(msg))
time.Sleep(time.Second)
}
time.Sleep(2 * time.Second)
}
Request-Reply Pattern
// Server (responder)
nc.Subscribe("help.request", func(m *nats.Msg) {
// Process request
response := fmt.Sprintf("Help for: %s", string(m.Data))
// Reply to the reply subject (inbox)
m.Respond([]byte(response))
})
// Client (requestor)
msg, err := nc.Request("help.request", []byte("How do I..."), 5*time.Second)
if err != nil {
log.Fatal("Request failed:", err)
}
fmt.Printf("Response: %s\n", string(msg.Data))
Queue Groups
// Multiple subscribers in the same queue group
// Messages are distributed among them (load balancing)
// Worker 1
nc.QueueSubscribe("tasks", "workers", func(m *nats.Msg) {
fmt.Printf("Worker 1 processing: %s\n", string(m.Data))
})
// Worker 2
nc.QueueSubscribe("tasks", "workers", func(m *nats.Msg) {
fmt.Printf("Worker 2 processing: %s\n", string(m.Data))
})
// Worker 3
nc.QueueSubscribe("tasks", "workers", func(m *nats.Msg) {
fmt.Printf("Worker 3 processing: %s\n", string(m.Data))
})
// Messages sent to "tasks" are distributed round-robin
Wildcards
// Subscribe to all US orders
nc.Subscribe("orders.us.*", func(m *nats.Msg) {
// Matches: orders.us.east, orders.us.west, orders.us.central
// Does NOT match: orders.us.east.nyc
})
// Subscribe to all orders recursively
nc.Subscribe("orders.>", func(m *nats.Msg) {
// Matches: orders.us, orders.us.east, orders.us.east.nyc
// Matches: orders.eu, orders.eu.germany, etc.
})
// Combine wildcards
nc.Subscribe("*.orders.>", func(m *nats.Msg) {
// Matches: us.orders.created, eu.orders.updated.processed
})
5. JetStream Deep Dive
Streams
A stream is a persistent message log. Messages are appended and can be replayed:
// Create a stream
js, _ := nc.JetStream()
stream, err := js.AddStream(&nats.StreamConfig{
Name: "ORDERS",
Subjects: []string{"orders.*"},
Storage: nats.FileStorage, // or MemoryStorage
Replicas: 3, // Cluster size
Retention: nats.LimitsPolicy, // Discard old messages
MaxMsgs: 1000000, // Keep last 1M messages
MaxBytes: 10 * 1024 * 1024 * 1024, // 10GB
})
if err != nil {
log.Fatal(err)
}
// Publish to stream
ack, err := js.Publish("orders.us", []byte(`{"id": "123", "total": 99.99}`))
if err != nil {
log.Fatal(err)
}
fmt.Printf("Published to stream: %s, sequence: %d\n", ack.Stream, ack.Sequence)
Consumers
Consumers read from streams. They can be ephemeral or durable:
// Create a durable consumer
sub, err := js.Subscribe("ORDERS", func(m *nats.Msg) {
// Process message
fmt.Printf("Order: %s\n", string(m.Data))
// Acknowledge successful processing
m.Ack()
}, nats.Durable("order-processor"), // Durable name
nats.ManualAck(), // Manual acknowledgment
nats.MaxDeliver(3), // Retry 3 times
nats.AckWait(30*time.Second)) // Wait 30s for ack
if err != nil {
log.Fatal(err)
}
// Create a pull consumer (batch processing)
sub, err = js.PullSubscribe("ORDERS", "batch-processor")
if err != nil {
log.Fatal(err)
}
// Fetch batch of messages
msgs, err := sub.Fetch(100)
for _, msg := range msgs {
// Process
msg.Ack()
}
Acknowledgment Policies
| Policy | Behavior | Use Case |
|---|---|---|
AckExplicit |
Client must explicitly ack each message | Reliable processing |
AckAll |
Acking message N acks all up to N | Batch processing |
AckNone |
No acknowledgment needed | Fire-and-forget |
Replay Policies
// Replay from the beginning
sub, _ := js.Subscribe("ORDERS", handler,
nats.ReplayInstant(), // Default: deliver as fast as possible
nats.DeliverAll()) // Start from first message
// Replay from specific time
sub, _ := js.Subscribe("ORDERS", handler,
nats.DeliverByStartTime(time.Now().Add(-24*time.Hour)))
// Replay from specific sequence
sub, _ := js.Subscribe("ORDERS", handler,
nats.StartSequence(1000))
// Only new messages (default for new consumers)
sub, _ := js.Subscribe("ORDERS", handler,
nats.DeliverNew())
Retention Policies
// LimitsPolicy: Keep messages until limits reached (default)
js.AddStream(&nats.StreamConfig{
Name: "EVENTS",
Retention: nats.LimitsPolicy,
MaxMsgs: 1000000,
MaxBytes: 1 * 1024 * 1024 * 1024, // 1GB
})
// InterestPolicy: Delete when all consumers have acked
js.AddStream(&nats.StreamConfig{
Name: "COMMANDS",
Retention: nats.InterestPolicy,
})
// WorkQueuePolicy: Delete after successful processing
// Only one consumer can process each message
js.AddStream(&nats.StreamConfig{
Name: "JOBS",
Retention: nats.WorkQueuePolicy,
})
6. Real-World Use Cases
Microservices Communication
NATS serves as the nervous system for microservices:
// Service Discovery
nc.Subscribe("discovery.register", func(m *nats.Msg) {
var svc ServiceInfo
json.Unmarshal(m.Data, &svc)
registry.Register(svc)
})
// Load Balanced RPC
nc.QueueSubscribe("payments.process", "payment-workers", func(m *nats.Msg) {
var payment PaymentRequest
json.Unmarshal(m.Data, &payment)
result := processPayment(payment)
response, _ := json.Marshal(result)
m.Respond(response)
})
// Event Broadcasting
nc.Publish("inventory.updated", eventData)
// Multiple services react: cache invalidation, search index update, analytics
IoT and Telemetry
NATS handles millions of IoT devices:
// Device telemetry (Core NATS - fire and forget)
deviceID := "sensor-001"
nc.Publish(fmt.Sprintf("telemetry.%s", deviceID), []byte(`{
"temperature": 23.5,
"humidity": 65,
"timestamp": "2026-03-17T10:00:00Z"
}`))
// Device commands (Request-Reply)
response, err := nc.Request(
fmt.Sprintf("commands.%s", deviceID),
[]byte(`{"action": "reboot"}`),
5*time.Second)
// Alert processing (JetStream for durability)
js.Publish("alerts.critical", []byte(`{
"device": "sensor-001",
"alert": "temperature_exceeded",
"value": 85.2
}`))
Event Sourcing and CQRS
// Command side: Store events
func (s *OrderService) CreateOrder(cmd CreateOrderCommand) error {
event := OrderCreatedEvent{
OrderID: uuid.New(),
CustomerID: cmd.CustomerID,
Items: cmd.Items,
Timestamp: time.Now(),
}
data, _ := json.Marshal(event)
_, err := s.js.Publish("events.order.created", data)
return err
}
// Query side: Project read models
sub, _ := s.js.Subscribe("events.order.>", func(m *nats.Msg) {
var event Event
json.Unmarshal(m.Data, &event)
switch event.Type {
case "OrderCreated":
projection.HandleOrderCreated(event)
case "OrderShipped":
projection.HandleOrderShipped(event)
}
m.Ack()
}, nats.Durable("order-projection"))
// Replay to rebuild projections
sub, _ = s.js.Subscribe("events.order.>", handler,
nats.DeliverAll(), // Replay all events
nats.ReplayOriginal()) // Maintain original timing
Distributed Transactions (Saga Pattern)
// Saga orchestrator
type OrderSaga struct {
steps []SagaStep
}
func (s *OrderSaga) Execute(order Order) error {
// Step 1: Reserve inventory
err := s.reserveInventory(order)
if err != nil {
return s.compensate(order, 0)
}
// Step 2: Process payment
err = s.processPayment(order)
if err != nil {
return s.compensate(order, 1)
}
// Step 3: Ship order
err = s.shipOrder(order)
if err != nil {
return s.compensate(order, 2)
}
return nil
}
func (s *OrderSaga) compensate(order Order, lastCompleted int) error {
// Undo completed steps in reverse order
for i := lastCompleted; i >= 0; i-- {
s.steps[i].Compensate(order)
}
return fmt.Errorf("saga failed at step %d", lastCompleted+1)
}
// Services communicate via NATS for saga coordination
nc.Subscribe("saga.inventory.reserve", handleReserveInventory)
nc.Subscribe("saga.inventory.release", handleReleaseInventory) // Compensation
nc.Subscribe("saga.payment.process", handleProcessPayment)
nc.Subscribe("saga.payment.refund", handleRefundPayment) // Compensation
7. NATS vs Kafka vs RabbitMQ
| Aspect | NATS | Kafka | RabbitMQ |
|---|---|---|---|
| Protocol | Text-based, simple | Binary (TCP) | AMQP |
| Throughput | Millions msg/sec | Millions msg/sec | Hundreds of thousands |
| Latency | Sub-millisecond | Low (ms) | Low (ms) |
| Persistence | Optional (JetStream) | Always (disk) | Optional (queues) |
| Deployment | Single binary | ZooKeeper/KRaft + Brokers | Single binary or cluster |
| Resource usage | Low | High (JVM) | Medium |
| Multi-tenancy | Accounts (JWT) | Limited | Virtual hosts |
| Geo-replication | Superclusters | MirrorMaker | Federation/Shovel |
| Best for | Cloud-native, IoT, control plane | Big data, event sourcing | Enterprise messaging |
When to Choose Each
Choose NATS when:
- You need a lightweight, fast messaging system
- You're building cloud-native microservices
- You want simple operations (single binary)
- You need flexible deployment (edge to cloud)
- You want both pub/sub and persistence in one system
Choose Kafka when:
- You're processing massive data streams (TB/day)
- You need long-term message retention
- You have a data engineering team
- You need stream processing (Kafka Streams, ksqlDB)
Choose RabbitMQ when:
- You need complex routing (exchanges, bindings)
- You have existing AMQP infrastructure
- You need advanced queue features (priority, TTL, dead letter)
- You're in an enterprise environment with AMQP expertise
8. Production Deployment
Docker Compose Setup
# docker-compose.yml
version: '3.8'
services:
nats-1:
image: nats:2.10-alpine
command: >
--name nats-1
--cluster_name my_cluster
--cluster nats://0.0.0.0:6222
--routes nats://nats-2:6222,nats://nats-3:6222
--js
--store_dir /data
--http_port 8222
volumes:
- nats-1-data:/data
ports:
- "4222:4222"
- "8222:8222"
networks:
- nats
nats-2:
image: nats:2.10-alpine
command: >
--name nats-2
--cluster_name my_cluster
--cluster nats://0.0.0.0:6222
--routes nats://nats-1:6222,nats://nats-3:6222
--js
--store_dir /data
volumes:
- nats-2-data:/data
networks:
- nats
nats-3:
image: nats:2.10-alpine
command: >
--name nats-3
--cluster_name my_cluster
--cluster nats://0.0.0.0:6222
--routes nats://nats-1:6222,nats://nats-2:6222
--js
--store_dir /data
volumes:
- nats-3-data:/data
networks:
- nats
volumes:
nats-1-data:
nats-2-data:
nats-3-data:
networks:
nats:
driver: bridge
Kubernetes Deployment
# nats-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nats
spec:
serviceName: nats
replicas: 3
selector:
matchLabels:
app: nats
template:
metadata:
labels:
app: nats
spec:
containers:
- name: nats
image: nats:2.10-alpine
command:
- nats-server
- --config
- /etc/nats/nats.conf
ports:
- containerPort: 4222
name: client
- containerPort: 6222
name: cluster
- containerPort: 8222
name: monitor
volumeMounts:
- name: config
mountPath: /etc/nats
- name: data
mountPath: /data
volumes:
- name: config
configMap:
name: nats-config
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: ConfigMap
metadata:
name: nats-config
data:
nats.conf: |
listen: 0.0.0.0:4222
http: 0.0.0.0:8222
jetstream {
store_dir: /data
max_memory_store: 1GB
max_file_store: 10GB
}
cluster {
name: my_cluster
listen: 0.0.0.0:6222
routes: [
nats://nats-0.nats:6222
nats://nats-1.nats:6222
nats://nats-2.nats:6222
]
}
tls {
cert_file: /etc/nats/certs/server.crt
key_file: /etc/nats/certs/server.key
ca_file: /etc/nats/certs/ca.crt
}
---
apiVersion: v1
kind: Service
metadata:
name: nats
spec:
selector:
app: nats
ports:
- name: client
port: 4222
targetPort: 4222
- name: cluster
port: 6222
targetPort: 6222
clusterIP: None # Headless for StatefulSet
TLS Configuration
// nats.conf with TLS
tls {
cert_file: "/etc/nats/certs/server.crt"
key_file: "/etc/nats/certs/server.key"
ca_file: "/etc/nats/certs/ca.crt"
verify: true
verify_and_map: true # Map certificate to user
}
// Client connection with TLS
nc, err := nats.Connect("tls://nats.example.com:4222",
nats.ClientCert("/certs/client.crt", "/certs/client.key"),
nats.RootCAs("/certs/ca.crt"),
nats.Secure())
if err != nil {
log.Fatal(err)
}
// mTLS with client certificate authentication
nc, err := nats.Connect("tls://nats.example.com:4222",
nats.ClientCert("/certs/client.crt", "/certs/client.key"),
nats.RootCAs("/certs/ca.crt"),
nats.Secure(),
nats.UserCredentials("/creds/user.creds"))
9. Monitoring and Observability
Built-in Monitoring
NATS exposes metrics via HTTP endpoint:
# Enable monitoring
nats-server -m 8222
# Access metrics
curl http://localhost:8222/varz
curl http://localhost:8222/connz
curl http://localhost:8222/routez
curl http://localhost:8222/subsz
# JetStream specific
curl http://localhost:8222/jsz
Prometheus Integration
# prometheus.yml
scrape_configs:
- job_name: 'nats'
static_configs:
- targets: ['nats:8222']
metrics_path: /metrics
# Using NATS Prometheus exporter
# https://github.com/nats-io/prometheus-nats-exporter
prometheus-nats-exporter -varz -connz -routez -subz -jsz http://nats:8222
Key Metrics to Monitor
| Metric | Description | Alert Threshold |
|---|---|---|
nats_connections |
Active client connections | > 10000 |
nats_subscriptions |
Total subscriptions | > 100000 |
nats_in_msgs |
Messages received | Baseline + 50% |
nats_slow_consumers |
Dropped messages (slow consumers) | > 0 |
jetstream_storage_used_bytes |
JetStream disk usage | > 80% |
jetstream_consumer_pending |
Pending messages per consumer | > 10000 |
Distributed Tracing
// Instrument NATS with OpenTelemetry
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/propagation"
)
// Publish with trace context
func publishWithTrace(ctx context.Context, nc *nats.Conn, subject string, data []byte) error {
tracer := otel.Tracer("nats-publisher")
_, span := tracer.Start(ctx, "nats.publish")
defer span.End()
// Inject trace context into message headers
carrier := propagation.MapCarrier{}
otel.GetTextMapPropagator().Inject(ctx, carrier)
msg := &nats.Msg{
Subject: subject,
Data: data,
Header: nats.Header(carrier),
}
return nc.PublishMsg(msg)
}
// Subscribe and extract trace context
nc.Subscribe("orders.>", func(m *nats.Msg) {
carrier := propagation.MapCarrier(m.Header)
ctx := otel.GetTextMapPropagator().Extract(context.Background(), carrier)
tracer := otel.Tracer("nats-subscriber")
_, span := tracer.Start(ctx, "nats.process")
defer span.End()
// Process message with trace context
processOrder(m.Data)
})
NATS combines simplicity, performance, and flexibility in a way that makes it ideal for modern distributed systems. Whether you need fire-and-forget messaging for IoT or durable streams for event sourcing, NATS delivers with minimal operational overhead. Its cloud-native design, from single binary deployment to supercluster federation, makes it a future-proof choice for your messaging infrastructure.