NATS Connection Management and Resilience
A detailed guide to managing NATS connections in production TypeScript applications, covering reconnection strategies, cluster awareness, connection draining, and error handling patterns.
A messaging system is only as reliable as its connections. In distributed systems, network partitions happen, servers restart, and DNS entries change. The difference between a fragile application and a resilient one often comes down to how well it handles these inevitable disruptions. The node-nats client library provides sophisticated connection management capabilities that, when configured correctly, make your applications remarkably tolerant of infrastructure instability.
This article covers the full lifecycle of a NATS connection: establishing it, keeping it alive, recovering from failures, and shutting it down gracefully. Each section includes practical TypeScript examples and the reasoning behind configuration choices that matter in production.
Establishing Connections
The simplest NATS connection requires only a server address, but production configurations need more thought. The connect function accepts a rich set of options that control everything from authentication to buffer sizes:
import { connect, NatsConnection, Events } from "nats";
async function createConnection(): Promise<NatsConnection> {
const nc = await connect({
servers: [
"nats://nats-1.internal:4222",
"nats://nats-2.internal:4222",
"nats://nats-3.internal:4222",
],
name: "order-service-prod",
maxReconnectAttempts: -1, // Unlimited reconnection attempts
reconnectTimeWait: 1000, // 1 second between attempts
reconnectJitter: 500, // Add 0-500ms random jitter
reconnectJitterTLS: 1000, // More jitter for TLS connections
timeout: 5000, // Connection timeout
pingInterval: 30_000, // Ping every 30 seconds
maxPingOut: 3, // Disconnect after 3 missed pongs
token: process.env.NATS_TOKEN,
});
console.log(`Connected to ${nc.getServer()}`);
return nc;
}Providing multiple server addresses is the first layer of resilience. The client will attempt to connect to each server in order, and if the first one is unavailable, it moves to the next. During reconnection, the client cycles through all known servers, including any that were discovered through cluster gossip.
The name option identifies this connection in NATS server monitoring tools and logs. In a system with dozens of services connecting to the same NATS cluster, meaningful connection names are invaluable for debugging. We recommend including the service name and environment: order-service-prod, payment-gateway-staging.
The maxReconnectAttempts: -1 setting tells the client to retry indefinitely. In a production environment where the NATS cluster is expected to be available, this is usually the right choice. The alternative, a finite number of attempts, risks your service permanently disconnecting from the messaging infrastructure due to a transient issue.
Connection Events and Monitoring
The node-nats client emits a series of lifecycle events that your application should listen to for monitoring and operational awareness:
import { connect, Events, DebugEvents, NatsConnection, Status } from "nats";
async function monitorConnection(nc: NatsConnection) {
// Status iterator provides all connection lifecycle events
const statusIterator = nc.status();
(async () => {
for await (const status of statusIterator) {
switch (status.type) {
case Events.Reconnect:
console.log(`Reconnected to ${status.data}`);
metrics.increment("nats.reconnect");
break;
case Events.Disconnect:
console.warn(`Disconnected from NATS server`);
metrics.increment("nats.disconnect");
alertOps("NATS disconnected", { service: "order-service" });
break;
case Events.Update:
console.log(`Server list updated: ${JSON.stringify(status.data)}`);
break;
case Events.LDM:
console.warn("Server is entering lame duck mode, will disconnect");
metrics.increment("nats.lame_duck");
break;
case Events.Error:
console.error("NATS error:", status.data);
metrics.increment("nats.error");
break;
case DebugEvents.Reconnecting:
console.log("Attempting to reconnect...");
break;
case DebugEvents.StaleConnection:
console.warn("Connection is stale, reconnecting");
break;
default:
console.log(`NATS status: ${status.type}`, status.data);
}
}
})();
}The LDM (Lame Duck Mode) event deserves special attention. When a NATS server enters lame duck mode, typically during a rolling restart, it signals connected clients to migrate to other servers. The node-nats client handles this automatically by reconnecting to another server in the cluster. Your application receives the LDM event as an informational signal, not as something you need to act on. This mechanism enables zero-downtime NATS cluster upgrades.
The Disconnect event fires when the client loses its connection. The Reconnecting debug event fires on each reconnection attempt. The Reconnect event fires when a connection is successfully re-established. Logging these events and exposing them as metrics gives your operations team visibility into the health of the messaging layer.
Reconnection Strategies
The default reconnection behavior in node-nats is already good, but understanding the configuration options lets you tune it for your specific environment:
import { connect, NatsConnection } from "nats";
async function createResilientConnection(): Promise<NatsConnection> {
const nc = await connect({
servers: "nats://nats.internal:4222",
// Reconnection timing
reconnectTimeWait: 2000, // Base wait between attempts
reconnectJitter: 1000, // Random jitter added to base wait
reconnectJitterTLS: 2000, // Extra jitter for TLS handshakes
maxReconnectAttempts: -1, // Never give up
// Buffer management during disconnections
reconnect: true, // Enable reconnection (default: true)
// Ping-based health detection
pingInterval: 20_000, // Check server health every 20s
maxPingOut: 2, // Mark stale after 2 missed pongs
// Connection identification
name: "payment-service",
verbose: false,
noEcho: true, // Don't receive our own published messages
});
return nc;
}The reconnection jitter is crucial in clustered environments. Without jitter, all clients would attempt to reconnect simultaneously after a server restart, creating a thundering herd that could overwhelm the recovering server. The jitter spreads reconnection attempts over a time window, allowing the server to accept connections gradually.
The pingInterval and maxPingOut settings form a dead-connection detection mechanism. The client sends a PING to the server at the configured interval. If the server does not respond with a PONG within maxPingOut attempts, the client considers the connection dead and initiates reconnection. Lower values detect failures faster but generate more network traffic. For most production deployments, a 20-30 second ping interval with 2-3 max outstanding pings provides a good balance.
When the connection is down, the client buffers published messages up to a configurable limit. This means short disconnections are completely transparent to your publishing code. However, if the buffer fills up (for example, during an extended outage while publishing at high volume), subsequent publish calls will fail. You should handle this in your publishing logic:
async function safePublish(
nc: NatsConnection,
subject: string,
data: Uint8Array
): Promise<boolean> {
try {
nc.publish(subject, data);
return true;
} catch (error) {
if (error instanceof Error && error.message.includes("CONNECTION_CLOSED")) {
console.error("Cannot publish: connection is closed");
metrics.increment("nats.publish.failed.connection_closed");
return false;
}
throw error;
}
}Cluster-Aware Connections
NATS clusters support automatic server discovery through gossip. When you connect to one server in a cluster, it informs the client about all other servers. If the initial server goes down, the client can reconnect to any other server in the cluster, even ones that were not in the original connection configuration:
import { connect, NatsConnection } from "nats";
async function connectToCluster(): Promise<NatsConnection> {
const nc = await connect({
// Start with a seed server; cluster gossip will reveal others
servers: "nats://nats-seed.internal:4222",
// Allow connections to servers discovered through gossip
noRandomize: false, // Randomize server selection (default)
ignoreClusterUpdates: false, // Accept new servers from gossip (default)
});
console.log(`Connected to: ${nc.getServer()}`);
console.log(`Known servers: ${JSON.stringify(nc.info?.connect_urls)}`);
return nc;
}The noRandomize: false setting (the default) ensures that the client selects servers randomly during reconnection. This distributes clients evenly across cluster members rather than having all clients pile onto the same server. In a three-node cluster, this means roughly one-third of your clients connect to each node.
For geographically distributed clusters, you might want to prefer local servers. While the node-nats client does not have built-in geographic awareness, you can achieve this by ordering the servers array with local servers first and setting noRandomize: true. The client will attempt servers in order and stick with the first successful connection:
const nc = await connect({
servers: [
// Local servers first
"nats://nats-eu-west-1.internal:4222",
"nats://nats-eu-west-2.internal:4222",
// Remote servers as fallback
"nats://nats-us-east-1.internal:4222",
"nats://nats-us-east-2.internal:4222",
],
noRandomize: true,
});Graceful Shutdown with Drain
Abruptly closing a NATS connection can lead to lost messages, particularly for JetStream consumers that have fetched messages but not yet acknowledged them. The drain method provides a graceful shutdown sequence that ensures all in-flight work completes:
import { connect, NatsConnection } from "nats";
let nc: NatsConnection;
async function startService() {
nc = await connect({ servers: "nats://localhost:4222" });
// Set up subscriptions
const sub = nc.subscribe("tasks.>", { queue: "workers" });
processMessages(sub);
// Handle shutdown signals
const shutdown = async () => {
console.log("Shutting down gracefully...");
// drain() does the following in order:
// 1. Unsubscribes all subscriptions (stops receiving new messages)
// 2. Waits for all pending message handlers to complete
// 3. Flushes any buffered outgoing messages
// 4. Closes the connection
await nc.drain();
console.log("NATS connection drained and closed");
process.exit(0);
};
process.on("SIGTERM", shutdown);
process.on("SIGINT", shutdown);
}
async function processMessages(sub: Subscription) {
for await (const msg of sub) {
// This handler will complete even during drain
const data = JSON.parse(new TextDecoder().decode(msg.data));
await processTask(data);
if (msg.reply) {
msg.respond(new TextEncoder().encode(JSON.stringify({ status: "done" })));
}
}
// The for-await loop exits when the subscription is drained
console.log("Subscription drained");
}The drain sequence is particularly important in Kubernetes environments where pods receive a SIGTERM signal before being terminated. By calling nc.drain() in your SIGTERM handler, you ensure that the pod completes all in-flight message processing before shutting down. Configure your Kubernetes terminationGracePeriodSeconds to be longer than your longest expected message processing time.
You can also drain individual subscriptions without closing the entire connection. This is useful when you want to stop processing one type of work while continuing others:
async function scaleDownOrderProcessing(sub: Subscription) {
// Stop receiving new orders but finish processing current ones
await sub.drain();
console.log("Order processing stopped, other subscriptions still active");
}Error Handling Patterns
Robust error handling is non-negotiable in production messaging. The node-nats client surfaces errors through several mechanisms, and your application should handle all of them:
import { connect, NatsError, ErrorCode } from "nats";
async function robustConnection() {
let nc: NatsConnection;
try {
nc = await connect({
servers: "nats://nats.internal:4222",
maxReconnectAttempts: 10,
});
} catch (error) {
// Initial connection failed after all retries
console.error("Failed to connect to NATS:", error);
process.exit(1);
}
// Handle connection closure
nc.closed().then((err) => {
if (err) {
console.error("NATS connection closed with error:", err);
process.exit(1);
}
console.log("NATS connection closed cleanly");
});
// Handle subscription errors
const sub = nc.subscribe("tasks.>");
(async () => {
for await (const msg of sub) {
try {
await processMessage(msg);
} catch (error) {
// Application-level error -- don't let it kill the subscription
console.error("Error processing message:", error);
metrics.increment("messages.processing_error");
}
}
})();
// Handle JetStream publish errors
const js = nc.jetstream();
try {
await js.publish("events.important", encoder.encode("data"));
} catch (error) {
if (error instanceof NatsError) {
if (error.code === ErrorCode.NoResponders) {
console.error("No JetStream stream is capturing this subject");
} else if (error.code === ErrorCode.Timeout) {
console.error("JetStream publish acknowledgment timed out");
}
}
throw error;
}
}The nc.closed() promise is your last line of defense. It resolves when the connection is permanently closed, either through an explicit close() or drain() call, or because the client exhausted its reconnection attempts. If it resolves with an error, something went wrong that the client could not recover from. In most production deployments, this should trigger an alert and a process restart.
Practical Tips for Production Connections
Always set the connection name to something meaningful. When you have 50 services connecting to a NATS cluster, nats server list --sort=name becomes your best diagnostic tool.
Monitor the reconnect and disconnect metrics closely. A steady trickle of reconnections might indicate network instability. A sudden spike across all services points to a NATS server issue.
Use noEcho: true when your service publishes and subscribes to the same subjects. Without it, the service receives its own messages, which is rarely the desired behavior and wastes processing time.
Set maxReconnectAttempts: -1 in production but use a finite value (like 10) in development. In development, you want fast failure feedback; in production, you want infinite persistence.
Test your drain handling under load. Publish messages at a high rate, trigger a SIGTERM, and verify that all in-flight messages complete processing and no messages are lost. This test has caught subtle bugs in almost every messaging application we have built.
Conclusion
Connection management is the unglamorous but essential foundation of reliable NATS-based systems. The node-nats client provides all the primitives you need: automatic reconnection with jitter, cluster-aware server discovery, graceful drain shutdown, and comprehensive event monitoring. By configuring these capabilities correctly and handling errors at every level, you build applications that survive the inevitable disruptions of distributed infrastructure. The investment in robust connection management pays dividends every time a server restarts, a network flaps, or a cluster performs a rolling upgrade, and your application continues processing messages without missing a beat.
Related Articles
Operating NATS in Production: Monitoring and Scaling
A practical operations guide for running NATS in production environments, covering monitoring strategies, capacity planning, scaling patterns, upgrade procedures, and incident response for engineering and platform teams.
Messaging Architecture for Fintech Systems
A strategic guide to designing messaging architectures for financial technology systems, covering regulatory requirements, data consistency patterns, auditability, and the role of NATS in building compliant, resilient fintech infrastructure.
Securing NATS: Authentication and Authorization
A comprehensive guide to securing NATS deployments with authentication mechanisms, fine-grained authorization, TLS encryption, and account-based multi-tenancy, with practical TypeScript client configuration examples.