NATS Connection Management and Resilience

A messaging system is only as reliable as its connections. In distributed systems, network partitions happen, servers restart, and DNS entries change. The difference between a fragile application and a resilient one often comes down to how well it handles these inevitable disruptions. The node-nats client library provides sophisticated connection management capabilities that, when configured correctly, make your applications remarkably tolerant of infrastructure instability.

This article covers the full lifecycle of a NATS connection: establishing it, keeping it alive, recovering from failures, and shutting it down gracefully. Each section includes practical TypeScript examples and the reasoning behind configuration choices that matter in production.

Establishing Connections

The simplest NATS connection requires only a server address, but production configurations need more thought. The connect function accepts a rich set of options that control everything from authentication to buffer sizes:

import { connect, NatsConnection, Events } from "nats";
 
async function createConnection(): Promise<NatsConnection> {
  const nc = await connect({
    servers: [
      "nats://nats-1.internal:4222",
      "nats://nats-2.internal:4222",
      "nats://nats-3.internal:4222",
    ],
    name: "order-service-prod",
    maxReconnectAttempts: -1, // Unlimited reconnection attempts
    reconnectTimeWait: 1000, // 1 second between attempts
    reconnectJitter: 500, // Add 0-500ms random jitter
    reconnectJitterTLS: 1000, // More jitter for TLS connections
    timeout: 5000, // Connection timeout
    pingInterval: 30_000, // Ping every 30 seconds
    maxPingOut: 3, // Disconnect after 3 missed pongs
    token: process.env.NATS_TOKEN,
  });
 
  console.log(`Connected to ${nc.getServer()}`);
  return nc;
}

Providing multiple server addresses is the first layer of resilience. The client will attempt to connect to each server in order, and if the first one is unavailable, it moves to the next. During reconnection, the client cycles through all known servers, including any that were discovered through cluster gossip.

The name option identifies this connection in NATS server monitoring tools and logs. In a system with dozens of services connecting to the same NATS cluster, meaningful connection names are invaluable for debugging. We recommend including the service name and environment: order-service-prod, payment-gateway-staging.

The maxReconnectAttempts: -1 setting tells the client to retry indefinitely. In a production environment where the NATS cluster is expected to be available, this is usually the right choice. The alternative, a finite number of attempts, risks your service permanently disconnecting from the messaging infrastructure due to a transient issue.

Connection Events and Monitoring

The node-nats client emits a series of lifecycle events that your application should listen to for monitoring and operational awareness:

import { connect, Events, DebugEvents, NatsConnection, Status } from "nats";
 
async function monitorConnection(nc: NatsConnection) {
  // Status iterator provides all connection lifecycle events
  const statusIterator = nc.status();
 
  (async () => {
    for await (const status of statusIterator) {
      switch (status.type) {
        case Events.Reconnect:
          console.log(`Reconnected to ${status.data}`);
          metrics.increment("nats.reconnect");
          break;
 
        case Events.Disconnect:
          console.warn(`Disconnected from NATS server`);
          metrics.increment("nats.disconnect");
          alertOps("NATS disconnected", { service: "order-service" });
          break;
 
        case Events.Update:
          console.log(`Server list updated: ${JSON.stringify(status.data)}`);
          break;
 
        case Events.LDM:
          console.warn("Server is entering lame duck mode, will disconnect");
          metrics.increment("nats.lame_duck");
          break;
 
        case Events.Error:
          console.error("NATS error:", status.data);
          metrics.increment("nats.error");
          break;
 
        case DebugEvents.Reconnecting:
          console.log("Attempting to reconnect...");
          break;
 
        case DebugEvents.StaleConnection:
          console.warn("Connection is stale, reconnecting");
          break;
 
        default:
          console.log(`NATS status: ${status.type}`, status.data);
      }
    }
  })();
}

The LDM (Lame Duck Mode) event deserves special attention. When a NATS server enters lame duck mode, typically during a rolling restart, it signals connected clients to migrate to other servers. The node-nats client handles this automatically by reconnecting to another server in the cluster. Your application receives the LDM event as an informational signal, not as something you need to act on. This mechanism enables zero-downtime NATS cluster upgrades.

The Disconnect event fires when the client loses its connection. The Reconnecting debug event fires on each reconnection attempt. The Reconnect event fires when a connection is successfully re-established. Logging these events and exposing them as metrics gives your operations team visibility into the health of the messaging layer.

Reconnection Strategies

The default reconnection behavior in node-nats is already good, but understanding the configuration options lets you tune it for your specific environment:

import { connect, NatsConnection } from "nats";
 
async function createResilientConnection(): Promise<NatsConnection> {
  const nc = await connect({
    servers: "nats://nats.internal:4222",
 
    // Reconnection timing
    reconnectTimeWait: 2000, // Base wait between attempts
    reconnectJitter: 1000, // Random jitter added to base wait
    reconnectJitterTLS: 2000, // Extra jitter for TLS handshakes
    maxReconnectAttempts: -1, // Never give up
 
    // Buffer management during disconnections
    reconnect: true, // Enable reconnection (default: true)
 
    // Ping-based health detection
    pingInterval: 20_000, // Check server health every 20s
    maxPingOut: 2, // Mark stale after 2 missed pongs
 
    // Connection identification
    name: "payment-service",
    verbose: false,
    noEcho: true, // Don't receive our own published messages
  });
 
  return nc;
}

The reconnection jitter is crucial in clustered environments. Without jitter, all clients would attempt to reconnect simultaneously after a server restart, creating a thundering herd that could overwhelm the recovering server. The jitter spreads reconnection attempts over a time window, allowing the server to accept connections gradually.

The pingInterval and maxPingOut settings form a dead-connection detection mechanism. The client sends a PING to the server at the configured interval. If the server does not respond with a PONG within maxPingOut attempts, the client considers the connection dead and initiates reconnection. Lower values detect failures faster but generate more network traffic. For most production deployments, a 20-30 second ping interval with 2-3 max outstanding pings provides a good balance.

When the connection is down, the client buffers published messages up to a configurable limit. This means short disconnections are completely transparent to your publishing code. However, if the buffer fills up (for example, during an extended outage while publishing at high volume), subsequent publish calls will fail. You should handle this in your publishing logic:

async function safePublish(
  nc: NatsConnection,
  subject: string,
  data: Uint8Array
): Promise<boolean> {
  try {
    nc.publish(subject, data);
    return true;
  } catch (error) {
    if (error instanceof Error && error.message.includes("CONNECTION_CLOSED")) {
      console.error("Cannot publish: connection is closed");
      metrics.increment("nats.publish.failed.connection_closed");
      return false;
    }
    throw error;
  }
}

Cluster-Aware Connections

NATS clusters support automatic server discovery through gossip. When you connect to one server in a cluster, it informs the client about all other servers. If the initial server goes down, the client can reconnect to any other server in the cluster, even ones that were not in the original connection configuration:

import { connect, NatsConnection } from "nats";
 
async function connectToCluster(): Promise<NatsConnection> {
  const nc = await connect({
    // Start with a seed server; cluster gossip will reveal others
    servers: "nats://nats-seed.internal:4222",
 
    // Allow connections to servers discovered through gossip
    noRandomize: false, // Randomize server selection (default)
    ignoreClusterUpdates: false, // Accept new servers from gossip (default)
  });
 
  console.log(`Connected to: ${nc.getServer()}`);
  console.log(`Known servers: ${JSON.stringify(nc.info?.connect_urls)}`);
 
  return nc;
}

The noRandomize: false setting (the default) ensures that the client selects servers randomly during reconnection. This distributes clients evenly across cluster members rather than having all clients pile onto the same server. In a three-node cluster, this means roughly one-third of your clients connect to each node.

For geographically distributed clusters, you might want to prefer local servers. While the node-nats client does not have built-in geographic awareness, you can achieve this by ordering the servers array with local servers first and setting noRandomize: true. The client will attempt servers in order and stick with the first successful connection:

const nc = await connect({
  servers: [
    // Local servers first
    "nats://nats-eu-west-1.internal:4222",
    "nats://nats-eu-west-2.internal:4222",
    // Remote servers as fallback
    "nats://nats-us-east-1.internal:4222",
    "nats://nats-us-east-2.internal:4222",
  ],
  noRandomize: true,
});

Graceful Shutdown with Drain

Abruptly closing a NATS connection can lead to lost messages, particularly for JetStream consumers that have fetched messages but not yet acknowledged them. The drain method provides a graceful shutdown sequence that ensures all in-flight work completes:

import { connect, NatsConnection } from "nats";
 
let nc: NatsConnection;
 
async function startService() {
  nc = await connect({ servers: "nats://localhost:4222" });
 
  // Set up subscriptions
  const sub = nc.subscribe("tasks.>", { queue: "workers" });
  processMessages(sub);
 
  // Handle shutdown signals
  const shutdown = async () => {
    console.log("Shutting down gracefully...");
 
    // drain() does the following in order:
    // 1. Unsubscribes all subscriptions (stops receiving new messages)
    // 2. Waits for all pending message handlers to complete
    // 3. Flushes any buffered outgoing messages
    // 4. Closes the connection
    await nc.drain();
 
    console.log("NATS connection drained and closed");
    process.exit(0);
  };
 
  process.on("SIGTERM", shutdown);
  process.on("SIGINT", shutdown);
}
 
async function processMessages(sub: Subscription) {
  for await (const msg of sub) {
    // This handler will complete even during drain
    const data = JSON.parse(new TextDecoder().decode(msg.data));
    await processTask(data);
 
    if (msg.reply) {
      msg.respond(new TextEncoder().encode(JSON.stringify({ status: "done" })));
    }
  }
  // The for-await loop exits when the subscription is drained
  console.log("Subscription drained");
}

The drain sequence is particularly important in Kubernetes environments where pods receive a SIGTERM signal before being terminated. By calling nc.drain() in your SIGTERM handler, you ensure that the pod completes all in-flight message processing before shutting down. Configure your Kubernetes terminationGracePeriodSeconds to be longer than your longest expected message processing time.

You can also drain individual subscriptions without closing the entire connection. This is useful when you want to stop processing one type of work while continuing others:

async function scaleDownOrderProcessing(sub: Subscription) {
  // Stop receiving new orders but finish processing current ones
  await sub.drain();
  console.log("Order processing stopped, other subscriptions still active");
}

Error Handling Patterns

Robust error handling is non-negotiable in production messaging. The node-nats client surfaces errors through several mechanisms, and your application should handle all of them:

import { connect, NatsError, ErrorCode } from "nats";
 
async function robustConnection() {
  let nc: NatsConnection;
 
  try {
    nc = await connect({
      servers: "nats://nats.internal:4222",
      maxReconnectAttempts: 10,
    });
  } catch (error) {
    // Initial connection failed after all retries
    console.error("Failed to connect to NATS:", error);
    process.exit(1);
  }
 
  // Handle connection closure
  nc.closed().then((err) => {
    if (err) {
      console.error("NATS connection closed with error:", err);
      process.exit(1);
    }
    console.log("NATS connection closed cleanly");
  });
 
  // Handle subscription errors
  const sub = nc.subscribe("tasks.>");
 
  (async () => {
    for await (const msg of sub) {
      try {
        await processMessage(msg);
      } catch (error) {
        // Application-level error -- don't let it kill the subscription
        console.error("Error processing message:", error);
        metrics.increment("messages.processing_error");
      }
    }
  })();
 
  // Handle JetStream publish errors
  const js = nc.jetstream();
  try {
    await js.publish("events.important", encoder.encode("data"));
  } catch (error) {
    if (error instanceof NatsError) {
      if (error.code === ErrorCode.NoResponders) {
        console.error("No JetStream stream is capturing this subject");
      } else if (error.code === ErrorCode.Timeout) {
        console.error("JetStream publish acknowledgment timed out");
      }
    }
    throw error;
  }
}

The nc.closed() promise is your last line of defense. It resolves when the connection is permanently closed, either through an explicit close() or drain() call, or because the client exhausted its reconnection attempts. If it resolves with an error, something went wrong that the client could not recover from. In most production deployments, this should trigger an alert and a process restart.

Practical Tips for Production Connections

Always set the connection name to something meaningful. When you have 50 services connecting to a NATS cluster, nats server list --sort=name becomes your best diagnostic tool.

Monitor the reconnect and disconnect metrics closely. A steady trickle of reconnections might indicate network instability. A sudden spike across all services points to a NATS server issue.

Use noEcho: true when your service publishes and subscribes to the same subjects. Without it, the service receives its own messages, which is rarely the desired behavior and wastes processing time.

Set maxReconnectAttempts: -1 in production but use a finite value (like 10) in development. In development, you want fast failure feedback; in production, you want infinite persistence.

Test your drain handling under load. Publish messages at a high rate, trigger a SIGTERM, and verify that all in-flight messages complete processing and no messages are lost. This test has caught subtle bugs in almost every messaging application we have built.

Conclusion

Connection management is the unglamorous but essential foundation of reliable NATS-based systems. The node-nats client provides all the primitives you need: automatic reconnection with jitter, cluster-aware server discovery, graceful drain shutdown, and comprehensive event monitoring. By configuring these capabilities correctly and handling errors at every level, you build applications that survive the inevitable disruptions of distributed infrastructure. The investment in robust connection management pays dividends every time a server restarts, a network flaps, or a cluster performs a rolling upgrade, and your application continues processing messages without missing a beat.

NATS Connection Management and Resilience

Establishing Connections

Connection Events and Monitoring

Reconnection Strategies

Cluster-Aware Connections

Graceful Shutdown with Drain

Error Handling Patterns

Practical Tips for Production Connections

Conclusion

Related Articles

Operating NATS in Production: Monitoring and Scaling

Messaging Architecture for Fintech Systems

Securing NATS: Authentication and Authorization

Related Articles

Operating NATS in Production: Monitoring and Scaling
August 28, 202512 min read

Messaging Architecture for Fintech Systems
August 26, 202511 min read

Securing NATS: Authentication and Authorization
August 24, 202510 min read