Operating NATS in Production: Monitoring and Scaling

A practical operations guide for running NATS in production environments, covering monitoring strategies, capacity planning, scaling patterns, upgrade procedures, and incident response for engineering and platform teams.

business12 min readBy Klivvr Engineering
Share:

Deploying NATS to production is straightforward. Keeping it running reliably at scale, month after month, requires operational discipline: knowing what to monitor, when to scale, how to upgrade without downtime, and what to do when things go wrong. This article is a practical guide for platform teams and engineering leaders responsible for NATS infrastructure. We cover the monitoring, scaling, and operational patterns that keep NATS clusters healthy in production, drawn from our experience operating NATS for fintech workloads.

Monitoring Fundamentals

NATS exposes monitoring data through HTTP endpoints on a dedicated monitoring port (default 8222). These endpoints provide real-time metrics in JSON format, suitable for ingestion by Prometheus, Datadog, or any metrics system that can scrape HTTP endpoints.

The essential monitoring endpoints are:

GET /varz    - Server information, connections, message rates
GET /connz   - Active connection details
GET /subsz   - Subscription routing information
GET /routez  - Cluster route information
GET /jsz     - JetStream account and usage information
GET /healthz - Health check endpoint

For Prometheus-based monitoring, the nats-exporter sidecar translates these endpoints into Prometheus metrics:

# docker-compose.yml excerpt
nats-exporter:
  image: natsio/prometheus-nats-exporter:latest
  command:
    - "-varz"
    - "-connz"
    - "-jsz"
    - "-port=7777"
    - "http://nats:8222"
  ports:
    - "7777:7777"

The metrics you should track fall into four categories: throughput, latency, resource utilization, and error rates.

Throughput metrics tell you how much work the system is doing:

  • nats_varz_messages_in / nats_varz_messages_out: Messages received and sent per second. Track these as rates and alert on sudden drops (service outage) or spikes (traffic surge or message storm).
  • nats_varz_bytes_in / nats_varz_bytes_out: Bytes transferred. This reveals message size trends that throughput counts alone miss.
  • JetStream stream message counts and byte sizes: Track per-stream to understand growth patterns and predict storage needs.

Latency metrics require client-side instrumentation because the NATS server does not track end-to-end latency. The node-nats client provides the building blocks:

import { connect, JSONCodec, NatsConnection } from "nats";
 
const jc = JSONCodec();
 
class InstrumentedClient {
  constructor(
    private nc: NatsConnection,
    private metrics: MetricsClient
  ) {}
 
  async request(
    subject: string,
    data: unknown,
    timeout: number = 5000
  ): Promise<unknown> {
    const start = performance.now();
    const labels = { subject };
 
    try {
      const response = await this.nc.request(
        subject,
        jc.encode(data),
        { timeout }
      );
 
      const duration = performance.now() - start;
      this.metrics.histogram("nats.request.duration_ms", duration, labels);
      this.metrics.increment("nats.request.success", labels);
 
      return jc.decode(response.data);
    } catch (error) {
      const duration = performance.now() - start;
      this.metrics.histogram("nats.request.duration_ms", duration, labels);
      this.metrics.increment("nats.request.error", {
        ...labels,
        error_type: error instanceof Error ? error.constructor.name : "unknown",
      });
      throw error;
    }
  }
 
  async publishJetStream(
    subject: string,
    data: unknown
  ): Promise<void> {
    const js = this.nc.jetstream();
    const start = performance.now();
 
    try {
      const ack = await js.publish(subject, jc.encode(data));
      const duration = performance.now() - start;
      this.metrics.histogram("nats.js.publish.duration_ms", duration, { subject });
      this.metrics.increment("nats.js.publish.success", {
        subject,
        stream: ack.stream,
      });
    } catch (error) {
      const duration = performance.now() - start;
      this.metrics.histogram("nats.js.publish.duration_ms", duration, { subject });
      this.metrics.increment("nats.js.publish.error", { subject });
      throw error;
    }
  }
}

Resource utilization metrics track the health of the NATS servers themselves:

  • CPU utilization: NATS is CPU-efficient, but high message rates or complex subject routing can drive CPU usage up.
  • Memory usage: Track nats_varz_mem and compare against available RAM. Memory spikes often indicate a subscription backlog or JetStream consumer catchup.
  • Disk I/O and utilization: Critical for JetStream. Track disk latency, throughput, and free space. JetStream performance degrades sharply when disks are overloaded.
  • Connection count: Track nats_varz_connections against configured limits. Alert when approaching the maximum.

Error metrics signal problems before they become outages:

  • Slow consumer warnings: NATS drops messages to slow consumers that cannot keep up. Each drop triggers a warning. If these increase, a consumer needs optimization or more instances.
  • Authentication failures: A spike indicates misconfigured clients or potential security issues.
  • JetStream consumer pending counts: The number of unprocessed messages per consumer. Growing pending counts mean a consumer is falling behind.

JetStream-Specific Monitoring

JetStream introduces stateful components that require their own monitoring discipline. The JetStream monitoring endpoint (/jsz) exposes account-level and stream-level metrics:

// Programmatic monitoring of JetStream health
async function monitorJetStream(nc: NatsConnection) {
  const jsm = await nc.jetstreamManager();
 
  // Check all streams
  const streams = await jsm.streams.list().next();
  for (const stream of streams) {
    const info = stream;
 
    console.log(`Stream: ${info.config.name}`);
    console.log(`  Messages: ${info.state.messages}`);
    console.log(`  Bytes: ${info.state.bytes}`);
    console.log(`  Consumer count: ${info.state.consumer_count}`);
    console.log(`  First seq: ${info.state.first_seq}`);
    console.log(`  Last seq: ${info.state.last_seq}`);
 
    // Check stream storage against limits
    if (info.config.max_bytes > 0) {
      const utilization = info.state.bytes / info.config.max_bytes;
      if (utilization > 0.8) {
        console.warn(
          `Stream ${info.config.name} is at ${(utilization * 100).toFixed(1)}% storage capacity`
        );
      }
    }
 
    // Check consumer health
    const consumers = await jsm.consumers.list(info.config.name).next();
    for (const consumer of consumers) {
      const pending = consumer.num_pending;
      const ackPending = consumer.num_ack_pending;
      const redelivered = consumer.num_redelivered;
 
      console.log(`  Consumer: ${consumer.name}`);
      console.log(`    Pending: ${pending}`);
      console.log(`    Ack pending: ${ackPending}`);
      console.log(`    Redelivered: ${redelivered}`);
 
      if (pending > 10000) {
        console.warn(
          `Consumer ${consumer.name} has ${pending} pending messages -- may be falling behind`
        );
      }
 
      if (redelivered > 100) {
        console.warn(
          `Consumer ${consumer.name} has ${redelivered} redeliveries -- check for poison messages`
        );
      }
    }
  }
}

The most critical JetStream metric is consumer lag: the difference between the stream's last sequence number and the consumer's acknowledged sequence. A growing lag means the consumer is processing messages slower than they arrive. The response depends on the magnitude: small lag may self-resolve during traffic spikes; persistent and growing lag requires adding consumer instances or optimizing processing logic.

Capacity Planning

NATS capacity planning starts with three questions: How many messages per second? How large are the messages? How long must they be retained?

Core NATS (without JetStream) is primarily CPU and network bound. A single NATS server on modern hardware (8 cores, 10Gbps networking) can route 10+ million small messages per second. The bottleneck is typically network bandwidth, not CPU. For most applications, a three-node cluster provides far more capacity than needed, and the cluster exists for availability rather than capacity.

JetStream adds disk I/O as a constraint. The formula for storage capacity is straightforward:

Daily storage = messages_per_day * average_message_size_bytes
Total storage = daily_storage * retention_days * replication_factor

For example: 1 million messages/day, 500 bytes average, 30-day retention, replication factor 3:

Daily:  1,000,000 * 500 = 500 MB/day
Total:  500 MB * 30 * 3 = 45 GB

Add a 50% overhead for metadata, indexes, and growth headroom, bringing the total to approximately 68 GB. This is the minimum disk allocation per NATS server (since each replica stores the data independently).

JetStream throughput depends on disk performance. NVMe SSDs support 100,000+ publishes per second per stream with replication. Network-attached storage (EBS, Persistent Disks) typically supports 10,000-50,000 publishes per second depending on the provisioned IOPS. Size your disk I/O capacity to match your peak publish rate, not your average.

Scaling Patterns

NATS scales differently depending on whether you need more messaging capacity or more JetStream storage.

Scaling messaging capacity: Add servers to the cluster. NATS clusters are a full mesh: every server connects to every other server. Clients are distributed across servers (either by DNS round-robin or explicit configuration), and each server routes messages to others as needed. Adding a server immediately absorbs new client connections and their associated message routing load.

The practical limit is cluster size. NATS clusters work well up to approximately 30 servers. Beyond that, the full mesh topology generates excessive inter-server traffic. For larger deployments, use super clusters with gateway connections between regional clusters.

Scaling JetStream storage: JetStream storage is distributed across the cluster. When you create a stream with num_replicas: 3, the data is replicated to three servers. Adding servers to the cluster provides more aggregate storage capacity and more servers eligible for new stream placements.

Existing streams do not automatically rebalance when new servers are added. You can manually trigger rebalancing by updating a stream's placement configuration, or you can let the server handle it by using placement tags:

# nats-server.conf on each server
jetstream {
  store_dir: "/data/jetstream"
  max_mem: 4GB
  max_file: 500GB
}

server_tags: ["region:eu-west", "storage:nvme"]
// Place streams on servers with specific tags
await jsm.streams.add({
  name: "HIGH_THROUGHPUT",
  subjects: ["ht.>"],
  placement: {
    tags: ["storage:nvme"],
  },
  num_replicas: 3,
});

Scaling consumers: JetStream consumers scale horizontally through queue groups (for push consumers) or by running multiple pull consumer instances with the same durable name. Adding instances is a zero-configuration operation: start a new process with the same consumer configuration, and JetStream automatically distributes messages among all active instances.

Zero-Downtime Upgrades

NATS supports zero-downtime cluster upgrades through Lame Duck Mode. The procedure is:

  1. Signal the first server to enter Lame Duck Mode (nats-server --signal ldm).
  2. The server stops accepting new connections and signals existing clients to reconnect to other servers.
  3. Wait for all connections to drain (configurable timeout, default 10 seconds).
  4. The server shuts down.
  5. Upgrade the binary and restart.
  6. Repeat for each server in the cluster.

Clients using the node-nats library handle this automatically. When a server enters Lame Duck Mode, the client receives the LDM event and reconnects to another server in the cluster:

// Client-side: Lame Duck Mode is handled automatically
// This status listener is for observability, not for taking action
for await (const status of nc.status()) {
  if (status.type === Events.LDM) {
    console.log("Server entering lame duck mode, reconnecting automatically");
    metrics.increment("nats.lame_duck_migration");
  }
}

For JetStream streams with replicas, the upgrade procedure is even more seamless. When one replica goes offline during an upgrade, the remaining replicas continue serving reads and writes. When the upgraded replica rejoins, it catches up automatically.

The key requirement for zero-downtime upgrades is that clients are configured with multiple server addresses or rely on cluster gossip for server discovery. A client connected to a single server with no alternatives will fail during that server's upgrade.

Incident Response Playbook

When things go wrong, having a structured response plan prevents panic-driven decisions. Here are the most common NATS incidents and their resolution patterns.

High consumer lag: A consumer is falling behind the stream. First, check the consumer's processing time per message. If it has increased, the root cause is likely in the consumer's application logic or its downstream dependencies. If processing time is normal but message volume has increased, add consumer instances. As a temporary measure, you can increase the consumer's max_ack_pending to allow more in-flight messages.

Slow consumer disconnections: NATS disconnects consumers that cannot keep up to protect the system. Check the server logs for slow consumer warnings. The fix is either optimizing the consumer, adding more consumer instances, or increasing the server's max_pending buffer for that connection.

Disk space exhaustion: JetStream streams are approaching disk limits. Options: increase disk size, reduce retention periods for non-critical streams, or add servers to the cluster for more aggregate storage. In an emergency, you can purge specific subjects from a stream using the management API:

// Emergency: purge old messages from a specific subject
async function emergencyPurge(nc: NatsConnection) {
  const jsm = await nc.jetstreamManager();
 
  await jsm.streams.purge("OPERATIONAL_EVENTS", {
    filter: "ops.debug.>",
    keep: 1000, // Keep only the most recent 1000 messages
  });
}

Split brain / network partition: If cluster servers lose connectivity to each other, they continue operating independently, potentially serving stale data. NATS clusters are designed to prioritize consistency over availability in partition scenarios: a server that cannot reach a quorum of peers for JetStream operations will refuse writes. Resolution involves restoring network connectivity and allowing the cluster to reconverge.

Connection storm after outage: When a NATS cluster recovers from an outage, all clients reconnect simultaneously. The reconnection jitter configured in node-nats mitigates this, but very large deployments (thousands of clients) may need additional protection. Consider implementing a staged reconnection strategy where services reconnect in priority order: critical services first, then secondary services.

Operational Best Practices

Maintain a NATS configuration as code. Store server configurations in version control and deploy them through your standard infrastructure pipeline. Configuration drift between cluster members causes subtle, hard-to-diagnose issues.

Run a separate monitoring NATS cluster (or at minimum, a separate account) for your monitoring and alerting traffic. If your primary cluster is experiencing issues, you do not want your monitoring to be affected by the same issues.

Establish baseline metrics during normal operation. Know what "normal" looks like for message rates, consumer lag, disk utilization, and connection counts. Alert on deviations from the baseline rather than on absolute thresholds. A 50% increase in message rate might be normal during business hours but abnormal at midnight.

Practice failure scenarios regularly. Kill a NATS server during peak traffic and verify that clients reconnect, consumers resume processing, and no messages are lost. Run this exercise at least quarterly.

Document your stream configurations and their business purpose. When an on-call engineer is paged about the "TRANSACTION_LEDGER" stream at 3 AM, they should be able to quickly understand what the stream is for, who produces to it, who consumes from it, and what the impact of an outage would be.

Conclusion

Operating NATS in production is less complex than operating most distributed infrastructure, but it still requires disciplined monitoring, thoughtful capacity planning, and prepared incident response. The monitoring fundamentals, throughput, latency, resource utilization, and error rates, are the same as any production system, but NATS's built-in HTTP monitoring endpoints and the JetStream management API make these metrics readily accessible. Scaling follows predictable patterns: add servers for capacity, add consumer instances for throughput, add disk for JetStream storage. Zero-downtime upgrades through Lame Duck Mode and replica redundancy keep the messaging layer available during maintenance. With these operational practices in place, NATS becomes an infrastructure component that platform teams can run confidently, and application teams can build on without worrying about the messaging layer beneath them.

Related Articles

business

Messaging Architecture for Fintech Systems

A strategic guide to designing messaging architectures for financial technology systems, covering regulatory requirements, data consistency patterns, auditability, and the role of NATS in building compliant, resilient fintech infrastructure.

11 min read
technical

Securing NATS: Authentication and Authorization

A comprehensive guide to securing NATS deployments with authentication mechanisms, fine-grained authorization, TLS encryption, and account-based multi-tenancy, with practical TypeScript client configuration examples.

10 min read
technical

Streaming Patterns with NATS JetStream

An exploration of advanced streaming patterns using NATS JetStream, including event sourcing, CQRS, windowed aggregations, and stream processing pipelines with practical TypeScript implementations.

12 min read