NATS in Microservices: Service Discovery and Communication

Microservice architectures replace monolithic function calls with network communication. Every method that once lived in the same process now requires a network hop, and with that hop comes a set of challenges: service discovery, load balancing, timeout handling, circuit breaking, and protocol negotiation. HTTP-based service meshes solve these problems, but they add significant infrastructure complexity. NATS offers a different approach: a lightweight messaging layer that provides service discovery, load balancing, and resilient communication as inherent properties of the protocol, not as bolted-on infrastructure.

This article demonstrates how to use NATS as the communication backbone for a microservice architecture, using the node-nats client library and TypeScript throughout. We cover the patterns that make NATS particularly effective for service-to-service communication and the practical considerations for running them in production.

Service Discovery Through Subjects

In a traditional microservice architecture, service discovery is a separate concern. Services register themselves with a registry (Consul, etcd, Kubernetes DNS), and clients query the registry to find available instances. This works, but it requires operating a discovery infrastructure and handling the inevitable consistency issues (stale registrations, DNS caching, health check races).

NATS eliminates dedicated service discovery by making it a natural consequence of subscriptions. A service that subscribes to services.users.get is, by definition, discoverable. Any client that publishes to that subject reaches the service. If multiple instances subscribe with the same queue group, NATS automatically distributes requests across them. When an instance crashes, its subscriptions disappear, and NATS stops routing to it. No registry updates, no health check configurations, no TTL expirations.

import { connect, JSONCodec, NatsConnection, Msg } from "nats";
 
const jc = JSONCodec();
 
interface ServiceHandler<TReq, TRes> {
  subject: string;
  queue: string;
  handler: (request: TReq) => Promise<TRes>;
}
 
class MicroserviceHost {
  private handlers: Array<{ subject: string; queue: string }> = [];
 
  constructor(
    private nc: NatsConnection,
    private serviceName: string
  ) {}
 
  register<TReq, TRes>(config: ServiceHandler<TReq, TRes>): void {
    const sub = this.nc.subscribe(config.subject, {
      queue: config.queue,
    });
 
    this.handlers.push({ subject: config.subject, queue: config.queue });
 
    (async () => {
      for await (const msg of sub) {
        try {
          const request = jc.decode(msg.data) as TReq;
          const response = await config.handler(request);
 
          if (msg.reply) {
            msg.respond(jc.encode(response));
          }
        } catch (error) {
          if (msg.reply) {
            msg.respond(
              jc.encode({
                error: error instanceof Error ? error.message : "Unknown error",
                service: this.serviceName,
              })
            );
          }
        }
      }
    })();
 
    console.log(
      `${this.serviceName}: registered handler for ${config.subject} (queue: ${config.queue})`
    );
  }
 
  async shutdown(): Promise<void> {
    await this.nc.drain();
    console.log(`${this.serviceName}: shutdown complete`);
  }
}

This pattern gives you a microservice framework in about 50 lines of code. Each service registers handlers for the subjects it serves, and the queue group ensures that multiple instances share the load automatically.

Building a Service Client

The counterpart to the service host is a client that makes requests in a type-safe manner:

interface ServiceError {
  error: string;
  service: string;
}
 
class ServiceClient {
  constructor(
    private nc: NatsConnection,
    private defaultTimeout: number = 5000
  ) {}
 
  async call<TReq, TRes>(
    subject: string,
    request: TReq,
    timeout?: number
  ): Promise<TRes> {
    try {
      const response = await this.nc.request(
        subject,
        jc.encode(request),
        { timeout: timeout || this.defaultTimeout }
      );
 
      const decoded = jc.decode(response.data) as TRes | ServiceError;
 
      if (decoded && typeof decoded === "object" && "error" in decoded) {
        throw new Error(
          `Service error from ${(decoded as ServiceError).service}: ${(decoded as ServiceError).error}`
        );
      }
 
      return decoded as TRes;
    } catch (error) {
      if (error instanceof Error && error.message.includes("503")) {
        throw new Error(
          `No service available for ${subject}. Is the service running?`
        );
      }
      throw error;
    }
  }
 
  async callWithRetry<TReq, TRes>(
    subject: string,
    request: TReq,
    options: { maxRetries: number; backoffMs: number; timeout?: number }
  ): Promise<TRes> {
    let lastError: Error | undefined;
 
    for (let attempt = 0; attempt <= options.maxRetries; attempt++) {
      try {
        return await this.call<TReq, TRes>(
          subject,
          request,
          options.timeout
        );
      } catch (error) {
        lastError = error instanceof Error ? error : new Error(String(error));
 
        if (attempt < options.maxRetries) {
          const delay = options.backoffMs * Math.pow(2, attempt);
          const jitter = Math.random() * options.backoffMs;
          await new Promise((resolve) => setTimeout(resolve, delay + jitter));
        }
      }
    }
 
    throw lastError;
  }
}

The callWithRetry method implements exponential backoff with jitter, which is essential for resilient service communication. When a service is temporarily overloaded, immediate retries make the problem worse. Exponential backoff reduces the retry rate, and jitter prevents multiple clients from retrying at the same time.

A Complete Microservice Example

Let us build a small but realistic microservice system: a user service, an order service that depends on it, and an API gateway that orchestrates them:

// user-service.ts
async function startUserService() {
  const nc = await connect({ servers: "nats://localhost:4222", name: "user-service" });
  const host = new MicroserviceHost(nc, "user-service");
 
  host.register<{ userId: string }, { id: string; name: string; email: string }>({
    subject: "services.users.get",
    queue: "user-service",
    handler: async (req) => {
      const user = await db.users.findById(req.userId);
      if (!user) throw new Error(`User ${req.userId} not found`);
      return { id: user.id, name: user.name, email: user.email };
    },
  });
 
  host.register<
    { email: string; name: string },
    { userId: string; created: boolean }
  >({
    subject: "services.users.create",
    queue: "user-service",
    handler: async (req) => {
      const user = await db.users.create({ email: req.email, name: req.name });
 
      // Publish event for other services (fire and forget)
      nc.publish(
        "events.users.created",
        jc.encode({ userId: user.id, email: user.email, name: user.name })
      );
 
      return { userId: user.id, created: true };
    },
  });
 
  console.log("User service started");
  return host;
}
 
// order-service.ts
async function startOrderService() {
  const nc = await connect({ servers: "nats://localhost:4222", name: "order-service" });
  const host = new MicroserviceHost(nc, "order-service");
  const client = new ServiceClient(nc);
 
  host.register<
    { userId: string; items: Array<{ productId: string; quantity: number }> },
    { orderId: string; total: number; userName: string }
  >({
    subject: "services.orders.create",
    queue: "order-service",
    handler: async (req) => {
      // Call user service to validate and get user info
      const user = await client.call<
        { userId: string },
        { id: string; name: string; email: string }
      >("services.users.get", { userId: req.userId });
 
      // Calculate total
      const total = await calculateTotal(req.items);
 
      // Create order
      const order = await db.orders.create({
        userId: req.userId,
        items: req.items,
        total,
      });
 
      // Publish event
      nc.publish(
        "events.orders.created",
        jc.encode({
          orderId: order.id,
          userId: req.userId,
          total,
          itemCount: req.items.length,
        })
      );
 
      return { orderId: order.id, total, userName: user.name };
    },
  });
 
  console.log("Order service started");
  return host;
}

Notice how the order service calls the user service through NATS, not through HTTP. There is no URL to configure, no service registry to query, and no DNS to resolve. The subject services.users.get is the only coupling point, and it is explicit in the code.

Health Checking and Service Introspection

NATS provides a built-in mechanism for service health checking through its micro framework conventions. You can also build lightweight health checking using request/reply:

interface ServiceInfo {
  name: string;
  version: string;
  uptime: number;
  handlers: string[];
  instanceId: string;
}
 
class HealthCheckableService extends MicroserviceHost {
  private startTime = Date.now();
  private instanceId = crypto.randomUUID();
 
  constructor(nc: NatsConnection, serviceName: string, version: string) {
    super(nc, serviceName);
 
    // Register health check endpoint
    this.register<Record<string, never>, ServiceInfo>({
      subject: `services.${serviceName}.health`,
      queue: `${serviceName}-health`,
      handler: async () => ({
        name: serviceName,
        version,
        uptime: Date.now() - this.startTime,
        handlers: this.getRegisteredSubjects(),
        instanceId: this.instanceId,
      }),
    });
 
    // Register ping for discovery
    this.register<Record<string, never>, { name: string; instanceId: string }>({
      subject: "services.discovery.ping",
      queue: `${serviceName}-discovery`,
      handler: async () => ({
        name: serviceName,
        instanceId: this.instanceId,
      }),
    });
  }
}
 
// Discover all running services
async function discoverServices(nc: NatsConnection): Promise<string[]> {
  const services: string[] = [];
  const inbox = nc.subscribe(`_INBOX.${crypto.randomUUID()}`, {
    timeout: 2000,
  });
 
  nc.publish(
    "services.discovery.ping",
    jc.encode({}),
  );
 
  // Collect responses for 2 seconds
  try {
    for await (const msg of inbox) {
      const info = jc.decode(msg.data) as { name: string; instanceId: string };
      services.push(`${info.name} (${info.instanceId})`);
    }
  } catch {
    // Timeout -- expected after 2 seconds
  }
 
  return services;
}

Because every service instance subscribes to services.discovery.ping with a queue group, exactly one instance of each service responds. If you want all instances to respond (for a complete cluster inventory), remove the queue group option.

Circuit Breaker Pattern

When a downstream service is failing, continuing to send requests wastes resources and can cascade the failure. A circuit breaker stops sending requests after a threshold of failures:

enum CircuitState {
  Closed = "closed",
  Open = "open",
  HalfOpen = "half-open",
}
 
class CircuitBreaker {
  private state: CircuitState = CircuitState.Closed;
  private failureCount = 0;
  private lastFailureTime = 0;
 
  constructor(
    private client: ServiceClient,
    private options: {
      failureThreshold: number;
      resetTimeMs: number;
      halfOpenMaxCalls: number;
    }
  ) {}
 
  async call<TReq, TRes>(subject: string, request: TReq): Promise<TRes> {
    if (this.state === CircuitState.Open) {
      if (Date.now() - this.lastFailureTime > this.options.resetTimeMs) {
        this.state = CircuitState.HalfOpen;
      } else {
        throw new Error(
          `Circuit breaker is open for ${subject}. Service is unavailable.`
        );
      }
    }
 
    try {
      const result = await this.client.call<TReq, TRes>(subject, request);
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
 
  private onSuccess(): void {
    this.failureCount = 0;
    this.state = CircuitState.Closed;
  }
 
  private onFailure(): void {
    this.failureCount++;
    this.lastFailureTime = Date.now();
 
    if (this.failureCount >= this.options.failureThreshold) {
      this.state = CircuitState.Open;
      console.warn(`Circuit breaker opened after ${this.failureCount} failures`);
    }
  }
}
 
// Usage
const userServiceBreaker = new CircuitBreaker(client, {
  failureThreshold: 5,
  resetTimeMs: 30_000,
  halfOpenMaxCalls: 1,
});
 
try {
  const user = await userServiceBreaker.call<
    { userId: string },
    { id: string; name: string }
  >("services.users.get", { userId: "usr_123" });
} catch (error) {
  // Handle circuit open or actual failure
  console.error("User service unavailable:", error);
}

Event-Driven Sagas

Microservices often need to coordinate multi-step workflows that span several services. NATS's combination of request/reply and pub/sub makes the saga pattern natural to implement:

async function processOrderSaga(
  nc: NatsConnection,
  client: ServiceClient,
  orderRequest: { userId: string; items: Array<{ productId: string; quantity: number }> }
) {
  const sagaId = crypto.randomUUID();
 
  try {
    // Step 1: Validate user
    const user = await client.call<{ userId: string }, { id: string; name: string }>(
      "services.users.get",
      { userId: orderRequest.userId }
    );
 
    // Step 2: Reserve inventory
    const reservation = await client.call<
      { items: typeof orderRequest.items; sagaId: string },
      { reservationId: string; total: number }
    >("services.inventory.reserve", {
      items: orderRequest.items,
      sagaId,
    });
 
    // Step 3: Process payment
    const payment = await client.call<
      { userId: string; amount: number; sagaId: string },
      { paymentId: string; status: string }
    >("services.payments.charge", {
      userId: orderRequest.userId,
      amount: reservation.total,
      sagaId,
    });
 
    // Step 4: Confirm order
    const order = await client.call<
      { userId: string; items: typeof orderRequest.items; paymentId: string },
      { orderId: string }
    >("services.orders.confirm", {
      userId: orderRequest.userId,
      items: orderRequest.items,
      paymentId: payment.paymentId,
    });
 
    // Publish success event
    nc.publish(
      "events.sagas.order.completed",
      jc.encode({ sagaId, orderId: order.orderId })
    );
 
    return order;
  } catch (error) {
    // Compensate: publish rollback event
    nc.publish(
      "events.sagas.order.failed",
      jc.encode({
        sagaId,
        error: error instanceof Error ? error.message : "Unknown error",
      })
    );
 
    throw error;
  }
}

Each service listens for the saga failure event and reverses its own step (releasing inventory, refunding payment). The saga coordinator does not need to know the compensation logic; it only needs to signal that the saga failed.

Practical Tips for NATS Microservices

Establish a subject naming convention early and enforce it. We use services.{name}.{operation} for request/reply and events.{domain}.{action} for pub/sub events. This convention makes it easy to set up NATS authorization rules and monitoring filters.

Always set timeouts on service calls. A missing timeout on nc.request will hang forever if the target service is down. We default to 5 seconds and adjust based on the expected operation latency.

Use queue groups for every service handler. Even if you currently run a single instance, the queue group costs nothing and makes horizontal scaling a zero-code-change operation.

Monitor NATS subject throughput as a service health signal. If a subject suddenly drops to zero messages, the service is likely down. If throughput doubles unexpectedly, something upstream changed. NATS monitoring endpoints expose this data for every subject.

Test service-to-service communication with chaos engineering. Kill service instances during load testing and verify that requests are rerouted to healthy instances. Introduce artificial latency and verify that timeouts and circuit breakers activate correctly.

Conclusion

NATS provides a uniquely lightweight foundation for microservice communication. Service discovery becomes a natural property of subscriptions rather than a separate infrastructure component. Load balancing is built into queue groups. Resilience patterns like retries and circuit breakers compose naturally with the request/reply protocol. For teams building TypeScript microservices, the node-nats client library delivers these capabilities through an ergonomic, type-safe API that requires no configuration servers, no service registries, and no sidecar proxies. The result is a microservice architecture that is simpler to operate, easier to debug, and faster to develop than traditional HTTP-based alternatives.

NATS in Microservices: Service Discovery and Communication

Service Discovery Through Subjects

Building a Service Client

A Complete Microservice Example

Health Checking and Service Introspection

Circuit Breaker Pattern

Event-Driven Sagas

Practical Tips for NATS Microservices

Conclusion

Related Articles

Operating NATS in Production: Monitoring and Scaling

Messaging Architecture for Fintech Systems

Securing NATS: Authentication and Authorization

Related Articles

Operating NATS in Production: Monitoring and Scaling
August 28, 202512 min read

Messaging Architecture for Fintech Systems
August 26, 202511 min read

Securing NATS: Authentication and Authorization
August 24, 202510 min read