NATS in Microservices: Service Discovery and Communication
A practical guide to using NATS as the communication backbone for microservice architectures, covering service discovery patterns, inter-service communication, and resilience strategies with TypeScript examples.
Microservice architectures replace monolithic function calls with network communication. Every method that once lived in the same process now requires a network hop, and with that hop comes a set of challenges: service discovery, load balancing, timeout handling, circuit breaking, and protocol negotiation. HTTP-based service meshes solve these problems, but they add significant infrastructure complexity. NATS offers a different approach: a lightweight messaging layer that provides service discovery, load balancing, and resilient communication as inherent properties of the protocol, not as bolted-on infrastructure.
This article demonstrates how to use NATS as the communication backbone for a microservice architecture, using the node-nats client library and TypeScript throughout. We cover the patterns that make NATS particularly effective for service-to-service communication and the practical considerations for running them in production.
Service Discovery Through Subjects
In a traditional microservice architecture, service discovery is a separate concern. Services register themselves with a registry (Consul, etcd, Kubernetes DNS), and clients query the registry to find available instances. This works, but it requires operating a discovery infrastructure and handling the inevitable consistency issues (stale registrations, DNS caching, health check races).
NATS eliminates dedicated service discovery by making it a natural consequence of subscriptions. A service that subscribes to services.users.get is, by definition, discoverable. Any client that publishes to that subject reaches the service. If multiple instances subscribe with the same queue group, NATS automatically distributes requests across them. When an instance crashes, its subscriptions disappear, and NATS stops routing to it. No registry updates, no health check configurations, no TTL expirations.
import { connect, JSONCodec, NatsConnection, Msg } from "nats";
const jc = JSONCodec();
interface ServiceHandler<TReq, TRes> {
subject: string;
queue: string;
handler: (request: TReq) => Promise<TRes>;
}
class MicroserviceHost {
private handlers: Array<{ subject: string; queue: string }> = [];
constructor(
private nc: NatsConnection,
private serviceName: string
) {}
register<TReq, TRes>(config: ServiceHandler<TReq, TRes>): void {
const sub = this.nc.subscribe(config.subject, {
queue: config.queue,
});
this.handlers.push({ subject: config.subject, queue: config.queue });
(async () => {
for await (const msg of sub) {
try {
const request = jc.decode(msg.data) as TReq;
const response = await config.handler(request);
if (msg.reply) {
msg.respond(jc.encode(response));
}
} catch (error) {
if (msg.reply) {
msg.respond(
jc.encode({
error: error instanceof Error ? error.message : "Unknown error",
service: this.serviceName,
})
);
}
}
}
})();
console.log(
`${this.serviceName}: registered handler for ${config.subject} (queue: ${config.queue})`
);
}
async shutdown(): Promise<void> {
await this.nc.drain();
console.log(`${this.serviceName}: shutdown complete`);
}
}This pattern gives you a microservice framework in about 50 lines of code. Each service registers handlers for the subjects it serves, and the queue group ensures that multiple instances share the load automatically.
Building a Service Client
The counterpart to the service host is a client that makes requests in a type-safe manner:
interface ServiceError {
error: string;
service: string;
}
class ServiceClient {
constructor(
private nc: NatsConnection,
private defaultTimeout: number = 5000
) {}
async call<TReq, TRes>(
subject: string,
request: TReq,
timeout?: number
): Promise<TRes> {
try {
const response = await this.nc.request(
subject,
jc.encode(request),
{ timeout: timeout || this.defaultTimeout }
);
const decoded = jc.decode(response.data) as TRes | ServiceError;
if (decoded && typeof decoded === "object" && "error" in decoded) {
throw new Error(
`Service error from ${(decoded as ServiceError).service}: ${(decoded as ServiceError).error}`
);
}
return decoded as TRes;
} catch (error) {
if (error instanceof Error && error.message.includes("503")) {
throw new Error(
`No service available for ${subject}. Is the service running?`
);
}
throw error;
}
}
async callWithRetry<TReq, TRes>(
subject: string,
request: TReq,
options: { maxRetries: number; backoffMs: number; timeout?: number }
): Promise<TRes> {
let lastError: Error | undefined;
for (let attempt = 0; attempt <= options.maxRetries; attempt++) {
try {
return await this.call<TReq, TRes>(
subject,
request,
options.timeout
);
} catch (error) {
lastError = error instanceof Error ? error : new Error(String(error));
if (attempt < options.maxRetries) {
const delay = options.backoffMs * Math.pow(2, attempt);
const jitter = Math.random() * options.backoffMs;
await new Promise((resolve) => setTimeout(resolve, delay + jitter));
}
}
}
throw lastError;
}
}The callWithRetry method implements exponential backoff with jitter, which is essential for resilient service communication. When a service is temporarily overloaded, immediate retries make the problem worse. Exponential backoff reduces the retry rate, and jitter prevents multiple clients from retrying at the same time.
A Complete Microservice Example
Let us build a small but realistic microservice system: a user service, an order service that depends on it, and an API gateway that orchestrates them:
// user-service.ts
async function startUserService() {
const nc = await connect({ servers: "nats://localhost:4222", name: "user-service" });
const host = new MicroserviceHost(nc, "user-service");
host.register<{ userId: string }, { id: string; name: string; email: string }>({
subject: "services.users.get",
queue: "user-service",
handler: async (req) => {
const user = await db.users.findById(req.userId);
if (!user) throw new Error(`User ${req.userId} not found`);
return { id: user.id, name: user.name, email: user.email };
},
});
host.register<
{ email: string; name: string },
{ userId: string; created: boolean }
>({
subject: "services.users.create",
queue: "user-service",
handler: async (req) => {
const user = await db.users.create({ email: req.email, name: req.name });
// Publish event for other services (fire and forget)
nc.publish(
"events.users.created",
jc.encode({ userId: user.id, email: user.email, name: user.name })
);
return { userId: user.id, created: true };
},
});
console.log("User service started");
return host;
}
// order-service.ts
async function startOrderService() {
const nc = await connect({ servers: "nats://localhost:4222", name: "order-service" });
const host = new MicroserviceHost(nc, "order-service");
const client = new ServiceClient(nc);
host.register<
{ userId: string; items: Array<{ productId: string; quantity: number }> },
{ orderId: string; total: number; userName: string }
>({
subject: "services.orders.create",
queue: "order-service",
handler: async (req) => {
// Call user service to validate and get user info
const user = await client.call<
{ userId: string },
{ id: string; name: string; email: string }
>("services.users.get", { userId: req.userId });
// Calculate total
const total = await calculateTotal(req.items);
// Create order
const order = await db.orders.create({
userId: req.userId,
items: req.items,
total,
});
// Publish event
nc.publish(
"events.orders.created",
jc.encode({
orderId: order.id,
userId: req.userId,
total,
itemCount: req.items.length,
})
);
return { orderId: order.id, total, userName: user.name };
},
});
console.log("Order service started");
return host;
}Notice how the order service calls the user service through NATS, not through HTTP. There is no URL to configure, no service registry to query, and no DNS to resolve. The subject services.users.get is the only coupling point, and it is explicit in the code.
Health Checking and Service Introspection
NATS provides a built-in mechanism for service health checking through its micro framework conventions. You can also build lightweight health checking using request/reply:
interface ServiceInfo {
name: string;
version: string;
uptime: number;
handlers: string[];
instanceId: string;
}
class HealthCheckableService extends MicroserviceHost {
private startTime = Date.now();
private instanceId = crypto.randomUUID();
constructor(nc: NatsConnection, serviceName: string, version: string) {
super(nc, serviceName);
// Register health check endpoint
this.register<Record<string, never>, ServiceInfo>({
subject: `services.${serviceName}.health`,
queue: `${serviceName}-health`,
handler: async () => ({
name: serviceName,
version,
uptime: Date.now() - this.startTime,
handlers: this.getRegisteredSubjects(),
instanceId: this.instanceId,
}),
});
// Register ping for discovery
this.register<Record<string, never>, { name: string; instanceId: string }>({
subject: "services.discovery.ping",
queue: `${serviceName}-discovery`,
handler: async () => ({
name: serviceName,
instanceId: this.instanceId,
}),
});
}
}
// Discover all running services
async function discoverServices(nc: NatsConnection): Promise<string[]> {
const services: string[] = [];
const inbox = nc.subscribe(`_INBOX.${crypto.randomUUID()}`, {
timeout: 2000,
});
nc.publish(
"services.discovery.ping",
jc.encode({}),
);
// Collect responses for 2 seconds
try {
for await (const msg of inbox) {
const info = jc.decode(msg.data) as { name: string; instanceId: string };
services.push(`${info.name} (${info.instanceId})`);
}
} catch {
// Timeout -- expected after 2 seconds
}
return services;
}Because every service instance subscribes to services.discovery.ping with a queue group, exactly one instance of each service responds. If you want all instances to respond (for a complete cluster inventory), remove the queue group option.
Circuit Breaker Pattern
When a downstream service is failing, continuing to send requests wastes resources and can cascade the failure. A circuit breaker stops sending requests after a threshold of failures:
enum CircuitState {
Closed = "closed",
Open = "open",
HalfOpen = "half-open",
}
class CircuitBreaker {
private state: CircuitState = CircuitState.Closed;
private failureCount = 0;
private lastFailureTime = 0;
constructor(
private client: ServiceClient,
private options: {
failureThreshold: number;
resetTimeMs: number;
halfOpenMaxCalls: number;
}
) {}
async call<TReq, TRes>(subject: string, request: TReq): Promise<TRes> {
if (this.state === CircuitState.Open) {
if (Date.now() - this.lastFailureTime > this.options.resetTimeMs) {
this.state = CircuitState.HalfOpen;
} else {
throw new Error(
`Circuit breaker is open for ${subject}. Service is unavailable.`
);
}
}
try {
const result = await this.client.call<TReq, TRes>(subject, request);
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess(): void {
this.failureCount = 0;
this.state = CircuitState.Closed;
}
private onFailure(): void {
this.failureCount++;
this.lastFailureTime = Date.now();
if (this.failureCount >= this.options.failureThreshold) {
this.state = CircuitState.Open;
console.warn(`Circuit breaker opened after ${this.failureCount} failures`);
}
}
}
// Usage
const userServiceBreaker = new CircuitBreaker(client, {
failureThreshold: 5,
resetTimeMs: 30_000,
halfOpenMaxCalls: 1,
});
try {
const user = await userServiceBreaker.call<
{ userId: string },
{ id: string; name: string }
>("services.users.get", { userId: "usr_123" });
} catch (error) {
// Handle circuit open or actual failure
console.error("User service unavailable:", error);
}Event-Driven Sagas
Microservices often need to coordinate multi-step workflows that span several services. NATS's combination of request/reply and pub/sub makes the saga pattern natural to implement:
async function processOrderSaga(
nc: NatsConnection,
client: ServiceClient,
orderRequest: { userId: string; items: Array<{ productId: string; quantity: number }> }
) {
const sagaId = crypto.randomUUID();
try {
// Step 1: Validate user
const user = await client.call<{ userId: string }, { id: string; name: string }>(
"services.users.get",
{ userId: orderRequest.userId }
);
// Step 2: Reserve inventory
const reservation = await client.call<
{ items: typeof orderRequest.items; sagaId: string },
{ reservationId: string; total: number }
>("services.inventory.reserve", {
items: orderRequest.items,
sagaId,
});
// Step 3: Process payment
const payment = await client.call<
{ userId: string; amount: number; sagaId: string },
{ paymentId: string; status: string }
>("services.payments.charge", {
userId: orderRequest.userId,
amount: reservation.total,
sagaId,
});
// Step 4: Confirm order
const order = await client.call<
{ userId: string; items: typeof orderRequest.items; paymentId: string },
{ orderId: string }
>("services.orders.confirm", {
userId: orderRequest.userId,
items: orderRequest.items,
paymentId: payment.paymentId,
});
// Publish success event
nc.publish(
"events.sagas.order.completed",
jc.encode({ sagaId, orderId: order.orderId })
);
return order;
} catch (error) {
// Compensate: publish rollback event
nc.publish(
"events.sagas.order.failed",
jc.encode({
sagaId,
error: error instanceof Error ? error.message : "Unknown error",
})
);
throw error;
}
}Each service listens for the saga failure event and reverses its own step (releasing inventory, refunding payment). The saga coordinator does not need to know the compensation logic; it only needs to signal that the saga failed.
Practical Tips for NATS Microservices
Establish a subject naming convention early and enforce it. We use services.{name}.{operation} for request/reply and events.{domain}.{action} for pub/sub events. This convention makes it easy to set up NATS authorization rules and monitoring filters.
Always set timeouts on service calls. A missing timeout on nc.request will hang forever if the target service is down. We default to 5 seconds and adjust based on the expected operation latency.
Use queue groups for every service handler. Even if you currently run a single instance, the queue group costs nothing and makes horizontal scaling a zero-code-change operation.
Monitor NATS subject throughput as a service health signal. If a subject suddenly drops to zero messages, the service is likely down. If throughput doubles unexpectedly, something upstream changed. NATS monitoring endpoints expose this data for every subject.
Test service-to-service communication with chaos engineering. Kill service instances during load testing and verify that requests are rerouted to healthy instances. Introduce artificial latency and verify that timeouts and circuit breakers activate correctly.
Conclusion
NATS provides a uniquely lightweight foundation for microservice communication. Service discovery becomes a natural property of subscriptions rather than a separate infrastructure component. Load balancing is built into queue groups. Resilience patterns like retries and circuit breakers compose naturally with the request/reply protocol. For teams building TypeScript microservices, the node-nats client library delivers these capabilities through an ergonomic, type-safe API that requires no configuration servers, no service registries, and no sidecar proxies. The result is a microservice architecture that is simpler to operate, easier to debug, and faster to develop than traditional HTTP-based alternatives.
Related Articles
Operating NATS in Production: Monitoring and Scaling
A practical operations guide for running NATS in production environments, covering monitoring strategies, capacity planning, scaling patterns, upgrade procedures, and incident response for engineering and platform teams.
Messaging Architecture for Fintech Systems
A strategic guide to designing messaging architectures for financial technology systems, covering regulatory requirements, data consistency patterns, auditability, and the role of NATS in building compliant, resilient fintech infrastructure.
Securing NATS: Authentication and Authorization
A comprehensive guide to securing NATS deployments with authentication mechanisms, fine-grained authorization, TLS encryption, and account-based multi-tenancy, with practical TypeScript client configuration examples.