gRPC and NATS: Building Resilient Service Communication

Microservice architectures live and die by their communication layer. Choose poorly and you end up with a brittle, tightly coupled system that is harder to operate than the monolith it replaced. In Andromeda, we settled on a dual-protocol strategy early: gRPC for synchronous, request-response interactions where a caller needs an immediate answer, and NATS for asynchronous, event-driven flows where services need to react to changes without blocking each other. This pairing has proven to be one of the best architectural decisions we made, giving us the strong typing and performance of gRPC alongside the decoupling and resilience of a message broker.

This article explains how we integrate both protocols, the patterns we follow, and the lessons we learned running this combination in production.

Why Two Protocols

A reasonable first question is why not pick one and stick with it. gRPC supports streaming, so it can handle some asynchronous patterns. NATS supports request-reply, so it can handle synchronous calls. The answer is that each protocol excels in its primary mode and introduces friction when forced into the other.

gRPC gives us strongly typed contracts via Protocol Buffers, automatic code generation for both client and server, built-in deadline propagation, and excellent tooling for load balancing and health checking. For calls like "fetch user profile" or "validate payment method," where the caller is blocked until it gets a response, gRPC is ideal.

NATS, on the other hand, gives us fire-and-forget publishing, fan-out to multiple subscribers, durable message streams via JetStream, and natural decoupling between producers and consumers. For events like "order placed," "KYC check completed," or "balance updated," where multiple services need to react independently, NATS is the right tool.

Using both means each service has two well-defined communication surfaces: a gRPC server for queries and commands that need responses, and a NATS subscriber for events it cares about.

gRPC Service Definitions and Generation

Every gRPC service in Andromeda starts with a Protocol Buffer definition in the proto/ directory. We follow a strict naming convention:

// proto/accounts/v1/accounts.proto
syntax = "proto3";
 
package accounts.v1;
 
option go_package = "github.com/klivvr/andromeda/gen/accounts/v1;accountsv1";
 
service AccountsService {
    rpc GetAccount(GetAccountRequest) returns (GetAccountResponse);
    rpc ListAccounts(ListAccountsRequest) returns (ListAccountsResponse);
    rpc CreateAccount(CreateAccountRequest) returns (CreateAccountResponse);
}
 
message GetAccountRequest {
    string account_id = 1;
}
 
message GetAccountResponse {
    Account account = 1;
}
 
message Account {
    string id = 1;
    string owner_id = 2;
    string currency = 3;
    int64 balance_minor_units = 4;
    string status = 5;
    string created_at = 6;
}
 
message ListAccountsRequest {
    string owner_id = 1;
    int32 page_size = 2;
    string page_token = 3;
}
 
message ListAccountsResponse {
    repeated Account accounts = 1;
    string next_page_token = 2;
}
 
message CreateAccountRequest {
    string owner_id = 1;
    string currency = 2;
}
 
message CreateAccountResponse {
    Account account = 1;
}

Code generation is handled by a Makefile target that invokes protoc with the Go and gRPC plugins. The generated code lands in a gen/ directory that is committed to the repository. We commit generated code because it makes builds faster, eliminates the need for protoc on every developer machine, and ensures that code review captures any contract changes.

On the server side, implementing the generated interface is straightforward:

// internal/accounts/ports/grpc.go
package ports
 
import (
    "context"
 
    accountsv1 "github.com/klivvr/andromeda/gen/accounts/v1"
    "github.com/klivvr/andromeda/internal/accounts/app"
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/status"
)
 
type GRPCServer struct {
    accountsv1.UnimplementedAccountsServiceServer
    service *app.AccountService
}
 
func NewGRPCServer(svc *app.AccountService) *GRPCServer {
    return &GRPCServer{service: svc}
}
 
func (s *GRPCServer) GetAccount(
    ctx context.Context,
    req *accountsv1.GetAccountRequest,
) (*accountsv1.GetAccountResponse, error) {
    if req.AccountId == "" {
        return nil, status.Error(codes.InvalidArgument, "account_id is required")
    }
 
    account, err := s.service.GetAccount(ctx, req.AccountId)
    if err != nil {
        return nil, mapDomainError(err)
    }
 
    return &accountsv1.GetAccountResponse{
        Account: toProtoAccount(account),
    }, nil
}
 
func mapDomainError(err error) error {
    switch {
    case app.IsNotFound(err):
        return status.Error(codes.NotFound, err.Error())
    case app.IsValidation(err):
        return status.Error(codes.InvalidArgument, err.Error())
    default:
        return status.Error(codes.Internal, "internal error")
    }
}

The pattern is deliberate: the gRPC layer is thin. It validates the request, calls the application service, maps domain errors to gRPC status codes, and converts domain types to protobuf types. No business logic lives here.

NATS Event Publishing and Subscribing

For asynchronous communication, we use NATS JetStream. JetStream adds persistence, at-least-once delivery, and consumer groups to core NATS, making it suitable for events that must not be lost.

Events are published as serialized Protocol Buffers on well-known subjects. We use a hierarchical subject namespace:

// pkg/natsutil/subjects.go
package natsutil
 
const (
    SubjectAccountCreated   = "events.accounts.created"
    SubjectAccountUpdated   = "events.accounts.updated"
    SubjectPaymentCompleted = "events.payments.completed"
    SubjectPaymentFailed    = "events.payments.failed"
    SubjectKYCApproved      = "events.kyc.approved"
    SubjectKYCRejected      = "events.kyc.rejected"
)

Publishing an event from the accounts service looks like this:

// internal/accounts/infra/publisher.go
package infra
 
import (
    "context"
    "time"
 
    "github.com/klivvr/andromeda/pkg/natsutil"
    "github.com/nats-io/nats.go/jetstream"
    "google.golang.org/protobuf/proto"
 
    eventsv1 "github.com/klivvr/andromeda/gen/events/v1"
)
 
type EventPublisher struct {
    js jetstream.JetStream
}
 
func NewEventPublisher(js jetstream.JetStream) *EventPublisher {
    return &EventPublisher{js: js}
}
 
func (p *EventPublisher) AccountCreated(ctx context.Context, accountID, ownerID, currency string) error {
    event := &eventsv1.AccountCreatedEvent{
        AccountId: accountID,
        OwnerId:   ownerID,
        Currency:  currency,
        Timestamp: time.Now().UTC().Format(time.RFC3339),
    }
 
    data, err := proto.Marshal(event)
    if err != nil {
        return err
    }
 
    _, err = p.js.Publish(ctx, natsutil.SubjectAccountCreated, data)
    return err
}

On the consuming side, we create durable consumers so that messages are not lost if a service restarts:

// internal/notifications/infra/subscriber.go
package infra
 
import (
    "context"
    "log/slog"
 
    "github.com/klivvr/andromeda/pkg/natsutil"
    "github.com/nats-io/nats.go/jetstream"
    "google.golang.org/protobuf/proto"
 
    eventsv1 "github.com/klivvr/andromeda/gen/events/v1"
)
 
type AccountEventHandler struct {
    logger  *slog.Logger
    notifSvc NotificationSender
}
 
func (h *AccountEventHandler) Start(ctx context.Context, js jetstream.JetStream) error {
    consumer, err := js.CreateOrUpdateConsumer(ctx, "events", jetstream.ConsumerConfig{
        Durable:       "notifications-account-created",
        FilterSubject: natsutil.SubjectAccountCreated,
        AckPolicy:     jetstream.AckExplicitPolicy,
        MaxDeliver:    5,
    })
    if err != nil {
        return err
    }
 
    iter, err := consumer.Messages()
    if err != nil {
        return err
    }
 
    go func() {
        for {
            msg, err := iter.Next()
            if err != nil {
                h.logger.Error("fetching message", "error", err)
                return
            }
            h.handleAccountCreated(ctx, msg)
        }
    }()
 
    go func() {
        <-ctx.Done()
        iter.Stop()
    }()
 
    return nil
}
 
func (h *AccountEventHandler) handleAccountCreated(ctx context.Context, msg jetstream.Msg) {
    var event eventsv1.AccountCreatedEvent
    if err := proto.Unmarshal(msg.Data(), &event); err != nil {
        h.logger.Error("unmarshaling event", "error", err)
        msg.Term()
        return
    }
 
    if err := h.notifSvc.SendWelcome(ctx, event.OwnerId); err != nil {
        h.logger.Error("sending welcome", "error", err, "owner_id", event.OwnerId)
        msg.Nak()
        return
    }
 
    msg.Ack()
}

The explicit acknowledge and negative-acknowledge calls give us fine-grained control over retry behavior. If a message cannot be processed because of a transient error, Nak() returns it to the queue. If the message is fundamentally malformed, Term() removes it permanently.

Patterns for Resilience

Running two communication protocols in production requires attention to failure modes. Here are the patterns we rely on:

Deadlines everywhere. Every gRPC call carries a context deadline. We set service-level defaults and allow callers to override with shorter deadlines. A missing deadline is a bug.

func (c *Client) GetAccount(ctx context.Context, id string) (*Account, error) {
    ctx, cancel := context.WithTimeout(ctx, 3*time.Second)
    defer cancel()
 
    resp, err := c.grpc.GetAccount(ctx, &accountsv1.GetAccountRequest{
        AccountId: id,
    })
    if err != nil {
        return nil, fmt.Errorf("get account %s: %w", id, err)
    }
    return fromProto(resp.Account), nil
}

Circuit breaking for gRPC. We wrap gRPC client calls with a circuit breaker that opens after a configurable number of consecutive failures. When the circuit is open, calls fail immediately without hitting the network, giving the downstream service time to recover.

Idempotent event handlers. Because NATS provides at-least-once delivery, every event handler must be idempotent. We achieve this by including a unique event ID in every published message and using a deduplication table in the database.

Dead letter subjects. After a message exceeds its maximum delivery count, NATS JetStream can route it to a dead letter subject. We monitor this subject with alerts so that permanently failed messages are investigated rather than silently dropped.

Bridging Sync and Async

Some operations start as a synchronous gRPC call but trigger asynchronous downstream processing. For example, creating a new account is a synchronous gRPC call that returns the created account to the caller. But it also publishes an AccountCreated event so that the notifications service can send a welcome email and the analytics service can record the event.

The key pattern here is that the gRPC handler commits the database transaction first, then publishes the event. If the event publish fails, we log the failure and rely on a background reconciliation process that scans for unpublished events. This avoids the complexity of distributed transactions while providing eventual consistency.

func (s *AccountService) CreateAccount(ctx context.Context, ownerID, currency string) (*Account, error) {
    account, err := s.repo.Create(ctx, ownerID, currency)
    if err != nil {
        return nil, err
    }
 
    if err := s.publisher.AccountCreated(ctx, account.ID, ownerID, currency); err != nil {
        s.logger.Error("publishing account created event",
            "error", err,
            "account_id", account.ID,
        )
        // Event will be picked up by the reconciler
    }
 
    return account, nil
}

Conclusion

The combination of gRPC and NATS gives Andromeda a communication layer that is both performant and resilient. gRPC handles the synchronous, latency-sensitive path with strong typing and excellent tooling. NATS handles the asynchronous, event-driven path with durable delivery and natural decoupling. The patterns described here, thin gRPC adapters, protobuf-serialized events, explicit acknowledgment, deadlines, circuit breaking, idempotency, and sync-to-async bridging, form a coherent system that has served us well through rapid growth and evolving requirements.

The most important lesson is to let each protocol do what it does best. Do not force gRPC into an event bus role, and do not use NATS for request-response calls that need strong typing and deadline propagation. Use both, define clear boundaries, and invest in the resilience patterns that make the combination production-ready.

gRPC and NATS: Building Resilient Service Communication

Why Two Protocols

gRPC Service Definitions and Generation

NATS Event Publishing and Subscribing

Patterns for Resilience

Bridging Sync and Async

Conclusion

Related Articles

Testing Strategies for Go Backend Services

How Monorepos Boost Team Productivity

Observability in Go: Tracing, Metrics, and Logging

Related Articles

Testing Strategies for Go Backend Services
February 22, 202511 min read

How Monorepos Boost Team Productivity
February 19, 20259 min read

Observability in Go: Tracing, Metrics, and Logging
February 17, 20257 min read