Idiomatic Error Handling in Go Microservices

A comprehensive guide to error handling patterns in Go microservices, covering error wrapping, sentinel errors, typed errors, and strategies for propagating errors across service boundaries.

technical8 min readBy Klivvr Engineering
Share:

Error handling in Go is famously verbose. The if err != nil pattern appears on nearly every other line, and newcomers often wonder whether there is a better way. After two years of building Andromeda, our answer is nuanced: the verbosity is the feature, not the bug. Explicit error handling forces you to think about failure at every step, which produces more reliable software. But verbosity without strategy produces a different kind of mess: errors that are swallowed, errors that lose context, and errors that cross service boundaries carrying internal implementation details.

This article covers the error handling patterns we have settled on in Andromeda. These patterns are not novel individually, but their combination creates a coherent system that works well across a monorepo of microservices.

Error Wrapping with Context

The most important error handling pattern in Go is wrapping errors with context using fmt.Errorf and the %w verb. Every function that receives an error from a callee should wrap it with enough context to understand where the error originated without reading a stack trace.

func (s *PaymentService) ProcessPayment(ctx context.Context, req ProcessPaymentRequest) error {
    account, err := s.accountRepo.FindByID(ctx, req.AccountID)
    if err != nil {
        return fmt.Errorf("finding account %s: %w", req.AccountID, err)
    }
 
    if err := account.Debit(req.Amount); err != nil {
        return fmt.Errorf("debiting account %s: %w", req.AccountID, err)
    }
 
    if err := s.accountRepo.Save(ctx, account); err != nil {
        return fmt.Errorf("saving account %s after debit: %w", req.AccountID, err)
    }
 
    if err := s.ledger.RecordTransaction(ctx, req.TransactionID, req.AccountID, req.Amount); err != nil {
        return fmt.Errorf("recording transaction %s: %w", req.TransactionID, err)
    }
 
    return nil
}

When this function fails, the error message reads like a breadcrumb trail: recording transaction tx_123: ledger unavailable: connection refused. Each layer adds its context, and the original error is preserved for programmatic inspection via errors.Is and errors.As.

We follow two rules for wrapping:

  1. Include identifying information. Do not wrap with just "finding account: %w". Include the account ID, the transaction ID, or whatever identifier helps an operator find the specific record that caused the failure.

  2. Use %w for errors the caller might need to inspect. Use %v for errors that should be opaque. If the caller needs to check errors.Is(err, domain.ErrNotFound), the error must be wrapped with %w. If the error is purely for logging and should not influence control flow upstream, %v breaks the chain intentionally.

Sentinel Errors and Typed Errors

Sentinel errors are package-level variables that represent specific, well-known failure conditions. They are the Go equivalent of named exceptions:

// internal/accounts/domain/errors.go
package domain
 
import "errors"
 
var (
    ErrNotFound         = errors.New("not found")
    ErrInsufficientFunds = errors.New("insufficient funds")
    ErrAccountFrozen    = errors.New("account is frozen")
    ErrDuplicateAccount = errors.New("duplicate account")
)

Callers check for sentinel errors using errors.Is:

account, err := s.repo.FindByID(ctx, id)
if errors.Is(err, domain.ErrNotFound) {
    return nil, status.Error(codes.NotFound, "account not found")
}
if err != nil {
    return nil, status.Error(codes.Internal, "internal error")
}

Sentinel errors work well for simple conditions. For errors that carry structured data, we use typed errors:

// internal/payments/domain/errors.go
package domain
 
import "fmt"
 
type ValidationError struct {
    Field   string
    Message string
}
 
func (e *ValidationError) Error() string {
    return fmt.Sprintf("validation: %s: %s", e.Field, e.Message)
}
 
type LimitExceededError struct {
    Limit   int64
    Current int64
    Period  string
}
 
func (e *LimitExceededError) Error() string {
    return fmt.Sprintf("limit exceeded: %d/%d in %s", e.Current, e.Limit, e.Period)
}

Callers check for typed errors using errors.As:

var validationErr *domain.ValidationError
if errors.As(err, &validationErr) {
    return nil, status.Errorf(codes.InvalidArgument,
        "invalid %s: %s", validationErr.Field, validationErr.Message)
}
 
var limitErr *domain.LimitExceededError
if errors.As(err, &limitErr) {
    return nil, status.Errorf(codes.ResourceExhausted,
        "transaction limit exceeded: %d/%d per %s",
        limitErr.Current, limitErr.Limit, limitErr.Period)
}

Error Classification at Service Boundaries

In a microservice architecture, errors cross service boundaries. A gRPC handler receives an error from the application layer and must decide which gRPC status code to return. A NATS event handler receives an error and must decide whether to ack, nak, or terminate the message. These decisions require a classification system.

We use a centralized error mapper in each service's port layer:

// internal/accounts/ports/errors.go
package ports
 
import (
    "errors"
 
    "github.com/klivvr/andromeda/internal/accounts/domain"
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/status"
)
 
func mapToGRPCError(err error) error {
    if err == nil {
        return nil
    }
 
    // Check sentinel errors first
    switch {
    case errors.Is(err, domain.ErrNotFound):
        return status.Error(codes.NotFound, sanitize(err))
    case errors.Is(err, domain.ErrInsufficientFunds):
        return status.Error(codes.FailedPrecondition, sanitize(err))
    case errors.Is(err, domain.ErrAccountFrozen):
        return status.Error(codes.FailedPrecondition, sanitize(err))
    case errors.Is(err, domain.ErrDuplicateAccount):
        return status.Error(codes.AlreadyExists, sanitize(err))
    }
 
    // Check typed errors
    var validationErr *domain.ValidationError
    if errors.As(err, &validationErr) {
        return status.Error(codes.InvalidArgument, validationErr.Error())
    }
 
    var limitErr *domain.LimitExceededError
    if errors.As(err, &limitErr) {
        return status.Error(codes.ResourceExhausted, limitErr.Error())
    }
 
    // Default: internal error, do not leak details
    return status.Error(codes.Internal, "internal error")
}
 
// sanitize returns a user-safe error message.
// It strips wrapping context that might contain internal details.
func sanitize(err error) string {
    // Unwrap to the root sentinel error
    for {
        unwrapped := errors.Unwrap(err)
        if unwrapped == nil {
            break
        }
        err = unwrapped
    }
    return err.Error()
}

The sanitize function is critical. Without it, an error like finding account acc_123 in shard 7: not found would leak internal details (shard numbers, account ID formats) to the client. By unwrapping to the root sentinel error, we return just not found.

Logging Errors with Structure

When an error reaches the top of the call stack, it must be logged with enough detail for debugging. We use log/slog for structured logging:

func (s *GRPCServer) GetAccount(
    ctx context.Context,
    req *accountsv1.GetAccountRequest,
) (*accountsv1.GetAccountResponse, error) {
    account, err := s.service.GetAccount(ctx, req.AccountId)
    if err != nil {
        grpcErr := mapToGRPCError(err)
 
        // Log the full error chain for debugging
        code := status.Code(grpcErr)
        if code == codes.Internal {
            // Internal errors are unexpected; log at Error level
            slog.ErrorContext(ctx, "GetAccount failed",
                "account_id", req.AccountId,
                "error", err.Error(),
                "grpc_code", code.String(),
            )
        } else {
            // Domain errors are expected; log at Info level
            slog.InfoContext(ctx, "GetAccount returned error",
                "account_id", req.AccountId,
                "error", err.Error(),
                "grpc_code", code.String(),
            )
        }
 
        return nil, grpcErr
    }
 
    return &accountsv1.GetAccountResponse{
        Account: toProtoAccount(account),
    }, nil
}

The distinction between Error and Info log levels is important. Domain errors (not found, validation failures, insufficient funds) are expected conditions that do not indicate a bug. They should be logged at Info or Warn level so that dashboards and alerts are not polluted. Internal errors indicate something genuinely wrong and should be logged at Error level to trigger alerts.

Error Handling in Concurrent Code

Go's concurrency primitives introduce additional error handling considerations. When multiple goroutines perform work, their errors must be collected and surfaced:

func (s *ReconciliationService) ReconcileAll(ctx context.Context, accountIDs []string) error {
    g, ctx := errgroup.WithContext(ctx)
    g.SetLimit(10) // limit concurrency
 
    for _, id := range accountIDs {
        id := id // capture loop variable
        g.Go(func() error {
            if err := s.reconcileOne(ctx, id); err != nil {
                return fmt.Errorf("reconciling %s: %w", id, err)
            }
            return nil
        })
    }
 
    return g.Wait()
}

The errgroup package from golang.org/x/sync is our standard tool for this. It cancels the context on the first error, collects the error, and waits for all goroutines to finish. For cases where we want to continue despite individual failures and collect all errors, we use a custom multi-error type:

type MultiError struct {
    errors []error
}
 
func (m *MultiError) Add(err error) {
    if err != nil {
        m.errors = append(m.errors, err)
    }
}
 
func (m *MultiError) Err() error {
    if len(m.errors) == 0 {
        return nil
    }
    return m
}
 
func (m *MultiError) Error() string {
    msgs := make([]string, len(m.errors))
    for i, err := range m.errors {
        msgs[i] = err.Error()
    }
    return fmt.Sprintf("%d errors: [%s]", len(m.errors), strings.Join(msgs, "; "))
}

Panic Recovery

We have a strict rule in Andromeda: no panics in library code. Panics are reserved for truly unrecoverable situations (programmer errors, corrupted state) and should never be used for expected error conditions. That said, panics happen, and we install recovery middleware in every entry point:

func RecoveryInterceptor(
    ctx context.Context,
    req interface{},
    info *grpc.UnaryServerInfo,
    handler grpc.UnaryHandler,
) (resp interface{}, err error) {
    defer func() {
        if r := recover(); r != nil {
            slog.ErrorContext(ctx, "panic recovered",
                "method", info.FullMethod,
                "panic", fmt.Sprintf("%v", r),
                "stack", string(debug.Stack()),
            )
            err = status.Error(codes.Internal, "internal error")
        }
    }()
 
    return handler(ctx, req)
}

This interceptor catches panics, logs them with a full stack trace, and returns a safe gRPC error to the client. The service stays up and continues serving subsequent requests.

Conclusion

Error handling in Go microservices is a system, not a collection of individual if err != nil checks. The system starts with consistent wrapping that preserves context and the error chain. It continues with sentinel and typed errors that enable programmatic classification. At service boundaries, errors are mapped to transport-appropriate representations while internal details are stripped. Logging distinguishes between expected domain errors and unexpected internal errors. Concurrent code uses errgroup or multi-error types to collect failures. And panic recovery ensures that even unexpected crashes do not take down the service.

The investment in a coherent error handling strategy pays off every time an operator investigates a production issue. Clear, contextual error messages reduce mean time to diagnosis. Proper classification reduces alert fatigue. And consistent patterns across all services mean that any engineer can debug any service, even one they have never worked on before.

Related Articles

technical

Testing Strategies for Go Backend Services

A comprehensive guide to testing Go backend services, covering unit tests, integration tests, end-to-end tests, table-driven patterns, test fixtures, and strategies for testing gRPC and NATS-based systems.

11 min read
business

How Monorepos Boost Team Productivity

An exploration of how monorepo architecture improves developer velocity, code quality, and cross-team collaboration, based on real-world experience with Andromeda.

9 min read
technical

Observability in Go: Tracing, Metrics, and Logging

A practical guide to implementing observability in Go backend services using OpenTelemetry for tracing, Prometheus for metrics, and structured logging with log/slog.

7 min read