Observability in Go: Tracing, Metrics, and Logging

You cannot improve what you cannot see. In a microservice architecture, where a single user request traverses multiple services, databases, and message brokers, observability is not a luxury. It is a prerequisite for operating the system reliably. Andromeda's observability stack rests on three pillars: distributed tracing with OpenTelemetry, metrics with Prometheus, and structured logging with Go's standard log/slog package. Together, these pillars provide the visibility we need to debug issues, monitor performance, and plan capacity.

This article covers how we instrument Go services in Andromeda, the patterns we follow, and the pitfalls we have learned to avoid.

Distributed Tracing with OpenTelemetry

In a system where a gRPC call from the gateway service triggers calls to the accounts service and the payments service, which in turn publishes a NATS event consumed by the notifications service, understanding the full lifecycle of a request requires distributed tracing. Each service creates spans that are linked by a shared trace ID, forming a tree that visualizes the entire request flow.

We use OpenTelemetry (OTel) as our tracing standard. OTel provides a vendor-neutral API, SDKs for Go, and exporters for backends like Jaeger, Tempo, and Datadog.

Setting up the tracer provider is the first step:

// pkg/observability/tracer.go
package observability
 
import (
    "context"
    "fmt"
 
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)
 
func InitTracer(ctx context.Context, serviceName, version string) (func(context.Context) error, error) {
    exporter, err := otlptracegrpc.New(ctx)
    if err != nil {
        return nil, fmt.Errorf("creating OTLP exporter: %w", err)
    }
 
    res, err := resource.New(ctx,
        resource.WithAttributes(
            semconv.ServiceNameKey.String(serviceName),
            semconv.ServiceVersionKey.String(version),
        ),
    )
    if err != nil {
        return nil, fmt.Errorf("creating resource: %w", err)
    }
 
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.ParentBased(sdktrace.TraceIDRatioBased(0.1))),
    )
 
    otel.SetTracerProvider(tp)
    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},
        propagation.Baggage{},
    ))
 
    return tp.Shutdown, nil
}

The sampler is set to 10% (TraceIDRatioBased(0.1)) with parent-based sampling. This means that if a parent span was sampled, all child spans are also sampled, ensuring complete traces. For new traces without a parent, only 10% are sampled, keeping overhead manageable at high traffic.

For gRPC services, OTel provides interceptors that automatically create spans for every RPC:

import (
    "go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
    "google.golang.org/grpc"
)
 
func newGRPCServer() *grpc.Server {
    return grpc.NewServer(
        grpc.StatsHandler(otelgrpc.NewServerHandler()),
    )
}
 
func newGRPCClient(addr string) (*grpc.ClientConn, error) {
    return grpc.NewClient(addr,
        grpc.WithStatsHandler(otelgrpc.NewClientHandler()),
    )
}

For NATS, there is no built-in OTel integration, so we propagate trace context manually through message headers:

// pkg/natsutil/tracing.go
package natsutil
 
import (
    "context"
 
    "github.com/nats-io/nats.go"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/propagation"
)
 
// InjectTraceContext adds trace context to NATS message headers.
func InjectTraceContext(ctx context.Context, msg *nats.Msg) {
    if msg.Header == nil {
        msg.Header = nats.Header{}
    }
    otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(msg.Header))
}
 
// ExtractTraceContext retrieves trace context from NATS message headers.
func ExtractTraceContext(ctx context.Context, msg *nats.Msg) context.Context {
    if msg.Header == nil {
        return ctx
    }
    return otel.GetTextMapPropagator().Extract(ctx, propagation.HeaderCarrier(msg.Header))
}

When publishing an event, we inject the trace context. When consuming, we extract it and use it as the parent context for the handler's span. This creates a continuous trace that flows through both gRPC calls and NATS events.

Custom Spans for Business Logic

The automatic gRPC spans cover the transport layer, but the most valuable tracing information comes from custom spans in business logic:

import (
    "context"
 
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
)
 
var tracer = otel.Tracer("payments/app")
 
func (s *PaymentService) ProcessPayment(ctx context.Context, req ProcessPaymentRequest) error {
    ctx, span := tracer.Start(ctx, "PaymentService.ProcessPayment")
    defer span.End()
 
    span.SetAttributes(
        attribute.String("payment.account_id", req.AccountID),
        attribute.Int64("payment.amount", req.Amount),
        attribute.String("payment.currency", req.Currency),
    )
 
    // Debit the account
    if err := s.debitAccount(ctx, req); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return err
    }
 
    // Record the transaction
    if err := s.recordTransaction(ctx, req); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return err
    }
 
    span.SetStatus(codes.Ok, "")
    return nil
}
 
func (s *PaymentService) debitAccount(ctx context.Context, req ProcessPaymentRequest) error {
    ctx, span := tracer.Start(ctx, "PaymentService.debitAccount")
    defer span.End()
 
    // ... debit logic
    return nil
}

Each child function creates its own span, building a detailed tree that shows exactly where time is spent. When a payment takes longer than expected, the trace immediately reveals whether the delay is in the account debit, the transaction recording, or the event publishing.

Metrics with Prometheus

While traces show you the behavior of individual requests, metrics show you the aggregate behavior of the system. We use Prometheus-style metrics exposed via an HTTP endpoint and scraped by our monitoring stack.

// pkg/observability/metrics.go
package observability
 
import (
    "net/http"
 
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)
 
var (
    RequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "grpc_request_duration_seconds",
            Help:    "Duration of gRPC requests in seconds.",
            Buckets: []float64{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5},
        },
        []string{"method", "status"},
    )
 
    ActiveConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "grpc_active_connections",
            Help: "Number of active gRPC connections.",
        },
    )
 
    EventsPublished = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "nats_events_published_total",
            Help: "Total number of NATS events published.",
        },
        []string{"subject"},
    )
 
    EventsConsumed = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "nats_events_consumed_total",
            Help: "Total number of NATS events consumed.",
        },
        []string{"subject", "status"},
    )
)
 
func MetricsHandler() http.Handler {
    return promhttp.Handler()
}

We instrument gRPC calls with a metrics interceptor:

func MetricsInterceptor(
    ctx context.Context,
    req interface{},
    info *grpc.UnaryServerInfo,
    handler grpc.UnaryHandler,
) (interface{}, error) {
    start := time.Now()
    resp, err := handler(ctx, req)
    duration := time.Since(start).Seconds()
 
    st, _ := status.FromError(err)
    RequestDuration.WithLabelValues(info.FullMethod, st.Code().String()).Observe(duration)
 
    return resp, err
}

The metrics we consider essential for every service are:

Request rate (counter): how many requests per second the service handles.
Error rate (counter): how many requests fail, broken down by error type.
Duration (histogram): how long requests take, with percentile breakdowns.
Saturation (gauge): how close the service is to its capacity limits (connection pool usage, goroutine count, memory).

These four metrics, known as the RED method (Rate, Errors, Duration) plus saturation, provide a complete picture of service health.

Structured Logging with log/slog

Go 1.21 introduced log/slog, a structured logging package in the standard library. We adopted it across Andromeda, replacing our previous use of third-party logging libraries.

// pkg/observability/logging.go
package observability
 
import (
    "log/slog"
    "os"
)
 
func InitLogger(serviceName, environment string) *slog.Logger {
    var handler slog.Handler
 
    if environment == "production" {
        handler = slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
            Level: slog.LevelInfo,
        })
    } else {
        handler = slog.NewTextHandler(os.Stdout, &slog.HandlerOptions{
            Level: slog.LevelDebug,
        })
    }
 
    logger := slog.New(handler).With(
        slog.String("service", serviceName),
        slog.String("env", environment),
    )
 
    slog.SetDefault(logger)
    return logger
}

In production, logs are JSON for machine parsing. In development, they are human-readable text. The service name and environment are attached to every log entry as default attributes.

We correlate logs with traces by adding the trace ID to every log entry. A custom slog.Handler wrapper extracts the trace ID from the context:

type traceHandler struct {
    inner slog.Handler
}
 
func (h *traceHandler) Handle(ctx context.Context, r slog.Record) error {
    span := trace.SpanFromContext(ctx)
    if span.SpanContext().IsValid() {
        r.AddAttrs(
            slog.String("trace_id", span.SpanContext().TraceID().String()),
            slog.String("span_id", span.SpanContext().SpanID().String()),
        )
    }
    return h.inner.Handle(ctx, r)
}
 
func (h *traceHandler) Enabled(ctx context.Context, level slog.Level) bool {
    return h.inner.Enabled(ctx, level)
}
 
func (h *traceHandler) WithAttrs(attrs []slog.Attr) slog.Handler {
    return &traceHandler{inner: h.inner.WithAttrs(attrs)}
}
 
func (h *traceHandler) WithGroup(name string) slog.Handler {
    return &traceHandler{inner: h.inner.WithGroup(name)}
}

With this handler, every log entry includes the trace ID, making it trivial to jump from a log line to the full trace in our tracing backend.

Alerting Philosophy

Collecting telemetry is only half the battle. The other half is acting on it. Our alerting philosophy in Andromeda follows two principles:

Alert on symptoms, not causes. We alert when the error rate exceeds a threshold or when latency degrades, not when CPU usage is high or a specific dependency is slow. Symptom-based alerts reduce noise because they fire only when users are actually affected.

Every alert must be actionable. If an alert fires and the on-call engineer's response is "I don't know what to do about this," the alert is broken. Every alert links to a runbook that describes the diagnostic steps and potential remediations.

Conclusion

Observability in Go is not about installing a framework and hoping for the best. It is about intentionally instrumenting the code, choosing the right level of detail, and connecting the three pillars, traces, metrics, and logs, into a coherent system. OpenTelemetry provides the distributed tracing standard. Prometheus provides the metrics model. And Go's log/slog provides structured logging that correlates with traces.

The investment in observability pays for itself the first time you debug a production issue. Instead of guessing, you look at the trace. Instead of reading raw logs, you search by trace ID. Instead of wondering whether the system is healthy, you check the dashboard. Observability turns a black box into a glass box, and in a microservice architecture, that visibility is essential.

Observability in Go: Tracing, Metrics, and Logging

Distributed Tracing with OpenTelemetry

Custom Spans for Business Logic

Metrics with Prometheus

Structured Logging with log/slog

Alerting Philosophy

Conclusion

Related Articles

Testing Strategies for Go Backend Services

How Monorepos Boost Team Productivity

Scaling Go Services: From Startup to Enterprise

Related Articles

Testing Strategies for Go Backend Services
February 22, 202511 min read

How Monorepos Boost Team Productivity
February 19, 20259 min read

Scaling Go Services: From Startup to Enterprise
February 15, 20257 min read