API Monitoring and Alerting Best Practices

A comprehensive guide to monitoring API gateways in production, covering the four golden signals, structured logging, distributed tracing, and actionable alerting strategies.

business11 min readBy Klivvr Engineering
Share:

An API gateway processes every request your platform handles. When it degrades, everything degrades. When it goes down, everything goes down. This central position makes monitoring and alerting not just important, but existential. You need to know within seconds when something is wrong, within minutes what is causing it, and within hours how to prevent it from happening again. This article describes the monitoring and alerting practices we have built around Dispatch, our edge-deployed API gateway, and the lessons we have learned from running it in production for fintech clients where downtime is measured in lost transactions and regulatory scrutiny.

The Four Golden Signals

Google's Site Reliability Engineering handbook defines four golden signals for monitoring any distributed system: latency, traffic, errors, and saturation. For an API gateway, these translate into specific, measurable metrics.

Latency is the time a request spends in the gateway. We measure this at three points: total request duration (client to client), gateway processing time (excluding upstream wait), and upstream response time. The distinction matters because a latency spike could indicate a problem in the gateway itself, a slow upstream service, or a network issue between the two.

import { Hono } from 'hono'
 
interface GatewayMetrics {
  totalDuration: number
  gatewayProcessing: number
  upstreamDuration: number
  route: string
  method: string
  statusCode: number
  upstream: string
}
 
app.use('*', async (c, next) => {
  const totalStart = performance.now()
  let upstreamStart = 0
  let upstreamEnd = 0
 
  // Instrument upstream calls
  const originalFetch = globalThis.fetch
  globalThis.fetch = async (...args) => {
    upstreamStart = performance.now()
    const result = await originalFetch(...args)
    upstreamEnd = performance.now()
    return result
  }
 
  try {
    await next()
  } finally {
    const totalEnd = performance.now()
    globalThis.fetch = originalFetch
 
    const metrics: GatewayMetrics = {
      totalDuration: totalEnd - totalStart,
      gatewayProcessing:
        totalEnd - totalStart - (upstreamEnd - upstreamStart),
      upstreamDuration: upstreamEnd - upstreamStart,
      route: c.req.routePath || c.req.path,
      method: c.req.method,
      statusCode: c.res?.status || 500,
      upstream: c.get('upstreamService') || 'unknown',
    }
 
    c.executionCtx.waitUntil(reportMetrics(metrics))
  }
})

Traffic is the volume of requests the gateway handles. We track this as requests per second, broken down by route, method, and client. Traffic patterns reveal usage trends, identify abuse, and provide the baseline for anomaly detection. A sudden spike in traffic to a single endpoint might indicate a DDoS attack. A gradual increase over weeks signals organic growth that may require capacity planning.

Errors encompass both HTTP error responses (4xx and 5xx) and internal gateway failures. We distinguish between client errors (validation failures, authentication failures) and server errors (upstream timeouts, gateway bugs). The error rate -- errors as a percentage of total traffic -- is a more useful metric than the error count because it normalizes for traffic volume.

Saturation in an edge-deployed gateway is less about CPU or memory (those are managed by the runtime) and more about rate limit utilization, upstream connection pool usage, and KV storage operation counts. When a client approaches their rate limit, they are close to saturation from the gateway's perspective.

Structured Logging for Edge Environments

Traditional application logging -- writing to stdout and shipping to a log aggregation service -- does not work well at the edge. Edge functions have limited execution time, and synchronous log writes would block response delivery. Dispatch uses structured, asynchronous logging that captures the information needed for debugging without impacting request latency.

interface LogEntry {
  timestamp: string
  level: 'debug' | 'info' | 'warn' | 'error'
  message: string
  requestId: string
  route: string
  method: string
  statusCode: number
  duration: number
  clientIp: string
  userAgent: string
  userId?: string
  error?: {
    name: string
    message: string
    stack?: string
  }
  metadata?: Record<string, unknown>
}
 
class GatewayLogger {
  private buffer: LogEntry[] = []
 
  log(entry: LogEntry): void {
    this.buffer.push(entry)
  }
 
  async flush(ctx: ExecutionContext): Promise<void> {
    if (this.buffer.length === 0) return
 
    const entries = [...this.buffer]
    this.buffer = []
 
    ctx.waitUntil(
      fetch('https://logs.klivvr.com/ingest', {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
          Authorization: `Bearer ${LOG_API_KEY}`,
        },
        body: JSON.stringify({ entries }),
      })
    )
  }
}
 
// Usage in middleware
app.use('*', async (c, next) => {
  const logger = new GatewayLogger()
  c.set('logger', logger)
 
  const start = performance.now()
 
  try {
    await next()
  } catch (error) {
    logger.log({
      timestamp: new Date().toISOString(),
      level: 'error',
      message: 'Unhandled error in request pipeline',
      requestId: c.get('requestId') || 'unknown',
      route: c.req.routePath || c.req.path,
      method: c.req.method,
      statusCode: 500,
      duration: performance.now() - start,
      clientIp: c.req.header('CF-Connecting-IP') || '',
      userAgent: c.req.header('User-Agent') || '',
      error: {
        name: (error as Error).name,
        message: (error as Error).message,
        stack: (error as Error).stack,
      },
    })
    throw error
  } finally {
    const duration = performance.now() - start
 
    logger.log({
      timestamp: new Date().toISOString(),
      level: c.res?.status >= 500 ? 'error' : c.res?.status >= 400 ? 'warn' : 'info',
      message: `${c.req.method} ${c.req.path} ${c.res?.status}`,
      requestId: c.get('requestId') || 'unknown',
      route: c.req.routePath || c.req.path,
      method: c.req.method,
      statusCode: c.res?.status || 500,
      duration,
      clientIp: c.req.header('CF-Connecting-IP') || '',
      userAgent: c.req.header('User-Agent') || '',
      userId: c.get('authenticatedUser')?.id,
    })
 
    logger.flush(c.executionCtx)
  }
})

Every log entry includes the request ID, which serves as the correlation key across the entire request lifecycle. When investigating an incident, an engineer can search for a request ID and see the complete chain of events: the gateway receiving the request, each middleware execution, the upstream call, and the response delivery.

Distributed Tracing Across the Gateway

Logs tell you what happened. Traces tell you how long each step took and in what order. For an API gateway that invokes multiple middleware layers and one or more upstream services, distributed tracing provides the detailed timing breakdown needed to diagnose performance issues.

Dispatch implements W3C Trace Context propagation, ensuring that traces span from the client through the gateway to upstream services:

interface TraceSpan {
  traceId: string
  spanId: string
  parentSpanId?: string
  operationName: string
  startTime: number
  duration: number
  status: 'ok' | 'error'
  tags: Record<string, string>
}
 
function generateSpanId(): string {
  return crypto.randomUUID().replace(/-/g, '').substring(0, 16)
}
 
function parseTraceParent(header: string | undefined): {
  traceId: string
  parentSpanId: string
} | null {
  if (!header) return null
  const parts = header.split('-')
  if (parts.length !== 4) return null
  return { traceId: parts[1], parentSpanId: parts[2] }
}
 
app.use('*', async (c, next) => {
  const incoming = parseTraceParent(c.req.header('traceparent'))
  const traceId = incoming?.traceId || crypto.randomUUID().replace(/-/g, '')
  const spanId = generateSpanId()
  const parentSpanId = incoming?.parentSpanId
 
  c.set('traceId', traceId)
  c.set('spanId', spanId)
 
  // Set traceparent header for upstream propagation
  c.header('traceparent', `00-${traceId}-${spanId}-01`)
 
  const spans: TraceSpan[] = []
  c.set('traceSpans', spans)
 
  const start = performance.now()
 
  try {
    await next()
 
    spans.push({
      traceId,
      spanId,
      parentSpanId,
      operationName: `gateway.${c.req.method.toLowerCase()}.${c.req.routePath}`,
      startTime: start,
      duration: performance.now() - start,
      status: c.res?.status < 500 ? 'ok' : 'error',
      tags: {
        'http.method': c.req.method,
        'http.route': c.req.routePath || c.req.path,
        'http.status_code': String(c.res?.status || 500),
        'gateway.edge_location': c.req.header('CF-Ray')?.split('-')[1] || 'unknown',
      },
    })
  } finally {
    c.executionCtx.waitUntil(exportSpans(spans))
  }
})

When a performance issue is reported, the trace view shows exactly where time was spent: 2ms in authentication middleware, 1ms in rate limiting, 0.5ms in request validation, 145ms waiting for the upstream service, and 1ms in response processing. This immediately directs investigation to the upstream service rather than the gateway.

Alerting That Drives Action

The goal of alerting is to notify the right person at the right time with enough context to take action. Bad alerting -- too many alerts, vague descriptions, wrong recipients -- leads to alert fatigue, where engineers ignore notifications because they are overwhelmed with noise.

Dispatch's alerting strategy follows three principles: alert on symptoms not causes, require actionable runbooks, and escalate based on severity.

Alert on symptoms means we alert on user-facing impact, not internal metrics. We do not alert when CPU usage is high; we alert when the p95 latency exceeds the SLA threshold. We do not alert when an upstream is slow; we alert when the error rate for a specific API exceeds the baseline.

interface AlertRule {
  name: string
  condition: string
  threshold: number
  window: string
  severity: 'critical' | 'warning' | 'info'
  runbookUrl: string
  notifyChannels: string[]
}
 
const alertRules: AlertRule[] = [
  {
    name: 'gateway_high_error_rate',
    condition: 'rate(gateway_requests_total{status=~"5.."}[5m]) / rate(gateway_requests_total[5m]) > 0.05',
    threshold: 0.05,
    window: '5m',
    severity: 'critical',
    runbookUrl: 'https://runbooks.klivvr.com/gateway/high-error-rate',
    notifyChannels: ['pagerduty-platform', 'slack-incidents'],
  },
  {
    name: 'gateway_p95_latency_breach',
    condition: 'histogram_quantile(0.95, gateway_request_duration_seconds[5m]) > 0.1',
    threshold: 0.1,
    window: '5m',
    severity: 'warning',
    runbookUrl: 'https://runbooks.klivvr.com/gateway/latency-breach',
    notifyChannels: ['slack-platform-alerts'],
  },
  {
    name: 'upstream_circuit_open',
    condition: 'gateway_circuit_breaker_state{state="open"} > 0',
    threshold: 1,
    window: '1m',
    severity: 'critical',
    runbookUrl: 'https://runbooks.klivvr.com/gateway/circuit-breaker-open',
    notifyChannels: ['pagerduty-platform', 'slack-incidents'],
  },
  {
    name: 'rate_limit_saturation',
    condition: 'rate(gateway_rate_limit_rejected_total[5m]) / rate(gateway_requests_total[5m]) > 0.1',
    threshold: 0.1,
    window: '5m',
    severity: 'warning',
    runbookUrl: 'https://runbooks.klivvr.com/gateway/rate-limit-saturation',
    notifyChannels: ['slack-platform-alerts'],
  },
]

Require actionable runbooks means every alert links to a document that describes the symptom, likely causes, diagnostic steps, and remediation actions. When an engineer is paged at 3 AM, they should not need to reverse-engineer the system from scratch. The runbook provides a decision tree that gets them from alert to resolution as quickly as possible.

Escalate based on severity means critical alerts (service outages, data integrity issues) page on-call engineers immediately, while warnings (latency degradation, elevated error rates) post to Slack channels for next-business-day investigation. Information alerts are purely diagnostic and never notify anyone.

Health Check Endpoints and Synthetic Monitoring

Dispatch exposes multiple health check endpoints, each testing a different layer of the system:

app.get('/health', (c) => {
  // Shallow health check: is the gateway responding?
  return c.json({ status: 'ok', version: GATEWAY_VERSION })
})
 
app.get('/health/deep', async (c) => {
  const checks: Record<string, { status: string; latency: number }> = {}
 
  // Check KV connectivity
  const kvStart = performance.now()
  try {
    await c.env.CONFIG.get('health-check-key')
    checks.kv = { status: 'ok', latency: performance.now() - kvStart }
  } catch (e) {
    checks.kv = { status: 'error', latency: performance.now() - kvStart }
  }
 
  // Check each upstream service
  const upstreamServices = ['users', 'payments', 'accounts']
  for (const service of upstreamServices) {
    const start = performance.now()
    try {
      const res = await fetch(`${getUpstreamUrl(service)}/health`, {
        signal: AbortSignal.timeout(2000),
      })
      checks[service] = {
        status: res.ok ? 'ok' : 'degraded',
        latency: performance.now() - start,
      }
    } catch {
      checks[service] = {
        status: 'error',
        latency: performance.now() - start,
      }
    }
  }
 
  const allHealthy = Object.values(checks).every((c) => c.status === 'ok')
  const anyError = Object.values(checks).some((c) => c.status === 'error')
 
  return c.json(
    {
      status: anyError ? 'unhealthy' : allHealthy ? 'healthy' : 'degraded',
      checks,
      timestamp: new Date().toISOString(),
    },
    anyError ? 503 : 200
  )
})

The shallow health check (/health) is used by load balancers and uptime monitors. It should return in under 1ms and never fail unless the gateway process itself is broken. The deep health check (/health/deep) tests connectivity to all dependencies and is used by monitoring dashboards and incident investigation tools.

Dispatch also runs synthetic monitoring -- automated requests that simulate real user flows at regular intervals. Every 30 seconds, a synthetic monitor executes the following sequence: authenticate, fetch user profile, list transactions, and initiate a test payment. Each step is timed and checked for correctness. If any step fails or exceeds latency thresholds, an alert fires before real users are affected.

Dashboards That Tell a Story

Good dashboards are organized around questions, not metrics. The Dispatch monitoring dashboard answers four questions:

"Is the gateway healthy right now?" The top row shows real-time traffic, error rate, and p95 latency, color-coded green/yellow/red against SLA thresholds.

"What changed recently?" The second row shows deployments, configuration changes, and incident markers overlaid on the traffic graph. This immediately correlates changes with impact.

"Which services are affected?" The third row breaks down error rates and latency by upstream service, making it obvious which backend is causing degradation.

"What is the user impact?" The bottom row shows business metrics: transaction completion rates, API call success rates by client type (mobile, web, internal), and geographic distribution of errors.

This structure means an on-call engineer can assess the situation in under 30 seconds by scanning the dashboard top-to-bottom, without clicking into detailed views or running queries.

Conclusion

Monitoring and alerting for an API gateway is not a set-and-forget exercise. It requires continuous refinement as traffic patterns change, new services are onboarded, and new failure modes are discovered. The practices described here -- golden signal monitoring, structured logging, distributed tracing, actionable alerting, and synthetic monitoring -- provide the foundation. But the real differentiator is the operational culture around these tools: treating alerts as precious signals rather than noise, maintaining runbooks as living documents, and investing in dashboards that serve the people who use them. In fintech, where a gateway outage can halt transactions across an entire platform, this investment is not optional. It is the cost of operating at the reliability level your users and regulators expect.

Related Articles

technical

Request Validation with Zod and Hono

How to implement comprehensive request validation in Hono using Zod schemas, covering body parsing, query parameters, headers, and custom error formatting for API gateways.

9 min read