API Performance Optimization: From 200ms to 20ms
A practical guide to optimizing API gateway performance, covering the specific techniques that took Dispatch's p95 latency from 200ms to under 20ms.
In fintech, API latency is not an abstract metric -- it is the time your users stare at a loading spinner, the delay between tapping "Send" and seeing a confirmation, the milliseconds that determine whether a payment completes before a session times out. When we first deployed Dispatch, our API gateway, in production, the p95 response time was 200ms. That sounds fast by many standards, but for a gateway that adds overhead to every single API call, 200ms was unacceptable. Six months later, our p95 sits at 18ms. This article documents the specific techniques that got us there, and the business impact of each improvement.
Measuring What Matters
Before optimizing anything, we needed to understand where time was actually being spent. The first mistake teams make with performance optimization is guessing. The second mistake is measuring the wrong thing. Average latency is nearly meaningless for API gateways because it hides the tail -- the 5% of requests that take 500ms or more and represent your worst user experience.
We instrumented Dispatch with four key metrics: p50 (median), p95, p99, and maximum latency, broken down by route, HTTP method, and upstream service. The instrumentation itself needed to be low-overhead:
import { Hono } from 'hono'
const app = new Hono()
app.use('*', async (c, next) => {
const start = performance.now()
const route = c.req.routePath || c.req.path
try {
await next()
} finally {
const duration = performance.now() - start
const status = c.res?.status || 500
// Non-blocking metric submission
c.executionCtx.waitUntil(
submitMetric({
name: 'gateway.request.duration',
value: duration,
tags: {
route,
method: c.req.method,
status: String(status),
upstream: c.get('upstreamService') || 'none',
},
})
)
}
})The waitUntil pattern is critical for edge performance. It tells the runtime to keep the worker alive to complete the metric submission, but does not block the response from being sent to the client. Without waitUntil, every request would pay the latency cost of the monitoring write.
Our initial measurements revealed a clear breakdown of the 200ms p95: 15ms for gateway middleware (authentication, rate limiting, validation), 10ms for request serialization and upstream connection setup, 150ms for the upstream service response, and 25ms for response processing and logging. The upstream service time was largely outside our control, so we focused on the 50ms of gateway overhead.
Eliminating Cold Start Overhead
Our first major win came from addressing cold starts. On Cloudflare Workers, a cold start occurs when a request hits an edge node that does not have a warm isolate for your worker. The runtime must compile your JavaScript, initialize module scope, and execute the first request -- all in sequence.
Dispatch's initial bundle was 850KB, inflated by schema validation libraries and logging utilities that we did not actually need in every request path. We applied three techniques:
// BEFORE: Importing the entire Zod library (120KB)
import { z } from 'zod'
// AFTER: Using Hono's built-in validator for simple cases
import { validator } from 'hono/validator'
// For simple validation, Hono's built-in validator avoids the Zod overhead
app.post(
'/api/v1/simple-endpoint',
validator('json', (value, c) => {
if (!value.email || typeof value.email !== 'string') {
return c.json({ error: 'Invalid email' }, 400)
}
return value as { email: string }
}),
(c) => {
const { email } = c.req.valid('json')
return c.json({ received: email })
}
)We kept Zod for complex schemas (discriminated unions, refinements, transforms) but switched to Hono's lightweight built-in validator for simple type checks. This reduced our bundle by 95KB. We also tree-shook our logging library, replacing a full-featured logger with a minimal structured logger that compiled to 3KB. The combined effect reduced cold start time from 45ms to 12ms.
The second cold start optimization was more architectural. We moved configuration loading from module scope to a cached lazy-initialization pattern:
let configCache: GatewayConfig | null = null
let configLoadedAt = 0
const CONFIG_TTL = 60_000 // Refresh config every 60 seconds
async function getConfig(env: Bindings): Promise<GatewayConfig> {
const now = Date.now()
if (configCache && now - configLoadedAt < CONFIG_TTL) {
return configCache
}
configCache = await env.CONFIG.get<GatewayConfig>('gateway-config', 'json')
|| getDefaultConfig()
configLoadedAt = now
return configCache
}This change eliminated the KV read from the cold start path entirely. The first request after a cold start uses default configuration, and the actual configuration loads in the background for subsequent requests. The default configuration is conservative (stricter rate limits, all features enabled), so the brief window of default configuration does not create a security gap.
Connection and Fetch Optimization
The next area of focus was upstream request handling. Each request to an upstream service involves DNS resolution, TCP connection establishment, TLS handshake, and data transfer. On the edge, these steps can be surprisingly expensive if not managed carefully.
// BEFORE: Naive fetch with no optimization
const response = await fetch(upstreamUrl, {
method: c.req.method,
headers: c.req.raw.headers,
body: c.req.raw.body,
})
// AFTER: Optimized fetch with connection reuse and streaming
async function proxyRequest(c: Context, upstreamUrl: string): Promise<Response> {
const headers = new Headers(c.req.raw.headers)
// Remove hop-by-hop headers that should not be forwarded
headers.delete('connection')
headers.delete('keep-alive')
headers.delete('transfer-encoding')
headers.delete('host')
// Add gateway headers
headers.set('X-Forwarded-For', c.get('clientIp') || '')
headers.set('X-Forwarded-Proto', 'https')
headers.set('X-Request-ID', c.get('requestId'))
const response = await fetch(upstreamUrl, {
method: c.req.method,
headers,
body: c.req.raw.body,
signal: AbortSignal.timeout(5000),
// @ts-expect-error Cloudflare-specific option
cf: {
cacheTtl: 0, // Disable CF cache for API responses
cacheEverything: false,
},
})
// Stream the response body instead of buffering it
return new Response(response.body, {
status: response.status,
headers: response.headers,
})
}Two changes made the biggest impact. First, streaming the response body instead of buffering it. The fetch API returns a ReadableStream body, and passing it directly to a new Response avoids loading the entire response into memory. For large payloads, this reduces both memory usage and time-to-first-byte. Second, setting explicit timeouts with AbortSignal.timeout() prevents slow upstreams from holding connections open indefinitely.
Caching at the Gateway Layer
Not every API response needs to be fresh. For data that changes infrequently -- user profiles, configuration data, exchange rates -- caching at the gateway can eliminate the upstream call entirely:
interface CacheConfig {
ttlSeconds: number
staleWhileRevalidateSeconds: number
varyBy: string[]
}
function cacheMiddleware(config: CacheConfig): MiddlewareHandler {
return async (c, next) => {
// Only cache GET requests
if (c.req.method !== 'GET') {
await next()
return
}
// Build cache key from path and vary-by headers/params
const varyValues = config.varyBy.map((v) => c.req.header(v) || c.req.query(v) || '')
const cacheKey = `cache:${c.req.path}:${varyValues.join(':')}`
// Check cache
const cached = await c.env.CACHE.get(cacheKey, 'json') as {
body: unknown
status: number
headers: Record<string, string>
cachedAt: number
} | null
if (cached) {
const age = Math.floor((Date.now() - cached.cachedAt) / 1000)
const isStale = age > config.ttlSeconds
if (!isStale) {
// Fresh cache hit
const response = c.json(cached.body as object, cached.status as any)
c.header('X-Cache', 'HIT')
c.header('Age', String(age))
return response
}
if (age < config.ttlSeconds + config.staleWhileRevalidateSeconds) {
// Stale but within revalidation window: serve stale, refresh in background
c.executionCtx.waitUntil(revalidateCache(c, cacheKey, config, next))
c.header('X-Cache', 'STALE')
c.header('Age', String(age))
return c.json(cached.body as object, cached.status as any)
}
}
// Cache miss: fetch from upstream and cache the response
await next()
if (c.res.ok) {
const body = await c.res.clone().json()
c.executionCtx.waitUntil(
c.env.CACHE.put(
cacheKey,
JSON.stringify({
body,
status: c.res.status,
headers: Object.fromEntries(c.res.headers),
cachedAt: Date.now(),
}),
{ expirationTtl: config.ttlSeconds + config.staleWhileRevalidateSeconds + 60 }
)
)
c.header('X-Cache', 'MISS')
}
}
}
// Apply to appropriate routes
app.use(
'/api/v1/exchange-rates/*',
cacheMiddleware({
ttlSeconds: 300,
staleWhileRevalidateSeconds: 60,
varyBy: ['Accept-Language'],
})
)The stale-while-revalidate pattern is particularly valuable for API gateways. The client receives a response immediately from cache (even if slightly stale), while the gateway refreshes the cache in the background. This eliminates latency spikes when cache entries expire.
Caching reduced our p95 for cacheable endpoints from 150ms (full upstream round trip) to 3ms (KV read). Since approximately 40% of Dispatch's traffic hits cacheable endpoints, the aggregate impact was substantial.
Parallel Middleware Execution
Our default middleware chain executed sequentially: authentication, then rate limiting, then validation, then routing. But some of these steps are independent. Authentication and rate limiting do not depend on each other -- they both read from the request and write to context variables. We introduced parallel execution for independent middleware:
function parallel(...middlewares: MiddlewareHandler[]): MiddlewareHandler {
return async (c, next) => {
// Execute all middleware in parallel
const results = await Promise.allSettled(
middlewares.map(
(mw) =>
new Promise<void>((resolve, reject) => {
mw(c, async () => resolve()).catch(reject)
})
)
)
// Check for any rejections
for (const result of results) {
if (result.status === 'rejected') {
throw result.reason
}
}
await next()
}
}
// Authentication and rate limiting run in parallel
app.use(
'/api/v1/*',
parallel(
authMiddleware({ secret: AUTH_SECRET }),
rateLimiter({ windowMs: 60000, maxRequests: 100 })
)
)This reduced middleware execution time from 15ms (sequential) to 8ms (parallel), because authentication (JWT verification, ~5ms) and rate limiting (KV lookup, ~3ms) now overlap.
The Business Impact
The cumulative effect of these optimizations was dramatic. Our p95 dropped from 200ms to 18ms. But more importantly, the business metrics improved:
Mobile app session completion rates increased by 12%. Faster API responses meant fewer users abandoning flows mid-transaction. Payment success rates improved by 3.2%, because fewer requests timed out during the multi-step payment process. Infrastructure costs decreased by 40%, because caching eliminated millions of upstream requests per day, reducing load on backend services and their databases.
The lesson is that API performance is not just an engineering concern. Every millisecond of gateway overhead multiplies across every request, every user, and every transaction. For a fintech platform processing millions of transactions daily, a 180ms reduction in gateway latency translates directly to revenue, user satisfaction, and operational efficiency.
Conclusion
API performance optimization is systematic work, not guesswork. Measure precisely, identify the bottlenecks, and apply targeted fixes. For Dispatch, the path from 200ms to 18ms involved cold start elimination, response streaming, intelligent caching, and parallel middleware execution. None of these techniques are exotic -- they are standard engineering practices applied methodically. The key is prioritizing by impact: fix the biggest bottleneck first, measure again, and repeat. In a gateway that handles every request your platform processes, even small improvements compound into significant business outcomes.
Related Articles
API Monitoring and Alerting Best Practices
A comprehensive guide to monitoring API gateways in production, covering the four golden signals, structured logging, distributed tracing, and actionable alerting strategies.
Edge Computing for Fintech: Latency and Compliance Benefits
How edge computing addresses the unique challenges of fintech platforms, including latency-sensitive transactions, data residency requirements, and regulatory compliance across jurisdictions.
Request Validation with Zod and Hono
How to implement comprehensive request validation in Hono using Zod schemas, covering body parsing, query parameters, headers, and custom error formatting for API gateways.