Rate Limiting at the Edge: Strategies and Implementation

Rate limiting is one of the first capabilities any API gateway must provide. It protects backend services from abuse, ensures fair resource allocation among clients, and prevents runaway costs from misconfigured integrations. At the edge, rate limiting becomes significantly more complex: your limiter runs in hundreds of locations worldwide, and those locations must agree on how many requests a client has made -- without a shared database or network round trip on every request. This article explores how Dispatch implements rate limiting at the edge, the tradeoffs between different algorithms, and the consistency challenges inherent in distributed limiting.

Why Rate Limiting at the Edge

Traditional rate limiting runs on a centralized API server or a shared Redis instance. Every request increments a counter in the same data store, and the counter accurately reflects the total request count. This works well for single-region deployments, but it fails at the edge for two reasons.

First, the centralized counter becomes a latency bottleneck. If your gateway runs in Cairo but your Redis is in Frankfurt, every rate limit check adds 30-50ms of network latency. This defeats the purpose of edge deployment, which aims to keep latency under 10ms.

Second, a centralized counter is a single point of failure. If Redis goes down, your gateway either stops rate limiting entirely (allowing abuse) or rejects all requests (causing an outage). Neither outcome is acceptable for a production fintech gateway.

Dispatch implements rate limiting directly at the edge, using a combination of local counters for approximate limiting and Cloudflare Durable Objects for precise limiting when required.

Fixed Window Rate Limiting

The simplest rate limiting algorithm divides time into fixed windows and counts requests per window. It is easy to implement, easy to understand, and works well for most use cases:

import type { MiddlewareHandler } from 'hono'
 
interface FixedWindowConfig {
  windowMs: number
  maxRequests: number
  keyExtractor: (c: Context) => string
}
 
function fixedWindowLimiter(
  kv: KVNamespace,
  config: FixedWindowConfig
): MiddlewareHandler {
  return async (c, next) => {
    const key = config.keyExtractor(c)
    const windowId = Math.floor(Date.now() / config.windowMs)
    const kvKey = `ratelimit:${key}:${windowId}`
 
    // Read current count
    const current = parseInt((await kv.get(kvKey)) || '0', 10)
 
    if (current >= config.maxRequests) {
      const resetAt = (windowId + 1) * config.windowMs
      const retryAfter = Math.ceil((resetAt - Date.now()) / 1000)
 
      c.header('X-RateLimit-Limit', String(config.maxRequests))
      c.header('X-RateLimit-Remaining', '0')
      c.header('X-RateLimit-Reset', String(Math.ceil(resetAt / 1000)))
      c.header('Retry-After', String(retryAfter))
 
      return c.json(
        { error: { code: 429, message: 'Rate limit exceeded', retryAfter } },
        429
      )
    }
 
    // Increment counter with TTL matching the window duration
    await kv.put(kvKey, String(current + 1), {
      expirationTtl: Math.ceil(config.windowMs / 1000) + 60,
    })
 
    c.header('X-RateLimit-Limit', String(config.maxRequests))
    c.header('X-RateLimit-Remaining', String(config.maxRequests - current - 1))
 
    await next()
  }
}

The fixed window algorithm has a well-known weakness: boundary spikes. A client can send maxRequests at the end of one window and maxRequests at the start of the next, effectively doubling the rate over a short period. For many use cases this is acceptable, but for fintech APIs where strict rate enforcement matters, we need a better approach.

Sliding Window Rate Limiting

The sliding window algorithm eliminates boundary spikes by interpolating between the current and previous window counts. Instead of a hard cutover between windows, it calculates a weighted request count based on how far into the current window we are:

interface SlidingWindowConfig {
  windowMs: number
  maxRequests: number
  keyExtractor: (c: Context) => string
}
 
function slidingWindowLimiter(
  kv: KVNamespace,
  config: SlidingWindowConfig
): MiddlewareHandler {
  return async (c, next) => {
    const key = config.keyExtractor(c)
    const now = Date.now()
    const currentWindowId = Math.floor(now / config.windowMs)
    const previousWindowId = currentWindowId - 1
    const windowProgress = (now % config.windowMs) / config.windowMs
 
    const currentKey = `ratelimit:${key}:${currentWindowId}`
    const previousKey = `ratelimit:${key}:${previousWindowId}`
 
    // Fetch both windows in parallel
    const [currentCount, previousCount] = await Promise.all([
      kv.get(currentKey).then((v) => parseInt(v || '0', 10)),
      kv.get(previousKey).then((v) => parseInt(v || '0', 10)),
    ])
 
    // Weighted count: full current window + proportion of previous window
    const estimatedCount =
      currentCount + Math.floor(previousCount * (1 - windowProgress))
 
    if (estimatedCount >= config.maxRequests) {
      const retryAfter = Math.ceil(
        (config.windowMs - (now % config.windowMs)) / 1000
      )
      c.header('Retry-After', String(retryAfter))
      return c.json(
        { error: { code: 429, message: 'Rate limit exceeded', retryAfter } },
        429
      )
    }
 
    // Increment current window
    await kv.put(currentKey, String(currentCount + 1), {
      expirationTtl: Math.ceil((config.windowMs * 2) / 1000) + 60,
    })
 
    c.header('X-RateLimit-Limit', String(config.maxRequests))
    c.header('X-RateLimit-Remaining', String(config.maxRequests - estimatedCount - 1))
 
    await next()
  }
}

The sliding window provides a much smoother rate limit experience. The tradeoff is two KV reads instead of one, but since Cloudflare KV reads from the nearest edge cache, the added latency is typically under 1ms.

Token Bucket for Burst Handling

Some APIs need to allow short bursts of traffic while enforcing a steady-state rate. A notification service, for example, might receive 50 requests in a second when a batch of events fires, but averages only 5 requests per second. The token bucket algorithm handles this naturally:

interface TokenBucketConfig {
  bucketSize: number       // Maximum burst size
  refillRate: number       // Tokens added per second
  keyExtractor: (c: Context) => string
}
 
interface BucketState {
  tokens: number
  lastRefill: number
}
 
function tokenBucketLimiter(
  storage: DurableObjectStub,
  config: TokenBucketConfig
): MiddlewareHandler {
  return async (c, next) => {
    const key = config.keyExtractor(c)
 
    const response = await storage.fetch(
      new Request('https://internal/consume', {
        method: 'POST',
        body: JSON.stringify({
          key,
          bucketSize: config.bucketSize,
          refillRate: config.refillRate,
        }),
      })
    )
 
    const result = await response.json<{
      allowed: boolean
      remaining: number
      retryAfter?: number
    }>()
 
    if (!result.allowed) {
      c.header('Retry-After', String(result.retryAfter || 1))
      return c.json(
        {
          error: {
            code: 429,
            message: 'Rate limit exceeded',
            retryAfter: result.retryAfter,
          },
        },
        429
      )
    }
 
    c.header('X-RateLimit-Remaining', String(result.remaining))
    await next()
  }
}
 
// Durable Object implementation for token bucket
export class TokenBucketDO {
  private buckets: Map<string, BucketState> = new Map()
 
  async fetch(request: Request): Promise<Response> {
    const { key, bucketSize, refillRate } = await request.json<{
      key: string
      bucketSize: number
      refillRate: number
    }>()
 
    const now = Date.now() / 1000
    let bucket = this.buckets.get(key) || { tokens: bucketSize, lastRefill: now }
 
    // Refill tokens based on elapsed time
    const elapsed = now - bucket.lastRefill
    bucket.tokens = Math.min(bucketSize, bucket.tokens + elapsed * refillRate)
    bucket.lastRefill = now
 
    if (bucket.tokens >= 1) {
      bucket.tokens -= 1
      this.buckets.set(key, bucket)
      return Response.json({ allowed: true, remaining: Math.floor(bucket.tokens) })
    }
 
    const retryAfter = Math.ceil((1 - bucket.tokens) / refillRate)
    this.buckets.set(key, bucket)
    return Response.json({ allowed: false, remaining: 0, retryAfter })
  }
}

The token bucket is implemented as a Durable Object because it requires atomic read-modify-write operations. KV's eventual consistency model would lead to race conditions where two concurrent requests both see available tokens and both proceed, exceeding the limit.

Distributed Consistency Challenges

The fundamental challenge of edge rate limiting is consistency. When a client sends 10 requests simultaneously, and those requests land on 5 different edge nodes, each node might see only 2 requests and allow all of them, resulting in 10 requests passing through a limit of 5.

Dispatch uses a tiered consistency model to balance accuracy and performance:

type ConsistencyLevel = 'eventual' | 'strong' | 'best-effort'
 
interface RateLimitTier {
  consistency: ConsistencyLevel
  algorithm: 'fixed-window' | 'sliding-window' | 'token-bucket'
  config: Record<string, unknown>
}
 
// Tier 1: Best-effort limiting for high-volume, low-sensitivity endpoints
// Uses local KV (eventually consistent) -- fast but may allow small overages
const publicApiLimiter: RateLimitTier = {
  consistency: 'best-effort',
  algorithm: 'fixed-window',
  config: { windowMs: 60_000, maxRequests: 1000 },
}
 
// Tier 2: Eventual consistency for standard API endpoints
// Uses sliding window with KV -- accurate within seconds
const standardApiLimiter: RateLimitTier = {
  consistency: 'eventual',
  algorithm: 'sliding-window',
  config: { windowMs: 60_000, maxRequests: 100 },
}
 
// Tier 3: Strong consistency for financial operations
// Uses Durable Objects -- globally consistent but adds ~10ms latency
const financialApiLimiter: RateLimitTier = {
  consistency: 'strong',
  algorithm: 'token-bucket',
  config: { bucketSize: 10, refillRate: 1 },
}

For most API endpoints, eventual consistency is sufficient. A client that is legitimately using the API will not hit the edge cases where eventual consistency allows overages. For financial endpoints where overages could mean duplicate transactions, we accept the latency cost of strong consistency through Durable Objects.

Multi-Dimensional Rate Limiting

Real-world rate limiting often needs to enforce multiple limits simultaneously. A client might be limited to 100 requests per minute, 1000 per hour, and 10 requests per second for burst protection. Dispatch supports stacking multiple limiters:

function multiLimiter(limiters: MiddlewareHandler[]): MiddlewareHandler {
  return async (c, next) => {
    // Check all limits before allowing the request
    for (const limiter of limiters) {
      let blocked = false
      const mockNext = async () => {
        blocked = false
      }
 
      // Run each limiter; if it returns without calling next, the request is blocked
      const response = await new Promise<Response | null>((resolve) => {
        let responded = false
        const mockContext = {
          ...c,
          json: (body: unknown, status?: number) => {
            responded = true
            resolve(c.json(body as object, status as any))
          },
        }
        limiter(mockContext as any, async () => {
          resolve(null)
        })
      })
 
      if (response) return response
    }
 
    await next()
  }
}
 
// Apply multiple limits to payment endpoints
app.use(
  '/api/v1/payments/*',
  multiLimiter([
    fixedWindowLimiter(kv, {
      windowMs: 1_000,
      maxRequests: 5,
      keyExtractor: (c) => `burst:${c.get('userId')}`,
    }),
    slidingWindowLimiter(kv, {
      windowMs: 60_000,
      maxRequests: 60,
      keyExtractor: (c) => `minute:${c.get('userId')}`,
    }),
    slidingWindowLimiter(kv, {
      windowMs: 3_600_000,
      maxRequests: 500,
      keyExtractor: (c) => `hour:${c.get('userId')}`,
    }),
  ])
)

Communicating Limits to Clients

Clear rate limit communication is as important as the limiting itself. Dispatch follows the IETF draft standard for rate limit headers, providing clients with all the information they need to self-throttle:

function setRateLimitHeaders(
  c: Context,
  config: {
    limit: number
    remaining: number
    resetTimestamp: number
    policy: string
  }
) {
  c.header('RateLimit-Limit', String(config.limit))
  c.header('RateLimit-Remaining', String(Math.max(0, config.remaining)))
  c.header('RateLimit-Reset', String(config.resetTimestamp))
  c.header('RateLimit-Policy', config.policy)
 
  // Legacy headers for backward compatibility
  c.header('X-RateLimit-Limit', String(config.limit))
  c.header('X-RateLimit-Remaining', String(Math.max(0, config.remaining)))
}

When a client is rate limited, the 429 response includes a Retry-After header with the number of seconds until the limit resets. Well-behaved clients use this to implement exponential backoff, reducing load on both the gateway and upstream services.

Conclusion

Rate limiting at the edge is a distributed systems problem disguised as a simple counter. The algorithm you choose -- fixed window, sliding window, or token bucket -- depends on your tolerance for boundary effects and burst behavior. The consistency level -- eventual via KV or strong via Durable Objects -- depends on the sensitivity of the endpoint being protected. And the implementation details -- parallel limit checks, standard headers, multi-dimensional limits -- determine whether your rate limiting is a seamless guardrail or a frustrating obstacle for legitimate clients. In Dispatch, we treat rate limiting as a first-class feature, not an afterthought, because in fintech, the difference between a well-limited API and a poorly-limited one can be measured in dollars.

Rate Limiting at the Edge: Strategies and Implementation

Why Rate Limiting at the Edge

Fixed Window Rate Limiting

Sliding Window Rate Limiting

Token Bucket for Burst Handling

Distributed Consistency Challenges

Multi-Dimensional Rate Limiting

Communicating Limits to Clients

Conclusion

Related Articles

API Monitoring and Alerting Best Practices

Edge Computing for Fintech: Latency and Compliance Benefits

API Performance Optimization: From 200ms to 20ms

Related Articles

API Monitoring and Alerting Best Practices
April 24, 202511 min read

Edge Computing for Fintech: Latency and Compliance Benefits
April 20, 20259 min read

API Performance Optimization: From 200ms to 20ms
A practical guide to optimizing API gateway performance, covering the specific techniques that took Dispatch's p95 latency from 200ms to under 20ms.