Retry Strategies: Exponential Backoff and Beyond

Every network call can fail. DNS resolution times out. TCP connections reset. HTTP servers return 503. Databases hit connection pool limits. In a workflow that makes dozens of remote calls across its lifetime, transient failures are not exceptional events but routine occurrences. The question is not whether to retry, but how.

A naive retry strategy, hammering the failing service in a tight loop, turns a transient hiccup into a sustained outage. It overwhelms the struggling service, delays its recovery, and can cascade into failures across your entire system. Thoughtful retry strategies protect both your workflow and the services it depends on.

This article covers the retry strategies available in Alfred, from simple fixed delays to adaptive policies that learn from the failure patterns they observe.

Fixed Delay and Exponential Backoff

The simplest retry strategy waits a fixed amount of time between each attempt. While easy to reason about, fixed delay retries have a critical flaw: when many workflow instances retry simultaneously, they all hit the target service at the same moment, creating a thundering herd.

import { RetryPolicy } from '@alfred/retry';
 
// Simple fixed delay: wait 2 seconds between each attempt
const fixedRetry = RetryPolicy.fixed({
  maxAttempts: 3,
  delay: 2000,
});
 
// Exponential backoff: 1s, 2s, 4s, 8s, 16s...
const exponentialRetry = RetryPolicy.exponential({
  maxAttempts: 5,
  initialDelay: 1000,
  multiplier: 2,
  maxDelay: 30000,
});

Exponential backoff addresses the thundering herd partially by spreading retries over increasingly longer intervals. The first retry happens after 1 second, the second after 2 seconds, the third after 4 seconds, and so on. This gives the target service progressively more breathing room to recover.

However, pure exponential backoff still has a synchronization problem. If 1000 workflow instances all fail at the same time, they will all retry at exactly 1 second, then all retry again at exactly 2 seconds. The retries are spread over time, but they still arrive in synchronized bursts.

Adding Jitter

Jitter breaks the synchronization by adding randomness to the delay. Instead of retrying at exactly the calculated backoff time, each instance retries at a random point within a window. This distributes the retry load more evenly over time.

import { RetryPolicy, JitterStrategy } from '@alfred/retry';
 
// Full jitter: random delay between 0 and the calculated backoff
const fullJitterRetry = RetryPolicy.exponential({
  maxAttempts: 5,
  initialDelay: 1000,
  multiplier: 2,
  maxDelay: 30000,
  jitter: JitterStrategy.full(),
});
 
// Equal jitter: half the backoff plus a random portion of the other half
const equalJitterRetry = RetryPolicy.exponential({
  maxAttempts: 5,
  initialDelay: 1000,
  multiplier: 2,
  maxDelay: 30000,
  jitter: JitterStrategy.equal(),
});
 
// Decorrelated jitter: each delay is random between base and 3x the previous delay
const decorrelatedJitterRetry = RetryPolicy.exponential({
  maxAttempts: 5,
  initialDelay: 1000,
  multiplier: 2,
  maxDelay: 30000,
  jitter: JitterStrategy.decorrelated(),
});

Research from AWS and others has shown that decorrelated jitter typically produces the best distribution of retry attempts. Full jitter is a close second and simpler to implement. Equal jitter provides a middle ground by guaranteeing a minimum delay (half the backoff) while still adding randomness.

Here is how Alfred calculates delays for each jitter strategy:

function calculateDelay(
  attempt: number,
  initialDelay: number,
  multiplier: number,
  maxDelay: number,
  jitter: JitterStrategy,
  previousDelay?: number
): number {
  const baseDelay = Math.min(initialDelay * Math.pow(multiplier, attempt), maxDelay);
 
  switch (jitter.type) {
    case 'none':
      return baseDelay;
 
    case 'full':
      // Random between 0 and baseDelay
      return Math.random() * baseDelay;
 
    case 'equal':
      // Half the base plus random up to half the base
      return baseDelay / 2 + Math.random() * (baseDelay / 2);
 
    case 'decorrelated':
      // Random between initialDelay and 3x previous delay
      const prev = previousDelay ?? initialDelay;
      return Math.min(maxDelay, Math.random() * (prev * 3 - initialDelay) + initialDelay);
 
    default:
      return baseDelay;
  }
}

In practice, we recommend starting with full jitter for most use cases. It is simple, effective, and well-understood. Switch to decorrelated jitter if you observe retry clustering in your metrics.

Circuit Breakers

Retries handle transient failures, but what about sustained failures? If a downstream service is completely down, retrying repeatedly wastes resources and delays the feedback loop to the caller. Circuit breakers address this by short-circuiting requests when a failure threshold is crossed.

import { CircuitBreaker, CircuitBreakerState } from '@alfred/retry';
 
const paymentCircuitBreaker = new CircuitBreaker({
  name: 'payment-service',
  failureThreshold: 5,       // Open circuit after 5 consecutive failures
  successThreshold: 3,        // Close circuit after 3 consecutive successes
  timeout: 30000,             // Try half-open after 30 seconds
  halfOpenMaxAttempts: 1,     // Allow 1 attempt in half-open state
  onStateChange: (from: CircuitBreakerState, to: CircuitBreakerState) => {
    metrics.gauge('circuit_breaker.state', to === 'open' ? 1 : 0, {
      service: 'payment-service',
    });
    if (to === 'open') {
      alertService.warn(`Circuit breaker opened for payment-service`);
    }
  },
});
 
// Using the circuit breaker with a workflow step
const processPayment = async (ctx: OrderContext): Promise<StepResult<OrderContext>> => {
  try {
    const result = await paymentCircuitBreaker.execute(async () => {
      return await paymentService.charge(ctx.orderId, ctx.amount);
    });
    return StepResult.success({ ...ctx, paymentId: result.id });
  } catch (error) {
    if (error instanceof CircuitOpenError) {
      // Circuit is open, fail fast without hitting the service
      return StepResult.defer(ctx, { reason: 'payment-service-unavailable', retryAfter: 30000 });
    }
    throw error;
  }
};

The circuit breaker has three states. In the closed state, requests flow normally and failures are counted. When the failure count crosses the threshold, the circuit transitions to open. In the open state, all requests fail immediately without contacting the downstream service. After a timeout period, the circuit moves to half-open, allowing a limited number of test requests through. If those succeed, the circuit closes and normal operation resumes. If they fail, the circuit reopens.

Alfred integrates circuit breakers directly into its retry framework, so you can combine them with backoff policies.

import { RetryPolicy, CircuitBreaker } from '@alfred/retry';
 
const resilientPolicy = RetryPolicy.exponential({
  maxAttempts: 5,
  initialDelay: 1000,
  multiplier: 2,
  maxDelay: 30000,
  jitter: JitterStrategy.full(),
  circuitBreaker: new CircuitBreaker({
    name: 'inventory-service',
    failureThreshold: 3,
    successThreshold: 2,
    timeout: 15000,
  }),
});
 
// When the circuit is open, retries are skipped entirely
// The step is deferred for later execution instead of wasting retry attempts

Selective Retry with Error Classification

Not all errors are worth retrying. A 404 Not Found will return the same result no matter how many times you try. A 400 Bad Request means your payload is wrong, not that the server had a temporary problem. Retrying non-transient errors wastes time and resources.

Alfred's retry framework uses error classifiers to distinguish between transient and permanent failures.

import { RetryPolicy, ErrorClassifier, ErrorClass } from '@alfred/retry';
 
const httpErrorClassifier: ErrorClassifier = (error: unknown): ErrorClass => {
  if (error instanceof HttpError) {
    // Server errors are transient
    if (error.status >= 500) return ErrorClass.TRANSIENT;
    // Rate limiting is transient
    if (error.status === 429) return ErrorClass.TRANSIENT;
    // Request timeout is transient
    if (error.status === 408) return ErrorClass.TRANSIENT;
    // Client errors are permanent
    if (error.status >= 400) return ErrorClass.PERMANENT;
  }
 
  if (error instanceof NetworkError) {
    // Connection resets, DNS failures, timeouts are transient
    return ErrorClass.TRANSIENT;
  }
 
  if (error instanceof ValidationError) {
    // Business logic validation failures are permanent
    return ErrorClass.PERMANENT;
  }
 
  // Unknown errors default to transient (retry to be safe)
  return ErrorClass.TRANSIENT;
};
 
const smartRetry = RetryPolicy.exponential({
  maxAttempts: 5,
  initialDelay: 1000,
  multiplier: 2,
  maxDelay: 30000,
  jitter: JitterStrategy.full(),
  errorClassifier: httpErrorClassifier,
  onPermanentError: async (error, context) => {
    // Permanent errors skip retries and go straight to failure handling
    await logger.error('Permanent error encountered, skipping retries', {
      error: error.message,
      stepName: context.currentStep,
      workflowId: context.workflowId,
    });
  },
});

The error classifier runs before each retry decision. If it returns PERMANENT, Alfred skips all remaining retry attempts and immediately propagates the failure to the workflow's error handler. If it returns TRANSIENT, the normal retry policy applies.

You can also implement custom classifiers for specific services. For example, a payment gateway might return specific error codes that indicate whether a failure is retryable.

const paymentErrorClassifier: ErrorClassifier = (error: unknown): ErrorClass => {
  if (error instanceof PaymentGatewayError) {
    switch (error.code) {
      case 'INSUFFICIENT_FUNDS':
      case 'CARD_EXPIRED':
      case 'INVALID_CARD':
        return ErrorClass.PERMANENT;
      case 'GATEWAY_TIMEOUT':
      case 'PROCESSOR_UNAVAILABLE':
      case 'RATE_LIMITED':
        return ErrorClass.TRANSIENT;
      case 'DUPLICATE_TRANSACTION':
        // Already processed, treat as success (idempotency)
        return ErrorClass.DUPLICATE;
      default:
        return ErrorClass.TRANSIENT;
    }
  }
  return ErrorClass.TRANSIENT;
};

Adaptive Retry Policies

Static retry policies work well for most scenarios, but some systems benefit from policies that adapt to observed conditions. Alfred's adaptive retry policy adjusts its behavior based on recent success and failure rates.

import { RetryPolicy, AdaptiveRetryConfig } from '@alfred/retry';
 
const adaptiveRetry = RetryPolicy.adaptive({
  // Baseline configuration
  baseMaxAttempts: 5,
  baseInitialDelay: 1000,
 
  // Observation window
  windowSize: 100,         // Track last 100 attempts
  windowDuration: 60000,   // Over a 60-second window
 
  // Adaptation rules
  rules: [
    {
      // When failure rate exceeds 50%, reduce max attempts and increase delay
      condition: (stats) => stats.failureRate > 0.5,
      adjust: (config) => ({
        ...config,
        maxAttempts: 2,
        initialDelay: config.initialDelay * 3,
      }),
    },
    {
      // When failure rate is below 10%, use aggressive retry
      condition: (stats) => stats.failureRate < 0.1,
      adjust: (config) => ({
        ...config,
        maxAttempts: 5,
        initialDelay: 500,
      }),
    },
    {
      // When p99 latency exceeds 5s, back off more aggressively
      condition: (stats) => stats.p99Latency > 5000,
      adjust: (config) => ({
        ...config,
        initialDelay: config.initialDelay * 2,
        multiplier: 3,
      }),
    },
  ],
 
  // Safety bounds: adaptation cannot exceed these limits
  bounds: {
    minDelay: 100,
    maxDelay: 120000,
    minAttempts: 1,
    maxAttempts: 10,
  },
});

The adaptive policy collects statistics about recent retry attempts, including success rate, failure rate, and latency percentiles. It evaluates the adaptation rules against these statistics and adjusts the retry parameters accordingly. Safety bounds prevent the policy from becoming too aggressive or too conservative.

This approach is particularly valuable in systems where the downstream service's behavior changes over time. During normal operation, the policy uses a standard configuration. During degraded performance, it automatically backs off to reduce load on the struggling service. When the service recovers, the policy returns to normal parameters.

Practical Tips

When configuring retry policies for your workflows, keep these principles in mind.

Always set a maximum number of attempts. An unbounded retry loop can run indefinitely, consuming resources and blocking workflow progress. Even with exponential backoff, set a hard cap.

Always set a maximum delay. Without a cap, exponential backoff with a multiplier of 2 reaches over 17 minutes by the tenth retry. In most business workflows, waiting that long between retries is unacceptable. Cap the delay at a reasonable value, typically 30 to 60 seconds.

Monitor your retry metrics. Track the number of retries per step, the success rate on each attempt number, and the total time spent retrying. If a step consistently needs 4 retries to succeed, the underlying service has a reliability problem that retries are masking.

Use different retry policies for different steps. A step that calls a high-availability internal service might use an aggressive retry policy with short delays. A step that calls a rate-limited external API needs a conservative policy with longer delays and fewer attempts.

Consider the end-to-end timeout. Each step's retry policy contributes to the total workflow duration. If a workflow has 10 steps and each can retry 5 times with up to 30 seconds between retries, the worst-case workflow duration is enormous. Set workflow-level timeouts in addition to step-level retry policies.

Conclusion

Retry strategies are a fundamental building block of resilient distributed systems. Simple exponential backoff is a good starting point, but production systems benefit from jitter to prevent thundering herds, circuit breakers to handle sustained failures, error classification to avoid retrying permanent failures, and adaptive policies to respond to changing conditions.

Alfred's retry framework provides all of these capabilities through a composable TypeScript API. By combining the right retry strategy with proper error classification and circuit breakers, you can build workflows that gracefully handle the full spectrum of failure modes in distributed systems. The key is to think of retries not as a way to ignore failures, but as a structured approach to giving transient failures a chance to resolve while protecting both your workflows and the services they depend on.

Retry Strategies: Exponential Backoff and Beyond

Fixed Delay and Exponential Backoff

Adding Jitter

Circuit Breakers

Selective Retry with Error Classification

Adaptive Retry Policies

Practical Tips

Conclusion

Related Articles

Testing Complex Workflows: Strategies and Tools

Error Recovery Patterns in Workflow Engines

Business Process Automation: Strategy and Implementation

Related Articles

Testing Complex Workflows: Strategies and Tools
April 26, 202513 min read

Error Recovery Patterns in Workflow Engines
April 22, 202513 min read

Business Process Automation: Strategy and Implementation
April 19, 202511 min read