Retry Strategies: Exponential Backoff and Beyond
An in-depth exploration of retry strategies for distributed workflows, covering exponential backoff, jitter, circuit breakers, and adaptive retry policies in TypeScript.
Every network call can fail. DNS resolution times out. TCP connections reset. HTTP servers return 503. Databases hit connection pool limits. In a workflow that makes dozens of remote calls across its lifetime, transient failures are not exceptional events but routine occurrences. The question is not whether to retry, but how.
A naive retry strategy, hammering the failing service in a tight loop, turns a transient hiccup into a sustained outage. It overwhelms the struggling service, delays its recovery, and can cascade into failures across your entire system. Thoughtful retry strategies protect both your workflow and the services it depends on.
This article covers the retry strategies available in Alfred, from simple fixed delays to adaptive policies that learn from the failure patterns they observe.
Fixed Delay and Exponential Backoff
The simplest retry strategy waits a fixed amount of time between each attempt. While easy to reason about, fixed delay retries have a critical flaw: when many workflow instances retry simultaneously, they all hit the target service at the same moment, creating a thundering herd.
import { RetryPolicy } from '@alfred/retry';
// Simple fixed delay: wait 2 seconds between each attempt
const fixedRetry = RetryPolicy.fixed({
maxAttempts: 3,
delay: 2000,
});
// Exponential backoff: 1s, 2s, 4s, 8s, 16s...
const exponentialRetry = RetryPolicy.exponential({
maxAttempts: 5,
initialDelay: 1000,
multiplier: 2,
maxDelay: 30000,
});Exponential backoff addresses the thundering herd partially by spreading retries over increasingly longer intervals. The first retry happens after 1 second, the second after 2 seconds, the third after 4 seconds, and so on. This gives the target service progressively more breathing room to recover.
However, pure exponential backoff still has a synchronization problem. If 1000 workflow instances all fail at the same time, they will all retry at exactly 1 second, then all retry again at exactly 2 seconds. The retries are spread over time, but they still arrive in synchronized bursts.
Adding Jitter
Jitter breaks the synchronization by adding randomness to the delay. Instead of retrying at exactly the calculated backoff time, each instance retries at a random point within a window. This distributes the retry load more evenly over time.
import { RetryPolicy, JitterStrategy } from '@alfred/retry';
// Full jitter: random delay between 0 and the calculated backoff
const fullJitterRetry = RetryPolicy.exponential({
maxAttempts: 5,
initialDelay: 1000,
multiplier: 2,
maxDelay: 30000,
jitter: JitterStrategy.full(),
});
// Equal jitter: half the backoff plus a random portion of the other half
const equalJitterRetry = RetryPolicy.exponential({
maxAttempts: 5,
initialDelay: 1000,
multiplier: 2,
maxDelay: 30000,
jitter: JitterStrategy.equal(),
});
// Decorrelated jitter: each delay is random between base and 3x the previous delay
const decorrelatedJitterRetry = RetryPolicy.exponential({
maxAttempts: 5,
initialDelay: 1000,
multiplier: 2,
maxDelay: 30000,
jitter: JitterStrategy.decorrelated(),
});Research from AWS and others has shown that decorrelated jitter typically produces the best distribution of retry attempts. Full jitter is a close second and simpler to implement. Equal jitter provides a middle ground by guaranteeing a minimum delay (half the backoff) while still adding randomness.
Here is how Alfred calculates delays for each jitter strategy:
function calculateDelay(
attempt: number,
initialDelay: number,
multiplier: number,
maxDelay: number,
jitter: JitterStrategy,
previousDelay?: number
): number {
const baseDelay = Math.min(initialDelay * Math.pow(multiplier, attempt), maxDelay);
switch (jitter.type) {
case 'none':
return baseDelay;
case 'full':
// Random between 0 and baseDelay
return Math.random() * baseDelay;
case 'equal':
// Half the base plus random up to half the base
return baseDelay / 2 + Math.random() * (baseDelay / 2);
case 'decorrelated':
// Random between initialDelay and 3x previous delay
const prev = previousDelay ?? initialDelay;
return Math.min(maxDelay, Math.random() * (prev * 3 - initialDelay) + initialDelay);
default:
return baseDelay;
}
}In practice, we recommend starting with full jitter for most use cases. It is simple, effective, and well-understood. Switch to decorrelated jitter if you observe retry clustering in your metrics.
Circuit Breakers
Retries handle transient failures, but what about sustained failures? If a downstream service is completely down, retrying repeatedly wastes resources and delays the feedback loop to the caller. Circuit breakers address this by short-circuiting requests when a failure threshold is crossed.
import { CircuitBreaker, CircuitBreakerState } from '@alfred/retry';
const paymentCircuitBreaker = new CircuitBreaker({
name: 'payment-service',
failureThreshold: 5, // Open circuit after 5 consecutive failures
successThreshold: 3, // Close circuit after 3 consecutive successes
timeout: 30000, // Try half-open after 30 seconds
halfOpenMaxAttempts: 1, // Allow 1 attempt in half-open state
onStateChange: (from: CircuitBreakerState, to: CircuitBreakerState) => {
metrics.gauge('circuit_breaker.state', to === 'open' ? 1 : 0, {
service: 'payment-service',
});
if (to === 'open') {
alertService.warn(`Circuit breaker opened for payment-service`);
}
},
});
// Using the circuit breaker with a workflow step
const processPayment = async (ctx: OrderContext): Promise<StepResult<OrderContext>> => {
try {
const result = await paymentCircuitBreaker.execute(async () => {
return await paymentService.charge(ctx.orderId, ctx.amount);
});
return StepResult.success({ ...ctx, paymentId: result.id });
} catch (error) {
if (error instanceof CircuitOpenError) {
// Circuit is open, fail fast without hitting the service
return StepResult.defer(ctx, { reason: 'payment-service-unavailable', retryAfter: 30000 });
}
throw error;
}
};The circuit breaker has three states. In the closed state, requests flow normally and failures are counted. When the failure count crosses the threshold, the circuit transitions to open. In the open state, all requests fail immediately without contacting the downstream service. After a timeout period, the circuit moves to half-open, allowing a limited number of test requests through. If those succeed, the circuit closes and normal operation resumes. If they fail, the circuit reopens.
Alfred integrates circuit breakers directly into its retry framework, so you can combine them with backoff policies.
import { RetryPolicy, CircuitBreaker } from '@alfred/retry';
const resilientPolicy = RetryPolicy.exponential({
maxAttempts: 5,
initialDelay: 1000,
multiplier: 2,
maxDelay: 30000,
jitter: JitterStrategy.full(),
circuitBreaker: new CircuitBreaker({
name: 'inventory-service',
failureThreshold: 3,
successThreshold: 2,
timeout: 15000,
}),
});
// When the circuit is open, retries are skipped entirely
// The step is deferred for later execution instead of wasting retry attemptsSelective Retry with Error Classification
Not all errors are worth retrying. A 404 Not Found will return the same result no matter how many times you try. A 400 Bad Request means your payload is wrong, not that the server had a temporary problem. Retrying non-transient errors wastes time and resources.
Alfred's retry framework uses error classifiers to distinguish between transient and permanent failures.
import { RetryPolicy, ErrorClassifier, ErrorClass } from '@alfred/retry';
const httpErrorClassifier: ErrorClassifier = (error: unknown): ErrorClass => {
if (error instanceof HttpError) {
// Server errors are transient
if (error.status >= 500) return ErrorClass.TRANSIENT;
// Rate limiting is transient
if (error.status === 429) return ErrorClass.TRANSIENT;
// Request timeout is transient
if (error.status === 408) return ErrorClass.TRANSIENT;
// Client errors are permanent
if (error.status >= 400) return ErrorClass.PERMANENT;
}
if (error instanceof NetworkError) {
// Connection resets, DNS failures, timeouts are transient
return ErrorClass.TRANSIENT;
}
if (error instanceof ValidationError) {
// Business logic validation failures are permanent
return ErrorClass.PERMANENT;
}
// Unknown errors default to transient (retry to be safe)
return ErrorClass.TRANSIENT;
};
const smartRetry = RetryPolicy.exponential({
maxAttempts: 5,
initialDelay: 1000,
multiplier: 2,
maxDelay: 30000,
jitter: JitterStrategy.full(),
errorClassifier: httpErrorClassifier,
onPermanentError: async (error, context) => {
// Permanent errors skip retries and go straight to failure handling
await logger.error('Permanent error encountered, skipping retries', {
error: error.message,
stepName: context.currentStep,
workflowId: context.workflowId,
});
},
});The error classifier runs before each retry decision. If it returns PERMANENT, Alfred skips all remaining retry attempts and immediately propagates the failure to the workflow's error handler. If it returns TRANSIENT, the normal retry policy applies.
You can also implement custom classifiers for specific services. For example, a payment gateway might return specific error codes that indicate whether a failure is retryable.
const paymentErrorClassifier: ErrorClassifier = (error: unknown): ErrorClass => {
if (error instanceof PaymentGatewayError) {
switch (error.code) {
case 'INSUFFICIENT_FUNDS':
case 'CARD_EXPIRED':
case 'INVALID_CARD':
return ErrorClass.PERMANENT;
case 'GATEWAY_TIMEOUT':
case 'PROCESSOR_UNAVAILABLE':
case 'RATE_LIMITED':
return ErrorClass.TRANSIENT;
case 'DUPLICATE_TRANSACTION':
// Already processed, treat as success (idempotency)
return ErrorClass.DUPLICATE;
default:
return ErrorClass.TRANSIENT;
}
}
return ErrorClass.TRANSIENT;
};Adaptive Retry Policies
Static retry policies work well for most scenarios, but some systems benefit from policies that adapt to observed conditions. Alfred's adaptive retry policy adjusts its behavior based on recent success and failure rates.
import { RetryPolicy, AdaptiveRetryConfig } from '@alfred/retry';
const adaptiveRetry = RetryPolicy.adaptive({
// Baseline configuration
baseMaxAttempts: 5,
baseInitialDelay: 1000,
// Observation window
windowSize: 100, // Track last 100 attempts
windowDuration: 60000, // Over a 60-second window
// Adaptation rules
rules: [
{
// When failure rate exceeds 50%, reduce max attempts and increase delay
condition: (stats) => stats.failureRate > 0.5,
adjust: (config) => ({
...config,
maxAttempts: 2,
initialDelay: config.initialDelay * 3,
}),
},
{
// When failure rate is below 10%, use aggressive retry
condition: (stats) => stats.failureRate < 0.1,
adjust: (config) => ({
...config,
maxAttempts: 5,
initialDelay: 500,
}),
},
{
// When p99 latency exceeds 5s, back off more aggressively
condition: (stats) => stats.p99Latency > 5000,
adjust: (config) => ({
...config,
initialDelay: config.initialDelay * 2,
multiplier: 3,
}),
},
],
// Safety bounds: adaptation cannot exceed these limits
bounds: {
minDelay: 100,
maxDelay: 120000,
minAttempts: 1,
maxAttempts: 10,
},
});The adaptive policy collects statistics about recent retry attempts, including success rate, failure rate, and latency percentiles. It evaluates the adaptation rules against these statistics and adjusts the retry parameters accordingly. Safety bounds prevent the policy from becoming too aggressive or too conservative.
This approach is particularly valuable in systems where the downstream service's behavior changes over time. During normal operation, the policy uses a standard configuration. During degraded performance, it automatically backs off to reduce load on the struggling service. When the service recovers, the policy returns to normal parameters.
Practical Tips
When configuring retry policies for your workflows, keep these principles in mind.
Always set a maximum number of attempts. An unbounded retry loop can run indefinitely, consuming resources and blocking workflow progress. Even with exponential backoff, set a hard cap.
Always set a maximum delay. Without a cap, exponential backoff with a multiplier of 2 reaches over 17 minutes by the tenth retry. In most business workflows, waiting that long between retries is unacceptable. Cap the delay at a reasonable value, typically 30 to 60 seconds.
Monitor your retry metrics. Track the number of retries per step, the success rate on each attempt number, and the total time spent retrying. If a step consistently needs 4 retries to succeed, the underlying service has a reliability problem that retries are masking.
Use different retry policies for different steps. A step that calls a high-availability internal service might use an aggressive retry policy with short delays. A step that calls a rate-limited external API needs a conservative policy with longer delays and fewer attempts.
Consider the end-to-end timeout. Each step's retry policy contributes to the total workflow duration. If a workflow has 10 steps and each can retry 5 times with up to 30 seconds between retries, the worst-case workflow duration is enormous. Set workflow-level timeouts in addition to step-level retry policies.
Conclusion
Retry strategies are a fundamental building block of resilient distributed systems. Simple exponential backoff is a good starting point, but production systems benefit from jitter to prevent thundering herds, circuit breakers to handle sustained failures, error classification to avoid retrying permanent failures, and adaptive policies to respond to changing conditions.
Alfred's retry framework provides all of these capabilities through a composable TypeScript API. By combining the right retry strategy with proper error classification and circuit breakers, you can build workflows that gracefully handle the full spectrum of failure modes in distributed systems. The key is to think of retries not as a way to ignore failures, but as a structured approach to giving transient failures a chance to resolve while protecting both your workflows and the services they depend on.
Related Articles
Testing Complex Workflows: Strategies and Tools
A comprehensive guide to testing multi-step distributed workflows, covering unit testing individual steps, integration testing complete flows, chaos testing, and time-travel debugging.
Error Recovery Patterns in Workflow Engines
Explore the error recovery patterns used in production workflow engines, from simple retries to complex human-in-the-loop escalation strategies, with a focus on business continuity.
Business Process Automation: Strategy and Implementation
A strategic guide to automating complex business processes with workflow orchestration, covering process discovery, prioritization, and phased implementation with real-world examples.