Implementing the Saga Pattern in TypeScript
Learn how to implement the saga pattern in TypeScript to manage distributed transactions across microservices, with practical examples of both choreography and orchestration approaches.
Traditional database transactions give us ACID guarantees: atomicity, consistency, isolation, and durability. But when a single business operation spans multiple services, each with its own database, a traditional transaction is no longer possible. You cannot hold a distributed lock across three microservices, an external payment API, and a shipping provider without grinding your entire system to a halt.
The saga pattern solves this problem by breaking a distributed transaction into a sequence of local transactions, each with a corresponding compensating action that undoes its effect. If any step in the saga fails, the compensating actions for all previously completed steps execute in reverse order, returning the system to a consistent state. It trades strict isolation for availability and partition tolerance, which is exactly the right trade-off for most distributed business processes.
In this article, we walk through a production-quality implementation of the saga pattern in TypeScript using Alfred's orchestration engine.
Understanding Saga Semantics
Before writing code, it is important to understand what a saga guarantees and what it does not. A saga provides atomicity at the business level: either all steps complete successfully or all completed steps are compensated. It does not provide isolation. Other operations can see intermediate states where some steps have completed and others have not.
This has practical implications. Consider an order processing saga where step one reserves inventory and step two charges the customer's credit card. Between these two steps, there is a window where inventory is reserved but payment has not been collected. Another process could observe this intermediate state. Your system must tolerate these windows of inconsistency.
There are two flavors of the saga pattern. In choreography-based sagas, each service publishes events and other services react to them. There is no central coordinator. In orchestration-based sagas, a central orchestrator tells each service what to do and when. Alfred implements the orchestration approach because it provides a single place to see the entire workflow, making debugging and monitoring dramatically easier.
// The fundamental building block: a saga step with its compensation
interface SagaStep<TContext> {
name: string;
execute: (ctx: TContext) => Promise<TContext>;
compensate: (ctx: TContext) => Promise<void>;
}Every saga step is a pair: an execute function that performs the forward action and a compensate function that undoes it. The compensate function receives the same context that was produced by the execute function, so it has all the information it needs to reverse the operation.
Building a Saga Orchestrator
Alfred's saga builder provides a fluent API for defining sagas. Each step registers both its forward action and its compensating transaction. The orchestrator manages execution order, error handling, and compensation sequencing.
import { SagaBuilder, SagaContext, SagaResult } from '@alfred/saga';
interface TravelBookingContext extends SagaContext {
tripId: string;
userId: string;
flightId?: string;
hotelReservationId?: string;
carRentalId?: string;
paymentTransactionId?: string;
}
const travelBookingSaga = new SagaBuilder<TravelBookingContext>('travel-booking')
.step('book-flight')
.execute(async (ctx) => {
const flight = await flightService.book({
userId: ctx.userId,
tripId: ctx.tripId,
});
return { ...ctx, flightId: flight.id };
})
.compensate(async (ctx) => {
if (ctx.flightId) {
await flightService.cancel(ctx.flightId);
}
})
.step('reserve-hotel')
.execute(async (ctx) => {
const reservation = await hotelService.reserve({
userId: ctx.userId,
tripId: ctx.tripId,
});
return { ...ctx, hotelReservationId: reservation.id };
})
.compensate(async (ctx) => {
if (ctx.hotelReservationId) {
await hotelService.cancelReservation(ctx.hotelReservationId);
}
})
.step('rent-car')
.execute(async (ctx) => {
const rental = await carRentalService.book({
userId: ctx.userId,
tripId: ctx.tripId,
});
return { ...ctx, carRentalId: rental.id };
})
.compensate(async (ctx) => {
if (ctx.carRentalId) {
await carRentalService.cancel(ctx.carRentalId);
}
})
.step('charge-payment')
.execute(async (ctx) => {
const transaction = await paymentService.charge({
userId: ctx.userId,
tripId: ctx.tripId,
amount: await calculateTotalCost(ctx),
});
return { ...ctx, paymentTransactionId: transaction.id };
})
.compensate(async (ctx) => {
if (ctx.paymentTransactionId) {
await paymentService.refund(ctx.paymentTransactionId);
}
})
.build();When you execute this saga, Alfred runs each step in order. If the rent-car step fails, Alfred automatically runs the compensating actions for reserve-hotel and then book-flight in reverse order. The charge-payment compensation is not run because that step never executed.
const result: SagaResult<TravelBookingContext> = await travelBookingSaga.execute({
tripId: 'trip-12345',
userId: 'user-67890',
});
if (result.status === 'completed') {
console.log('Booking confirmed:', result.context);
} else if (result.status === 'compensated') {
console.log('Booking failed, all actions reversed:', result.compensationLog);
} else if (result.status === 'compensation-failed') {
console.error('CRITICAL: Compensation failed, manual intervention required');
await alertOpsTeam(result);
}Notice the three possible outcomes. The happy path is completed. The sad-but-handled path is compensated, meaning something failed but all previous steps were successfully reversed. The dangerous path is compensation-failed, where a compensating action itself failed, leaving the system in an inconsistent state that requires manual intervention.
Handling Compensation Failures
Compensation failures are the hardest problem in saga implementations. If the cancel-flight compensating action fails because the flight service is down, you cannot simply give up. The system is in an inconsistent state: the hotel reservation was cancelled, but the flight booking still exists.
Alfred addresses this with a multi-layered retry and escalation strategy for compensating actions.
const resilientSaga = new SagaBuilder<TravelBookingContext>('travel-booking')
.step('book-flight')
.execute(async (ctx) => {
const flight = await flightService.book({
userId: ctx.userId,
tripId: ctx.tripId,
});
return { ...ctx, flightId: flight.id };
})
.compensate(async (ctx) => {
if (ctx.flightId) {
await flightService.cancel(ctx.flightId);
}
})
.compensationRetry({
maxAttempts: 5,
backoff: 'exponential',
initialDelay: 1000,
maxDelay: 60000,
})
.onCompensationExhausted(async (ctx, error) => {
// All retries failed. Store for manual resolution.
await deadLetterQueue.enqueue({
sagaId: ctx.sagaId,
stepName: 'book-flight',
action: 'compensate',
context: ctx,
error: error.message,
timestamp: Date.now(),
});
await alertService.page({
severity: 'critical',
message: `Failed to cancel flight ${ctx.flightId} after 5 attempts`,
runbook: 'https://wiki.internal/runbooks/saga-compensation-failure',
});
})
// ... remaining steps
.build();The compensationRetry configuration tells Alfred to retry the compensating action with exponential backoff before giving up. If all retries are exhausted, the onCompensationExhausted handler fires, which should write the failed compensation to a dead letter queue and alert your operations team.
This dead letter queue is critical infrastructure. You should have a dashboard that shows all pending compensation failures and a process for resolving them manually. In practice, these events are rare, but when they happen, they require immediate attention.
Saga State Persistence
For sagas that span minutes or hours, in-memory state is not sufficient. Alfred persists the saga state at every step boundary, recording which steps have completed, the current context, and enough information to resume or compensate at any point.
import { SagaBuilder, PostgresSagaStore } from '@alfred/saga';
const sagaStore = new PostgresSagaStore({
connectionString: process.env.DATABASE_URL,
tableName: 'saga_instances',
schemaName: 'alfred',
});
const persistentSaga = new SagaBuilder<TravelBookingContext>('travel-booking')
.withStore(sagaStore)
.withTimeout('30m')
.step('book-flight')
.execute(async (ctx) => {
// ...flight booking logic
return { ...ctx, flightId: 'FL-123' };
})
.compensate(async (ctx) => {
// ...cancellation logic
})
// ... remaining steps
.build();
// The saga can be resumed after a process restart
const sagaInstance = await sagaStore.findById('saga-instance-id');
if (sagaInstance && sagaInstance.status === 'in-progress') {
await persistentSaga.resume(sagaInstance);
}The saga store records a complete audit trail: which steps executed, when they completed, what the context looked like at each step boundary, and whether compensations were triggered. This audit trail is invaluable for debugging production issues and for compliance in regulated industries.
Alfred also supports a recovery daemon that periodically scans for stale saga instances, those that have been in-progress for longer than their timeout period, and either resumes them or triggers compensation. This ensures that no saga is forgotten, even if the original process that started it crashes without a trace.
import { SagaRecoveryDaemon } from '@alfred/saga';
const recoveryDaemon = new SagaRecoveryDaemon({
store: sagaStore,
pollInterval: 30000, // check every 30 seconds
sagas: [travelBookingSaga, orderFulfillmentSaga, onboardingSaga],
onRecovery: async (instance, action) => {
console.log(`Recovering saga ${instance.id}: ${action}`);
await metricsService.increment('saga.recovery', { saga: instance.name, action });
},
});
recoveryDaemon.start();Choreography vs. Orchestration: When to Use Each
While Alfred focuses on orchestration-based sagas, it is worth understanding when choreography might be a better fit.
Choreography works well when the participating services are owned by different teams, the coupling between services should be minimal, and the workflow is simple with few steps. In a choreography, each service publishes domain events and subscribes to events from other services. There is no single point of failure and no central coordinator to maintain.
// Choreography approach (not using Alfred's orchestrator)
// Each service handles its own piece independently
// In the Order Service:
orderEventBus.on('order.created', async (event) => {
await inventoryService.reserve(event.orderId, event.items);
await eventBus.publish('inventory.reserved', { orderId: event.orderId });
});
// In the Payment Service:
paymentEventBus.on('inventory.reserved', async (event) => {
await paymentService.charge(event.orderId);
await eventBus.publish('payment.charged', { orderId: event.orderId });
});
// In the Shipping Service:
shippingEventBus.on('payment.charged', async (event) => {
await shippingService.ship(event.orderId);
await eventBus.publish('order.shipped', { orderId: event.orderId });
});The problem becomes apparent when you need to answer questions like "what is the current state of order 12345?" or "why did order 12345 fail?". With choreography, the answer is scattered across multiple service logs. With orchestration, it is in one place.
Orchestration is the better choice when you need visibility into the end-to-end process, when the workflow has many steps or complex branching, when you need to enforce ordering guarantees, and when you want centralized error handling and compensation logic.
For most business-critical workflows in our experience, orchestration provides the right balance of control, visibility, and reliability. That is why Alfred is built around the orchestration model.
Practical Tips
When implementing sagas, keep these guidelines in mind. First, compensating actions must be idempotent. A compensation might be retried after a transient failure, and running it twice should produce the same result as running it once. Use idempotency keys or check-before-act patterns in every compensating function.
Second, design compensations at the same time as forward actions. It is tempting to add compensations later, but by then you may have lost the design context. If you cannot define a compensating action for a step, that is a signal that the step does too much and should be split.
Third, test your compensations as rigorously as your forward path. In production, the compensation path runs infrequently, which means bugs in compensating actions hide until the worst possible moment. Write integration tests that deliberately fail at each step and verify that compensation produces the expected state.
Fourth, monitor your dead letter queue. A saga that fails to compensate is a ticking time bomb. The longer a compensation sits unresolved, the harder it becomes to fix because related state may have changed.
Conclusion
The saga pattern is essential for managing distributed transactions in microservice architectures. By pairing each forward action with a compensating transaction, sagas provide eventual consistency without the scalability limitations of distributed locks or two-phase commit.
Alfred's TypeScript implementation of the saga pattern gives you a type-safe, composable API for defining sagas, automatic compensation sequencing when failures occur, configurable retry policies for both forward and compensating actions, durable state persistence for long-running sagas, and a recovery daemon for handling process crashes. The pattern is not without its complexities, particularly around compensation failures and intermediate state visibility. But with careful design and the right tooling, sagas are a proven approach to building reliable distributed workflows at scale.
Related Articles
Testing Complex Workflows: Strategies and Tools
A comprehensive guide to testing multi-step distributed workflows, covering unit testing individual steps, integration testing complete flows, chaos testing, and time-travel debugging.
Error Recovery Patterns in Workflow Engines
Explore the error recovery patterns used in production workflow engines, from simple retries to complex human-in-the-loop escalation strategies, with a focus on business continuity.
Business Process Automation: Strategy and Implementation
A strategic guide to automating complex business processes with workflow orchestration, covering process discovery, prioritization, and phased implementation with real-world examples.