Compensating Transactions in Distributed Systems

A practical guide to designing and implementing compensating transactions for distributed workflows, including semantic rollback strategies and failure handling in TypeScript.

technical11 min readBy Klivvr Engineering
Share:

In a traditional monolithic application backed by a single database, rolling back a failed operation is straightforward: issue a ROLLBACK command and the database undoes everything. In a distributed system where a single business operation spans multiple services, each with its own database, there is no global ROLLBACK. Once a local transaction in one service has committed, it cannot be magically undone by another service.

Compensating transactions fill this gap. A compensating transaction is an operation that semantically reverses the effect of a previously committed transaction. It does not restore the exact previous state in the way a database rollback does. Instead, it performs a new operation whose business effect cancels out the original. Cancelling a flight booking is a compensating transaction for creating that booking. Issuing a refund is a compensating transaction for charging a payment.

This distinction matters. A database rollback is invisible to the outside world. A compensating transaction is a visible business operation with its own effects, audit trail, and potential for failure. Understanding and embracing this difference is key to building robust distributed workflows.

Designing Compensating Actions

Not every operation has a natural compensating action. The first step in designing compensations is to categorize your operations by their reversibility.

// Category 1: Fully reversible operations
// The compensation perfectly undoes the original action
interface FullyReversible {
  execute: () => Promise<void>;
  compensate: () => Promise<void>;
}
 
// Example: Inventory reservation
const reserveInventory: FullyReversible = {
  execute: async () => {
    await inventoryService.reserve(orderId, items);
  },
  compensate: async () => {
    await inventoryService.release(orderId, items);
  },
};
 
// Category 2: Partially reversible operations
// The compensation approximates undoing the original action
// but some effects cannot be fully reversed
interface PartiallyReversible {
  execute: () => Promise<void>;
  compensate: () => Promise<void>;
  residualEffects: string[];
}
 
// Example: Payment charge (refund does not undo processing fees)
const chargePayment: PartiallyReversible = {
  execute: async () => {
    await paymentService.charge(orderId, amount);
  },
  compensate: async () => {
    await paymentService.refund(orderId, amount);
    // Note: payment processor fees are not refunded
  },
  residualEffects: ['Payment processing fees incurred'],
};
 
// Category 3: Irreversible operations
// No compensation is possible; must use alternative strategies
interface Irreversible {
  execute: () => Promise<void>;
  mitigate: () => Promise<void>;
  mitigation: string;
}
 
// Example: Sending an email (cannot be unsent)
const sendNotification: Irreversible = {
  execute: async () => {
    await emailService.send(customerId, 'order-confirmed', orderDetails);
  },
  mitigate: async () => {
    // Cannot unsend, but can send a correction
    await emailService.send(customerId, 'order-cancelled', cancellationDetails);
  },
  mitigation: 'Send correction email',
};

Understanding these categories helps you design workflows that sequence operations appropriately. Irreversible operations should be placed as late as possible in the workflow, after all reversible operations have succeeded. This minimizes the chance that you need to "undo" something that cannot be undone.

Implementing Compensation in Alfred

Alfred's compensation framework associates each workflow step with its compensating action and manages the compensation sequence automatically when a failure occurs.

import { CompensatingWorkflow, CompensationLog } from '@alfred/compensation';
 
interface TransferContext {
  transferId: string;
  sourceAccountId: string;
  destinationAccountId: string;
  amount: number;
  sourceDebitId?: string;
  destinationCreditId?: string;
  notificationSent?: boolean;
}
 
const moneyTransfer = new CompensatingWorkflow<TransferContext>('money-transfer')
  .step({
    name: 'validate-accounts',
    execute: async (ctx) => {
      const source = await accountService.get(ctx.sourceAccountId);
      const destination = await accountService.get(ctx.destinationAccountId);
 
      if (!source || !destination) {
        throw new PermanentError('One or both accounts do not exist');
      }
      if (source.balance < ctx.amount) {
        throw new PermanentError('Insufficient funds');
      }
      return ctx;
    },
    // Validation has no side effects, so no compensation needed
    compensate: async () => {},
  })
  .step({
    name: 'debit-source',
    execute: async (ctx) => {
      const debit = await accountService.debit(
        ctx.sourceAccountId,
        ctx.amount,
        ctx.transferId
      );
      return { ...ctx, sourceDebitId: debit.transactionId };
    },
    compensate: async (ctx) => {
      if (ctx.sourceDebitId) {
        await accountService.credit(
          ctx.sourceAccountId,
          ctx.amount,
          `compensation-${ctx.transferId}`
        );
      }
    },
  })
  .step({
    name: 'credit-destination',
    execute: async (ctx) => {
      const credit = await accountService.credit(
        ctx.destinationAccountId,
        ctx.amount,
        ctx.transferId
      );
      return { ...ctx, destinationCreditId: credit.transactionId };
    },
    compensate: async (ctx) => {
      if (ctx.destinationCreditId) {
        await accountService.debit(
          ctx.destinationAccountId,
          ctx.amount,
          `compensation-${ctx.transferId}`
        );
      }
    },
  })
  .step({
    name: 'send-confirmation',
    execute: async (ctx) => {
      await notificationService.sendTransferConfirmation(
        ctx.sourceAccountId,
        ctx.destinationAccountId,
        ctx.amount
      );
      return { ...ctx, notificationSent: true };
    },
    compensate: async (ctx) => {
      if (ctx.notificationSent) {
        // Cannot unsend, but can send a reversal notification
        await notificationService.sendTransferReversalNotice(
          ctx.sourceAccountId,
          ctx.destinationAccountId,
          ctx.amount
        );
      }
    },
  })
  .build();

When the workflow executes, if credit-destination fails, Alfred automatically runs the compensation for debit-source (crediting the amount back to the source account). It does not run the compensation for validate-accounts because it has no side effects, and it does not run the compensation for credit-destination because it never completed.

Compensation Ordering and Dependencies

By default, Alfred runs compensations in reverse order of execution. This is correct for most workflows because later steps often depend on the results of earlier steps. However, some scenarios require more nuanced ordering.

import { CompensatingWorkflow, CompensationStrategy } from '@alfred/compensation';
 
const complexWorkflow = new CompensatingWorkflow<ComplexContext>('complex-process')
  .step({
    name: 'step-a',
    execute: async (ctx) => { /* ... */ return ctx; },
    compensate: async (ctx) => { /* reverse A */ },
  })
  .step({
    name: 'step-b',
    execute: async (ctx) => { /* ... */ return ctx; },
    compensate: async (ctx) => { /* reverse B */ },
  })
  .step({
    name: 'step-c',
    execute: async (ctx) => { /* ... */ return ctx; },
    compensate: async (ctx) => { /* reverse C */ },
  })
  .compensationStrategy(CompensationStrategy.custom({
    // Step B and C can be compensated in parallel
    // but Step A must be compensated last
    order: [
      { parallel: ['step-c', 'step-b'] },
      { sequential: ['step-a'] },
    ],
  }))
  .build();

Parallel compensation can significantly reduce the total compensation time when the compensating actions are independent. In the example above, reversing step B and step C simultaneously before reversing step A cuts the compensation time roughly in half compared to sequential reversal.

However, parallel compensation introduces a new failure mode: if one parallel compensation fails while another succeeds, you have an uneven rollback. Alfred handles this by tracking each compensation independently and retrying only the failed ones.

Handling Compensation Failures

The hardest problem in compensating transactions is what happens when a compensation itself fails. You are already in an error recovery path, and now the recovery has failed. Alfred uses a tiered approach to handle this scenario.

import {
  CompensatingWorkflow,
  CompensationFailureHandler,
  DeadLetterStore,
} from '@alfred/compensation';
 
const deadLetterStore = new DeadLetterStore({
  connectionString: process.env.DATABASE_URL,
  tableName: 'compensation_dead_letters',
});
 
const robustWorkflow = new CompensatingWorkflow<TransferContext>('robust-transfer')
  .step({
    name: 'debit-source',
    execute: async (ctx) => {
      const debit = await accountService.debit(
        ctx.sourceAccountId,
        ctx.amount,
        ctx.transferId
      );
      return { ...ctx, sourceDebitId: debit.transactionId };
    },
    compensate: async (ctx) => {
      if (ctx.sourceDebitId) {
        await accountService.credit(
          ctx.sourceAccountId,
          ctx.amount,
          `compensation-${ctx.transferId}`
        );
      }
    },
    compensationPolicy: {
      // Tier 1: Immediate retries with backoff
      retries: {
        maxAttempts: 3,
        backoff: 'exponential',
        initialDelay: 1000,
        maxDelay: 10000,
      },
      // Tier 2: Delayed retries
      delayedRetries: {
        maxAttempts: 5,
        delays: [60000, 300000, 900000, 3600000, 7200000], // 1m, 5m, 15m, 1h, 2h
      },
      // Tier 3: Dead letter queue with manual resolution
      onExhausted: async (ctx, error, stepName) => {
        await deadLetterStore.enqueue({
          workflowId: ctx.workflowId,
          stepName,
          action: 'compensate',
          context: ctx,
          error: {
            message: error.message,
            stack: error.stack,
          },
          createdAt: new Date(),
        });
 
        await alertService.critical({
          title: `Compensation failure: ${stepName}`,
          description: `Failed to compensate step "${stepName}" in workflow ${ctx.workflowId}`,
          runbook: 'https://wiki.internal/runbooks/compensation-failure',
          metadata: {
            transferId: ctx.transferId,
            sourceAccountId: ctx.sourceAccountId,
            amount: ctx.amount,
          },
        });
      },
    },
  })
  // ... additional steps
  .build();

The three tiers provide escalating responses to compensation failures. Immediate retries handle transient network issues. Delayed retries handle temporary service outages that last minutes to hours. The dead letter queue captures truly stubborn failures for manual resolution.

The dead letter queue is not just a storage mechanism. It should be backed by a resolution dashboard where operations staff can see pending compensation failures, understand their context, and manually execute or skip compensations.

// Dead letter resolution API
import { DeadLetterStore, ResolutionAction } from '@alfred/compensation';
 
class CompensationResolutionService {
  constructor(private store: DeadLetterStore) {}
 
  async listPending(filters?: { workflowId?: string; stepName?: string }): Promise<DeadLetterEntry[]> {
    return this.store.query({ status: 'pending', ...filters });
  }
 
  async resolve(entryId: string, action: ResolutionAction): Promise<void> {
    const entry = await this.store.get(entryId);
 
    switch (action.type) {
      case 'retry':
        // Re-attempt the compensation
        await this.executeCompensation(entry);
        await this.store.markResolved(entryId, 'retried');
        break;
 
      case 'skip':
        // Mark as intentionally skipped (with required justification)
        await this.store.markResolved(entryId, 'skipped', action.justification);
        await auditLog.record({
          action: 'compensation-skipped',
          entryId,
          justification: action.justification,
          resolvedBy: action.resolvedBy,
        });
        break;
 
      case 'manual':
        // Record that the compensation was performed manually
        await this.store.markResolved(entryId, 'manually-resolved', action.notes);
        await auditLog.record({
          action: 'compensation-manual',
          entryId,
          notes: action.notes,
          resolvedBy: action.resolvedBy,
        });
        break;
    }
  }
}

Semantic Compensation vs. Exact Reversal

It is important to understand that compensating transactions do not restore the system to its exact previous state. They achieve semantic equivalence, meaning the business effect is undone, but the system state may differ from what it was before.

// Original operation: Create a shipment
async function createShipment(orderId: string, items: Item[]): Promise<Shipment> {
  const shipment = await shippingService.create({
    orderId,
    items,
    createdAt: new Date(), // Timestamp is recorded
  });
  await auditLog.record({ action: 'shipment-created', shipmentId: shipment.id });
  return shipment;
}
 
// Compensating transaction: Cancel the shipment
// This does NOT delete the shipment record or remove the audit log entry
// It creates a NEW cancellation record
async function cancelShipment(shipmentId: string): Promise<void> {
  await shippingService.cancel(shipmentId);
  // The shipment record still exists with status 'cancelled'
  // The audit log now has both 'shipment-created' and 'shipment-cancelled' entries
  // This is correct: we have a complete audit trail
  await auditLog.record({ action: 'shipment-cancelled', shipmentId });
}

This distinction has practical implications for your data model. Your domain entities need to support the full lifecycle, including cancelled and reversed states, not just the happy path. An order that was placed and then compensated should be distinguishable from an order that was never placed. Both are "not active," but they have different histories.

Design your compensations to be additive rather than destructive. Create cancellation records rather than deleting original records. Mark entries as reversed rather than removing them. This preserves the audit trail and makes debugging much easier.

Testing Compensating Transactions

Testing compensations requires a different mindset than testing the happy path. You need to verify that every step's compensation correctly reverses its effect and that the entire compensation chain restores business consistency.

import { CompensationTestHarness } from '@alfred/testing';
 
describe('Money Transfer Compensation', () => {
  let harness: CompensationTestHarness<TransferContext>;
 
  beforeEach(() => {
    harness = new CompensationTestHarness(moneyTransfer);
  });
 
  it('should restore source account balance when credit-destination fails', async () => {
    const initialSourceBalance = await accountService.getBalance('source-account');
 
    // Configure the harness to fail at a specific step
    harness.failAt('credit-destination', new Error('Destination account frozen'));
 
    const result = await harness.execute({
      transferId: 'test-transfer-1',
      sourceAccountId: 'source-account',
      destinationAccountId: 'dest-account',
      amount: 500,
    });
 
    expect(result.status).toBe('compensated');
    expect(result.compensationLog).toEqual([
      { step: 'debit-source', status: 'compensated' },
    ]);
 
    // Verify the source account balance is restored
    const finalSourceBalance = await accountService.getBalance('source-account');
    expect(finalSourceBalance).toBe(initialSourceBalance);
  });
 
  it('should compensate all steps when the last step fails', async () => {
    harness.failAt('send-confirmation', new Error('Email service down'));
 
    const result = await harness.execute({
      transferId: 'test-transfer-2',
      sourceAccountId: 'source-account',
      destinationAccountId: 'dest-account',
      amount: 1000,
    });
 
    expect(result.status).toBe('compensated');
    expect(result.compensationLog).toHaveLength(2);
    expect(result.compensationLog[0].step).toBe('credit-destination');
    expect(result.compensationLog[1].step).toBe('debit-source');
  });
 
  it('should handle compensation failure gracefully', async () => {
    harness.failAt('credit-destination', new Error('Service timeout'));
    harness.failCompensationAt('debit-source', new Error('Account service down'));
 
    const result = await harness.execute({
      transferId: 'test-transfer-3',
      sourceAccountId: 'source-account',
      destinationAccountId: 'dest-account',
      amount: 250,
    });
 
    expect(result.status).toBe('compensation-failed');
    expect(result.deadLetterEntries).toHaveLength(1);
    expect(result.deadLetterEntries[0].stepName).toBe('debit-source');
  });
});

The CompensationTestHarness lets you inject failures at specific steps and even in specific compensations, allowing you to test every combination of forward and backward failure paths.

Practical Tips

Design compensations at the same time as forward actions. If you defer compensation design, you will find that critical context information is not being captured, making compensation impossible after the fact.

Always check preconditions in compensating actions. Before refunding a payment, check that the payment exists and has not already been refunded. This makes compensations idempotent and safe to retry.

Log compensation actions at the same level of detail as forward actions. When an issue arises in production, the compensation logs are often the most important diagnostic tool.

Consider the time dimension. A compensation that runs immediately after a forward action may behave differently than one that runs hours later. State may have changed, dependencies may have shifted. Design compensations that are robust to timing variations.

Place irreversible operations as late as possible in your workflow. The later a step executes, the less likely it is that a subsequent failure will require its compensation. And if an irreversible step does not need compensation, your workflow is more resilient.

Conclusion

Compensating transactions are the undo mechanism of the distributed world. Unlike database rollbacks, they are visible business operations that must be carefully designed, rigorously tested, and robustly implemented. Alfred provides the framework for defining compensations alongside forward actions, managing compensation sequences automatically, handling compensation failures through retries and dead letter queues, and testing compensation paths systematically.

The key insight is that compensating transactions are not an afterthought. They are a first-class part of your workflow design, deserving the same care and attention as the forward path. In a distributed system, the ability to reliably undo what you have done is just as important as the ability to do it in the first place.

Related Articles

technical

Testing Complex Workflows: Strategies and Tools

A comprehensive guide to testing multi-step distributed workflows, covering unit testing individual steps, integration testing complete flows, chaos testing, and time-travel debugging.

13 min read
business

Error Recovery Patterns in Workflow Engines

Explore the error recovery patterns used in production workflow engines, from simple retries to complex human-in-the-loop escalation strategies, with a focus on business continuity.

13 min read