Error Recovery Patterns in Workflow Engines
Explore the error recovery patterns used in production workflow engines, from simple retries to complex human-in-the-loop escalation strategies, with a focus on business continuity.
In the world of workflow orchestration, errors are not exceptional events. They are a fundamental part of the operating reality. Network connections drop. External APIs return unexpected responses. Databases hit capacity limits. Third-party services go offline for maintenance. Human approvers go on vacation. Data arrives in formats nobody anticipated.
The question that separates a reliable workflow engine from a fragile one is not whether errors occur, but what happens when they do. A well-designed error recovery strategy keeps business processes moving even when individual components fail, minimizes the need for manual intervention, and provides clear escalation paths when automated recovery is not possible.
This article examines the error recovery patterns that Alfred implements, from simple automatic retries to sophisticated human-in-the-loop recovery mechanisms, all viewed through the lens of business continuity rather than pure technical resilience.
Understanding Error Categories
The first step in effective error recovery is understanding what kind of error you are dealing with. Different error categories require fundamentally different recovery strategies.
Transient errors are temporary failures that resolve on their own. A network timeout, a database connection pool exhaustion, or a rate-limited API response are all transient. The correct response is to wait briefly and try again.
Persistent errors are failures in external systems that last longer than a brief retry window but do eventually resolve. A third-party service undergoing scheduled maintenance or a payment processor experiencing an outage falls into this category. The correct response is to defer the work and try again later.
Data errors are failures caused by invalid, inconsistent, or unexpected data. A customer record missing a required field, an order with a negative quantity, or a payment amount that does not match the order total are data errors. Retrying will not help because the same data will produce the same error. The correct response is to fix the data or route the work to someone who can.
Logic errors are bugs in the workflow itself. A step that makes an invalid assumption, a state transition that should not be possible, or a calculation that overflows are logic errors. The correct response is to halt the workflow, alert the development team, and fix the code.
import { ErrorClassifier, ErrorCategory, RecoveryStrategy } from '@alfred/recovery';
const workflowErrorClassifier: ErrorClassifier = (error: unknown): ErrorCategory => {
// Transient: retry immediately
if (error instanceof NetworkTimeoutError) return ErrorCategory.TRANSIENT;
if (error instanceof ConnectionPoolExhaustedError) return ErrorCategory.TRANSIENT;
if (error instanceof RateLimitError) return ErrorCategory.TRANSIENT;
// Persistent: defer and retry later
if (error instanceof ServiceUnavailableError) return ErrorCategory.PERSISTENT;
if (error instanceof MaintenanceWindowError) return ErrorCategory.PERSISTENT;
// Data: route to human for correction
if (error instanceof ValidationError) return ErrorCategory.DATA;
if (error instanceof MissingFieldError) return ErrorCategory.DATA;
if (error instanceof DataInconsistencyError) return ErrorCategory.DATA;
// Logic: halt and alert developers
if (error instanceof AssertionError) return ErrorCategory.LOGIC;
if (error instanceof TypeError) return ErrorCategory.LOGIC;
if (error instanceof RangeError) return ErrorCategory.LOGIC;
// Unknown: treat as transient for first few retries, then escalate
return ErrorCategory.UNKNOWN;
};
const recoveryRouter: Record<ErrorCategory, RecoveryStrategy> = {
[ErrorCategory.TRANSIENT]: RecoveryStrategy.retryWithBackoff({
maxAttempts: 5,
initialDelay: 1000,
maxDelay: 30000,
}),
[ErrorCategory.PERSISTENT]: RecoveryStrategy.deferAndRetry({
checkInterval: 60000, // Check every minute
maxDeferTime: 3600000, // Max 1 hour
healthCheck: async (stepName) => {
return await dependencyHealth.check(stepName);
},
}),
[ErrorCategory.DATA]: RecoveryStrategy.escalateToHuman({
queue: 'data-correction',
priority: 'normal',
timeout: '24h',
}),
[ErrorCategory.LOGIC]: RecoveryStrategy.haltAndAlert({
severity: 'critical',
team: 'engineering',
channel: '#workflow-alerts',
}),
[ErrorCategory.UNKNOWN]: RecoveryStrategy.tieredRecovery([
RecoveryStrategy.retryWithBackoff({ maxAttempts: 3, initialDelay: 2000 }),
RecoveryStrategy.escalateToHuman({ queue: 'unknown-errors', priority: 'high', timeout: '4h' }),
]),
};This classification and routing approach ensures that each error type receives the appropriate response. Transient errors do not bother humans. Data errors do not waste retry attempts. Logic errors do not silently retry and produce inconsistent results.
Checkpoint and Resume
Long-running workflows that fail midway through should not start over from the beginning. Alfred uses checkpointing to record the workflow's progress at each step boundary, allowing it to resume from the last successful checkpoint after recovery.
import { WorkflowBuilder, CheckpointStore, StepResult } from '@alfred/core';
interface ETLContext {
jobId: string;
sourceTable: string;
totalRecords: number;
processedRecords: number;
failedRecords: string[];
lastProcessedId?: string;
outputTable: string;
}
const dataIngestionWorkflow = new WorkflowBuilder<ETLContext>('data-ingestion')
.withCheckpointing({
store: new PostgresCheckpointStore(process.env.DATABASE_URL!),
frequency: 'every-step', // Also supports 'every-n-steps' and 'on-demand'
})
.addStep('count-source-records', async (ctx) => {
const count = await sourceDatabase.count(ctx.sourceTable);
return StepResult.success({ ...ctx, totalRecords: count });
})
.addStep('process-records', async (ctx, { checkpoint }) => {
const batchSize = 1000;
let lastId = ctx.lastProcessedId;
let processed = ctx.processedRecords;
const failed: string[] = [...ctx.failedRecords];
while (processed < ctx.totalRecords) {
const batch = await sourceDatabase.fetchBatch(
ctx.sourceTable,
batchSize,
lastId
);
if (batch.length === 0) break;
for (const record of batch) {
try {
const transformed = await transformRecord(record);
await destinationDatabase.upsert(ctx.outputTable, transformed);
processed++;
} catch (error) {
failed.push(record.id);
// Continue processing other records
}
lastId = record.id;
}
// Save progress checkpoint within the step
await checkpoint.save({
...ctx,
processedRecords: processed,
failedRecords: failed,
lastProcessedId: lastId,
});
}
return StepResult.success({
...ctx,
processedRecords: processed,
failedRecords: failed,
lastProcessedId: lastId,
});
})
.addStep('generate-report', async (ctx) => {
await reportService.create({
jobId: ctx.jobId,
totalRecords: ctx.totalRecords,
processedRecords: ctx.processedRecords,
failedRecords: ctx.failedRecords.length,
failedRecordIds: ctx.failedRecords,
});
return StepResult.success(ctx);
})
.build();When this workflow resumes after a failure, it does not re-process the records that were already handled. The checkpoint contains the lastProcessedId and processedRecords count, so the workflow picks up exactly where it left off. For a data ingestion job processing millions of records, this is the difference between a five-minute recovery and a five-hour restart.
Human-in-the-Loop Recovery
Some errors cannot be resolved by software alone. A customer provided an address that does not validate against any known format. A payment was declined but the customer insists they have sufficient funds. A document is unreadable and cannot be processed by OCR. These situations require human judgment.
Alfred's human-in-the-loop recovery creates a structured interaction between the automated workflow and human operators. The workflow pauses, creates a task for a human, and resumes when the human provides a resolution.
import { WorkflowBuilder, HumanTaskQueue, StepResult } from '@alfred/core';
interface ApplicationContext {
applicationId: string;
applicantName: string;
applicantEmail: string;
documents: Array<{ id: string; type: string; url: string }>;
extractedData?: Record<string, unknown>;
humanCorrections?: Record<string, unknown>;
verificationStatus?: 'verified' | 'rejected';
}
const humanTaskQueue = new HumanTaskQueue({
store: new PostgresTaskStore(process.env.DATABASE_URL!),
notificationChannel: 'slack',
slackWebhook: process.env.SLACK_WEBHOOK_URL,
});
const applicationProcessing = new WorkflowBuilder<ApplicationContext>('application-review')
.addStep('extract-document-data', async (ctx) => {
try {
const extracted = await ocrService.extract(ctx.documents);
const confidence = calculateOverallConfidence(extracted);
if (confidence >= 0.95) {
return StepResult.success({ ...ctx, extractedData: extracted.data });
}
// Low confidence: route to human for verification
return StepResult.success({
...ctx,
extractedData: extracted.data,
needsHumanVerification: true,
lowConfidenceFields: extracted.lowConfidenceFields,
});
} catch (error) {
if (error instanceof UnreadableDocumentError) {
// Cannot be retried, needs human intervention
return StepResult.escalate(ctx, {
reason: 'Document unreadable by OCR',
taskType: 'manual-data-entry',
});
}
throw error; // Other errors follow normal retry policy
}
})
.addStep('human-verification', async (ctx, { waitForEvent }) => {
if (!ctx.needsHumanVerification && !ctx.escalated) {
return StepResult.success(ctx); // Skip if not needed
}
const task = await humanTaskQueue.create({
type: ctx.escalated ? 'manual-data-entry' : 'data-verification',
title: `Review application ${ctx.applicationId}`,
description: ctx.escalated
? `OCR could not read documents for ${ctx.applicantName}. Please enter data manually.`
: `Low confidence OCR results for ${ctx.applicantName}. Please verify highlighted fields.`,
data: {
applicationId: ctx.applicationId,
documents: ctx.documents,
extractedData: ctx.extractedData,
lowConfidenceFields: ctx.lowConfidenceFields,
},
priority: ctx.escalated ? 'high' : 'normal',
assignTo: { team: 'document-review' },
sla: { resolution: '4h', response: '30m' },
});
// Wait for human to complete the task
const resolution = await waitForEvent(`task.${task.id}.completed`, {
timeout: '24h',
onTimeout: async () => {
await humanTaskQueue.escalate(task.id, {
reason: 'SLA breach: task not completed within 24 hours',
escalateTo: { role: 'team-lead' },
});
},
});
return StepResult.success({
...ctx,
extractedData: { ...ctx.extractedData, ...resolution.corrections },
humanCorrections: resolution.corrections,
verifiedBy: resolution.operatorId,
});
})
.addStep('process-application', async (ctx) => {
// Continue with verified data
await applicationService.process(ctx.applicationId, ctx.extractedData!);
return StepResult.success(ctx);
})
.build();The human task includes SLA enforcement. If the task is not resolved within the specified time, it automatically escalates to a team lead. This prevents workflows from being stuck indefinitely waiting for human action.
Graceful Degradation
Sometimes the right response to an error is not to fix it and continue, but to provide a reduced level of service. Graceful degradation allows a workflow to complete with partial results rather than failing entirely.
import { WorkflowBuilder, DegradationPolicy, StepResult } from '@alfred/core';
interface EnrichmentContext {
customerId: string;
customerEmail: string;
basicProfile: CustomerProfile;
creditScore?: number;
socialMediaPresence?: SocialMediaData;
companyInfo?: CompanyInfo;
enrichmentLevel: 'full' | 'partial' | 'basic';
skippedEnrichments: string[];
}
const customerEnrichment = new WorkflowBuilder<EnrichmentContext>('customer-enrichment')
.addStep('fetch-basic-profile', async (ctx) => {
// This step is required; failure here should stop the workflow
const profile = await customerService.getProfile(ctx.customerId);
return StepResult.success({ ...ctx, basicProfile: profile, enrichmentLevel: 'full' });
})
.addStep('fetch-credit-score', async (ctx) => {
try {
const score = await creditBureau.getScore(ctx.customerId);
return StepResult.success({ ...ctx, creditScore: score });
} catch (error) {
// Degraded: continue without credit score
return StepResult.success({
...ctx,
enrichmentLevel: 'partial',
skippedEnrichments: [...ctx.skippedEnrichments, 'credit-score'],
});
}
})
.addStep('fetch-social-media', async (ctx) => {
try {
const social = await socialMediaService.lookup(ctx.customerEmail);
return StepResult.success({ ...ctx, socialMediaPresence: social });
} catch (error) {
return StepResult.success({
...ctx,
enrichmentLevel: ctx.enrichmentLevel === 'full' ? 'partial' : ctx.enrichmentLevel,
skippedEnrichments: [...ctx.skippedEnrichments, 'social-media'],
});
}
})
.addStep('fetch-company-info', async (ctx) => {
try {
const company = await companyDataService.lookup(ctx.customerEmail);
return StepResult.success({ ...ctx, companyInfo: company });
} catch (error) {
return StepResult.success({
...ctx,
enrichmentLevel: ctx.enrichmentLevel === 'full' ? 'partial' : ctx.enrichmentLevel,
skippedEnrichments: [...ctx.skippedEnrichments, 'company-info'],
});
}
})
.addStep('finalize-profile', async (ctx) => {
const profile = await customerService.updateEnrichedProfile(ctx.customerId, {
creditScore: ctx.creditScore,
socialMedia: ctx.socialMediaPresence,
companyInfo: ctx.companyInfo,
enrichmentLevel: ctx.enrichmentLevel,
skippedEnrichments: ctx.skippedEnrichments,
});
if (ctx.skippedEnrichments.length > 0) {
// Schedule a retry for the failed enrichments
await scheduler.schedule('customer-enrichment-retry', {
customerId: ctx.customerId,
enrichmentsToRetry: ctx.skippedEnrichments,
}, { delay: '1h' });
}
return StepResult.success(ctx);
})
.build();The workflow completes even if some enrichment sources are unavailable. The customer profile is created with whatever data could be gathered, the enrichment level is tracked so downstream consumers know what data is available, and a background job is scheduled to fill in the gaps later.
This pattern is particularly important for customer-facing workflows. A customer would rather see their profile with partial data than wait indefinitely for all data sources to become available. The partial data can be backfilled transparently when the failing services recover.
Dead Letter Queues and Manual Resolution
When all automated recovery mechanisms are exhausted, the workflow enters the dead letter queue. This is the last line of defense: a durable store of failed workflows and steps that require human investigation and resolution.
import { DeadLetterQueue, ResolutionWorkbench } from '@alfred/recovery';
const deadLetterQueue = new DeadLetterQueue({
store: new PostgresStore(process.env.DATABASE_URL!),
retentionPeriod: '90d',
alertOnEntry: true,
alertChannels: ['pagerduty', 'slack'],
});
// The resolution workbench provides tools for operators to investigate and resolve
const workbench = new ResolutionWorkbench({
deadLetterQueue,
workflowEngine,
});
// Operator workflow for resolving dead letter entries
class DeadLetterResolutionService {
async investigate(entryId: string): Promise<InvestigationReport> {
const entry = await deadLetterQueue.get(entryId);
return {
entry,
workflowHistory: await workflowEngine.getHistory(entry.workflowInstanceId),
relatedLogs: await logService.search({
correlationId: entry.correlationId,
timeRange: {
start: entry.originalTimestamp,
end: entry.failedTimestamp,
},
}),
relatedTraces: await traceService.search({
traceId: entry.traceId,
}),
suggestedActions: await this.suggestResolution(entry),
};
}
private async suggestResolution(entry: DeadLetterEntry): Promise<SuggestedAction[]> {
const suggestions: SuggestedAction[] = [];
// Analyze the error and suggest resolution actions
if (entry.errorCategory === 'data') {
suggestions.push({
action: 'fix-data',
description: `Fix the data issue in field "${entry.errorDetails.field}" and retry`,
steps: [
`Review the current value: ${entry.errorDetails.currentValue}`,
`Update to a valid value using the correction API`,
`Retry the workflow from step "${entry.failedStep}"`,
],
});
}
if (entry.retryCount >= entry.maxRetries) {
suggestions.push({
action: 'manual-retry',
description: 'All automated retries exhausted. Verify the external service is healthy before retrying.',
steps: [
`Check service health: ${entry.errorDetails.serviceUrl}`,
`If healthy, retry from step "${entry.failedStep}"`,
`If unhealthy, wait for service recovery and then retry`,
],
});
}
return suggestions;
}
async resolve(entryId: string, action: ResolutionAction): Promise<void> {
switch (action.type) {
case 'retry-from-step':
await workbench.retryFromStep(entryId, action.stepName);
break;
case 'retry-from-beginning':
await workbench.retryFromBeginning(entryId);
break;
case 'skip-step':
await workbench.skipStep(entryId, action.stepName, action.defaultContext);
break;
case 'manually-complete':
await workbench.markComplete(entryId, action.result, action.notes);
break;
case 'abandon':
await workbench.abandon(entryId, action.reason);
break;
}
}
}The resolution workbench gives operators multiple options: retry from the failed step, retry from the beginning, skip the failed step with a manually provided context, mark the workflow as manually completed, or abandon it entirely. Each action is logged with the operator's identity and justification for audit purposes.
Building a Recovery Culture
Error recovery is not just a technical concern. It is an organizational practice that requires clear ownership, defined processes, and regular attention.
Assign clear ownership for each workflow's error recovery. Someone should be responsible for monitoring the dead letter queue, investigating failures, and driving resolution. Rotate this responsibility to prevent burnout and spread knowledge across the team.
Conduct regular review sessions for recurring errors. If the same step fails with the same error pattern every week, the right response is not to keep recovering from it. It is to fix the underlying issue. Use your error metrics to identify these patterns and drive root cause resolution.
Document your recovery procedures. When a new team member is on call at two in the morning and a critical workflow has entered the dead letter queue, they need a clear, step-by-step guide for investigating and resolving the issue. Runbooks should be linked directly from alert notifications.
Practice recovery. Run game days where you deliberately inject failures into your workflows and practice the recovery process. This builds muscle memory and reveals gaps in your procedures before a real incident exposes them.
Practical Tips
Set up alerts with appropriate urgency levels. Not every dead letter entry is a critical alert. A failed customer enrichment can wait until morning. A failed payment processing workflow needs immediate attention. Configure alert severity based on the business impact of the affected workflow.
Include business context in error reports. A technical error message like "HTTP 503 from service-xyz" is less useful than "Failed to process payment of $1,234.56 for order ORD-789 because the payment gateway is unavailable." The latter tells the operator what is at stake and what they might need to do.
Track your mean time to recovery (MTTR) for workflow failures. This metric tells you how effective your recovery processes are. A high MTTR suggests that your operators lack the tools or information they need to resolve issues quickly.
Never silently swallow errors. Even if you choose graceful degradation, log the error and track it in your metrics. Silent failures accumulate and eventually produce business outcomes that nobody can explain.
Conclusion
Error recovery is what separates a production-grade workflow engine from a prototype. Alfred provides a layered recovery architecture, from automatic retries for transient failures to human-in-the-loop resolution for data errors, from graceful degradation for non-critical enrichments to dead letter queues for truly stuck workflows.
But the technical mechanisms are only half the story. Effective error recovery requires organizational practices: clear ownership, documented procedures, regular reviews, and a culture that treats errors as opportunities to improve system resilience. When automated recovery handles the routine failures and human operators efficiently resolve the exceptions, your business processes run reliably even in a world where individual components are inherently unreliable.
Related Articles
Testing Complex Workflows: Strategies and Tools
A comprehensive guide to testing multi-step distributed workflows, covering unit testing individual steps, integration testing complete flows, chaos testing, and time-travel debugging.
Business Process Automation: Strategy and Implementation
A strategic guide to automating complex business processes with workflow orchestration, covering process discovery, prioritization, and phased implementation with real-world examples.
Observability for Long-Running Workflows
How to instrument, monitor, and debug long-running distributed workflows using structured logging, distributed tracing, and custom metrics in TypeScript.