Observability for Long-Running Workflows

A workflow that takes ten milliseconds to complete is easy to debug. When it fails, you look at the error message, check the input, and fix the bug. A workflow that takes ten days to complete, spanning dozens of steps across multiple services with human approval gates and external API calls, is a different beast entirely. When it fails, the error might be a consequence of something that happened three days ago. The relevant logs are scattered across half a dozen services. The context that explains the failure has been serialized, deserialized, and transformed multiple times.

Observability is what makes long-running workflows manageable. It is the combination of structured logging, distributed tracing, and metrics that lets you answer three questions at any point: what is happening right now, what happened in the past, and what is likely to happen next. Alfred builds observability into the workflow engine itself, so you get deep visibility without instrumenting every step by hand.

Structured Logging for Workflows

Traditional log lines like "Processing order" are nearly useless for debugging long-running workflows. You need structured logs that carry context: which workflow instance, which step, which attempt, what the input was, and what the output was.

import { WorkflowLogger, LogLevel } from '@alfred/observability';
 
// Alfred automatically injects a contextual logger into every step
const orderFulfillment = new WorkflowBuilder<OrderContext>('order-fulfillment')
  .addStep('process-payment', async (ctx, { logger }) => {
    logger.info('Initiating payment processing', {
      orderId: ctx.orderId,
      amount: ctx.totalAmount,
      paymentMethod: ctx.paymentMethod,
    });
 
    try {
      const result = await paymentService.charge(ctx.orderId, ctx.totalAmount);
 
      logger.info('Payment processed successfully', {
        transactionId: result.transactionId,
        processingTime: result.processingTimeMs,
      });
 
      return StepResult.success({ ...ctx, paymentId: result.transactionId });
    } catch (error) {
      logger.error('Payment processing failed', {
        error: error instanceof Error ? error.message : String(error),
        errorCode: error instanceof PaymentError ? error.code : undefined,
        willRetry: error instanceof TransientError,
      });
      throw error;
    }
  })
  .build();

Every log entry produced by Alfred's logger automatically includes the workflow ID, step name, attempt number, correlation ID, and timestamp. This context is not injected by the step author but by the framework, ensuring consistency across all steps.

// What the structured log output looks like
{
  "timestamp": "2025-04-16T14:23:45.123Z",
  "level": "info",
  "message": "Payment processed successfully",
  "workflow": {
    "id": "wf-abc123",
    "name": "order-fulfillment",
    "instanceId": "inst-def456",
    "correlationId": "order-ORD-789"
  },
  "step": {
    "name": "process-payment",
    "attempt": 1,
    "startedAt": "2025-04-16T14:23:44.891Z"
  },
  "data": {
    "transactionId": "txn-ghi012",
    "processingTime": 232
  }
}

This structure lets you query your log aggregator effectively. Need all logs for a specific workflow instance? Filter by workflow.instanceId. Want to see all payment failures across all workflows? Filter by step.name and level. Need to trace a customer's order through the entire system? Filter by workflow.correlationId.

Distributed Tracing

Structured logs tell you what happened within a single step. Distributed tracing tells you how steps relate to each other and how work flows across service boundaries. Alfred integrates with OpenTelemetry to provide end-to-end traces for every workflow execution.

import { WorkflowBuilder, StepResult } from '@alfred/core';
import { trace, SpanKind, SpanStatusCode } from '@opentelemetry/api';
 
// Alfred's tracing middleware automatically creates spans for each step
// You can add custom spans within a step for finer-grained tracing
 
const orderFulfillment = new WorkflowBuilder<OrderContext>('order-fulfillment')
  .withTracing({
    serviceName: 'alfred-order-service',
    exporters: ['jaeger', 'datadog'],
    samplingRate: 1.0, // Sample all workflow traces
    propagation: ['w3c-tracecontext', 'baggage'],
  })
  .addStep('process-payment', async (ctx, { logger, tracer }) => {
    // The step already has a span. Add custom attributes.
    tracer.setAttributes({
      'order.id': ctx.orderId,
      'order.amount': ctx.totalAmount,
      'payment.method': ctx.paymentMethod,
    });
 
    // Create a child span for the external API call
    return tracer.withSpan('payment-gateway-call', { kind: SpanKind.CLIENT }, async (span) => {
      span.setAttributes({
        'http.method': 'POST',
        'http.url': 'https://api.payment-provider.com/v1/charges',
      });
 
      try {
        const result = await paymentService.charge(ctx.orderId, ctx.totalAmount);
        span.setStatus({ code: SpanStatusCode.OK });
        return StepResult.success({ ...ctx, paymentId: result.transactionId });
      } catch (error) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: error instanceof Error ? error.message : 'Unknown error',
        });
        span.recordException(error as Error);
        throw error;
      }
    });
  })
  .build();

The trace hierarchy for a workflow execution looks like this:

workflow: order-fulfillment (inst-def456)
  |-- step: validate-inventory (attempt 1)
  |     |-- db: SELECT FROM inventory
  |-- step: reserve-inventory (attempt 1)
  |     |-- db: UPDATE inventory
  |-- step: process-payment (attempt 1) [FAILED]
  |     |-- http: POST payment-gateway [503 Service Unavailable]
  |-- step: process-payment (attempt 2) [SUCCESS]
  |     |-- http: POST payment-gateway [200 OK]
  |-- step: create-shipment (attempt 1)
        |-- http: POST shipping-api [201 Created]

This trace tells a complete story. You can see that the payment step failed on the first attempt with a 503 error, was retried, and succeeded on the second attempt. You can see the latency of each step and each external call. And because trace context is propagated across service boundaries, you can follow the trace into the payment service and shipping service to see what happened on their side.

Custom Metrics for Workflow Health

Logs and traces are essential for debugging individual workflow instances. Metrics give you the aggregate view: how is the system performing overall? Alfred exposes a comprehensive set of workflow metrics and allows you to define custom ones.

import { WorkflowMetrics, MetricsRegistry } from '@alfred/observability';
 
const metrics = new MetricsRegistry({
  prefix: 'alfred',
  defaultLabels: {
    service: 'order-service',
    environment: process.env.NODE_ENV ?? 'development',
  },
});
 
// Alfred's built-in metrics (automatically collected)
// alfred_workflow_started_total{workflow="order-fulfillment"}
// alfred_workflow_completed_total{workflow="order-fulfillment", status="success|failed|compensated"}
// alfred_workflow_duration_seconds{workflow="order-fulfillment", status="success|failed"}
// alfred_step_duration_seconds{workflow="order-fulfillment", step="process-payment"}
// alfred_step_retries_total{workflow="order-fulfillment", step="process-payment"}
// alfred_step_failure_total{workflow="order-fulfillment", step="process-payment", error_class="transient|permanent"}
// alfred_active_workflows{workflow="order-fulfillment"}
// alfred_waiting_workflows{workflow="order-fulfillment", wait_reason="human-approval|external-event"}
 
// Custom business metrics
const orderMetrics = {
  orderValue: metrics.histogram(
    'order_value_dollars',
    'Distribution of order values',
    { buckets: [10, 50, 100, 250, 500, 1000, 5000] }
  ),
  paymentRetries: metrics.counter(
    'payment_retries_total',
    'Number of payment retry attempts',
    { labels: ['payment_method', 'error_type'] }
  ),
  fulfillmentLeadTime: metrics.histogram(
    'fulfillment_lead_time_seconds',
    'Time from order submission to shipment',
    { buckets: [3600, 7200, 14400, 28800, 43200, 86400] }
  ),
};
 
// Instrumenting a workflow step with custom metrics
const processPayment = async (ctx: OrderContext, { logger, tracer }: StepUtils) => {
  orderMetrics.orderValue.observe(ctx.totalAmount);
 
  const startTime = Date.now();
  try {
    const result = await paymentService.charge(ctx.orderId, ctx.totalAmount);
    return StepResult.success({ ...ctx, paymentId: result.transactionId });
  } catch (error) {
    orderMetrics.paymentRetries.inc({
      payment_method: ctx.paymentMethod,
      error_type: error instanceof TransientError ? 'transient' : 'permanent',
    });
    throw error;
  }
};

From these metrics, you can build dashboards that answer critical operational questions. What is the current throughput of workflow completions? What is the p95 latency for each workflow type? Which steps have the highest failure rates? How many workflows are currently waiting for external events? Are retry rates increasing, suggesting a downstream service is degrading?

Workflow Dashboards and Alerting

Raw metrics and logs are the foundation. Dashboards and alerts turn them into actionable operational intelligence.

import { AlertRule, AlertSeverity, DashboardBuilder } from '@alfred/observability';
 
// Define alert rules based on workflow metrics
const alertRules: AlertRule[] = [
  {
    name: 'high-workflow-failure-rate',
    expression: 'rate(alfred_workflow_completed_total{status="failed"}[5m]) / rate(alfred_workflow_completed_total[5m]) > 0.05',
    duration: '5m',
    severity: AlertSeverity.WARNING,
    summary: 'Workflow failure rate exceeds 5%',
    description: 'The {{ $labels.workflow }} workflow has a failure rate of {{ $value | humanizePercentage }} over the last 5 minutes.',
    runbook: 'https://wiki.internal/runbooks/workflow-failure-rate',
  },
  {
    name: 'compensation-queue-growing',
    expression: 'alfred_compensation_dead_letter_pending > 0',
    duration: '1m',
    severity: AlertSeverity.CRITICAL,
    summary: 'Unresolved compensation failures detected',
    description: '{{ $value }} compensation failures are pending manual resolution.',
    runbook: 'https://wiki.internal/runbooks/compensation-failure',
  },
  {
    name: 'workflow-stuck',
    expression: 'alfred_workflow_step_duration_seconds{quantile="0.99"} > 300',
    duration: '10m',
    severity: AlertSeverity.WARNING,
    summary: 'Workflow step taking unusually long',
    description: 'The {{ $labels.step }} step in {{ $labels.workflow }} has a p99 duration exceeding 5 minutes.',
  },
  {
    name: 'active-workflows-spike',
    expression: 'alfred_active_workflows > 10000',
    duration: '5m',
    severity: AlertSeverity.WARNING,
    summary: 'Unusually high number of active workflows',
    description: '{{ $value }} active {{ $labels.workflow }} workflows. Normal range is 1000-5000.',
  },
];
 
// Programmatic dashboard definition
const workflowDashboard = new DashboardBuilder('alfred-workflow-health')
  .addRow('Overview', [
    { type: 'stat', title: 'Active Workflows', query: 'sum(alfred_active_workflows)' },
    { type: 'stat', title: 'Completions/min', query: 'sum(rate(alfred_workflow_completed_total{status="success"}[5m])) * 60' },
    { type: 'stat', title: 'Failure Rate', query: 'sum(rate(alfred_workflow_completed_total{status="failed"}[5m])) / sum(rate(alfred_workflow_completed_total[5m]))' },
    { type: 'stat', title: 'Pending Compensations', query: 'sum(alfred_compensation_dead_letter_pending)' },
  ])
  .addRow('Throughput', [
    { type: 'timeseries', title: 'Workflow Completions', query: 'sum by (workflow, status) (rate(alfred_workflow_completed_total[5m])) * 60' },
    { type: 'timeseries', title: 'Step Retries', query: 'sum by (workflow, step) (rate(alfred_step_retries_total[5m])) * 60' },
  ])
  .addRow('Latency', [
    { type: 'heatmap', title: 'Workflow Duration', query: 'sum by (le) (rate(alfred_workflow_duration_seconds_bucket[5m]))' },
    { type: 'timeseries', title: 'Step Duration p95', query: 'histogram_quantile(0.95, sum by (step, le) (rate(alfred_step_duration_seconds_bucket[5m])))' },
  ])
  .build();

Workflow Instance Inspector

Beyond aggregate dashboards, Alfred provides a workflow instance inspector that lets you drill into a single workflow execution and understand exactly what happened.

import { WorkflowInspector } from '@alfred/observability';
 
const inspector = new WorkflowInspector({
  store: workflowStore,
  logProvider: elasticsearchClient,
  traceProvider: jaegerClient,
});
 
// Get the complete execution history of a workflow instance
const history = await inspector.getHistory('inst-def456');
 
// Returns a structured timeline:
// {
//   workflowId: 'wf-abc123',
//   instanceId: 'inst-def456',
//   correlationId: 'order-ORD-789',
//   startedAt: '2025-04-16T14:23:44.000Z',
//   currentState: 'completed',
//   steps: [
//     {
//       name: 'validate-inventory',
//       status: 'completed',
//       attempt: 1,
//       startedAt: '2025-04-16T14:23:44.100Z',
//       completedAt: '2025-04-16T14:23:44.350Z',
//       duration: 250,
//       inputContext: { ... },
//       outputContext: { ... },
//     },
//     {
//       name: 'process-payment',
//       status: 'completed',
//       attempt: 2,  // Had to retry
//       attempts: [
//         { attempt: 1, status: 'failed', error: 'Gateway timeout', duration: 5002 },
//         { attempt: 2, status: 'completed', duration: 232 },
//       ],
//       startedAt: '2025-04-16T14:23:44.350Z',
//       completedAt: '2025-04-16T14:23:50.584Z',
//       duration: 6234,
//       inputContext: { ... },
//       outputContext: { ... },
//     },
//   ],
//   traceId: 'abc123def456',
//   traceUrl: 'https://jaeger.internal/trace/abc123def456',
// }
 
// Search for workflow instances matching criteria
const stuckWorkflows = await inspector.search({
  workflow: 'order-fulfillment',
  state: 'in-progress',
  startedBefore: new Date(Date.now() - 3600000), // More than 1 hour ago
  currentStep: 'process-payment',
});

The inspector combines data from the workflow store, log aggregator, and tracing backend to provide a unified view of each workflow instance. This is the tool you reach for when a customer reports that their order is stuck and you need to find out why.

Practical Tips

Invest in correlation IDs from day one. Every workflow should have a business-meaningful correlation ID, such as an order ID, customer ID, or request ID, that threads through every log entry, trace span, and metric label. Without correlation, connecting the dots across services is painfully manual.

Sample traces judiciously. In high-throughput systems, tracing every workflow execution is expensive. Use head-based or tail-based sampling to capture a representative subset. Alfred supports tail-based sampling that preferentially captures traces for failed or slow workflow instances, which are the ones you actually want to debug.

Set up alerts on compensation failures before anything else. A failed workflow is a problem. A failed compensation is a crisis. Make sure your team is notified immediately when a compensation cannot complete.

Store workflow execution history for at least as long as your compliance requirements demand, and preferably longer. The ability to reconstruct exactly what happened in a workflow execution six months ago is invaluable for root cause analysis and auditing.

Monitor the age distribution of active workflows. A sudden increase in old active workflows usually indicates a systemic problem: a downstream service outage, a configuration error, or a bug that causes workflows to hang.

Conclusion

Observability transforms long-running workflows from opaque black boxes into transparent, debuggable processes. Alfred's built-in observability stack, comprising structured logging with automatic context injection, distributed tracing with OpenTelemetry integration, and comprehensive workflow metrics, gives you the tools to understand what your workflows are doing at every level of abstraction.

From aggregate dashboards that show system-wide health to the workflow instance inspector that reveals the exact sequence of events for a single execution, observability is what makes it possible to operate complex workflow systems in production with confidence. Build it in from the start, because retrofitting observability onto a running system is far harder than including it from the beginning.

Observability for Long-Running Workflows

Structured Logging for Workflows

Distributed Tracing

Custom Metrics for Workflow Health

Workflow Dashboards and Alerting

Workflow Instance Inspector

Practical Tips

Conclusion

Related Articles

Testing Complex Workflows: Strategies and Tools

Error Recovery Patterns in Workflow Engines

Business Process Automation: Strategy and Implementation

Related Articles

Testing Complex Workflows: Strategies and Tools
April 26, 202513 min read

Error Recovery Patterns in Workflow Engines
April 22, 202513 min read

Business Process Automation: Strategy and Implementation
April 19, 202511 min read