Distributed Patterns

Microservice Saga Orchestration

Two flavors of saga — events that trigger each other, or a central coordinator that drives them

A microservice saga is a long-running transaction split into local steps with compensating undos. Choreography uses peer events; orchestration uses a central coordinator like Temporal or Step Functions to drive each step.

  • Two flavorsChoreography vs Orchestration
  • ChoreographyEvents trigger next step, no controller
  • OrchestrationCentral coordinator drives flow
  • OrchestratorsTemporal, Step Functions, Camunda
  • Step Functions2.5T state transitions/year (2024)
  • Used inUber, Snap, Coinbase, banks (BPMN)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why sagas exist

Distributed transactions across microservices need a different model than ACID transactions in one database. Two-phase commit holds locks across all participants for the duration of the transaction — across five services on a shaky network, those locks become the bottleneck and the single point of failure. The saga pattern was invented (Garcia-Molina & Salem, 1987; revived for microservices ~2015) to replace 2PC with a sequence of local transactions, each with a compensating undo.

This article is about the control model — how the saga's steps get executed in order. Two answers exist: choreography and orchestration. They sit on a spectrum, both are deployed at scale, and you should know the trade-offs before reaching for one.

Choreography — events trigger the next step

In choreography, there is no central controller. Each service publishes a success event when its step completes; the next service subscribes and acts. The global flow lives entirely in event subscriptions on a shared bus (Kafka, RabbitMQ, EventBridge).

E-commerce checkout in choreography across Order, Payment, Inventory, Shipping:

  1. OrderService writes the order locally → publishes OrderCreated.
  2. PaymentService consumes OrderCreated → charges card → publishes PaymentSucceeded.
  3. InventoryService consumes PaymentSucceeded → decrements stock → publishes InventoryReserved.
  4. ShippingService consumes InventoryReserved → books carrier → publishes ShipmentBooked.
  5. OrderService consumes ShipmentBooked → marks the order COMPLETE.

If Inventory fails (out of stock), it publishes InventoryFailed. Payment subscribes and runs RefundPayment. Order subscribes to RefundIssued and marks the order CANCELLED. The compensation chain is implicit in event subscriptions.

Pros: No single point of failure, loose coupling, new services add by subscribing to existing events.

Cons: The global flow exists nowhere in code — you trace it via distributed tracing. Beyond ~4 service hops, choreography becomes hard to reason about; cross-cutting concerns (timeouts, retries, audit) are duplicated in each service.

Orchestration — a central coordinator drives

In orchestration, a central component (the saga orchestrator) maintains the saga's state machine, calls each service in turn (sync RPC or async events), receives the response, and decides what to do next.

Same e-commerce checkout in orchestration:

  1. Orchestrator receives PlaceOrder request.
  2. Calls OrderService.create() → success.
  3. Calls PaymentService.charge() → success.
  4. Calls InventoryService.reserve() → fails.
  5. Orchestrator triggers compensations in reverse: PaymentService.refund(), OrderService.cancel().

Pros: Explicit state machine (Camunda gives you BPMN diagrams), easy to add timeouts and retries per step, single place for cross-cutting concerns (audit, monitoring), trivially observable.

Cons: Orchestrator is a deployment unit and a scaling concern; services know they're called by an orchestrator (tighter coupling than pure event-driven).

Production orchestrators

  • AWS Step Functions. Managed serverless, JSON-defined state machines. Native saga support with Catch + Retry + Compensate state types. ~2.5 trillion state transitions/year customer-wide (2024). Pricing per transition (~$25 per million). Deep AWS integration: any Lambda, ECS task, or service event is a valid step. Best for AWS-native pipelines, ML training, ETL orchestration.
  • Temporal. Open-source successor to Uber's Cadence. Pioneer of "durable execution" — workflow code in Go/Java/Python/TypeScript that automatically survives process crashes via event-history checkpointing. Workflows can run for hours, days, years. Powers Uber, Snap, DoorDash, Coinbase, Datadog — billions of workflow executions/month.
  • Camunda 8 (Zeebe). BPMN-based. The visual diagram IS the deployable artifact. 100,000+ workflow instances/sec on a 5-broker Zeebe cluster. Banks, insurers, telcos use it for regulated workflows (KYC, claims, onboarding) that span days and 15+ steps.
  • Netflix Conductor. Open-source orchestrator built for Netflix's content licensing pipelines (24-72 hour workflows). JSON workflow definitions, language-agnostic workers.
  • Uber Cadence. The predecessor to Temporal; still in use at Uber for legacy workflows.
  • Microsoft Dapr Workflows. Durable-execution actor model, similar in spirit to Temporal, baked into the Dapr sidecar.

Choreography vs Orchestration — comparison

ChoreographyOrchestration
Control flowDistributed across event subscriptionsCentral coordinator
Where flow livesNowhere explicit — distributed tracing onlySingle state machine (code or BPMN)
CouplingLoose (services share event schema only)Tighter (services know they're called)
Failure modesNo single point; cascade failure possibleOrchestrator outage = workflow stalls
Cross-cutting (timeouts, audit)Duplicated per serviceCentralized
ObservabilityDistributed traces requiredBuilt-in (state machine logs)
Best for2-4 step flows, fully async5+ step flows, complex error handling
Adding a new stepSubscribe to existing eventModify workflow definition
Regulatory auditHard (no explicit flow)Easy (every transition logged)
ThroughputBound by bus (1M+ msg/sec)Bound by orchestrator (~100k workflows/sec)

A concrete worked example: order saga in Temporal

from temporalio import workflow

@workflow.defn
class OrderSaga:
    @workflow.run
    async def run(self, order):
        order_id = None
        payment_id = None
        try:
            order_id = await workflow.execute_activity(
                create_order, order, start_to_close_timeout=timedelta(seconds=10),
            )
            payment_id = await workflow.execute_activity(
                charge_card, order.card, order.total,
                start_to_close_timeout=timedelta(seconds=30),
                retry_policy=RetryPolicy(maximum_attempts=3),
            )
            await workflow.execute_activity(
                reserve_inventory, order.items,
                start_to_close_timeout=timedelta(seconds=10),
            )
            await workflow.execute_activity(
                ship_order, order_id,
                start_to_close_timeout=timedelta(seconds=60),
            )
            return {"order_id": order_id, "status": "complete"}

        except Exception as e:
            # Compensate in reverse order of what succeeded.
            if payment_id:
                await workflow.execute_activity(refund_payment, payment_id, ...)
            if order_id:
                await workflow.execute_activity(cancel_order, order_id, ...)
            raise

This looks like normal sequential code. Temporal makes it survive crashes invisibly: the worker process can die at any line, restart, and resume from where it left off. The compensation logic is explicit in the catch block — no event subscriptions to wire up, no global flow scattered across services.

The hybrid model

Most large systems mix both. Pattern: an orchestrator drives the high-level business workflow (order lifecycle, KYC onboarding), but individual steps publish events that downstream services consume choreographically.

Example: the OrderSaga orchestrator drives Order → Pay → Ship, but each step also publishes a domain event:

  • OrderCreated — analytics, recommendation engine subscribe.
  • PaymentSucceeded — fraud detection, audit log, finance dashboards subscribe.
  • ShipmentBooked — customer notification, ETA-prediction model subscribe.

The orchestrator has explicit control over the business-critical path (where compensation matters). Downstream services fan out for free over the bus (where eventual consistency is fine). This hybrid is the most common production pattern in modern microservices stacks.

Durable execution — Temporal's killer feature

Pre-Temporal, orchestrators required developers to manually checkpoint state ("we're in step 3 of 5, here's the partial result"). Step Functions does this implicitly via its JSON state machine.

Temporal goes further: workflow code is regular code (Go, Java, Python, TypeScript) that automatically survives process crashes. Every external call — service invocation, sleep, timer, signal — is intercepted by the Temporal SDK and recorded as an event in the workflow's event history. On crash, when the worker restarts, Temporal replays the event history to reconstruct the workflow's exact state up to the crash point, then continues from the next instruction.

Practical impact: a workflow can call await workflow.sleep(timedelta(days=30)) and Temporal will resume it 30 days later in a completely different process on a completely different host. No state in your code, no schedulers, no cron jobs. Workflows that span days or months become first-class.

Real-world saga deployments

  • Uber. Trips, payments, fare splitting on Cadence/Temporal — millions of sagas/day, each spanning hours (rider waits, trip completes, billing finalizes).
  • Snap. Identity, content moderation, ads pipelines on Temporal — billions of workflow executions/month.
  • Coinbase. Cryptocurrency transaction sagas on Temporal — irreversible on-chain steps demand correct compensation logic.
  • AWS Step Functions. 2.5 trillion state transitions/year customer-wide; pricing per transition encourages keeping state in workflows.
  • Camunda 8 (banks). ING, Allianz, Goldman Sachs run loan, claims, and onboarding workflows on BPMN — 15+ steps over days, 100k+ instances/sec.
  • Netflix Conductor. Content licensing workflows (24-72 hour runtimes) coordinating ~50 microservices.

Common misconceptions and traps

  • "Choreography is always more loosely coupled." Past 4-5 services, the implicit global flow is its own form of coupling — services bound by event-name conventions, with no place to look up "what happens after PaymentSucceeded?"
  • "Orchestration is a single point of failure." Modern orchestrators (Step Functions, Temporal, Camunda) are clustered and durable. Workflow state survives any single-node failure.
  • "I'll just use Kafka and choreography." Works for 3-4 steps. At 10+ steps with complex error handling, the debugging cost becomes prohibitive. Move to orchestration.
  • "Step Functions can do anything Temporal can." Step Functions excels at AWS-native pipelines; Temporal excels at long-running code-first workflows with complex local state. Different tools for different jobs.
  • "BPMN is overkill." For regulated industries (banking, insurance, healthcare), BPMN's visual auditability is a compliance requirement, not a luxury.
  • "Orchestrator means synchronous." No — Temporal, Step Functions, Camunda all support async activities. The orchestrator dispatches work and waits asynchronously.
  • "I need to pick one and stick with it." Hybrid is the production reality. Orchestrate the critical-path workflow; let downstream services fan out via choreography.

Performance characteristics

  • Choreography latency: sum of step latencies + bus hop latencies (~10-50 ms/hop on Kafka). 5 steps ≈ 200-500 ms.
  • Orchestration latency: sum of step latencies + orchestrator decision time (~5-20 ms). Comparable to choreography.
  • Orchestrator throughput: Step Functions 4k-100k workflows/sec; Temporal scales to similar numbers per cluster; Camunda Zeebe 100k+ instances/sec on 5-broker cluster.
  • State storage: ~1 KB per saga step persisted; Step Functions on DynamoDB, Temporal on Cassandra/Postgres, Camunda on Elasticsearch + DB.
  • Recovery from crash: <1 second after orchestrator process restart, resuming from last checkpoint.
  • Cost: Step Functions ~$25/M transitions; Temporal Cloud ~$200/M actions; self-hosted Temporal is open-source.

Frequently asked questions

What's the difference between choreography and orchestration?

Choreography: each service publishes a 'step done' event and the next service subscribes and reacts — no central controller, the flow lives entirely in event subscriptions. Orchestration: a central saga coordinator maintains the state machine, calls each service in turn (over RPC or events), receives the response, and decides what to do next. Choreography pros: loose coupling, no single point of failure. Cons: global flow is invisible — debugging needs distributed tracing. Orchestration pros: explicit state machine, easy timeouts, observable. Cons: orchestrator is a deployment unit and a scaling concern. Rule of thumb: choreography for 2-4 service hops; orchestration for 5+ steps or complex error handling.

When should I use Temporal vs AWS Step Functions vs Camunda?

Temporal (and its predecessor Uber Cadence): pioneer of 'durable execution' — your workflow code is regular Go/Java/Python that automatically survives process crashes via checkpointing. Best for: complex workflows, long-running (hours-days), need code-first ergonomics. AWS Step Functions: managed serverless, JSON state machine definition, deep AWS integration. Best for: AWS-native pipelines, ML training orchestration, ETL. ~$25 per million state transitions; 2.5 trillion transitions/year customer-wide. Camunda 8 (Zeebe): BPMN-based — visual workflow modeling for business stakeholders. Best for: regulated industries (banks, insurers) that want BPMN audit trails and 100k+ workflow instances/sec. Netflix Conductor: open-source orchestrator built for media licensing pipelines.

How does orchestration handle a step failure?

The orchestrator catches the failure, decides whether to retry (transient) or compensate (terminal). On compensation, it runs the compensating action for every previously-committed step, in REVERSE order. Reserve inventory → Charge card → Ship. If Ship fails terminally: Refund card, then Release inventory. Step Functions models this as Catch + Retry + Compensate state types in JSON. Temporal models it as a workflow with try/catch and explicit compensation handlers. Camunda BPMN uses compensation events on each step. All three persist orchestrator state durably (DynamoDB / Cassandra / Postgres) so a crashed orchestrator resumes from its last checkpoint within seconds.

Can I mix choreography and orchestration?

Yes — and most large systems do. The pattern: orchestrators drive the high-level business workflow (order lifecycle, KYC onboarding), but the individual steps within the workflow may publish events that downstream services consume choreographically. For example, an Order orchestrator drives PlaceOrder → Pay → Ship, but the 'Pay' step publishes a PaymentReceived event that analytics, fraud detection, and the audit log all consume independently. The orchestrator doesn't know or care about those subscribers. This hybrid gives you explicit control where you need it and free fan-out where you don't.

What is BPMN and why do enterprises like it?

Business Process Model and Notation (BPMN 2.0) is an ISO-standardized visual notation for workflows: boxes for activities, diamonds for gateways, arrows for sequence flows. Camunda 8 / Zeebe executes BPMN diagrams directly — the diagram IS the deployment artifact. Enterprises like it because (1) business analysts can author and review the diagram alongside developers; (2) auditors and regulators understand BPMN; (3) every step is observable in the BPMN execution log. Banks use it for loan approval (15+ steps over days), insurers for claims, manufacturers for supply chains. Limitation: BPMN encodes structure but not full code — for complex logic you still drop into Java service tasks.

How does Temporal's 'durable execution' work?

Your workflow code (regular Go/Java/Python/TypeScript) runs inside the Temporal worker process. Every external call (service invocation, sleep, timer) is intercepted and recorded as an event in Temporal's event history. If the worker crashes, when it restarts Temporal replays the event history to reconstruct the workflow's exact state up to the crash, then continues. The programmer writes code that looks synchronous and stateful; Temporal makes it survive failures invisibly. Workflows can run for hours, days, or years without holding any process-level state. Temporal currently powers Uber, Snap, DoorDash, Coinbase, and Datadog's saga-style orchestration at scale — billions of workflow executions per month across deployments.