Distributed Systems

Saga Pattern

Q: How does saga differ from two-phase commit?

2PC holds locks across all participants for the duration of the transaction (prepare phase + commit phase) — every participant blocks waiting for the coordinator. If the coordinator fails after prepare and before commit, locks are held indefinitely. Saga has no global locks; each step is a local transaction that commits immediately. If a later step fails, compensations undo earlier steps. 2PC: ACID across services, blocking, fragile. Saga: eventual consistency, non-blocking, resilient. For microservices crossing 5+ services, 2PC's hold time becomes prohibitive — saga is the practical choice.

Q: What is a compensating transaction?

A compensating transaction (compensation) is the semantic inverse of a forward transaction — it undoes the business effect, not the database row. ReserveSeat → CancelReservation. ChargeCard → IssueRefund. SendEmail → SendCancellationEmail (you can't unsend). Compensations must be idempotent (safe to retry), commutative (order with concurrent operations doesn't matter for the same actor), and complete (cover the full business effect). A naive 'restore old DB state' fails because the world moved on — other transactions touched the same rows.

Q: Choreography vs orchestration — which to choose?

Choreography: each service publishes events when its step finishes; downstream services subscribe and act. No central controller. Pros: loose coupling, no single point of failure. Cons: cross-cutting flow is invisible — debugging needs distributed tracing. Best for 2-4 service hops. Orchestration: a central saga orchestrator (AWS Step Functions, Camunda, Temporal) drives the flow, calling each service in sequence. Pros: explicit state machine, observable, easy to add timeouts. Cons: orchestrator becomes a deployment unit. Best for 5+ steps, complex error handling, regulatory audit trails.

Q: How are sagas idempotent?

Every step (forward and compensation) must accept retries safely. Standard technique: assign a unique saga_id + step_id, store it in the local DB on first execution, and reject duplicate (saga_id, step_id) pairs. Stripe's idempotency-key header is a textbook example — same key on a retry returns the original result without double-charging. Compensations need the same protection: receiving CancelReservation twice for the same booking should cancel exactly once. Without idempotency, message retries (which network failures force) double-execute steps and corrupt state.

Q: What about isolation in a saga?

Sagas explicitly sacrifice isolation. While a saga is in flight, partial state is visible to other transactions — order created but not yet paid, seat reserved but flight not booked. Garcia-Molina (1987) called this 'lack of isolation' the fundamental cost. Mitigations: semantic locks (mark records as 'pending' so other sagas avoid them), commutative updates (order doesn't matter), pessimistic resource hold (reserve before committing externally), or accept eventual visibility with explicit UI states ('Pending payment').

Q: When does a saga fail (and compensation fails too)?

If a compensation itself fails (network down, downstream service permanently broken), the saga enters a stuck state. Standard recovery: retry compensation with exponential backoff (Step Functions: up to 24 hours), escalate to dead-letter queue, alert humans. For irreversible actions (sending physical mail, transferring money to external bank), saga steps should run them last — minimize compensations needed. Or model them as semantic compensations: 'send refund check' instead of 'reverse the original transfer.' Compensation failure is the saga model's hardest open problem; production systems wrap it in alerting and human escalation.

A sequence of local transactions with compensating undo actions — the microservices alternative to two-phase commit

A saga is a sequence of local transactions across multiple services where each transaction has a corresponding compensating transaction that semantically "undoes" it. If any local transaction fails, previously-committed transactions are rolled back by running their compensations in reverse order. First defined by Garcia-Molina and Salem in 1987 for long-running database transactions; revived for microservices to replace 2PC's blocking and tight coupling. Two implementations: choreography (each service publishes events; others react) and orchestration (a central saga orchestrator drives the flow). Used in Uber's order processing, AWS Step Functions, Camunda BPMN.

Patternlocal transactions + compensations
OriginatedGarcia-Molina & Salem 1987
Variantschoreography, orchestration
Compensation rulesidempotent, semantic
Used inUber, k8s, Step Functions, Camunda
Beats2PC for microservices

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why saga matters

Microservices. An order checkout touches inventory, payment, shipping, notifications — five services. 2PC across all five would lock for hundreds of milliseconds; saga commits each independently.
Event-driven systems. Kafka, EventBridge, RabbitMQ — sagas naturally fit these by treating each step as an event published to a topic.
Long-running workflows. Onboarding a new customer can take hours (manual KYC). 2PC can't hold locks that long; saga proceeds in steps.
Ride-sharing. Uber's order pipeline — match driver, charge rider, notify both, finalize trip — is a textbook saga across 5+ services.
Booking and travel. Reserve flight + hotel + car. Each provider exposes its own API; you can't 2PC across Expedia + Marriott + Hertz.
SaaS provisioning. Spin up DB + storage + DNS + billing record. Some succeed, some fail; saga rolls back the partial provisioning.
Regulatory audit trails. Each step produces an event log entry, giving compliance a chronological trace of every state change.

Origin: Garcia-Molina and Salem 1987

The 1987 paper "Sagas" addressed long-lived transactions (LLTs) in single-database systems — bank reconciliation jobs that ran for hours or days. ACID transactions of that era held locks for the entire duration, blocking any concurrent activity. Garcia-Molina and Salem proposed splitting an LLT into a sequence T1, T2, ..., Tn of short transactions, each commitable on its own, with compensations C1, C2, ..., Cn-1 to undo prior steps if Tn failed.

The key invariants:

Each Ti commits as a normal short transaction.
If T1, ..., Tj all commit, the saga succeeds.
If Tj+1 fails, run Cj, Cj-1, ..., C1 in reverse order to compensate.
Compensations themselves cannot fail (or must be retried until success — modern interpretation).

The pattern was niche for two decades — relational DBs got better at concurrency, and most apps fit in one database. Microservices revived it around 2015 as Netflix, Uber, and Airbnb hit the limits of distributed 2PC.

Choreography saga

In choreography, there's no central coordinator. Each service:

Listens for an event indicating it's its turn.
Performs its local transaction.
Publishes a success event (next service picks up) or failure event (triggers compensation chain).

Example: e-commerce checkout flowing through Order, Payment, Inventory, Shipping services on Kafka.

OrderService writes the order, publishes OrderCreated.
PaymentService consumes OrderCreated, charges card, publishes PaymentSucceeded.
InventoryService consumes PaymentSucceeded, decrements stock, publishes InventoryReserved.
ShippingService consumes InventoryReserved, books carrier, publishes ShipmentBooked.
OrderService consumes ShipmentBooked, marks order COMPLETE.

If InventoryService fails (out of stock), it publishes InventoryFailed. PaymentService consumes it and runs RefundPayment. OrderService consumes RefundIssued and marks the order CANCELLED. The compensation chain is implicit in event subscriptions.

Pros: services don't know each other; new services can be added by subscribing to existing events. Cons: the global flow exists nowhere — you trace it only through distributed tracing (Jaeger, OpenTelemetry). Beyond ~4 services, choreography becomes hard to reason about.

Orchestration saga

In orchestration, a central component (the saga orchestrator) drives the flow. The orchestrator maintains the saga's state machine, calls each service in sequence (synchronous RPC or async events), and decides when to compensate.

Same checkout in orchestration:

Orchestrator receives PlaceOrder request.
Calls OrderService.create() → success.
Calls PaymentService.charge() → success.
Calls InventoryService.reserve() → fails.
Orchestrator triggers compensations in reverse: PaymentService.refund(), OrderService.cancel().

Pros: explicit state machine, easy to visualize (Camunda gives you BPMN diagrams), easy to add timeouts and retries per step, single place for cross-cutting concerns (audit, monitoring). Cons: orchestrator is a deployment unit and scaling concern; tighter coupling — services know they're called by an orchestrator.

Production orchestrators: AWS Step Functions, Temporal (open source, durable execution), Camunda 8 (Zeebe), Netflix Conductor, Uber Cadence. Temporal in particular pioneered "durable execution" — workflow code that survives crashes by checkpointing state automatically.

Designing compensations

Compensations are the hardest part of saga design. Five rules:

Semantic, not physical. CancelReservation, IssueRefund, SendCancellationEmail — not "restore old DB row." The world moved on; physical undo is meaningless.
Idempotent. Receiving the same compensation twice (network retries) must cancel exactly once. Use idempotency keys.
Commutative with concurrent ops. If two sagas concurrently target the same resource, compensation order shouldn't corrupt state.
Complete. Cover all side effects, including indirect ones (notifications, cache invalidation, audit logs).
Always succeeds (eventually). If compensation can fail, design retry + escalation. A stuck saga is the worst failure mode.

Some actions are irreversible: physical mail, IRL meetings, money sent to an external bank. Two strategies: (1) sequence them last so few or no compensations are needed if they fail; (2) model semantic compensations — "send refund check" instead of "reverse the wire transfer."

The isolation problem

Sagas explicitly give up isolation. Mid-saga state is visible to concurrent reads and writes. An order shows as PENDING_PAYMENT for the user; an inventory item shows as RESERVED but not yet allocated. Three mitigation patterns:

Semantic locks. Add a "saga_id" or "pending" column; other sagas read this and skip. Released on saga end.
Commutative operations. Use atomic counters (DECR inventory by 1) instead of read-modify-write. Concurrent sagas don't conflict.
Pessimistic view. Reserve resources up front (mark seat as held), commit externally only after all reservations succeed.
Re-read. Critical decisions (charge card) re-read state at execution time to catch concurrent changes.
Status-aware UI. Show "pending" / "processing" states explicitly so users know about the eventual consistency.

Real-world saga deployments

Uber Cadence/Temporal: rides, payments, fare splitting — millions of sagas/day, each potentially spanning hours (rider waits, trip completes, billing finalizes).
AWS Step Functions: 2.5 trillion state transitions/year (2024 numbers), pricing per transition, native saga support via Catch + Compensate state types.
Netflix Conductor: open-sourced 2017, used internally for content licensing workflows that run 24-72 hours.
Camunda 8 (Zeebe): 100,000+ workflow instances/sec on a 5-broker cluster; banks and insurers use it for claims and onboarding.
Microsoft Dapr: exposes saga as a first-class building block for cloud-native apps; chosen for portability across clouds.

Common misconceptions

"Saga gives you ACID across services." No — saga sacrifices isolation, accepts eventual consistency, and only guarantees atomicity through compensations.
"Saga is easy to implement." Compensation logic is the hardest part of distributed systems. Reasoning about partial failure × retry × concurrency is genuinely hard.
"Saga always replaces 2PC." Within a single relational DB, 2PC (or the underlying single-node transaction) is still the right answer — saga adds overhead with no benefit there.
"Choreography always loosely couples better." Beyond 4-5 services, the hidden global flow becomes worse coupling — services bound by event-name conventions.
"Compensations always work." Sending physical mail, IRL meetings, money to external banks — sometimes irreversible. Design ordering carefully.
"Idempotency is automatic." Every saga step needs an explicit idempotency key + dedup logic. Without it, message retries double-execute.
"Sagas need a workflow engine." Temporal/Step Functions help, but a small saga can be 50 lines of Python with retries. Pick tooling proportional to complexity.

Performance characteristics

Latency: sum of step latencies (sequential) — not all-at-once like 2PC. A 5-step saga at 50ms/step = 250ms end-to-end.
Throughput: bound by the slowest service; no global lock contention so 10x higher than 2PC at scale.
Failure rate: compensation success ≥ 99.9% in well-designed systems; the remaining 0.1% needs human intervention.
State storage: orchestration sagas need a durable state store (Step Functions: DynamoDB; Temporal: Cassandra/Postgres). ~1KB per saga step.
Recovery time: after orchestrator crash, Temporal/Step Functions resume in <1 second from last checkpoint.

Frequently asked questions

How does saga differ from two-phase commit?

2PC holds locks across all participants for the duration of the transaction (prepare phase + commit phase) — every participant blocks waiting for the coordinator. If the coordinator fails after prepare and before commit, locks are held indefinitely. Saga has no global locks; each step is a local transaction that commits immediately. If a later step fails, compensations undo earlier steps. 2PC: ACID across services, blocking, fragile. Saga: eventual consistency, non-blocking, resilient. For microservices crossing 5+ services, 2PC's hold time becomes prohibitive — saga is the practical choice.

What is a compensating transaction?

A compensating transaction (compensation) is the semantic inverse of a forward transaction — it undoes the business effect, not the database row. ReserveSeat → CancelReservation. ChargeCard → IssueRefund. SendEmail → SendCancellationEmail (you can't unsend). Compensations must be idempotent (safe to retry), commutative (order with concurrent operations doesn't matter for the same actor), and complete (cover the full business effect). A naive 'restore old DB state' fails because the world moved on — other transactions touched the same rows.

Choreography vs orchestration — which to choose?

Choreography: each service publishes events when its step finishes; downstream services subscribe and act. No central controller. Pros: loose coupling, no single point of failure. Cons: cross-cutting flow is invisible — debugging needs distributed tracing. Best for 2-4 service hops. Orchestration: a central saga orchestrator (AWS Step Functions, Camunda, Temporal) drives the flow, calling each service in sequence. Pros: explicit state machine, observable, easy to add timeouts. Cons: orchestrator becomes a deployment unit. Best for 5+ steps, complex error handling, regulatory audit trails.

How are sagas idempotent?

Every step (forward and compensation) must accept retries safely. Standard technique: assign a unique saga_id + step_id, store it in the local DB on first execution, and reject duplicate (saga_id, step_id) pairs. Stripe's idempotency-key header is a textbook example — same key on a retry returns the original result without double-charging. Compensations need the same protection: receiving CancelReservation twice for the same booking should cancel exactly once. Without idempotency, message retries (which network failures force) double-execute steps and corrupt state.

What about isolation in a saga?

Sagas explicitly sacrifice isolation. While a saga is in flight, partial state is visible to other transactions — order created but not yet paid, seat reserved but flight not booked. Garcia-Molina (1987) called this 'lack of isolation' the fundamental cost. Mitigations: semantic locks (mark records as 'pending' so other sagas avoid them), commutative updates (order doesn't matter), pessimistic resource hold (reserve before committing externally), or accept eventual visibility with explicit UI states ('Pending payment').

When does a saga fail (and compensation fails too)?

If a compensation itself fails (network down, downstream service permanently broken), the saga enters a stuck state. Standard recovery: retry compensation with exponential backoff (Step Functions: up to 24 hours), escalate to dead-letter queue, alert humans. For irreversible actions (sending physical mail, transferring money to external bank), saga steps should run them last — minimize compensations needed. Or model them as semantic compensations: 'send refund check' instead of 'reverse the original transfer.' Compensation failure is the saga model's hardest open problem; production systems wrap it in alerting and human escalation.