Saga Pattern: Coordinating Distributed Transactions with Compensation
Model long-running workflows with compensating actions instead of fragile global transactions.
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: A Saga replaces fragile distributed 2PC with a sequence of local transactions, each backed by an explicit compensating transaction. Use orchestration when workflow control needs a single brain; use choreography when services must stay loosely coupled.
π Why Distributed Transactions Break Microservices
A ride-booking app charged a user's credit card, then failed to dispatch a driver β with no automatic refund. The payment service completed its transaction successfully. The dispatch service threw a transient exception. No compensating action fired. The user was billed for a ride that never happened, and customer support had to issue a manual refund three hours later.
This is not a theoretical edge case. It is what happens in any distributed system when you treat a multi-step business operation as a series of independent API calls with no explicit rollback plan.
The Saga pattern is the engineering answer. Instead of a single atomic transaction spanning multiple services (which microservices cannot support), you design a sequence of local transactions where every step has a pre-defined compensating transaction β a new, forward-moving action that reverses the business effect of a prior step if a later step fails.
Step 1: Charge card β success β
Step 2: Dispatch driver β failure β
Compensate Step 1: Refund charge β fires automatically β
With a Saga, the user gets their refund. Without one, they don't β until someone notices.
In a microservices architecture, checkout, inventory, payment, and dispatch are separate services with separate databases. No single transaction coordinator owns all of them.
Two-Phase Commit (2PC) β the classical distributed transaction alternative β solves this with a coordinator that locks resources across every participant until they all vote "ready." At microservice scale, this creates three serious problems:
- Lock contention. Every participant holds database locks while waiting for the global commit decision β a latency multiplier across unrelated services.
- Coordinator as a single point of failure. If the coordinator crashes mid-protocol, participants are left in limbo holding locks indefinitely.
- Tight coupling. Every service must implement the 2PC protocol, making independent deployments and polyglot databases nearly impossible.
A Saga is the practical alternative: a sequence of local transactions β each owned by exactly one service β where every step has a corresponding compensating transaction that undoes its business effect if a later step fails.
π‘ Think of booking a trip: flight, hotel, and car rental are confirmed separately. If the car rental fails, you compensate by cancelling hotel and flight explicitly, in reverse order.
Sagas accept eventual consistency: the system may be temporarily inconsistent between steps, but well-designed compensation guarantees convergence every time.
π Two Flavors: Orchestration vs Choreography
There are two fundamentally different styles for coordinating saga participants:
| Dimension | Orchestration | Choreography |
| Who decides the next step | Central saga class sends commands | Each service reacts to events it cares about |
| Coupling | Services coupled to orchestrator via command contracts | Services coupled only to shared event schemas |
| Workflow visibility | Full workflow visible in one place | Workflow is emergent β harder to trace end-to-end |
| Compensation control | Orchestrator explicitly issues compensation commands in order | Each service publishes failure events; others react |
| Best for | Complex conditional workflows, auditability, long-running processes | High-throughput pipelines, independently-owned services |
| Debugging | Easier β saga state is centralized and queryable | Requires distributed tracing across every participant |
Neither style is universally better. Choose orchestration when the workflow has conditional branches, per-step retry policies, or must be auditable. Choose choreography when services are owned by different teams and maximum decoupling is the priority.
βοΈ How the Checkout Saga Flows Step by Step
The running example for this post: Checkout orchestrates inventory reservation, payment authorization, and shipment creation, then compensates inventory and payment on downstream failure.
flowchart TD
A[OrderPlacedEvent] --> B[ReserveInventory]
B -->|InventoryReservedEvent| C[AuthorizePayment]
C -->|PaymentAuthorizedEvent| D[CreateShipment]
D --> E[ SagaEnd Order Confirmed]
B -->|InventoryReservationFailedEvent| F[CancelOrder]
F --> G[ SagaEnd Cancelled: No Stock]
C -->|PaymentFailedEvent| H[ReleaseInventory]
H --> I[CancelOrder]
I --> J[ SagaEnd Cancelled: Payment Declined]
Every arrow in the diagram is either a command (orchestrator β service) or an event (service β orchestrator). The orchestrator never performs business logic β it routes, compensates, and terminates.
| Step | Command sent | Success event | Failure event | Compensation action |
| 1 β Inventory | ReserveInventoryCommand | InventoryReservedEvent | InventoryReservationFailedEvent | CancelOrderCommand |
| 2 β Payment | AuthorizePaymentCommand | PaymentAuthorizedEvent | PaymentFailedEvent | ReleaseInventoryCommand β CancelOrderCommand |
| 3 β Shipment | CreateShipmentCommand | (saga ends successfully) | β | β |
Notice that compensation for Step 2 must happen in reverse order: release inventory first, then cancel the order. If you cancel first, downstream reads of order state may see an inconsistency before the reservation is released.
π Compensation Flow: State Transitions and Event Routing
Each saga transition is driven by a single event or command. The diagram below captures every state the CheckoutSaga can reach, including both compensation paths, so the full workflow is traceable from one place rather than scattered across services.
flowchart TD
S([Start]) --> INV[ReserveInventory]
INV -->|InventoryReservedEvent| PAY[AuthorizePayment]
INV -->|InventoryReservationFailedEvent| CANCEL1[CancelOrder]
PAY -->|PaymentAuthorizedEvent| SHIP[CreateShipment]
PAY -->|PaymentFailedEvent| RELV[ReleaseInventory]
RELV --> CANCEL2[CancelOrder]
SHIP --> END1([ Confirmed])
CANCEL1 --> END2([ Cancelled: No Stock])
CANCEL2 --> END3([ Cancelled: Payment Failed])
π§ Deep Dive: Saga Internals and Failure Semantics
Compensating Transactions Are Not Rollbacks
A compensating transaction is a new, forward-moving transaction that reverses the business effect of a prior step β it is not a database rollback. This distinction has real implications:
ReleaseInventoryCommandtells the inventory service "release reservationR-4892." The reservation record is not deleted; it is marked released with a timestamp and reason.- Compensations must be idempotent: if
ReleaseInventoryCommandis delivered twice (broker retry, network hiccup), the second delivery must be a safe no-op, not a double-release.
Design rule: every command carries a stable business key (orderId, reservationId, paymentId). The receiving service checks whether the command has already been applied before executing side effects.
Performance Analysis
Sagas trade atomic rollback for latency decomposition. Each step adds at minimum one round-trip: command dispatch β service processing β event publication β orchestrator wakeup. For a 3-step checkout saga with 20 ms service latency per step:
| Scenario | Estimated end-to-end latency |
| Happy path (3 steps, no retries) | 60β90 ms saga latency + broker overhead |
| Compensation path (2 steps back out) | 100β130 ms total |
| Step 2 retried once before failing | p95 easily reaches 300β500 ms |
When p99 spikes without elevated error rates, cold-start delays or GC pauses on the saga orchestrator host are the primary suspects.
π οΈ Java Frameworks for Saga Orchestration
| Framework | Programming model | Sweet spot | Compensation support |
| Axon Framework | CQRS/Event Sourcing, @Saga annotation | DDD-style microservices, audit-heavy domains | Built-in: saga state persisted via JPA/JDBC; dead letter queue integration out of the box |
| Camunda / Zeebe | BPMN workflow engine | Long-running sagas, human approval tasks, ops dashboards | BPMN compensation events; visual process explorer for non-engineers |
| Temporal (Java SDK) | Workflow-as-code, durable execution | Polyglot teams, complex retry policies, exactly-once semantics | try/catch in workflow code; each activity's compensation is explicit Java logic |
Eventuate Tram: Saga Coordination Without a Dedicated Framework
Eventuate Tram (a lightweight saga/transactional messaging library by Chris Richardson) implements sagas as SagaDefinition objects with explicit step sequences and compensating actions, without requiring an event-sourcing infrastructure like Axon Server.
public class CheckoutSagaDefinition implements SimpleSaga<CheckoutSagaData> {
private SagaDefinition<CheckoutSagaData> sagaDefinition =
step()
.invokeParticipant(this::reserveInventory)
.withCompensation(this::releaseInventory)
.step()
.invokeParticipant(this::authorizePayment)
.withCompensation(this::refundPayment)
.step()
.invokeParticipant(this::createShipment)
.build();
@Override
public SagaDefinition<CheckoutSagaData> getSagaDefinition() { return sagaDefinition; }
private CommandWithDestination reserveInventory(CheckoutSagaData data) {
return send(new ReserveInventoryCommand(data.getOrderId(), data.getItems()))
.to("inventoryService").build();
}
private CommandWithDestination releaseInventory(CheckoutSagaData data) {
return send(new ReleaseInventoryCommand(data.getOrderId(), data.getReservationId()))
.to("inventoryService").build();
}
// authorizePayment, refundPayment, createShipment follow the same pattern
}
SagaDefinition chains steps with .withCompensation() β Eventuate Tram executes compensations in reverse step order automatically on failure. No central orchestrator wakeup loop; the framework manages message dispatch and state persistence via JDBC.
For a full deep-dive on Eventuate Tram saga state management and transactional messaging patterns, a dedicated follow-up post is planned.
This post uses Axon Framework for examples. Add the starter to your pom.xml:
<dependency>
<groupId>org.axonframework</groupId>
<artifactId>axon-spring-boot-starter</artifactId>
<version>4.9.3</version>
</dependency>
π§ͺ Orchestration in Code: The Axon Checkout Saga
The saga class below is the single source of truth for the checkout workflow. Axon persists its fields between events, so the saga survives service restarts mid-flight without losing state.
@Saga
public class CheckoutSaga {
@Inject
private transient CommandGateway commandGateway; // transient = not serialized with saga state
private String orderId;
private String reservationId;
private boolean paymentAuthorized;
@StartSaga
@SagaEventHandler(associationProperty = "orderId")
public void on(OrderPlacedEvent event) {
this.orderId = event.orderId();
commandGateway.send(new ReserveInventoryCommand(event.orderId(), event.items()));
}
@SagaEventHandler(associationProperty = "orderId")
public void on(InventoryReservedEvent event) {
this.reservationId = event.reservationId();
commandGateway.send(new AuthorizePaymentCommand(event.orderId(), event.totalCents()));
}
@SagaEventHandler(associationProperty = "orderId")
public void on(PaymentAuthorizedEvent event) {
this.paymentAuthorized = true;
commandGateway.send(new CreateShipmentCommand(event.orderId()));
SagaLifecycle.end();
}
// Compensation: inventory step failed β nothing reserved yet, just cancel the order
@SagaEventHandler(associationProperty = "orderId")
public void on(InventoryReservationFailedEvent event) {
commandGateway.send(new CancelOrderCommand(event.orderId(), "inventory_unavailable"));
SagaLifecycle.end();
}
// Compensation: payment failed β release reservation first, then cancel order
@SagaEventHandler(associationProperty = "orderId")
public void on(PaymentFailedEvent event) {
commandGateway.send(new ReleaseInventoryCommand(orderId, reservationId));
commandGateway.send(new CancelOrderCommand(orderId, "payment_failed"));
SagaLifecycle.end();
}
}
Key annotations: @StartSaga creates a new saga instance per trigger event. @SagaEventHandler(associationProperty = "orderId") routes events to the right instance by orderId. SagaLifecycle.end() removes the completed saga from the active store.
π Orchestration Saga: Order Flow
sequenceDiagram
participant O as Orchestrator
participant A as InventoryService
participant B as PaymentService
participant C as ShipmentService
O->>A: ReserveInventory
A-->>O: InventoryReserved
O->>B: AuthorizePayment
B-->>O: PaymentFailed
O->>A: ReleaseInventory
A-->>O: InventoryReleased
Note over O: Saga compensated
The sequence diagram shows orchestration-style compensation in action: the orchestrator drives every step, so when PaymentFailed returns it knows exactly which prior step (inventory reservation) must be reversed and sends the compensation command directly. The Note over O: Saga compensated marker signals a legitimate terminal state reached through explicit business-level undo operationsβnot a database rollback, but a series of reversals that each service handles independently. The key advantage visible here is that the full compensation plan is centrally visible, making debugging and auditing straightforward.
π§ͺ Choreography in Code: Spring Kafka Payment Participant
In the choreography style there is no central saga class. Each service subscribes to events and publishes its own outcome. The workflow emerges from the chain of reactions across services.
@Component
public class PaymentSagaParticipant {
private final KafkaTemplate<String, Object> kafka;
public PaymentSagaParticipant(KafkaTemplate<String, Object> kafka) {
this.kafka = kafka;
}
@KafkaListener(topics = "order-placed")
public void onOrderPlaced(OrderPlacedEvent event) {
try {
paymentService.authorize(event.orderId(), event.totalCents());
kafka.send("payment-authorized", new PaymentAuthorizedEvent(event.orderId()));
} catch (PaymentDeclinedException e) {
kafka.send("payment-failed",
new PaymentFailedEvent(event.orderId(), e.getReason()));
}
}
}
The inventory service similarly listens on the payment-failed topic and releases its reservation. The compensation chain is fully event-driven β no orchestrator wakes up, no shared state. The trade-off: to understand the full saga flow you must trace events across multiple services. A distributed tracing tool (Jaeger, Zipkin, OpenTelemetry) is not optional for choreography sagas beyond two steps.
π Choreography Saga: Event Chain
sequenceDiagram
participant A as OrderService
participant B as InventoryService
participant C as PaymentService
A->>B: OrderPlaced event
B->>C: InventoryReserved event
C->>A: PaymentFailed event
A->>B: OrderCancelled event
B-->>A: InventoryReleased event
Note over A,C: Compensation via events
This sequence shows the fully decentralized choreography model: no orchestrator existsβeach service reacts to the event it receives and emits the next event in the chain. When PaymentFailed arrives, OrderService publishes OrderCancelled, which triggers InventoryService to release its reservation autonomously without any central coordinator. The Note over A,C: Compensation via events annotation captures the key operational challenge: to understand why a saga compensated, you must trace events across every participating service, making distributed tracing a requirement rather than an optional convenience.
π Real-World Applications
Sagas appear wherever a multi-step business process spans service boundaries:
- Travel booking: flight β hotel β car rental. If the car is unavailable, compensate by cancelling hotel and flight in reverse order.
- Bank money transfer: debit source β credit destination β notify parties. Compensation reverses the debit if the credit write fails.
βοΈ Trade-offs & Failure Modes: Saga Trade-offs and the Failure Modes That Bite First
| Failure mode | Symptom | Root cause | First mitigation |
| Non-idempotent compensation | Inventory released twice; duplicate refund issued | Missing deduplication key on compensation command | Add reservationId / paymentId as idempotency key in each receiving service |
| Saga stuck mid-flight | Order stays "pending" indefinitely | Expected event never arrives (broker outage, service crash) | Implement saga timeout with automatic dead-letter escalation |
| Compensation also fails | Inventory released, order never cancelled | Downstream service down during compensation | Route failed compensation to DLQ; alert data-integrity team for manual review |
| Lost in-flight saga state | Orchestrator restart drops active sagas | Saga state kept in JVM heap only | Use Axon's JPA saga store or equivalent; never rely on in-memory state across restarts |
No architecture pattern is free. Sagas trade atomic rollback for eventual consistency and explicit operational complexity.
π Saga Transaction States
stateDiagram-v2
[*] --> Started
Started --> Step1Done : inventory reserved
Step1Done --> Step2Done : payment authorized
Step2Done --> Completed : shipment scheduled
Step1Done --> Compensating : payment failed
Step2Done --> Compensating : shipment failed
Compensating --> Compensated : steps undone
Compensated --> [*]
Completed --> [*]
The state diagram captures the full saga lifecycle from Started through sequential step completions to either Completed or the Compensating branch. The two compensation entry pointsβfrom Step1Done and Step2Doneβshow that compensation can be triggered at any stage of the forward path, not only at the final step. The critical insight is that Compensated is a legitimate terminal state, not a failure: a saga that compensates correctly has maintained distributed consistency even when the business happy path was not achievable.
π§ Decision Guide: When to Reach for Orchestration vs Choreography
| Situation | Recommendation |
| Workflow has 4+ steps with conditional branches | Orchestration β a central saga class prevents spaghetti event routing |
| Services are owned by separate teams with no shared state | Choreography β each team owns only its event listener and emitter |
| Full workflow audit trail and replay are required | Orchestration with Axon Server or Camunda; workflow state is queryable from a dashboard |
| Throughput is the primary constraint and latency SLO is loose | Choreography with Kafka β no orchestrator wakeup overhead per event |
| Compensation logic is conditional or order-dependent | Orchestration β the saga class makes compensation sequence explicit and testable |
π¨ Operator Field Note: Stuck Sagas and Dead Letters
Alert on saga active duration β a checkout saga open for more than five minutes is almost certainly stuck. Axon routes commands that exhaust retries to a Dead Letter Queue (DLQ); inspect the entry with its orderId, replay if the failure was transient, or escalate for manual data correction.
π Lessons Learned
- Design compensations before happy paths. If you cannot name the compensating transaction for each step at design time, the saga is not production-ready.
- Idempotency is not optional. Event and command redelivery is guaranteed in any durable message broker. Every compensation handler must be safe to run twice with the same input.
- Choreography hides complexity, it does not eliminate it. Distributed tracing is mandatory for any choreography saga beyond two steps.
- Saga state is production data. Treat it like a database table β back it up, monitor it, and never discard it silently on a service restart.
π TLDR: Summary & Key Takeaways
- A Saga replaces distributed 2PC with a sequence of local transactions, each backed by an explicit compensating transaction that reverses the business effect β not a database rollback.
- Orchestration centralizes workflow control in one saga class; choreography distributes it across event-driven participants. Neither is universally superior.
- The Axon
@Sagaannotation,@SagaEventHandler, andSagaLifecycle.end()give orchestrated sagas durable, event-sourced state with minimal boilerplate β but saga state must be backed by a persistent store, never kept in memory. - Compensation order matters: release inventory before cancelling the order, or downstream reads see inconsistent state in the window between steps.
- Stuck sagas and failed compensations are operational realities. Monitor DLQ depth and saga active duration from day one.
π Related Posts
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer β 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2Γ A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
