Saga Pattern: Coordinating Distributed Transactions with Compensation

Model long-running workflows with compensating actions instead of fragile global transactions.

Architecture Patterns for Production Systems

Abstract Algorithms

·Mar 13, 2026·14 min read

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: A Saga replaces fragile distributed 2PC with a sequence of local transactions, each backed by an explicit compensating transaction. Use orchestration when workflow control needs a single brain; use choreography when services must stay loosely coupled.

📖 Why Distributed Transactions Break Microservices

A ride-booking app charged a user's credit card, then failed to dispatch a driver — with no automatic refund. The payment service completed its transaction successfully. The dispatch service threw a transient exception. No compensating action fired. The user was billed for a ride that never happened, and customer support had to issue a manual refund three hours later.

This is not a theoretical edge case. It is what happens in any distributed system when you treat a multi-step business operation as a series of independent API calls with no explicit rollback plan.

The Saga pattern is the engineering answer. Instead of a single atomic transaction spanning multiple services (which microservices cannot support), you design a sequence of local transactions where every step has a pre-defined compensating transaction — a new, forward-moving action that reverses the business effect of a prior step if a later step fails.

Step 1: Charge card      → success ✅
Step 2: Dispatch driver  → failure ❌
Compensate Step 1: Refund charge → fires automatically ✅

With a Saga, the user gets their refund. Without one, they don't — until someone notices.

In a microservices architecture, checkout, inventory, payment, and dispatch are separate services with separate databases. No single transaction coordinator owns all of them.

Two-Phase Commit (2PC) — the classical distributed transaction alternative — solves this with a coordinator that locks resources across every participant until they all vote "ready." At microservice scale, this creates three serious problems:

Lock contention. Every participant holds database locks while waiting for the global commit decision — a latency multiplier across unrelated services.
Coordinator as a single point of failure. If the coordinator crashes mid-protocol, participants are left in limbo holding locks indefinitely.
Tight coupling. Every service must implement the 2PC protocol, making independent deployments and polyglot databases nearly impossible.

A Saga is the practical alternative: a sequence of local transactions — each owned by exactly one service — where every step has a corresponding compensating transaction that undoes its business effect if a later step fails.

💡 Think of booking a trip: flight, hotel, and car rental are confirmed separately. If the car rental fails, you compensate by cancelling hotel and flight explicitly, in reverse order.

Sagas accept eventual consistency: the system may be temporarily inconsistent between steps, but well-designed compensation guarantees convergence every time.

🔍 Two Flavors: Orchestration vs Choreography

There are two fundamentally different styles for coordinating saga participants:

Dimension	Orchestration	Choreography
Who decides the next step	Central saga class sends commands	Each service reacts to events it cares about
Coupling	Services coupled to orchestrator via command contracts	Services coupled only to shared event schemas
Workflow visibility	Full workflow visible in one place	Workflow is emergent — harder to trace end-to-end
Compensation control	Orchestrator explicitly issues compensation commands in order	Each service publishes failure events; others react
Best for	Complex conditional workflows, auditability, long-running processes	High-throughput pipelines, independently-owned services
Debugging	Easier — saga state is centralized and queryable	Requires distributed tracing across every participant

Neither style is universally better. Choose orchestration when the workflow has conditional branches, per-step retry policies, or must be auditable. Choose choreography when services are owned by different teams and maximum decoupling is the priority.

⚙️ How the Checkout Saga Flows Step by Step

The running example for this post: Checkout orchestrates inventory reservation, payment authorization, and shipment creation, then compensates inventory and payment on downstream failure.

flowchart TD
    A[OrderPlacedEvent] --> B[ReserveInventory]
    B -->|InventoryReservedEvent| C[AuthorizePayment]
    C -->|PaymentAuthorizedEvent| D[CreateShipment]
    D --> E[ SagaEnd  Order Confirmed]

    B -->|InventoryReservationFailedEvent| F[CancelOrder]
    F --> G[ SagaEnd  Cancelled: No Stock]

    C -->|PaymentFailedEvent| H[ReleaseInventory]
    H --> I[CancelOrder]
    I --> J[ SagaEnd  Cancelled: Payment Declined]

Every arrow in the diagram is either a command (orchestrator → service) or an event (service → orchestrator). The orchestrator never performs business logic — it routes, compensates, and terminates.

Step	Command sent	Success event	Failure event	Compensation action
1 — Inventory	`ReserveInventoryCommand`	`InventoryReservedEvent`	`InventoryReservationFailedEvent`	`CancelOrderCommand`
2 — Payment	`AuthorizePaymentCommand`	`PaymentAuthorizedEvent`	`PaymentFailedEvent`	`ReleaseInventoryCommand` → `CancelOrderCommand`
3 — Shipment	`CreateShipmentCommand`	(saga ends successfully)	—	—

Notice that compensation for Step 2 must happen in reverse order: release inventory first, then cancel the order. If you cancel first, downstream reads of order state may see an inconsistency before the reservation is released.

📊 Compensation Flow: State Transitions and Event Routing

Each saga transition is driven by a single event or command. The diagram below captures every state the CheckoutSaga can reach, including both compensation paths, so the full workflow is traceable from one place rather than scattered across services.

flowchart TD
    S([Start]) --> INV[ReserveInventory]
    INV -->|InventoryReservedEvent| PAY[AuthorizePayment]
    INV -->|InventoryReservationFailedEvent| CANCEL1[CancelOrder]
    PAY -->|PaymentAuthorizedEvent| SHIP[CreateShipment]
    PAY -->|PaymentFailedEvent| RELV[ReleaseInventory]
    RELV --> CANCEL2[CancelOrder]
    SHIP --> END1([ Confirmed])
    CANCEL1 --> END2([ Cancelled: No Stock])
    CANCEL2 --> END3([ Cancelled: Payment Failed])

🧠 Deep Dive: Saga Internals and Failure Semantics

Compensating Transactions Are Not Rollbacks

A compensating transaction is a new, forward-moving transaction that reverses the business effect of a prior step — it is not a database rollback. This distinction has real implications:

ReleaseInventoryCommand tells the inventory service "release reservation R-4892." The reservation record is not deleted; it is marked released with a timestamp and reason.
Compensations must be idempotent: if ReleaseInventoryCommand is delivered twice (broker retry, network hiccup), the second delivery must be a safe no-op, not a double-release.

Design rule: every command carries a stable business key (orderId, reservationId, paymentId). The receiving service checks whether the command has already been applied before executing side effects.

Performance Analysis

Sagas trade atomic rollback for latency decomposition. Each step adds at minimum one round-trip: command dispatch → service processing → event publication → orchestrator wakeup. For a 3-step checkout saga with 20 ms service latency per step:

Scenario	Estimated end-to-end latency
Happy path (3 steps, no retries)	60–90 ms saga latency + broker overhead
Compensation path (2 steps back out)	100–130 ms total
Step 2 retried once before failing	p95 easily reaches 300–500 ms

When p99 spikes without elevated error rates, cold-start delays or GC pauses on the saga orchestrator host are the primary suspects.

🛠️ Java Frameworks for Saga Orchestration

Framework	Programming model	Sweet spot	Compensation support
Axon Framework	CQRS/Event Sourcing, `@Saga` annotation	DDD-style microservices, audit-heavy domains	Built-in: saga state persisted via JPA/JDBC; dead letter queue integration out of the box
Camunda / Zeebe	BPMN workflow engine	Long-running sagas, human approval tasks, ops dashboards	BPMN compensation events; visual process explorer for non-engineers
Temporal (Java SDK)	Workflow-as-code, durable execution	Polyglot teams, complex retry policies, exactly-once semantics	`try/catch` in workflow code; each activity's compensation is explicit Java logic

Eventuate Tram: Saga Coordination Without a Dedicated Framework

Eventuate Tram (a lightweight saga/transactional messaging library by Chris Richardson) implements sagas as SagaDefinition objects with explicit step sequences and compensating actions, without requiring an event-sourcing infrastructure like Axon Server.

public class CheckoutSagaDefinition implements SimpleSaga<CheckoutSagaData> {

    private SagaDefinition<CheckoutSagaData> sagaDefinition =
        step()
            .invokeParticipant(this::reserveInventory)
            .withCompensation(this::releaseInventory)
        .step()
            .invokeParticipant(this::authorizePayment)
            .withCompensation(this::refundPayment)
        .step()
            .invokeParticipant(this::createShipment)
        .build();

    @Override
    public SagaDefinition<CheckoutSagaData> getSagaDefinition() { return sagaDefinition; }

    private CommandWithDestination reserveInventory(CheckoutSagaData data) {
        return send(new ReserveInventoryCommand(data.getOrderId(), data.getItems()))
            .to("inventoryService").build();
    }
    private CommandWithDestination releaseInventory(CheckoutSagaData data) {
        return send(new ReleaseInventoryCommand(data.getOrderId(), data.getReservationId()))
            .to("inventoryService").build();
    }
    // authorizePayment, refundPayment, createShipment follow the same pattern
}

SagaDefinition chains steps with .withCompensation() — Eventuate Tram executes compensations in reverse step order automatically on failure. No central orchestrator wakeup loop; the framework manages message dispatch and state persistence via JDBC.

For a full deep-dive on Eventuate Tram saga state management and transactional messaging patterns, a dedicated follow-up post is planned.

This post uses Axon Framework for examples. Add the starter to your pom.xml:

<dependency>
  <groupId>org.axonframework</groupId>
  <artifactId>axon-spring-boot-starter</artifactId>
  <version>4.9.3</version>
</dependency>

🧪 Orchestration in Code: The Axon Checkout Saga

The saga class below is the single source of truth for the checkout workflow. Axon persists its fields between events, so the saga survives service restarts mid-flight without losing state.

@Saga
public class CheckoutSaga {

    @Inject
    private transient CommandGateway commandGateway;  // transient = not serialized with saga state

    private String orderId;
    private String reservationId;
    private boolean paymentAuthorized;

    @StartSaga
    @SagaEventHandler(associationProperty = "orderId")
    public void on(OrderPlacedEvent event) {
        this.orderId = event.orderId();
        commandGateway.send(new ReserveInventoryCommand(event.orderId(), event.items()));
    }

    @SagaEventHandler(associationProperty = "orderId")
    public void on(InventoryReservedEvent event) {
        this.reservationId = event.reservationId();
        commandGateway.send(new AuthorizePaymentCommand(event.orderId(), event.totalCents()));
    }

    @SagaEventHandler(associationProperty = "orderId")
    public void on(PaymentAuthorizedEvent event) {
        this.paymentAuthorized = true;
        commandGateway.send(new CreateShipmentCommand(event.orderId()));
        SagaLifecycle.end();
    }

    // Compensation: inventory step failed — nothing reserved yet, just cancel the order
    @SagaEventHandler(associationProperty = "orderId")
    public void on(InventoryReservationFailedEvent event) {
        commandGateway.send(new CancelOrderCommand(event.orderId(), "inventory_unavailable"));
        SagaLifecycle.end();
    }

    // Compensation: payment failed — release reservation first, then cancel order
    @SagaEventHandler(associationProperty = "orderId")
    public void on(PaymentFailedEvent event) {
        commandGateway.send(new ReleaseInventoryCommand(orderId, reservationId));
        commandGateway.send(new CancelOrderCommand(orderId, "payment_failed"));
        SagaLifecycle.end();
    }
}

Key annotations: @StartSaga creates a new saga instance per trigger event. @SagaEventHandler(associationProperty = "orderId") routes events to the right instance by orderId. SagaLifecycle.end() removes the completed saga from the active store.

📊 Orchestration Saga: Order Flow

sequenceDiagram
    participant O as Orchestrator
    participant A as InventoryService
    participant B as PaymentService
    participant C as ShipmentService
    O->>A: ReserveInventory
    A-->>O: InventoryReserved
    O->>B: AuthorizePayment
    B-->>O: PaymentFailed
    O->>A: ReleaseInventory
    A-->>O: InventoryReleased
    Note over O: Saga compensated

The sequence diagram shows orchestration-style compensation in action: the orchestrator drives every step, so when PaymentFailed returns it knows exactly which prior step (inventory reservation) must be reversed and sends the compensation command directly. The Note over O: Saga compensated marker signals a legitimate terminal state reached through explicit business-level undo operations—not a database rollback, but a series of reversals that each service handles independently. The key advantage visible here is that the full compensation plan is centrally visible, making debugging and auditing straightforward.

🧪 Choreography in Code: Spring Kafka Payment Participant

In the choreography style there is no central saga class. Each service subscribes to events and publishes its own outcome. The workflow emerges from the chain of reactions across services.

@Component
public class PaymentSagaParticipant {

    private final KafkaTemplate<String, Object> kafka;

    public PaymentSagaParticipant(KafkaTemplate<String, Object> kafka) {
        this.kafka = kafka;
    }

    @KafkaListener(topics = "order-placed")
    public void onOrderPlaced(OrderPlacedEvent event) {
        try {
            paymentService.authorize(event.orderId(), event.totalCents());
            kafka.send("payment-authorized", new PaymentAuthorizedEvent(event.orderId()));
        } catch (PaymentDeclinedException e) {
            kafka.send("payment-failed",
                new PaymentFailedEvent(event.orderId(), e.getReason()));
        }
    }
}

The inventory service similarly listens on the payment-failed topic and releases its reservation. The compensation chain is fully event-driven — no orchestrator wakes up, no shared state. The trade-off: to understand the full saga flow you must trace events across multiple services. A distributed tracing tool (Jaeger, Zipkin, OpenTelemetry) is not optional for choreography sagas beyond two steps.

📊 Choreography Saga: Event Chain

sequenceDiagram
    participant A as OrderService
    participant B as InventoryService
    participant C as PaymentService
    A->>B: OrderPlaced event
    B->>C: InventoryReserved event
    C->>A: PaymentFailed event
    A->>B: OrderCancelled event
    B-->>A: InventoryReleased event
    Note over A,C: Compensation via events

This sequence shows the fully decentralized choreography model: no orchestrator exists—each service reacts to the event it receives and emits the next event in the chain. When PaymentFailed arrives, OrderService publishes OrderCancelled, which triggers InventoryService to release its reservation autonomously without any central coordinator. The Note over A,C: Compensation via events annotation captures the key operational challenge: to understand why a saga compensated, you must trace events across every participating service, making distributed tracing a requirement rather than an optional convenience.

🌍 Real-World Applications

Sagas appear wherever a multi-step business process spans service boundaries:

Travel booking: flight → hotel → car rental. If the car is unavailable, compensate by cancelling hotel and flight in reverse order.
Bank money transfer: debit source → credit destination → notify parties. Compensation reverses the debit if the credit write fails.

⚖️ Trade-offs & Failure Modes: Saga Trade-offs and the Failure Modes That Bite First

Failure mode	Symptom	Root cause	First mitigation
Non-idempotent compensation	Inventory released twice; duplicate refund issued	Missing deduplication key on compensation command	Add `reservationId` / `paymentId` as idempotency key in each receiving service
Saga stuck mid-flight	Order stays "pending" indefinitely	Expected event never arrives (broker outage, service crash)	Implement saga timeout with automatic dead-letter escalation
Compensation also fails	Inventory released, order never cancelled	Downstream service down during compensation	Route failed compensation to DLQ; alert data-integrity team for manual review
Lost in-flight saga state	Orchestrator restart drops active sagas	Saga state kept in JVM heap only	Use Axon's JPA saga store or equivalent; never rely on in-memory state across restarts

No architecture pattern is free. Sagas trade atomic rollback for eventual consistency and explicit operational complexity.

📊 Saga Transaction States

stateDiagram-v2
    [*] --> Started
    Started --> Step1Done : inventory reserved
    Step1Done --> Step2Done : payment authorized
    Step2Done --> Completed : shipment scheduled
    Step1Done --> Compensating : payment failed
    Step2Done --> Compensating : shipment failed
    Compensating --> Compensated : steps undone
    Compensated --> [*]
    Completed --> [*]

The state diagram captures the full saga lifecycle from Started through sequential step completions to either Completed or the Compensating branch. The two compensation entry points—from Step1Done and Step2Done—show that compensation can be triggered at any stage of the forward path, not only at the final step. The critical insight is that Compensated is a legitimate terminal state, not a failure: a saga that compensates correctly has maintained distributed consistency even when the business happy path was not achievable.

🧭 Decision Guide: When to Reach for Orchestration vs Choreography

Situation	Recommendation
Workflow has 4+ steps with conditional branches	Orchestration — a central saga class prevents spaghetti event routing
Services are owned by separate teams with no shared state	Choreography — each team owns only its event listener and emitter
Full workflow audit trail and replay are required	Orchestration with Axon Server or Camunda; workflow state is queryable from a dashboard
Throughput is the primary constraint and latency SLO is loose	Choreography with Kafka — no orchestrator wakeup overhead per event
Compensation logic is conditional or order-dependent	Orchestration — the saga class makes compensation sequence explicit and testable

🚨 Operator Field Note: Stuck Sagas and Dead Letters

Alert on saga active duration — a checkout saga open for more than five minutes is almost certainly stuck. Axon routes commands that exhaust retries to a Dead Letter Queue (DLQ); inspect the entry with its orderId, replay if the failure was transient, or escalate for manual data correction.

📚 Lessons Learned

Design compensations before happy paths. If you cannot name the compensating transaction for each step at design time, the saga is not production-ready.
Idempotency is not optional. Event and command redelivery is guaranteed in any durable message broker. Every compensation handler must be safe to run twice with the same input.
Choreography hides complexity, it does not eliminate it. Distributed tracing is mandatory for any choreography saga beyond two steps.
Saga state is production data. Treat it like a database table — back it up, monitor it, and never discard it silently on a service restart.

📌 TLDR: Summary & Key Takeaways

A Saga replaces distributed 2PC with a sequence of local transactions, each backed by an explicit compensating transaction that reverses the business effect — not a database rollback.
Orchestration centralizes workflow control in one saga class; choreography distributes it across event-driven participants. Neither is universally superior.
The Axon @Saga annotation, @SagaEventHandler, and SagaLifecycle.end() give orchestrated sagas durable, event-sourced state with minimal boilerplate — but saga state must be backed by a persistent store, never kept in memory.
Compensation order matters: release inventory before cancelling the order, or downstream reads see inconsistent state in the window between steps.
Stuck sagas and failed compensations are operational realities. Monitor DLQ depth and saga active duration from day one.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...

Apr 19, 2026•23 min read