All Posts

Saga Pattern: Coordinating Distributed Transactions with Compensation

Model long-running workflows with compensating actions instead of fragile global transactions.

Abstract AlgorithmsAbstract Algorithms
Β·Β·14 min read

AI-assisted content.

TLDR: A Saga replaces fragile distributed 2PC with a sequence of local transactions, each backed by an explicit compensating transaction. Use orchestration when workflow control needs a single brain; use choreography when services must stay loosely coupled.

πŸ“– Why Distributed Transactions Break Microservices

A ride-booking app charged a user's credit card, then failed to dispatch a driver β€” with no automatic refund. The payment service completed its transaction successfully. The dispatch service threw a transient exception. No compensating action fired. The user was billed for a ride that never happened, and customer support had to issue a manual refund three hours later.

This is not a theoretical edge case. It is what happens in any distributed system when you treat a multi-step business operation as a series of independent API calls with no explicit rollback plan.

The Saga pattern is the engineering answer. Instead of a single atomic transaction spanning multiple services (which microservices cannot support), you design a sequence of local transactions where every step has a pre-defined compensating transaction β€” a new, forward-moving action that reverses the business effect of a prior step if a later step fails.

Step 1: Charge card      β†’ success βœ…
Step 2: Dispatch driver  β†’ failure ❌
Compensate Step 1: Refund charge β†’ fires automatically βœ…

With a Saga, the user gets their refund. Without one, they don't β€” until someone notices.

In a microservices architecture, checkout, inventory, payment, and dispatch are separate services with separate databases. No single transaction coordinator owns all of them.

Two-Phase Commit (2PC) β€” the classical distributed transaction alternative β€” solves this with a coordinator that locks resources across every participant until they all vote "ready." At microservice scale, this creates three serious problems:

  1. Lock contention. Every participant holds database locks while waiting for the global commit decision β€” a latency multiplier across unrelated services.
  2. Coordinator as a single point of failure. If the coordinator crashes mid-protocol, participants are left in limbo holding locks indefinitely.
  3. Tight coupling. Every service must implement the 2PC protocol, making independent deployments and polyglot databases nearly impossible.

A Saga is the practical alternative: a sequence of local transactions β€” each owned by exactly one service β€” where every step has a corresponding compensating transaction that undoes its business effect if a later step fails.

πŸ’‘ Think of booking a trip: flight, hotel, and car rental are confirmed separately. If the car rental fails, you compensate by cancelling hotel and flight explicitly, in reverse order.

Sagas accept eventual consistency: the system may be temporarily inconsistent between steps, but well-designed compensation guarantees convergence every time.

πŸ” Two Flavors: Orchestration vs Choreography

There are two fundamentally different styles for coordinating saga participants:

DimensionOrchestrationChoreography
Who decides the next stepCentral saga class sends commandsEach service reacts to events it cares about
CouplingServices coupled to orchestrator via command contractsServices coupled only to shared event schemas
Workflow visibilityFull workflow visible in one placeWorkflow is emergent β€” harder to trace end-to-end
Compensation controlOrchestrator explicitly issues compensation commands in orderEach service publishes failure events; others react
Best forComplex conditional workflows, auditability, long-running processesHigh-throughput pipelines, independently-owned services
DebuggingEasier β€” saga state is centralized and queryableRequires distributed tracing across every participant

Neither style is universally better. Choose orchestration when the workflow has conditional branches, per-step retry policies, or must be auditable. Choose choreography when services are owned by different teams and maximum decoupling is the priority.

βš™οΈ How the Checkout Saga Flows Step by Step

The running example for this post: Checkout orchestrates inventory reservation, payment authorization, and shipment creation, then compensates inventory and payment on downstream failure.

flowchart TD
    A[OrderPlacedEvent] --> B[ReserveInventory]
    B -->|InventoryReservedEvent| C[AuthorizePayment]
    C -->|PaymentAuthorizedEvent| D[CreateShipment]
    D --> E[ SagaEnd  Order Confirmed]

    B -->|InventoryReservationFailedEvent| F[CancelOrder]
    F --> G[ SagaEnd  Cancelled: No Stock]

    C -->|PaymentFailedEvent| H[ReleaseInventory]
    H --> I[CancelOrder]
    I --> J[ SagaEnd  Cancelled: Payment Declined]

Every arrow in the diagram is either a command (orchestrator β†’ service) or an event (service β†’ orchestrator). The orchestrator never performs business logic β€” it routes, compensates, and terminates.

StepCommand sentSuccess eventFailure eventCompensation action
1 β€” InventoryReserveInventoryCommandInventoryReservedEventInventoryReservationFailedEventCancelOrderCommand
2 β€” PaymentAuthorizePaymentCommandPaymentAuthorizedEventPaymentFailedEventReleaseInventoryCommand β†’ CancelOrderCommand
3 β€” ShipmentCreateShipmentCommand(saga ends successfully)β€”β€”

Notice that compensation for Step 2 must happen in reverse order: release inventory first, then cancel the order. If you cancel first, downstream reads of order state may see an inconsistency before the reservation is released.

πŸ“Š Compensation Flow: State Transitions and Event Routing

Each saga transition is driven by a single event or command. The diagram below captures every state the CheckoutSaga can reach, including both compensation paths, so the full workflow is traceable from one place rather than scattered across services.

flowchart TD
    S([Start]) --> INV[ReserveInventory]
    INV -->|InventoryReservedEvent| PAY[AuthorizePayment]
    INV -->|InventoryReservationFailedEvent| CANCEL1[CancelOrder]
    PAY -->|PaymentAuthorizedEvent| SHIP[CreateShipment]
    PAY -->|PaymentFailedEvent| RELV[ReleaseInventory]
    RELV --> CANCEL2[CancelOrder]
    SHIP --> END1([ Confirmed])
    CANCEL1 --> END2([ Cancelled: No Stock])
    CANCEL2 --> END3([ Cancelled: Payment Failed])

🧠 Deep Dive: Saga Internals and Failure Semantics

Compensating Transactions Are Not Rollbacks

A compensating transaction is a new, forward-moving transaction that reverses the business effect of a prior step β€” it is not a database rollback. This distinction has real implications:

  • ReleaseInventoryCommand tells the inventory service "release reservation R-4892." The reservation record is not deleted; it is marked released with a timestamp and reason.
  • Compensations must be idempotent: if ReleaseInventoryCommand is delivered twice (broker retry, network hiccup), the second delivery must be a safe no-op, not a double-release.

Design rule: every command carries a stable business key (orderId, reservationId, paymentId). The receiving service checks whether the command has already been applied before executing side effects.

Performance Analysis

Sagas trade atomic rollback for latency decomposition. Each step adds at minimum one round-trip: command dispatch β†’ service processing β†’ event publication β†’ orchestrator wakeup. For a 3-step checkout saga with 20 ms service latency per step:

ScenarioEstimated end-to-end latency
Happy path (3 steps, no retries)60–90 ms saga latency + broker overhead
Compensation path (2 steps back out)100–130 ms total
Step 2 retried once before failingp95 easily reaches 300–500 ms

When p99 spikes without elevated error rates, cold-start delays or GC pauses on the saga orchestrator host are the primary suspects.

πŸ› οΈ Java Frameworks for Saga Orchestration

FrameworkProgramming modelSweet spotCompensation support
Axon FrameworkCQRS/Event Sourcing, @Saga annotationDDD-style microservices, audit-heavy domainsBuilt-in: saga state persisted via JPA/JDBC; dead letter queue integration out of the box
Camunda / ZeebeBPMN workflow engineLong-running sagas, human approval tasks, ops dashboardsBPMN compensation events; visual process explorer for non-engineers
Temporal (Java SDK)Workflow-as-code, durable executionPolyglot teams, complex retry policies, exactly-once semanticstry/catch in workflow code; each activity's compensation is explicit Java logic

Eventuate Tram: Saga Coordination Without a Dedicated Framework

Eventuate Tram (a lightweight saga/transactional messaging library by Chris Richardson) implements sagas as SagaDefinition objects with explicit step sequences and compensating actions, without requiring an event-sourcing infrastructure like Axon Server.

public class CheckoutSagaDefinition implements SimpleSaga<CheckoutSagaData> {

    private SagaDefinition<CheckoutSagaData> sagaDefinition =
        step()
            .invokeParticipant(this::reserveInventory)
            .withCompensation(this::releaseInventory)
        .step()
            .invokeParticipant(this::authorizePayment)
            .withCompensation(this::refundPayment)
        .step()
            .invokeParticipant(this::createShipment)
        .build();

    @Override
    public SagaDefinition<CheckoutSagaData> getSagaDefinition() { return sagaDefinition; }

    private CommandWithDestination reserveInventory(CheckoutSagaData data) {
        return send(new ReserveInventoryCommand(data.getOrderId(), data.getItems()))
            .to("inventoryService").build();
    }
    private CommandWithDestination releaseInventory(CheckoutSagaData data) {
        return send(new ReleaseInventoryCommand(data.getOrderId(), data.getReservationId()))
            .to("inventoryService").build();
    }
    // authorizePayment, refundPayment, createShipment follow the same pattern
}

SagaDefinition chains steps with .withCompensation() β€” Eventuate Tram executes compensations in reverse step order automatically on failure. No central orchestrator wakeup loop; the framework manages message dispatch and state persistence via JDBC.

For a full deep-dive on Eventuate Tram saga state management and transactional messaging patterns, a dedicated follow-up post is planned.

This post uses Axon Framework for examples. Add the starter to your pom.xml:

<dependency>
  <groupId>org.axonframework</groupId>
  <artifactId>axon-spring-boot-starter</artifactId>
  <version>4.9.3</version>
</dependency>

πŸ§ͺ Orchestration in Code: The Axon Checkout Saga

The saga class below is the single source of truth for the checkout workflow. Axon persists its fields between events, so the saga survives service restarts mid-flight without losing state.

@Saga
public class CheckoutSaga {

    @Inject
    private transient CommandGateway commandGateway;  // transient = not serialized with saga state

    private String orderId;
    private String reservationId;
    private boolean paymentAuthorized;

    @StartSaga
    @SagaEventHandler(associationProperty = "orderId")
    public void on(OrderPlacedEvent event) {
        this.orderId = event.orderId();
        commandGateway.send(new ReserveInventoryCommand(event.orderId(), event.items()));
    }

    @SagaEventHandler(associationProperty = "orderId")
    public void on(InventoryReservedEvent event) {
        this.reservationId = event.reservationId();
        commandGateway.send(new AuthorizePaymentCommand(event.orderId(), event.totalCents()));
    }

    @SagaEventHandler(associationProperty = "orderId")
    public void on(PaymentAuthorizedEvent event) {
        this.paymentAuthorized = true;
        commandGateway.send(new CreateShipmentCommand(event.orderId()));
        SagaLifecycle.end();
    }

    // Compensation: inventory step failed β€” nothing reserved yet, just cancel the order
    @SagaEventHandler(associationProperty = "orderId")
    public void on(InventoryReservationFailedEvent event) {
        commandGateway.send(new CancelOrderCommand(event.orderId(), "inventory_unavailable"));
        SagaLifecycle.end();
    }

    // Compensation: payment failed β€” release reservation first, then cancel order
    @SagaEventHandler(associationProperty = "orderId")
    public void on(PaymentFailedEvent event) {
        commandGateway.send(new ReleaseInventoryCommand(orderId, reservationId));
        commandGateway.send(new CancelOrderCommand(orderId, "payment_failed"));
        SagaLifecycle.end();
    }
}

Key annotations: @StartSaga creates a new saga instance per trigger event. @SagaEventHandler(associationProperty = "orderId") routes events to the right instance by orderId. SagaLifecycle.end() removes the completed saga from the active store.

πŸ“Š Orchestration Saga: Order Flow

sequenceDiagram
    participant O as Orchestrator
    participant A as InventoryService
    participant B as PaymentService
    participant C as ShipmentService
    O->>A: ReserveInventory
    A-->>O: InventoryReserved
    O->>B: AuthorizePayment
    B-->>O: PaymentFailed
    O->>A: ReleaseInventory
    A-->>O: InventoryReleased
    Note over O: Saga compensated

The sequence diagram shows orchestration-style compensation in action: the orchestrator drives every step, so when PaymentFailed returns it knows exactly which prior step (inventory reservation) must be reversed and sends the compensation command directly. The Note over O: Saga compensated marker signals a legitimate terminal state reached through explicit business-level undo operationsβ€”not a database rollback, but a series of reversals that each service handles independently. The key advantage visible here is that the full compensation plan is centrally visible, making debugging and auditing straightforward.

πŸ§ͺ Choreography in Code: Spring Kafka Payment Participant

In the choreography style there is no central saga class. Each service subscribes to events and publishes its own outcome. The workflow emerges from the chain of reactions across services.

@Component
public class PaymentSagaParticipant {

    private final KafkaTemplate<String, Object> kafka;

    public PaymentSagaParticipant(KafkaTemplate<String, Object> kafka) {
        this.kafka = kafka;
    }

    @KafkaListener(topics = "order-placed")
    public void onOrderPlaced(OrderPlacedEvent event) {
        try {
            paymentService.authorize(event.orderId(), event.totalCents());
            kafka.send("payment-authorized", new PaymentAuthorizedEvent(event.orderId()));
        } catch (PaymentDeclinedException e) {
            kafka.send("payment-failed",
                new PaymentFailedEvent(event.orderId(), e.getReason()));
        }
    }
}

The inventory service similarly listens on the payment-failed topic and releases its reservation. The compensation chain is fully event-driven β€” no orchestrator wakes up, no shared state. The trade-off: to understand the full saga flow you must trace events across multiple services. A distributed tracing tool (Jaeger, Zipkin, OpenTelemetry) is not optional for choreography sagas beyond two steps.

πŸ“Š Choreography Saga: Event Chain

sequenceDiagram
    participant A as OrderService
    participant B as InventoryService
    participant C as PaymentService
    A->>B: OrderPlaced event
    B->>C: InventoryReserved event
    C->>A: PaymentFailed event
    A->>B: OrderCancelled event
    B-->>A: InventoryReleased event
    Note over A,C: Compensation via events

This sequence shows the fully decentralized choreography model: no orchestrator existsβ€”each service reacts to the event it receives and emits the next event in the chain. When PaymentFailed arrives, OrderService publishes OrderCancelled, which triggers InventoryService to release its reservation autonomously without any central coordinator. The Note over A,C: Compensation via events annotation captures the key operational challenge: to understand why a saga compensated, you must trace events across every participating service, making distributed tracing a requirement rather than an optional convenience.

🌍 Real-World Applications

Sagas appear wherever a multi-step business process spans service boundaries:

  • Travel booking: flight β†’ hotel β†’ car rental. If the car is unavailable, compensate by cancelling hotel and flight in reverse order.
  • Bank money transfer: debit source β†’ credit destination β†’ notify parties. Compensation reverses the debit if the credit write fails.

βš–οΈ Trade-offs & Failure Modes: Saga Trade-offs and the Failure Modes That Bite First

Failure modeSymptomRoot causeFirst mitigation
Non-idempotent compensationInventory released twice; duplicate refund issuedMissing deduplication key on compensation commandAdd reservationId / paymentId as idempotency key in each receiving service
Saga stuck mid-flightOrder stays "pending" indefinitelyExpected event never arrives (broker outage, service crash)Implement saga timeout with automatic dead-letter escalation
Compensation also failsInventory released, order never cancelledDownstream service down during compensationRoute failed compensation to DLQ; alert data-integrity team for manual review
Lost in-flight saga stateOrchestrator restart drops active sagasSaga state kept in JVM heap onlyUse Axon's JPA saga store or equivalent; never rely on in-memory state across restarts

No architecture pattern is free. Sagas trade atomic rollback for eventual consistency and explicit operational complexity.

πŸ“Š Saga Transaction States

stateDiagram-v2
    [*] --> Started
    Started --> Step1Done : inventory reserved
    Step1Done --> Step2Done : payment authorized
    Step2Done --> Completed : shipment scheduled
    Step1Done --> Compensating : payment failed
    Step2Done --> Compensating : shipment failed
    Compensating --> Compensated : steps undone
    Compensated --> [*]
    Completed --> [*]

The state diagram captures the full saga lifecycle from Started through sequential step completions to either Completed or the Compensating branch. The two compensation entry pointsβ€”from Step1Done and Step2Doneβ€”show that compensation can be triggered at any stage of the forward path, not only at the final step. The critical insight is that Compensated is a legitimate terminal state, not a failure: a saga that compensates correctly has maintained distributed consistency even when the business happy path was not achievable.

🧭 Decision Guide: When to Reach for Orchestration vs Choreography

SituationRecommendation
Workflow has 4+ steps with conditional branchesOrchestration β€” a central saga class prevents spaghetti event routing
Services are owned by separate teams with no shared stateChoreography β€” each team owns only its event listener and emitter
Full workflow audit trail and replay are requiredOrchestration with Axon Server or Camunda; workflow state is queryable from a dashboard
Throughput is the primary constraint and latency SLO is looseChoreography with Kafka β€” no orchestrator wakeup overhead per event
Compensation logic is conditional or order-dependentOrchestration β€” the saga class makes compensation sequence explicit and testable

🚨 Operator Field Note: Stuck Sagas and Dead Letters

Alert on saga active duration β€” a checkout saga open for more than five minutes is almost certainly stuck. Axon routes commands that exhaust retries to a Dead Letter Queue (DLQ); inspect the entry with its orderId, replay if the failure was transient, or escalate for manual data correction.

πŸ“š Lessons Learned

  • Design compensations before happy paths. If you cannot name the compensating transaction for each step at design time, the saga is not production-ready.
  • Idempotency is not optional. Event and command redelivery is guaranteed in any durable message broker. Every compensation handler must be safe to run twice with the same input.
  • Choreography hides complexity, it does not eliminate it. Distributed tracing is mandatory for any choreography saga beyond two steps.
  • Saga state is production data. Treat it like a database table β€” back it up, monitor it, and never discard it silently on a service restart.

πŸ“Œ TLDR: Summary & Key Takeaways

  • A Saga replaces distributed 2PC with a sequence of local transactions, each backed by an explicit compensating transaction that reverses the business effect β€” not a database rollback.
  • Orchestration centralizes workflow control in one saga class; choreography distributes it across event-driven participants. Neither is universally superior.
  • The Axon @Saga annotation, @SagaEventHandler, and SagaLifecycle.end() give orchestrated sagas durable, event-sourced state with minimal boilerplate β€” but saga state must be backed by a persistent store, never kept in memory.
  • Compensation order matters: release inventory before cancelling the order, or downstream reads see inconsistent state in the window between steps.
  • Stuck sagas and failed compensations are operational realities. Monitor DLQ depth and saga active duration from day one.
Share

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms