Home/Blog/Java/Zero-Downtime Schema Migrations: The Expand-Contract Pattern for Live Production DBs
JavaIntermediateβ€’11 min readβ€’

Zero-Downtime Schema Migrations: The Expand-Contract Pattern for Live Production DBs

Learn how to evolve relational database schemas under live traffic using the Expand-Contract pattern in Spring Boot.

Abstract Algorithms

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: Altering database schemas in a high-traffic production system without downtime requires the Expand-Contract pattern. Instead of changing a column in place, we add a new column (Expand), double-write to both columns via the application (Transition), backfill historical data, and finally remove the old column (Contract). This guide details this lifecycle with full Spring Boot and Liquibase snippets.


πŸ“– Concept: Zero-Downtime Schema Migrations

In modern distributed systems, deploying updates to application servers is routinely managed using rolling updates or blue-green deployments. However, databases present a unique challenge because state cannot be easily duplicated or discarded. Running a standard ALTER TABLE RENAME COLUMN statement on a table containing millions of rows requires acquiring an exclusive table lock, causing active queries to block, requests to time out, and leading to service outages.

To prevent outages during database changes, we must ensure that any schema modification remains compatible with both the currently running version of the application and the new version being deployed. This compatibility is achieved using the Expand-Contract pattern (sometimes called the Parallel Change pattern).

The core idea is to split a single breaking database alteration into multiple small, backwards-compatible stages. The application and the database evolve in step, ensuring that at no point does a running application server read from or write to a database schema it does not understand.


βš™οΈ Mechanics: The Three-Phase Migration Lifecycle

Let us trace the evolution of a schema during a column rename operation. Suppose we want to rename the column phone_number to contact_phone in our users table.

We execute the change through these structured phases:

  1. Phase 1: Expand (Database): We add the new column contact_phone to the database schema as a nullable field.
  2. Phase 2: Transition (Application Deployments):
    • V1 Deploy (Double-Write): We deploy a version of the application that writes to both phone_number and contact_phone but reads only from phone_number. This ensures that new data is captured in both formats.
    • Backfill (Data Migration): We execute an asynchronous background job to copy all existing historical values from phone_number to contact_phone.
    • V2 Deploy (Read-New): We deploy a version of the application that reads only from contact_phone and continues to write to both columns (to allow for safe rollback).
  3. Phase 3: Contract (Database): Once we are sure the new application version is stable, we stop writing to the old column and execute a database migration to drop phone_number from the database.

πŸ“Š Flow: High-Level Migration Sequence

The diagram below visualizes the transition states of the database and the active application versions during the migration:

graph TD
    Start["State 0: Legacy Schema
App V0 reads/writes 'phone_number'"] -->|Expand: Add Column| Phase1["State 1: Expanded Schema
App V1 double-writes to both, reads 'phone_number'"] Phase1 -->|Data Backfill| Backfill["State 2: Backfilled Schema
All historical values copied to 'contact_phone'"] Backfill -->|Read Switch| Phase2["State 3: Transition Schema
App V2 double-writes to both, reads 'contact_phone'"] Phase2 -->|Contract: Drop Old Column| Phase3["State 4: Contracted Schema
App V3 reads/writes 'contact_phone' only"]

The table below summarizes the operational compatibility matrices at each stage:

StageDatabase StateApplication ReadsApplication WritesRollback Capability
0. Legacyphone_numberphone_numberphone_numberBaseline
1. Expandphone_number, contact_phone (null)phone_numberphone_number & contact_phoneEasy (drop new column)
2. Transitionphone_number, contact_phone (populated)contact_phonephone_number & contact_phoneSafe (old column still active)
3. Contractcontact_phonecontact_phonecontact_phoneHard (requires database recovery)

🧠 Deep Dive: Structural Evolution & Performance Profiling

Understanding how the database engine executes these operations reveals why the Expand-Contract pattern is necessary.

Database Engine Lock Internals

In relational database engines like PostgreSQL or MySQL (InnoDB), schema changes require obtaining a Metadata Lock (MDL).

  • In older database versions, any ALTER TABLE operation required rebuilding the entire table on disk, acquiring an Exclusive (Access Exclusive) lock. This blocks all read and write queries on the target table.
  • Modern engines support INSTANT or ONLINE operations for adding nullable columns. For example, in PostgreSQL, ALTER TABLE ADD COLUMN contact_phone VARCHAR only updates the database catalog (system tables) and returns in milliseconds without blocking concurrent queries. However, renaming a column still requires catalog updates that require short Access Exclusive locks. Performing a rename under heavy concurrent transaction load can block the query planner queue, leading to pool exhaustion.

Mathematical Model of Multi-Version Data Consistency

During Phase 2 (the Transition phase), we must guarantee that reads are consistent regardless of which application instance handles the request.

Let $R_a(t)$ and $Wa(t)$ denote the read and write operations of application version $a \in {0, 1, 2}$ at time $t$. Let $D(t) = [C{old}, C_{new}]$ represent the database state.

  • For version $V0$ (Legacy):

    $$ W_0(t) \to C_{old} \quad \text{and} \quad R_0(t) \gets C_{old} $$

  • For version $V1$ (Double-Write, Read-Old):

    $$ W_1(t) \to [C_{old}, C_{new}] \quad \text{and} \quad R_1(t) \gets C_{old} $$

  • For version $V2$ (Double-Write, Read-New):

    $$ W_2(t) \to [C_{old}, C_{new}] \quad \text{and} \quad R_2(t) \gets C_{new} $$

Because both $V1$ and $V2$ write to both columns, any write $W(t)$ is replicated to both fields: $C{old} = C{new}$. Consequently, if an instance of $V1$ reads $C{old}$ and an instance of $V2$ reads $C{new}$, they retrieve the same data, satisfying: $$ R_1(t) = R_2(t) $$ This mathematical equivalence guarantees read-write consistency across mixed application cluster states.

Performance Analysis of Double Writing

Double writing increases the network payload size and requires slightly more CPU cycles during serialization. However, on modern hardware, the impact of double-writing two string fields in a single query is negligible (less than 1% increase in query latency).

The primary performance risk is the Data Backfill process. Running a naive query like: UPDATE users SET contact_phone = phone_number WHERE contact_phone IS NULL; on a table with 10 million rows will result in a long-running transaction. This transaction will bloat the Undo Log (rollback segment), lock rows, and exhaust connection pools.

To mitigate this, backfilling must be executed in small, throttled batches (e.g., updating 1,000 rows at a time with a short sleep between batches).


πŸ—οΈ Advanced Concepts: Handling Large-Scale Table Alterations

When working with tables that exceed hundreds of gigabytes, even adding a nullable column can sometimes hit system boundaries (e.g., maximum row size limits or disk space limits during page splits).

To handle these scenarios at scale, we use toolkits like GitHub's gh-ost or Percona's pt-online-schema-change. These tools create a shadow copy of the table, apply migrations to the shadow table, set up database triggers or parse binary logs to stream incoming writes from the live table to the shadow table, and then execute a fast rename operation to swap the tables once they are synchronized.


🌍 Applications: Real-World SaaS Migrations

  1. Stripe Ledger Evolutions: Stripe uses a multi-phase column-mapping layer to move fields between database servers without dropping API requests.
  2. E-Commerce Account Merges: Moving customer shipping profiles from separate legacy databases into a unified schema during corporate mergers.
  3. High-Frequency Audits: Adding tracking metadata or audit hashes to core transaction records under live load.

βš–οΈ Trade-offs and Failure Modes

  • Operational Overhead: What would have been a single SQL script is split into multiple code deployments, data backfills, and database cleanup scripts.
  • Orphaned Writes: If the backfill script executes while older application instances (V0) are still writing to the database, some updates might only write to the old column and never propagate to the new column.
  • Mitigation: Ensure that all application servers are successfully upgraded to the double-writing version (V1) before triggering the backfill script.

🧭 Decision Guide: Expand-Contract vs. Locking Maintenance

ScenarioRecommended StrategyAlternative
High-traffic user-facing database (e.g. checkout, auth)Expand-Contract PatternHighly recommended to preserve SLAs.
Internal reporting DB or low-traffic admin panelScheduled Maintenance WindowExecute a simple locking ALTER during off-peak hours to save engineering time.
Very large tables (> 500GB) under continuous write loadOnline Schema Change Tool (gh-ost)Bypasses direct ALTER statements to prevent catalog locks.

πŸ§ͺ Practical Implementation: Liquibase & Spring Boot Code

Let us implement the database migrations and application code for Phase 2 (Transition Phase) of our column rename operation.

1. Database Migration Scripts (Liquibase XML)

Step 1: Expand Phase (Add Column)

<databaseChangeLog xmlns="http://www.liquibase.org/xml/ns/dbchangelog"
                   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                   xsi:schemaLocation="http://www.liquibase.org/xml/ns/dbchangelog
                   http://www.liquibase.org/xml/ns/dbchangelog/dbchangelog-4.3.xsd">

    <changeSet id="20260618-add-contact-phone-column" author="abstract-algorithms">
        <addColumn tableName="users">
            <column name="contact_phone" type="varchar(255)">
                <constraints nullable="true"/>
            </column>
        </addColumn>
    </changeSet>
</databaseChangeLog>

Step 2: Contract Phase (Drop Column)

<databaseChangeLog xmlns="http://www.liquibase.org/xml/ns/dbchangelog"
                   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                   xsi:schemaLocation="http://www.liquibase.org/xml/ns/dbchangelog
                   http://www.liquibase.org/xml/ns/dbchangelog/dbchangelog-4.3.xsd">

    <changeSet id="20260618-drop-old-phone-number-column" author="abstract-algorithms">
        <dropColumn tableName="users" columnName="phone_number"/>
    </changeSet>
</databaseChangeLog>

2. Spring Boot JPA Transition Code

During the transition phase, the application entity double-writes to both columns. We use JPA lifecycle callbacks (@PrePersist and @PreUpdate) to ensure that any change to the model updates both fields automatically.

import jakarta.persistence.*;

@Entity
@Table(name = "users")
public class User {

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    @Column(name = "username", nullable = false)
    private String username;

    // Legacy column (to be removed in Contract phase)
    @Column(name = "phone_number")
    private String phoneNumber;

    // New column (added in Expand phase)
    @Column(name = "contact_phone")
    private String contactPhone;

    // Field getter/setter routing for transition phase
    public String getActivePhone() {
        // V2 Read-New logic: Read from contactPhone if populated, fallback to legacy
        return this.contactPhone != null ? this.contactPhone : this.phoneNumber;
    }

    public void setActivePhone(String newPhone) {
        this.phoneNumber = newPhone;
        this.contactPhone = newPhone;
    }

    // JPA lifecycle hook to guarantee double-writing on insert/update
    @PrePersist
    @PreUpdate
    protected void synchronizePhoneFields() {
        if (this.contactPhone != null && this.phoneNumber == null) {
            this.phoneNumber = this.contactPhone;
        } else if (this.phoneNumber != null && this.contactPhone == null) {
            this.contactPhone = this.phoneNumber;
        } else if (this.phoneNumber != null && !this.phoneNumber.equals(this.contactPhone)) {
            // Replicate the latest change to both columns
            this.contactPhone = this.phoneNumber;
        }
    }

    // Boilerplate Getters and Setters
    public Long getId() { return id; }
    public void setId(Long id) { this.id = id; }
    public String getUsername() { return username; }
    public void setUsername(String username) { this.username = username; }
    public String getPhoneNumber() { return phoneNumber; }
    public void setPhoneNumber(String phoneNumber) { this.phoneNumber = phoneNumber; }
    public String getContactPhone() { return contactPhone; }
    public void setContactPhone(String contactPhone) { this.contactPhone = contactPhone; }
}

3. Throttled Batch Backfill Script (SQL / Java Execution)

This repository method executes the backfill in throttled batches to prevent table locking and Undo Log overflow.

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;

@Service
public class BackfillService {

    @Autowired
    private JdbcTemplate jdbcTemplate;

    public void executeBackfill(int batchSize, long throttleDelayMs) {
        boolean hasMore = true;

        while (hasMore) {
            // Update a chunk of rows where the new column is null but old column contains data
            int updatedRows = executeBatchUpdate(batchSize);

            if (updatedRows == 0) {
                hasMore = false;
            } else {
                try {
                    // Rest the database connection pool between batches
                    Thread.sleep(throttleDelayMs);
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                    throw new RuntimeException("Backfill process interrupted", e);
                }
            }
        }
    }

    @Transactional
    protected int executeBatchUpdate(int batchSize) {
        String query = "UPDATE users " +
                       "SET contact_phone = phone_number " +
                       "WHERE contact_phone IS NULL AND phone_number IS NOT NULL " +
                       "LIMIT ?";

        return jdbcTemplate.update(query, batchSize);
    }
}

πŸ“š Lessons Learned: Common Migration Pitfalls

  1. Running Backfill Before Deploying Double-Writes: If the backfill job runs while application instances are still writing only to the legacy column, any database records modified after they are backfilled will contain stale values in the new column. Double-writing must be active across the cluster before the backfill starts.
  2. Missing Constraints on the New Column: If the legacy column had a NOT NULL constraint, we cannot directly enforce NOT NULL on the new column during the Expand phase because existing records will not have values yet. The new column must remain nullable until the backfill is complete. A final database migration must add the constraint after data is populated.
  3. Rollback Verification Failures: Always verify that your double-writing application can run alongside both the old and new schemas. This ensures that if the new deployment (V2) fails, you can safely roll back to V1 without database corruption.

πŸ“Œ Summary: The Zero-Downtime Migration Cheatsheet

  • Pattern Core: Expand (add new database field), Transition (application double-writes & copies history), Contract (drop old database field).
  • Backfill Safety: Never run a single global UPDATE query on large tables. Run updates in small, throttled batches.
  • Nullability: New columns must be created as nullable. Enforce constraints only after the backfill completes.
  • JPA Hooks: Leverage @PrePersist and @PreUpdate callbacks in Hibernate to ensure clean double-writes.
  • Compatibility Rule: The database schema must remain compatible with at least $N-1$ versions of the running application.

AI-generated article quiz

Test your understanding

🧠

Ready to test what you just learned?

Generate four focused questions from this article. Answers include immediate explanations.

Guided series path

Architecture Patterns for Production Systems

View all lessons β†’
Lesson 12 of 24

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Sign in to save your rating.