Home/Blog/Java/Tarjan’s Algorithm: Finding Strongly Connected Components (SCC) step-by-step
JavaIntermediate11 min read

Tarjan’s Algorithm: Finding Strongly Connected Components (SCC) step-by-step

Master graph partitioning by deriving and implementing Tarjan's linear-time SCC algorithm in Java.

Abstract Algorithms

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: In directed graphs, a Strongly Connected Component (SCC) is a maximal subset of vertices where every vertex is reachable from any other vertex in the subset. Finding these components is crucial for resolving circular dependencies. Tarjan's algorithm finds all SCCs in a single Depth-First Search (DFS) pass, achieving an optimal $O(V + E)$ time complexity. This guide details its mechanics, mathematical models, and provides a full Java implementation.


📖 Concept: Graph Reachability and Strongly Connected Components

Graph reachability is a fundamental concept in computer science. In a directed graph, we say a vertex $v$ is reachable from a vertex $u$ if there exists a directed path starting at $u$ and ending at $v$. Reachability is not always symmetric; in a directed graph, the existence of a path from $u$ to $v$ does not guarantee a return path from $v$ to $u$.

However, when mutual reachability does exist between a set of vertices, they form a Strongly Connected Component (SCC). Formally, an SCC of a directed graph $G = (V, E)$ is a maximal subgraph $G' = (V', E')$ such that for every pair of vertices $u, v \in V'$, there is a directed path from $u$ to $v$ and a directed path from $v$ to $u$.

Identifying SCCs is essential for analyzing networks and solving software engineering challenges. For instance:

  • Circular Dependency Resolution: In build systems (like Maven or Gradle) or module bundlers (like Webpack), finding circular references between packages or files is equivalent to finding SCCs containing more than one vertex.
  • Social Network Subgroups: Identifying highly cohesive communities where information can flow mutually between all members.
  • Simplifying Complex Networks: We can compress any directed graph into a Directed Acyclic Graph (DAG) by collapsing each SCC into a single super-vertex, simplifying downstream topological routing.

⚙️ Mechanics: DFS Traversal and Stack Constraints

To find SCCs, we run a Depth-First Search (DFS) traversal. The key challenge is to identify the "entry point" of each SCC during our traversal. We call this entry point the root of the strongly connected component.

Tarjan's algorithm achieves linear time execution by tracking two integer values for each node during a single DFS pass:

  1. DFS Index (ids[u]): A unique, incrementing ID assigned to vertex $u$ in the order it is first discovered by DFS.
  2. Low-Link Value (low[u]): The smallest DFS index reachable from $u$, including $u$ itself, through a path of zero or more tree edges followed by at most one back-edge.

As we traverse the graph, we push nodes onto an auxiliary stack. When the DFS returns from traversing the neighbors of a vertex $u$, if its low-link value equals its DFS index (low[u] == ids[u]), it means $u$ is the root of an SCC. We then pop all nodes off the stack down to and including $u$. All these popped nodes belong to the same SCC.


📊 Flow: Graph Partitioning Sequence

The flowchart below visualizes how Tarjan's algorithm traverses a directed graph, updates node attributes, and extracts SCCs using the stack:

flowchart TD
    Start[Start DFS on unvisited Node u] -->|1. Assign ids and low-link| Assign[ids u = counter, low u = ids u]
    Assign -->|2. Push to Stack| Push[Push u to stack]
    Push -->|3. For each neighbor v| Loop[Check neighbor state]
    Loop -->|v not visited| Recurse[Recurse DFS v]
    Recurse -->|On return| UpdateUnvisited[low u = min low u, low v]
    Loop -->|v on stack| UpdateOnStack[low u = min low u, ids v]
    Loop -->|Loop finished| RootCheck{low u == ids u?}
    RootCheck -->|Yes| PopStack[Pop stack until u retrieved]
    PopStack -->|Collect SCC| SaveSCC[Save popped nodes as SCC]
    RootCheck -->|No| Return[Return to parent DFS]

To see this in action, let us trace a query on a directed graph with 4 nodes: 0 -> 1 -> 2 -> 0 (forming a cycle) and an edge 2 -> 3.

The table below traces the state modifications of each node during execution:

Traversal StepNodeOperationStack StateIDs TableLow-Link TableAction
10Visit[0][0, -, -, -][0, -, -, -]Recurse to 1
21Visit[0, 1][0, 1, -, -][0, 1, -, -]Recurse to 2
32Visit[0, 1, 2][0, 1, 2, -][0, 1, 2, -]Recurse to 3
43Visit[0, 1, 2, 3][0, 1, 2, 3][0, 1, 2, 3]No neighbors. low[3] == ids[3]
53Pop[0, 1, 2][0, 1, 2, 3][0, 1, 2, 3]SCC 1 collected: {3}
62Check 2 -> 0[0, 1, 2][0, 1, 2, 3][0, 1, 0, 3]0 is on stack. low[2] = min(2, ids[0]) = 0
71Return[0, 1, 2][0, 1, 2, 3][0, 0, 0, 3]low[1] = min(1, low[2]) = 0
80Return[0, 1, 2][0, 1, 2, 3][0, 0, 0, 3]low[0] = min(0, low[1]) = 0. low[0] == ids[0]
90Pop[][0, 1, 2, 3][0, 0, 0, 3]SCC 2 collected: {2, 1, 0}

🧠 Deep Dive: Tree Back-Edges and Cycle Identification

Implementing Tarjan's algorithm efficiently requires understanding how back-edges detect cycles.

The low-link value of a node is updated dynamically based on whether a neighbor is currently on the stack.

  • If a neighbor $v$ has not been visited, we recursively traverse it. When the traversal returns, we update the parent's low-link value:

    $$ \text{low}[u] = \min(\text{low}[u], \text{low}[v]) $$

    This propagates low-link values back up the DFS tree.
  • If a neighbor $v$ has already been visited and is on the stack, it means we have found a back-edge (a cycle). We update the parent's low-link value using the neighbor's DFS index:

    $$ \text{low}[u] = \min(\text{low}[u], \text{ids}[v]) $$

Mathematical Model of Tarjan's Equivalence Classes

Let $G = (V, E)$ be a directed graph. The mutual reachability relation $\sim$ is defined on $V$ as: $$ u \sim v \iff (u \to^ v) \land (v \to^ u) $$ where $\to^$ denotes a directed path of length $\ge 0$. The relation $\sim$ is an *equivalence relation because it satisfies:

  1. Reflexivity: $u \sim u$ (a node can always reach itself via a path of length 0).
  2. Symmetry: If $u \sim v$, then $v \sim u$.
  3. Transitivity: If $u \sim v$ and $v \sim w$, then $u \sim w$.

Therefore, the mutual reachability relation partitions the vertex set $V$ into disjoint equivalence classes. These equivalence classes are exactly the strongly connected components of the graph. Tarjan's algorithm calculates these equivalence classes in linear time by mapping them to subtree roots in the DFS execution tree.

Performance Analysis of Linear Time DFS

The time complexity of Tarjan's algorithm is $O(V + E)$, where $V$ is the number of vertices and $E$ is the number of edges. This is optimal because the algorithm visits each vertex and examines each edge exactly once during the DFS traversal.

The space complexity is $O(V)$ to store the array states (ids, low, and stack flags) and the recursion stack, matching the memory requirements of standard DFS.


🏗️ Advanced Concepts: Cross-Edges and Stack Containment Rules

During DFS, we categorize edges into four types: tree edges, forward edges, back-edges, and cross-edges.

  • Back-Edges: Point from a node to one of its ancestors in the DFS tree. These edges always represent cycles, so we use them to update low-link values.
  • Cross-Edges: Point from a node to a previously visited node that is not its ancestor.

A cross-edge $u \to v$ can connect different subtrees. If the target node $v$ has already been processed and popped from the stack, it belongs to a completed SCC. In this case, we must ignore the edge and not update the low-link value of $u$, because there is no path back from $v$ to $u$.

This is why we maintain a boolean onStack[] array. We only update low-link values if the target node is currently on the stack (onStack[v] == true).


🌍 Applications: From Circular Dependencies to Social Graphs

  1. Dependency Analysis in Package Managers: Checking for dependency cycles in package structures (e.g. npm dependencies) to prevent compilation loops.
  2. Garbage Collection Reference Tracing: Modern runtimes use graph reachability to find cyclic object references and clean up unreferenced memory groups.
  3. Optimizing Database Queries: Breaking recursive SQL queries down into independent DAG components to parallelize database executions.

⚖️ Trade-offs and Failure Modes

  • Recursion Depth Limits: For massive graphs (e.g., millions of nodes in a social network), a standard recursive DFS can exhaust the JVM call stack, causing a StackOverflowError.
  • Mitigation: Implement Tarjan's algorithm using an iterative DFS structure. This replaces the recursive JVM stack with a custom heap-allocated array stack.

🧭 Decision Guide: Tarjan vs. Kosaraju vs. Gabow

Use this table to choose the right SCC algorithm for your graph application:

AlgorithmSingle DFS Pass?Space ComplexityPractical Advantage
Tarjan'sYes$O(V)$Fast execution; finds components in a single DFS pass.
Kosaraju'sNo (requires 2 passes)$O(V)$Easier to conceptualize; requires transposing the graph.
Gabow'sYes$O(V)$Bypasses low-link integer tracking, using a second stack to track path limits.

🧪 Practical Implementation: Java Reference Code

Here is a complete, production-ready Java implementation of Tarjan's Strongly Connected Components algorithm.

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Stack;

public class TarjansSCC {
    private int idCounter = 0;
    private int[] ids;
    private int[] low;
    private boolean[] onStack;
    private Stack<Integer> stack;
    private List<List<Integer>> sccs;

    public List<List<Integer>> findSCCs(int numVertices, List<List<Integer>> adjacencyList) {
        if (numVertices <= 0 || adjacencyList == null) {
            return new ArrayList<>();
        }

        ids = new int[numVertices];
        low = new int[numVertices];
        onStack = new boolean[numVertices];
        stack = new Stack<>();
        sccs = new ArrayList<>();

        // Initialize discovery IDs with a sentinel value (-1)
        Arrays.fill(ids, -1);

        for (int i = 0; i < numVertices; i++) {
            if (ids[i] == -1) {
                dfs(i, adjacencyList);
            }
        }

        return sccs;
    }

    private void dfs(int u, List<List<Integer>> adj) {
        ids[u] = idCounter;
        low[u] = idCounter;
        idCounter++;

        stack.push(u);
        onStack[u] = true;

        for (int v : adj.get(u)) {
            if (ids[v] == -1) {
                // Neighbor not visited yet: recurse
                dfs(v, adj);
                low[u] = Math.min(low[u], low[v]);
            } else if (onStack[v]) {
                // Back-edge found: update low-link value using neighbor's index
                low[u] = Math.min(low[u], ids[v]);
            }
            // If ids[v] != -1 and !onStack[v], this is a cross-edge to a completed SCC. Ignore it.
        }

        // If u is the root of a strongly connected component, collect the nodes
        if (low[u] == ids[u]) {
            List<Integer> component = new ArrayList<>();
            while (true) {
                int node = stack.pop();
                onStack[node] = false;
                component.add(node);
                if (node == u) {
                    break;
                }
            }
            sccs.add(component);
        }
    }
}

📚 Lessons Learned: Common Graph Traversal Bugs

  1. Forgetting to check the onStack constraint: If you update the low-link value of a node using a cross-edge to a completed SCC (where onStack[v] is false), you will corrupt the low-link values, leading to incorrect SCC partitions. Always check onStack[v] before updating low[u].
  2. Resetting idCounter inside DFS loops: The idCounter must be a global variable that increments continuously across all DFS calls. Resetting it for each unvisited root component will cause duplicate IDs, breaking the low-link verification logic.
  3. Array index out of bounds on Graph Representation: When mapping nodes to arrays, ensure the vertex IDs are normalized to the range [0, numVertices - 1]. If vertex IDs are arbitrary integers, use a Map<Integer, Integer> to map them to contiguous indices before running the algorithm.

📌 Summary: The Tarjan's Algorithm Cheatsheet

  • Definition: An SCC is a maximal subset of vertices where every vertex can reach every other vertex.
  • Complexity: Time complexity is $O(V + E)$; space complexity is $O(V)$.
  • Internal State Variables:
    • ids[u]: The order index in which vertex u was first visited.
    • low[u]: The lowest DFS index reachable from u.
  • SCC Root Identification: Occurs during traversal backtracking when low[u] == ids[u].
  • Stack Collection: Pop nodes off the stack down to and including the root node to collect all members of the SCC.

AI-generated article quiz

Test your understanding

🧠

Ready to test what you just learned?

Generate four focused questions from this article. Answers include immediate explanations.

Guided series path

Data Structures and Algorithms

View all lessons →
Lesson 4 of 27

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Sign in to save your rating.