All Posts

What are Hash Tables? Basics Explained

Hash Tables are the magic behind O(1) lookups. We explain Hash Functions, Collisions, and why the...

Abstract AlgorithmsAbstract Algorithms
ยทยท11 min read

AI-assisted content.

TLDR: A hash table gives you near-O(1) lookups, inserts, and deletes by using a hash function to map keys to array indices. The tradeoff: collisions (when two keys hash to the same slot) must be handled, and a full hash table must be resized.


๐Ÿ“– The Magic Mailroom Stamp

Imagine a mailroom with 100 drawers. A magic stamp converts any employee name to a number 1โ€“100. Want John's mail? Stamp "John" โ†’ you get 42 โ†’ open Drawer 42. No search needed.

That stamp is a hash function. The row of drawers is the array (the backing store). The combination is a hash table.


๐Ÿ” Hash Table Fundamentals

A hash table combines two simple ideas into a data structure with near-constant-time performance:

  1. An array provides O(1) access by index: array[42] is instant.
  2. A hash function converts any key (string, integer, object) to an index: h("john") โ†’ 42.

Together: given any key, compute its index in one step, then access the array slot directly. No iteration. No comparison loop.

The three fundamental operations and their complexity:

OperationAverage caseWorst caseWhy worst case occurs
LookupO(1)O(N)All keys hash to same slot
InsertO(1) amortizedO(N)Rehash triggered on full table
DeleteO(1)O(N)All keys collide into same bucket

Worst case requires a pathologically bad hash function (or deliberate hash flooding). In practice, well-designed hash functions achieve average O(1) for all three operations.

What makes a good hash function?

  • Deterministic: Same key always maps to the same slot.
  • Uniform: Keys distribute evenly across all slots to minimize collisions.
  • Fast: Must compute in O(1) โ€” the hash computation cannot become the bottleneck.
  • Avalanche effect: Similar keys produce very different hash values (prevents clustering).

๐Ÿ”ข Deep Dive: Array + Hash Function

flowchart LR
    Key[Key: 'cat'] --> HashFn[Hash Function h('cat') = 2]
    HashFn --> Array[Array [0][1][2: 'catmeow'][3]...]
    Array --> Value[Value: 'meow']

A hash function must be:

  1. Deterministic: Same input always produces the same output.
  2. Fast: O(1) computation.
  3. Uniform: Keys distribute evenly to minimize collisions.

Simple hash function (NOT production-quality):

def simple_hash(key: str, size: int) -> int:
    return sum(ord(c) for c in key) % size

Python's built-in hash() uses SipHash (cryptographically secure version) to prevent hash flooding attacks.


๐Ÿ“Š How a Hash Table Handles a Lookup

The lookup flow for key "alice" in a hash table with separate chaining:

flowchart TD
    K[Key: 'alice'] --> HF[Hash Function: h('alice') = 7]
    HF --> IDX[Array index 7]
    IDX --> BK[Bucket 7: [('alice', 25), ('charlie', 30)]]
    BK --> SCAN{Scan bucket: key == 'alice'?}
    SCAN -->|Match found| RET[Return value: 25]
    SCAN -->|No match| MISS[Return None / KeyError]
    style RET fill:#90EE90
    style MISS fill:#FFB6C1

Three scenarios during lookup:

  1. Direct hit: array[index] contains exactly the target key โ†’ O(1).
  2. Collision chain: array[index] has multiple entries โ†’ scan the short chain โ†’ O(k) where k = bucket length (typically 1โ€“2 with good load factor).
  3. Miss: The slot is empty โ†’ key not in table โ†’ O(1).

For a table with load factor 0.75 and a good hash function, average bucket length is < 2, making even chained lookups extremely fast in practice.


โš™๏ธ Collisions: When Two Keys Land in the Same Slot

No hash function is perfect. Given a finite array, two different keys will eventually map to the same slot.

Example:

  • hash('cat') = 2 โ†’ Stored at index 2.
  • hash('bird') = 2 โ†’ Collision at index 2!

Two collision resolution strategies:

Separate Chaining

Each slot holds a linked list of key-value pairs. On collision, append to the list.

index 2: [('cat', 'meow') โ†’ ('bird', 'tweet')]

O(1) average, O(n) worst case if all keys hash to the same slot.

Open Addressing (Linear Probing)

No linked lists. On collision, step forward until you find an empty slot.

hash('cat') = 2 โ†’ store at 2
hash('bird') = 2 (full) โ†’ probe 3 โ†’ store at 3

Pros: Better cache performance (all data in one array). Cons: Clustering โ€” many consecutive full slots slow probing.

StrategySpaceLookup (avg)Cache FriendlyUsed In
Separate ChainingExtra memory for listsO(1)NoJava HashMap (until JDK 8)
Linear ProbingCompact arrayO(1)YesPython dict, Java HashMap (JDK 8+)
Quadratic ProbingCompact arrayO(1)PartialOpen-addressing tables
Robin Hood HashingCompact arrayO(1)YesRust HashMap

๐Ÿ“Š Open Addressing: Collision Probe Sequence

sequenceDiagram
    participant K as Key "bird"
    participant HF as Hash Function
    participant AR as Array

    K->>HF: h(bird) = 2
    HF->>AR: check slot 2
    AR-->>HF: slot 2 occupied by "cat"
    HF->>AR: probe slot 3 (linear)
    AR-->>HF: slot 3 empty
    HF-->>K: insert "bird" at slot 3
    Note over AR: No linked list needed

This diagram traces a collision in open addressing step by step. When "bird" hashes to slot 2 โ€” already occupied by "cat" โ€” the algorithm probes forward to slot 3 instead of allocating a separate linked list. The key takeaway is that open addressing keeps all data in a single contiguous array, which is friendlier to CPU cache lines but requires careful load-factor management to prevent long probe chains.


๐Ÿง  Deep Dive: Load Factor and Rehashing

Load factor = (number of entries) / (array capacity).

As the table fills up (load factor โ†’ 1.0), collisions increase and performance degrades.

Standard practice: rehash when load factor exceeds 0.75.

Rehashing:

  1. Create a new array 2ร— the size.
  2. Re-insert every existing key into the new array (re-hash all keys).
  3. Old array is garbage collected.

Cost: O(N) for one rehash. But since it happens at doubly-spaced intervals, the amortized cost per insert is O(1).

๐Ÿ“Š Load Factor Decision: Keep or Resize

flowchart TD
    A[Insert new key] --> B{Load factor > 0.75?}
    B -- No --> C[Insert into current array]
    C --> D[O(1) average]
    B -- Yes --> E[Allocate 2x array]
    E --> F[Re-hash all existing keys]
    F --> G[Insert new key]
    G --> H[Old array GC'd]
    H --> I[Amortized O(1) per insert]

This flowchart captures the split-second decision Kotlin and Java runtimes make on every insert. Below the 0.75 threshold the operation is pure O(1); above it, the table allocates a new backing array and rehashes every existing key โ€” an O(N) operation that happens infrequently enough that amortized cost remains O(1). Sizing your hash table upfront with new HashMap<>(expectedSize, 0.75f) avoids this rehash entirely in batch-loading scenarios.


โš™๏ธ Why Python Dicts Are Fast

Python dict uses open addressing with compact arrays (since Python 3.6). Entries are stored in an incrementally-growing compact array indexed by insertion order, with a separate hash table of indices pointing into it. This gives:

  • Fast iteration (compact memory, no pointer chasing).
  • Memory-efficient (no linked list overhead).
  • Dict guaranteed ordered by insertion order.

โš–๏ธ Trade-offs & Failure Modes: Trade-offs, Failure Modes & Decision Guide: Hash Tables vs Others

OperationHash TableBST (e.g., TreeMap)Array (sorted)
Lookup by keyO(1) avgO(log n)O(log n) binary search
InsertO(1) avgO(log n)O(n) shift
DeleteO(1) avgO(log n)O(n) shift
Range queryโŒ Not supportedโœ… O(log n + k)โœ… O(log n + k)
Ordered iterationโŒ No orderโœ… In-orderโœ… Sorted

Choose hash table when: You need fast key-value lookup and don't need ordering or range queries. Choose BST when: You need sorted iteration or range queries (e.g., "find all keys between 10 and 50").


๐ŸŒ Real-World Application: Hash Tables in Production

Hash tables are one of the most widely used data structures in existence:

ApplicationHash table usedWhy
Programming language runtimesPython dict, JavaScript object, Java HashMapCore key-value primitive for all dynamic lookups
Database query executionHash join, hash aggregateJoin two tables by hashing the join key for O(1) match
Caches (Redis, Memcached)Hash table over key-value pairsO(1) cache hit lookup by key
DNS resolutionCached DNS recordsMap hostname โ†’ IP in O(1) after first resolution
Compiler symbol tablesVariable name โ†’ type/locationO(1) lookup during type-checking and code generation
Counting frequenciesWord count, histogramcount[word] += 1 pattern โ€” O(1) per word
DeduplicationSeen-set during ETLO(1) check if record was already processed

In coding interviews, hash tables solve a huge class of problems in O(N):

  • Two Sum / Four Sum (O(N) with hash lookup vs O(Nยฒ) brute force)
  • Anagram detection (frequency map comparison)
  • Longest subarray with constraint (sliding window + hash)
  • Graph adjacency representation (adjacency list as hash map)

๐Ÿงช Practical: Building a Frequency Counter

The most common hash table interview pattern: count occurrences and use them to answer queries.

Problem: Given two strings, determine if one is an anagram of the other in O(N) time.

def is_anagram(s: str, t: str) -> bool:
    if len(s) != len(t):
        return False

    freq = {}

    # Count characters in s
    for char in s:
        freq[char] = freq.get(char, 0) + 1

    # Subtract characters in t
    for char in t:
        if char not in freq or freq[char] == 0:
            return False
        freq[char] -= 1

    return True

# Examples
print(is_anagram("anagram", "nagaram"))  # True
print(is_anagram("rat", "car"))          # False

Why this is O(N): Building the frequency map is one pass over s (O(N)). Validating against t is one pass over t (O(N)). Total: O(2N) = O(N). No sorting required (sorting would be O(N log N)).

Python shortcut:

from collections import Counter

def is_anagram_v2(s: str, t: str) -> bool:
    return Counter(s) == Counter(t)

collections.Counter is Python's built-in frequency hash table โ€” a dict subclass with the same O(1) insert and lookup.


๐Ÿ› ๏ธ Java HashMap & ConcurrentHashMap: Hash Table Mechanics Under the Hood

Java's HashMap is a production-quality hash table that uses separate chaining internally (linked list that upgrades to a red-black tree when a bucket reaches depth โ‰ฅ 8) and automatically rehashes at a 0.75 load factor โ€” exactly the mechanics described in the collision resolution and rehashing sections above.

ConcurrentHashMap extends the same design for multithreaded environments: instead of locking the entire map, it uses CAS operations on individual bucket heads, allowing many concurrent writers with no lock contention on separate buckets.

import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

public class HashTableDemo {

    // โ”€โ”€ Custom hash table with separate chaining (illustrative) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    static class ChainedHashTable<K, V> {
        private static final int SIZE = 16;
        @SuppressWarnings("unchecked")
        private final java.util.LinkedList<Map.Entry<K,V>>[] buckets =
            new java.util.LinkedList[SIZE];

        public void put(K key, V value) {
            int idx = Math.abs(key.hashCode() % SIZE);
            if (buckets[idx] == null) buckets[idx] = new java.util.LinkedList<>();
            buckets[idx].removeIf(e -> e.getKey().equals(key)); // update existing
            buckets[idx].add(Map.entry(key, value));
        }

        public V get(K key) {
            int idx = Math.abs(key.hashCode() % SIZE);
            if (buckets[idx] == null) return null;
            return buckets[idx].stream()
                               .filter(e -> e.getKey().equals(key))
                               .map(Map.Entry::getValue)
                               .findFirst().orElse(null);
        }
    }

    public static void main(String[] args) {
        // โ”€โ”€ HashMap: O(1) average โ€” word frequency counter (mirrors ๐Ÿงช section)
        Map<String, Integer> freq = new HashMap<>();
        for (String word : new String[]{"apple","banana","apple","cherry","apple"}) {
            freq.merge(word, 1, Integer::sum);  // merge is atomic put-or-update
        }
        System.out.println(freq);  // {apple=3, banana=1, cherry=1}

        // โ”€โ”€ ConcurrentHashMap: thread-safe, no full-map lock โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
        ConcurrentHashMap<String, Integer> concurrent = new ConcurrentHashMap<>();
        concurrent.merge("apple", 1, Integer::sum);  // CAS-based atomic increment
        System.out.println("ConcurrentHashMap: " + concurrent);
    }
}
ClassThread-safeNull keys/valuesOrderedUse when
HashMapโŒโœ… / โœ…โŒSingle-threaded or externally synchronized
LinkedHashMapโŒโœ… / โœ…Insertion orderNeed insertion-order iteration (e.g., LRU cache)
TreeMapโŒโŒ / โœ…Sorted by keyNeed sorted keys or range queries (BST internally)
ConcurrentHashMapโœ…โŒ / โŒโŒHigh-concurrency, write-heavy, multi-threaded

For a full deep-dive on Java HashMap internals (treeification, resize strategy) and ConcurrentHashMap's segment locking, a dedicated follow-up post is planned.


๐Ÿ“š What Hash Tables Teach You About Trade-offs

  • O(1) average is not O(1) guaranteed. Every hash table has a worst case. Security-sensitive systems (web servers) must use cryptographic hash functions (SipHash, etc.) to prevent deliberate hash flooding attacks that force O(N) collisions.
  • Space and time have a direct relationship. A larger backing array โ†’ fewer collisions โ†’ faster lookups. Load factor is the knob that controls this trade-off.
  • Ordering is sacrificed for speed. A hash table scrambles keys by design. If you need to iterate keys in sorted order, you need a BST or a sorted array instead.
  • Rehashing is amortized, not free. Individual rehash events cost O(N). Latency-sensitive systems sometimes pre-size hash tables to avoid mid-operation rehashing and jitter.

๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • Hash function maps key โ†’ array index in O(1). Must be deterministic and uniform.
  • Collisions are inevitable; Separate Chaining uses linked lists; Linear Probing uses the array itself.
  • Rehash at load factor 0.75 to keep average O(1); amortized cost is O(1) per insert.
  • Hash tables have no order โ€” use a BST when you need sorted iteration or range queries.

Share
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms