What are Hash Tables? Basics Explained
Hash Tables are the magic behind O(1) lookups. We explain Hash Functions, Collisions, and why the...
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: A hash table gives you near-O(1) lookups, inserts, and deletes by using a hash function to map keys to array indices. The tradeoff: collisions (when two keys hash to the same slot) must be handled, and a full hash table must be resized.
๐ The Magic Mailroom Stamp
Imagine a mailroom with 100 drawers. A magic stamp converts any employee name to a number 1โ100. Want John's mail? Stamp "John" โ you get 42 โ open Drawer 42. No search needed.
That stamp is a hash function. The row of drawers is the array (the backing store). The combination is a hash table.
๐ Hash Table Fundamentals
A hash table combines two simple ideas into a data structure with near-constant-time performance:
- An array provides O(1) access by index:
array[42]is instant. - A hash function converts any key (string, integer, object) to an index:
h("john") โ 42.
Together: given any key, compute its index in one step, then access the array slot directly. No iteration. No comparison loop.
The three fundamental operations and their complexity:
| Operation | Average case | Worst case | Why worst case occurs |
| Lookup | O(1) | O(N) | All keys hash to same slot |
| Insert | O(1) amortized | O(N) | Rehash triggered on full table |
| Delete | O(1) | O(N) | All keys collide into same bucket |
Worst case requires a pathologically bad hash function (or deliberate hash flooding). In practice, well-designed hash functions achieve average O(1) for all three operations.
What makes a good hash function?
- Deterministic: Same key always maps to the same slot.
- Uniform: Keys distribute evenly across all slots to minimize collisions.
- Fast: Must compute in O(1) โ the hash computation cannot become the bottleneck.
- Avalanche effect: Similar keys produce very different hash values (prevents clustering).
๐ข Deep Dive: Array + Hash Function
flowchart LR
Key[Key: 'cat'] --> HashFn[Hash Function h('cat') = 2]
HashFn --> Array[Array [0][1][2: 'catmeow'][3]...]
Array --> Value[Value: 'meow']
A hash function must be:
- Deterministic: Same input always produces the same output.
- Fast: O(1) computation.
- Uniform: Keys distribute evenly to minimize collisions.
Simple hash function (NOT production-quality):
def simple_hash(key: str, size: int) -> int:
return sum(ord(c) for c in key) % size
Python's built-in hash() uses SipHash (cryptographically secure version) to prevent hash flooding attacks.
๐ How a Hash Table Handles a Lookup
The lookup flow for key "alice" in a hash table with separate chaining:
flowchart TD
K[Key: 'alice'] --> HF[Hash Function: h('alice') = 7]
HF --> IDX[Array index 7]
IDX --> BK[Bucket 7: [('alice', 25), ('charlie', 30)]]
BK --> SCAN{Scan bucket: key == 'alice'?}
SCAN -->|Match found| RET[Return value: 25]
SCAN -->|No match| MISS[Return None / KeyError]
style RET fill:#90EE90
style MISS fill:#FFB6C1
Three scenarios during lookup:
- Direct hit:
array[index]contains exactly the target key โ O(1). - Collision chain:
array[index]has multiple entries โ scan the short chain โ O(k) where k = bucket length (typically 1โ2 with good load factor). - Miss: The slot is empty โ key not in table โ O(1).
For a table with load factor 0.75 and a good hash function, average bucket length is < 2, making even chained lookups extremely fast in practice.
โ๏ธ Collisions: When Two Keys Land in the Same Slot
No hash function is perfect. Given a finite array, two different keys will eventually map to the same slot.
Example:
hash('cat') = 2โ Stored at index 2.hash('bird') = 2โ Collision at index 2!
Two collision resolution strategies:
Separate Chaining
Each slot holds a linked list of key-value pairs. On collision, append to the list.
index 2: [('cat', 'meow') โ ('bird', 'tweet')]
O(1) average, O(n) worst case if all keys hash to the same slot.
Open Addressing (Linear Probing)
No linked lists. On collision, step forward until you find an empty slot.
hash('cat') = 2 โ store at 2
hash('bird') = 2 (full) โ probe 3 โ store at 3
Pros: Better cache performance (all data in one array). Cons: Clustering โ many consecutive full slots slow probing.
| Strategy | Space | Lookup (avg) | Cache Friendly | Used In |
| Separate Chaining | Extra memory for lists | O(1) | No | Java HashMap (until JDK 8) |
| Linear Probing | Compact array | O(1) | Yes | Python dict, Java HashMap (JDK 8+) |
| Quadratic Probing | Compact array | O(1) | Partial | Open-addressing tables |
| Robin Hood Hashing | Compact array | O(1) | Yes | Rust HashMap |
๐ Open Addressing: Collision Probe Sequence
sequenceDiagram
participant K as Key "bird"
participant HF as Hash Function
participant AR as Array
K->>HF: h(bird) = 2
HF->>AR: check slot 2
AR-->>HF: slot 2 occupied by "cat"
HF->>AR: probe slot 3 (linear)
AR-->>HF: slot 3 empty
HF-->>K: insert "bird" at slot 3
Note over AR: No linked list needed
This diagram traces a collision in open addressing step by step. When "bird" hashes to slot 2 โ already occupied by "cat" โ the algorithm probes forward to slot 3 instead of allocating a separate linked list. The key takeaway is that open addressing keeps all data in a single contiguous array, which is friendlier to CPU cache lines but requires careful load-factor management to prevent long probe chains.
๐ง Deep Dive: Load Factor and Rehashing
Load factor = (number of entries) / (array capacity).
As the table fills up (load factor โ 1.0), collisions increase and performance degrades.
Standard practice: rehash when load factor exceeds 0.75.
Rehashing:
- Create a new array 2ร the size.
- Re-insert every existing key into the new array (re-hash all keys).
- Old array is garbage collected.
Cost: O(N) for one rehash. But since it happens at doubly-spaced intervals, the amortized cost per insert is O(1).
๐ Load Factor Decision: Keep or Resize
flowchart TD
A[Insert new key] --> B{Load factor > 0.75?}
B -- No --> C[Insert into current array]
C --> D[O(1) average]
B -- Yes --> E[Allocate 2x array]
E --> F[Re-hash all existing keys]
F --> G[Insert new key]
G --> H[Old array GC'd]
H --> I[Amortized O(1) per insert]
This flowchart captures the split-second decision Kotlin and Java runtimes make on every insert. Below the 0.75 threshold the operation is pure O(1); above it, the table allocates a new backing array and rehashes every existing key โ an O(N) operation that happens infrequently enough that amortized cost remains O(1). Sizing your hash table upfront with new HashMap<>(expectedSize, 0.75f) avoids this rehash entirely in batch-loading scenarios.
โ๏ธ Why Python Dicts Are Fast
Python dict uses open addressing with compact arrays (since Python 3.6). Entries are stored in an incrementally-growing compact array indexed by insertion order, with a separate hash table of indices pointing into it. This gives:
- Fast iteration (compact memory, no pointer chasing).
- Memory-efficient (no linked list overhead).
- Dict guaranteed ordered by insertion order.
โ๏ธ Trade-offs & Failure Modes: Trade-offs, Failure Modes & Decision Guide: Hash Tables vs Others
| Operation | Hash Table | BST (e.g., TreeMap) | Array (sorted) |
| Lookup by key | O(1) avg | O(log n) | O(log n) binary search |
| Insert | O(1) avg | O(log n) | O(n) shift |
| Delete | O(1) avg | O(log n) | O(n) shift |
| Range query | โ Not supported | โ O(log n + k) | โ O(log n + k) |
| Ordered iteration | โ No order | โ In-order | โ Sorted |
Choose hash table when: You need fast key-value lookup and don't need ordering or range queries. Choose BST when: You need sorted iteration or range queries (e.g., "find all keys between 10 and 50").
๐ Real-World Application: Hash Tables in Production
Hash tables are one of the most widely used data structures in existence:
| Application | Hash table used | Why |
| Programming language runtimes | Python dict, JavaScript object, Java HashMap | Core key-value primitive for all dynamic lookups |
| Database query execution | Hash join, hash aggregate | Join two tables by hashing the join key for O(1) match |
| Caches (Redis, Memcached) | Hash table over key-value pairs | O(1) cache hit lookup by key |
| DNS resolution | Cached DNS records | Map hostname โ IP in O(1) after first resolution |
| Compiler symbol tables | Variable name โ type/location | O(1) lookup during type-checking and code generation |
| Counting frequencies | Word count, histogram | count[word] += 1 pattern โ O(1) per word |
| Deduplication | Seen-set during ETL | O(1) check if record was already processed |
In coding interviews, hash tables solve a huge class of problems in O(N):
- Two Sum / Four Sum (O(N) with hash lookup vs O(Nยฒ) brute force)
- Anagram detection (frequency map comparison)
- Longest subarray with constraint (sliding window + hash)
- Graph adjacency representation (adjacency list as hash map)
๐งช Practical: Building a Frequency Counter
The most common hash table interview pattern: count occurrences and use them to answer queries.
Problem: Given two strings, determine if one is an anagram of the other in O(N) time.
def is_anagram(s: str, t: str) -> bool:
if len(s) != len(t):
return False
freq = {}
# Count characters in s
for char in s:
freq[char] = freq.get(char, 0) + 1
# Subtract characters in t
for char in t:
if char not in freq or freq[char] == 0:
return False
freq[char] -= 1
return True
# Examples
print(is_anagram("anagram", "nagaram")) # True
print(is_anagram("rat", "car")) # False
Why this is O(N): Building the frequency map is one pass over s (O(N)). Validating against t is one pass over t (O(N)). Total: O(2N) = O(N). No sorting required (sorting would be O(N log N)).
Python shortcut:
from collections import Counter
def is_anagram_v2(s: str, t: str) -> bool:
return Counter(s) == Counter(t)
collections.Counter is Python's built-in frequency hash table โ a dict subclass with the same O(1) insert and lookup.
๐ ๏ธ Java HashMap & ConcurrentHashMap: Hash Table Mechanics Under the Hood
Java's HashMap is a production-quality hash table that uses separate chaining internally (linked list that upgrades to a red-black tree when a bucket reaches depth โฅ 8) and automatically rehashes at a 0.75 load factor โ exactly the mechanics described in the collision resolution and rehashing sections above.
ConcurrentHashMap extends the same design for multithreaded environments: instead of locking the entire map, it uses CAS operations on individual bucket heads, allowing many concurrent writers with no lock contention on separate buckets.
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
public class HashTableDemo {
// โโ Custom hash table with separate chaining (illustrative) โโโโโโโโโโโโโโ
static class ChainedHashTable<K, V> {
private static final int SIZE = 16;
@SuppressWarnings("unchecked")
private final java.util.LinkedList<Map.Entry<K,V>>[] buckets =
new java.util.LinkedList[SIZE];
public void put(K key, V value) {
int idx = Math.abs(key.hashCode() % SIZE);
if (buckets[idx] == null) buckets[idx] = new java.util.LinkedList<>();
buckets[idx].removeIf(e -> e.getKey().equals(key)); // update existing
buckets[idx].add(Map.entry(key, value));
}
public V get(K key) {
int idx = Math.abs(key.hashCode() % SIZE);
if (buckets[idx] == null) return null;
return buckets[idx].stream()
.filter(e -> e.getKey().equals(key))
.map(Map.Entry::getValue)
.findFirst().orElse(null);
}
}
public static void main(String[] args) {
// โโ HashMap: O(1) average โ word frequency counter (mirrors ๐งช section)
Map<String, Integer> freq = new HashMap<>();
for (String word : new String[]{"apple","banana","apple","cherry","apple"}) {
freq.merge(word, 1, Integer::sum); // merge is atomic put-or-update
}
System.out.println(freq); // {apple=3, banana=1, cherry=1}
// โโ ConcurrentHashMap: thread-safe, no full-map lock โโโโโโโโโโโโโโโโโโ
ConcurrentHashMap<String, Integer> concurrent = new ConcurrentHashMap<>();
concurrent.merge("apple", 1, Integer::sum); // CAS-based atomic increment
System.out.println("ConcurrentHashMap: " + concurrent);
}
}
| Class | Thread-safe | Null keys/values | Ordered | Use when |
HashMap | โ | โ / โ | โ | Single-threaded or externally synchronized |
LinkedHashMap | โ | โ / โ | Insertion order | Need insertion-order iteration (e.g., LRU cache) |
TreeMap | โ | โ / โ | Sorted by key | Need sorted keys or range queries (BST internally) |
ConcurrentHashMap | โ | โ / โ | โ | High-concurrency, write-heavy, multi-threaded |
For a full deep-dive on Java HashMap internals (treeification, resize strategy) and ConcurrentHashMap's segment locking, a dedicated follow-up post is planned.
๐ What Hash Tables Teach You About Trade-offs
- O(1) average is not O(1) guaranteed. Every hash table has a worst case. Security-sensitive systems (web servers) must use cryptographic hash functions (SipHash, etc.) to prevent deliberate hash flooding attacks that force O(N) collisions.
- Space and time have a direct relationship. A larger backing array โ fewer collisions โ faster lookups. Load factor is the knob that controls this trade-off.
- Ordering is sacrificed for speed. A hash table scrambles keys by design. If you need to iterate keys in sorted order, you need a BST or a sorted array instead.
- Rehashing is amortized, not free. Individual rehash events cost O(N). Latency-sensitive systems sometimes pre-size hash tables to avoid mid-operation rehashing and jitter.
๐ TLDR: Summary & Key Takeaways
- Hash function maps key โ array index in O(1). Must be deterministic and uniform.
- Collisions are inevitable; Separate Chaining uses linked lists; Linear Probing uses the array itself.
- Rehash at load factor 0.75 to keep average O(1); amortized cost is O(1) per insert.
- Hash tables have no order โ use a BST when you need sorted iteration or range queries.
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer โ 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2ร A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
