26 min readPython Data Structures General Programming

Python Data Structures: Lists, Dicts, Sets, and Tuples

Python's built-in collections aren't just syntax — their internals determine whether your code runs in O(1) or O(n)

Abstract Algorithms/Apr 19, 2026/Python Programming

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Executive TLDR

TLDR: Python's four built in collections are not interchangeable — their internals are fundamentally different.
is a dynamic array: fast at the end, slow for membership.
is a hash table: O(1) key lookup, insertion order preserving since Python 3.7.
is a hash set: O(1) membership, no duplicates, unordered.

Core mental model

Read this as a system of state, constraints, and failure boundaries.

Python's built-in collections aren't just syntax — their internals determine whether your code runs in O(1) or O(n)

Explain simpler Compare tradeoffs

Key systems visualization

The article’s conceptual path

Python

Data Structures

General Programming

Beginner

Data

TLDR: Python's four built-in collections are not interchangeable — their internals are fundamentally different. list is a dynamic array: fast at the end, slow for membership. dict is a hash table: O(1) key lookup, insertion-order-preserving since Python 3.7. set is a hash set: O(1) membership, no duplicates, unordered. tuple is immutable and hashable: safe as a dict key, cheaper to create than a list. Pick the right one and a 30-minute batch job stays 30 minutes instead of turning into a 14-hour overnight disaster.

📖 The O(n) Bug That Turned a 30-Minute Batch Job into an Overnight Disaster

David works on a data pipeline that processes customer transactions at a fintech startup. Every night, the pipeline loads one million already-processed transaction IDs from a database, then iterates over each incoming event to check whether its ID is already in that loaded collection — a "have we processed this before?" deduplication guard.

David's code looked completely reasonable. He had been writing Python for about a year and was comfortable with lists:

# Load one million processed transaction IDs from the database
processed_ids = []
for row in db.fetch_all_processed():
    processed_ids.append(row["transaction_id"])

# Now check each incoming event — the hidden performance bomb
for event in incoming_events:
    if event["id"] in processed_ids:   # <-- O(n) per check
        continue
    process(event)

On a dataset of 1,000 incoming events against 1,000 stored IDs, this ran in under a second. On staging with 10,000 events and 50,000 stored IDs, it was a bit slow but nobody flagged it. Then it hit production: 500,000 incoming events, 1,000,000 stored IDs.

The batch job that completed in 32 minutes on staging ran for 14 hours in production.

The culprit is the single line if event["id"] in processed_ids. With a Python list, the in operator performs a linear scan — it starts at index 0 and compares each element one by one until it finds a match or exhausts the list. For a list of 1,000,000 items, each membership check is O(n). With 500,000 incoming events, that is 500,000 × 1,000,000 comparisons in the worst case — O(n²) total. That is not a performance concern. That is an algorithm that cannot scale.

The fix is a single-character change:

# Change [] to set() — the rest stays the same
processed_ids = set()
for row in db.fetch_all_processed():
    processed_ids.add(row["transaction_id"])

# Now the same check is O(1) regardless of how many IDs are stored
for event in incoming_events:
    if event["id"] in processed_ids:   # O(1) hash lookup
        continue
    process(event)

With a set, the in operator runs a hash lookup — O(1) regardless of collection size. The batch job returned to 31 minutes.

This is not a beginner mistake caused by ignorance of Python syntax. It is a mental model gap: if you have never thought about what in does internally for each collection type, you will reach for whichever collection looks most list-like and ship code that behaves correctly in development but catastrophically at production scale.

This post closes that gap. By the end you will understand what each of Python's four core collections — list, dict, set, and tuple — actually does under the hood, when to use each one, and what the performance tradeoffs look like with real numbers.

🔍 Python's Four Core Collections at a Glance

Before going deeper, here is the single most important reference table for choosing a collection. Every row describes a behavioral guarantee — not a style preference.

Collection	Ordered	Mutable	Allows Duplicates	O(1) Lookup	Hashable	Typical Use
`list`	Yes	Yes	Yes	No (O(n))	No	Sequences, stacks, ordered data
`dict`	Yes (3.7+)	Yes	Keys: No / Values: Yes	Yes by key	No	Key-value pairs, counters, caches
`set`	No	Yes	No	Yes	No	Deduplication, fast membership tests
`tuple`	Yes	No	Yes	No (O(n))	Yes	Immutable records, dict keys, named fields

A few entries deserve immediate clarification.

Ordered for dict: Since Python 3.7, dictionaries preserve insertion order as a language guarantee. If you add keys "a", "b", "c", iterating over the dict will always yield them in that order. This is no longer just a CPython implementation detail — it is guaranteed behavior.

O(1) Lookup: For list and tuple, the in operator scans from start to finish — O(n). For dict and set, lookup uses a hash table and is O(1) amortized. This difference is the single biggest source of accidental performance bugs in Python.

Hashable: An object is hashable if it has a stable __hash__ value that never changes during its lifetime. Only hashable objects can be used as dict keys or set members. Tuples are hashable when all their elements are hashable. Lists, dicts, and sets are not hashable because they are mutable — their contents can change, which would change their hash and break the hash table.

⚙️ Under the Hood: How Lists, Dicts, Sets, and Tuples Actually Work

Lists: A Resizable Array With Amortized O(1) Append

Python's list is a dynamic array — a contiguous block of memory holding object references (pointers to Python objects). When you create a list, Python allocates an initial capacity. As you append items, Python fills those pre-allocated slots. When capacity runs out, Python allocates a new, larger block (typically 1.125× to 2× the current capacity), copies all existing references, and frees the old block. This occasional resize is expensive but rare enough that the average cost per append stays constant.

import sys

lst = []
for i in range(8):
    lst.append(i)
    print(f"len={len(lst)}, bytes={sys.getsizeof(lst)}")

# len=1, bytes=88    (pre-allocated a few slots)
# len=2, bytes=96
# len=3, bytes=104
# len=4, bytes=104   (all pre-allocated slots used)
# len=5, bytes=120   (resize triggered)
# len=6, bytes=120
# len=7, bytes=120
# len=8, bytes=120

Because most append calls fill a pre-allocated slot without any reallocation, appending is O(1) amortized. However, insert(0, value) — adding to the beginning — is O(n) because every existing element shifts one position right to make room. Removing the first element with pop(0) is equally O(n). If you need fast insertions and removals at both ends of a sequence, use collections.deque.

Key list operation complexities:

Operation	Time Complexity	Note
`append(x)`	O(1) amortized	Occasional resize
`pop()` from end	O(1)	Fast
`insert(0, x)`	O(n)	Shifts all elements
`pop(0)`	O(n)	Shifts all elements
`x in list`	O(n)	Linear scan
`list[i]`	O(1)	Direct memory offset

Dicts: Hash Tables With Insertion-Order Preservation

Python's dict is a hash table. When you store a key-value pair, Python:

Calls hash(key) to produce an integer
Takes that integer modulo the table capacity to find a slot index
Stores the key-value pair at that slot

Lookup works the same way: hash the key, jump directly to that slot. No scanning required — hence O(1).

config = {
    "host": "localhost",
    "port": 5432,
    "database": "myapp",
}

# O(1) lookup — Python hashes "port" and jumps directly to its slot
print(config["port"])    # 5432

# Only hashable types can be dict keys
valid_key = (1, 2, 3)
config[valid_key] = "tuple key works"   # OK — tuples are hashable

bad_key = [1, 2, 3]
# config[bad_key] = "fails"             # TypeError: unhashable type: 'list'

The load factor is the ratio of filled slots to total capacity. CPython resizes the table when it reaches roughly two-thirds full, keeping collision rates acceptably low. When two keys hash to the same slot — a collision — CPython uses open addressing with a pseudorandom probing sequence to find the next available slot.

Sets: Hash Sets for O(1) Membership and Set Algebra

A set is a dict with only keys and no values. The same hash table machinery delivers O(1) add, discard, and in operations.

fruits = {"apple", "banana", "cherry"}

print("banana" in fruits)    # True  — O(1)
print("mango" in fruits)     # False — O(1)

# Set algebra built in
a = {1, 2, 3, 4}
b = {3, 4, 5, 6}

print(a | b)    # Union:            {1, 2, 3, 4, 5, 6}
print(a & b)    # Intersection:     {3, 4}
print(a - b)    # Difference:       {1, 2}
print(a ^ b)    # Symmetric diff:   {1, 2, 5, 6}

Sets are unordered — you cannot rely on any particular iteration order. Sets only accept hashable objects as elements: no lists, dicts, or sets as members. Use frozenset when you need a set that is itself hashable (for example, as a dict key).

Tuples: Immutable Sequences With a Hashability Superpower

A tuple is like a list but immutable — once created its contents cannot change. This single constraint enables two practical advantages. First, tuples are hashable when all their elements are hashable, making them valid dict keys and set members. Second, Python can allocate tuples more cheaply than lists because it knows the size is fixed — no over-allocation, no resizing bookkeeping.

point = (3, 7)
print(hash(point))     # A valid integer — changes with content

# Tuples as composite dict keys (lists cannot do this)
distances = {
    (0, 0): 0,
    (3, 4): 5,     # sqrt(3^2 + 4^2) = 5
    (1, 1): 1.41,
}
print(distances[(3, 4)])   # 5

# namedtuple: a tuple with named fields — attribute access without a full class
from collections import namedtuple
Point = namedtuple("Point", ["x", "y"])
p = Point(x=3, y=7)
print(p.x, p.y)   # 3 7
print(p[0])       # 3 — still indexable as a regular tuple

Tuples are the right choice for function return values that logically belong together, for dictionary keys that are composite (such as (user_id, session_id)), and for configuration records that should never be mutated at runtime.

🧠 How Python's dict and set Actually Work Inside CPython

The Internals of Python Dict

Before Python 3.6, CPython's dict was a standard open-addressing hash table with no ordering. Python 3.6 introduced a compact hash table design that CPython uses today, and Python 3.7 enshrined insertion-order preservation as a language guarantee.

The compact design uses two separate data structures:

An indices array — a compact array that maps hash slot positions to integer offsets into the entries array
An entries array — stores (hash, key, value) triples in strict insertion order

When you write d["name"] = "Alice", CPython:

Calls hash("name") and truncates to the current table size to get slot s
Looks at indices[s]
If the slot is empty, it writes the next free entries-array index into indices[s], then appends (hash("name"), "name", "Alice") to the entries array
If the slot is occupied (collision), CPython probes further using the formula i = (i * 5 + perturbation + 1) % size — where perturbation is derived from the original hash — until an empty slot is found

The ordering guarantee emerges naturally from this design: new entries are always appended to the entries array, and iteration walks the entries array in order. Because the entries array never reorders, insertion order is always preserved.

Key requirements for dict keys: Keys must implement both __hash__ (returns a stable integer) and __eq__ (compares identity or value). Strings, integers, floats, frozensets, and tuples of hashable types all qualify. Lists, sets, and dicts do not — they are mutable, so their hash would change when their contents change, making any existing hash-table entry unreachable.

Performance Analysis

The table below compares the four core collections on the operations that matter most in day-to-day code:

Operation	`list`	`dict`	`set`	`tuple`
`x in coll`	O(n)	O(1) by key	O(1)	O(n)
Index/key access	O(1) by index	O(1) by key	N/A	O(1) by index
Append / add	O(1) amortized	O(1) amortized	O(1) amortized	N/A (immutable)
Delete by value	O(n)	O(1) by key	O(1)	N/A
Iteration	O(n)	O(n)	O(n)	O(n)
Memory (1M integers)	~8 MB	~85 MB	~32 MB	~8 MB

A concrete benchmark on 1,000,000 random integers (CPython 3.12, averaged over 10 runs) illustrates the real-world difference for membership tests:

Collection size	`list` `in` (item at middle)	`set` `in`	`dict` `in`
1,000 items	~0.03 ms	~0.00006 ms	~0.00006 ms
100,000 items	~1.8 ms	~0.00007 ms	~0.00007 ms
1,000,000 items	~18 ms	~0.00007 ms	~0.00007 ms

At one million items, set membership is roughly 250,000× faster than list membership. The set and dict times are essentially flat because hash lookup does not depend on collection size. The list time grows linearly. This is the core lesson of the opening production story.

📊 Choosing the Right Collection: A Decision Flowchart

The questions in this flowchart are ordered to eliminate wrong options as fast as possible. Starting from the top, each question cuts the candidate pool in half. The leaf node tells you exactly which collection to use and why.

graph TD
    A[Start: Pick a Collection] --> B{Need fast membership test or key lookup?}
    B -->|Yes| C{Need to associate a value with each key?}
    B -->|No| D{Need to preserve insertion order?}
    C -->|Yes| E[Use dict]
    C -->|No| F{Does the collection itself need to be hashable?}
    F -->|Yes| G[Use frozenset]
    F -->|No| H[Use set]
    D -->|Yes| I{Will the collection change after creation?}
    D -->|No| J[Consider set for automatic dedup]
    I -->|Yes| K[Use list]
    I -->|No| L[Use tuple]

Reading the flowchart: if you need fast in checks or key lookups, you immediately route to the hash-based types — dict, set, or frozenset. The decisive question is whether you need key-value pairs (dict) or just membership (set). If you do not need fast lookups, the next question is ordering: if order matters and the data will change, use list; if the record should never change, use tuple. The frozenset branch handles the edge case where you need a hashable set to use as a dict key or as an element of another set.

🌍 Four Real-World Patterns Using Python Collections in Production

These patterns cover the most common production scenarios. Each one shows why a specific collection is the right tool — not just by convention, but because of its internal behavior.

Pattern 1 — LRU Cache Backed by dict

A Least-Recently-Used cache needs O(1) key lookup and O(1) eviction. A plain dict handles lookup. collections.OrderedDict adds the move_to_end() method needed to track usage order.

from collections import OrderedDict

class LRUCache:
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.cache = OrderedDict()   # insertion order = access order

    def get(self, key: str):
        if key not in self.cache:
            return None
        self.cache.move_to_end(key)   # promote to most-recently-used
        return self.cache[key]

    def put(self, key: str, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)  # evict the least-recently-used

cache = LRUCache(3)
cache.put("a", 1)
cache.put("b", 2)
cache.put("c", 3)
cache.get("a")           # promotes "a"
cache.put("d", 4)        # evicts "b" (least recently used)
print(list(cache.cache.keys()))   # ['c', 'a', 'd']

Pattern 2 — Deduplication Pipeline With set

Webhook retry logic and double-delivery from message queues mean the same event often arrives multiple times. A set is the cleanest dedup mechanism — O(1) per lookup, zero configuration.

def deduplicate_events(raw_events: list) -> list:
    """Remove duplicate events by id, preserving first-occurrence order."""
    seen_ids = set()
    unique = []
    for event in raw_events:
        if event["id"] not in seen_ids:   # O(1)
            seen_ids.add(event["id"])
            unique.append(event)
    return unique

events = [
    {"id": "evt_001", "amount": 99.99},
    {"id": "evt_002", "amount": 14.99},
    {"id": "evt_001", "amount": 99.99},   # duplicate webhook retry
    {"id": "evt_003", "amount": 49.99},
]
clean = deduplicate_events(events)
print(len(clean))   # 3 — one duplicate removed

Pattern 3 — Immutable Config With namedtuple

Application configuration that is set once at startup and should never change at runtime is a perfect fit for namedtuple. It provides named attribute access, is immutable, and is hashable.

from collections import namedtuple

DatabaseConfig = namedtuple("DatabaseConfig", ["host", "port", "name", "pool_size"])

db_cfg = DatabaseConfig(host="prod-db.internal", port=5432, name="myapp", pool_size=10)

print(db_cfg.host)        # prod-db.internal
print(db_cfg.port)        # 5432

# Immutability prevents accidental mutation
try:
    db_cfg.host = "wrong-host"   # raises AttributeError — by design
except AttributeError as e:
    print(f"Config is immutable: {e}")

Pattern 4 — Stack and Queue With list and deque

A list works well as a stack (LIFO) because append and pop at the right end are both O(1). But list.pop(0) is O(n), making it a poor queue. Use collections.deque whenever you need O(1) at both ends.

from collections import deque

# Stack with list — O(1) push and pop from the right end only
call_stack = []
call_stack.append("main()")
call_stack.append("connect()")
call_stack.append("authenticate()")
print(call_stack.pop())   # "authenticate()" — O(1)

# Queue with deque — O(1) at both ends
task_queue = deque()
task_queue.append("task_1")       # enqueue right — O(1)
task_queue.append("task_2")
task_queue.append("task_3")
print(task_queue.popleft())       # "task_1" — O(1), vs O(n) for list.pop(0)

⚖️ The Alternatives You Should Know: defaultdict, deque, and frozenset

list vs deque for Queue Workloads

Using a list as a FIFO queue — appending to the right and removing from the left with pop(0) — is one of the most common accidental O(n) patterns in Python. Each pop(0) shifts all remaining elements one position left: O(n) per operation. For a queue of 50,000 items, draining the entire queue is an O(n²) operation. A collections.deque uses a doubly-linked sequence of fixed-size blocks and delivers O(1) appends and pops from either end.

Use list when you only need stack-style access (right end only). Use deque when you need queue-style access (both ends) or a fixed-size sliding window.

dict vs defaultdict vs Counter

Plain dict requires explicit handling of missing keys, leading to repetitive if key not in d guards. collections.defaultdict eliminates this by accepting a factory function that produces the default value when a missing key is first accessed. collections.Counter goes further: it is a defaultdict(int) with arithmetic operators and a .most_common() method built in.

from collections import defaultdict, Counter

text = "the quick brown fox jumps over the lazy dog the"
words = text.split()

# Plain dict — verbose
freq = {}
for w in words:
    if w not in freq:
        freq[w] = 0
    freq[w] += 1

# defaultdict — cleaner
freq2 = defaultdict(int)
for w in words:
    freq2[w] += 1

# Counter — idiomatic
freq3 = Counter(words)
print(freq3.most_common(3))   # [('the', 3), ('quick', 1), ('brown', 1)]

Use Counter for frequency analysis. Use defaultdict for grouping items into lists or accumulating arbitrary values. Use plain dict when you need full explicit control over missing-key behavior.

set vs frozenset

set is mutable and unhashable — it cannot be used as a dict key or placed inside another set. frozenset is the immutable, hashable sibling. Use frozenset when the set itself needs to be a dict key, when you need to store sets as elements of other sets, or when you want to communicate clearly that a collection of items will never change.

# frozenset as dict key — impossible with regular set
role_map = {
    frozenset({"read", "write"}): "editor",
    frozenset({"read"}): "viewer",
    frozenset({"read", "write", "admin"}): "owner",
}
user_perms = frozenset({"read", "write"})
print(role_map[user_perms])   # "editor"

🧭 Matching Use Case to Collection: A Quick Reference

Use this table as a decision checklist when you are about to introduce a new collection variable. Match the row that best describes your access pattern and start from the "Best Collection" column.

Use Case	Best Collection	Reason
Membership test in a loop	`set`	O(1) per check
Count item frequencies	`Counter`	Built-in tallying and `.most_common()`
Map arbitrary keys to values	`dict`	O(1) key lookup
Ordered mutable sequence	`list`	Index access O(1), append O(1) amortized
Immutable record / composite key	`tuple`	Hashable, lightweight
High-throughput FIFO queue	`deque`	O(1) popleft
LIFO stack	`list`	O(1) append and pop at end
Unique items with set algebra	`set`	Union, intersection, difference built in
Read-only named configuration	`namedtuple`	Attribute access, immutable, hashable
Deduplicate while preserving order	`dict` (keys only)	Keys are unique, insertion order preserved

🧪 Four Worked Python Examples: From Word Counters to Dedup Pipelines

The following four examples are independent programs you can run immediately. Each one demonstrates a different collection choice and explains exactly why that choice matters. Pay attention to where the collection's internal behavior — not just its syntax — drives the outcome.

Example 1: Word Frequency Counter

This example shows how to count word frequencies using three progressively cleaner approaches. The goal is to illustrate the progression from verbose-but-obvious to idiomatic Python. The Counter approach is not just shorter — it signals intent to every reader without requiring them to trace a for-loop to understand what is being accumulated.

from collections import defaultdict, Counter
import re

text = """
to be or not to be that is the question
whether tis nobler in the mind to suffer
the slings and arrows of outrageous fortune
or to take arms against a sea of troubles
"""

words = re.findall(r'\w+', text.lower())

# Approach 1: plain dict (explicit but verbose)
freq_v1 = {}
for word in words:
    freq_v1[word] = freq_v1.get(word, 0) + 1

# Approach 2: defaultdict (cleaner, no get() needed)
freq_v2 = defaultdict(int)
for word in words:
    freq_v2[word] += 1

# Approach 3: Counter (idiomatic, one line)
freq_v3 = Counter(words)

# All three produce the same result
print(freq_v3.most_common(5))
# [('to', 4), ('the', 3), ('or', 2), ('be', 2), ('of', 2)]

Example 2: Graph Adjacency List With set

A graph represented as a dict-of-sets gives O(1) neighbor lookup and O(1) edge existence checks. A dict-of-lists would make "does edge (A, B) exist?" an O(degree) scan. When a highly-connected node has thousands of neighbors, the set version keeps the check instant.

from collections import defaultdict, deque

# Build a directed graph as dict of sets
graph = defaultdict(set)

edges = [("A", "B"), ("A", "C"), ("B", "D"), ("C", "D"), ("D", "E")]
for src, dst in edges:
    graph[src].add(dst)

# O(1) edge existence check
print("C" in graph["A"])   # True  — O(1) hash lookup
print("E" in graph["A"])   # False — O(1) hash lookup

# BFS traversal using deque for O(1) queue operations
def bfs(graph: dict, start: str) -> list:
    visited = set()
    queue = deque([start])
    order = []
    while queue:
        node = queue.popleft()      # O(1) with deque
        if node in visited:         # O(1) with set
            continue
        visited.add(node)
        order.append(node)
        queue.extend(sorted(graph[node]))   # sorted for determinism
    return order

print(bfs(graph, "A"))   # ['A', 'B', 'C', 'D', 'E']

Example 3: O(1) Cache Lookup With dict

This example builds a memoization cache manually to make the O(1) dict lookup visible. A manual cache is useful when you need fine-grained control over cache size or eviction. Notice how the cache lookup check if n in fib_cache is O(1) regardless of how many results are stored.

import time

# Manual memoization cache using dict
fib_cache: dict = {}

def fib(n: int) -> int:
    if n in fib_cache:            # O(1) lookup — no scanning
        return fib_cache[n]
    if n <= 1:
        return n
    result = fib(n - 1) + fib(n - 2)
    fib_cache[n] = result         # O(1) insert
    return result

start = time.perf_counter()
answer = fib(45)
elapsed = time.perf_counter() - start
print(f"fib(45) = {answer}, computed in {elapsed:.6f}s")
# fib(45) = 1134903170, computed in 0.000042s

# Python's standard library provides functools.lru_cache for the same pattern
from functools import lru_cache

@lru_cache(maxsize=None)
def fib_std(n: int) -> int:
    if n <= 1:
        return n
    return fib_std(n - 1) + fib_std(n - 2)

print(fib_std(50))   # 12586269025

Example 4: Data Deduplication Pipeline With set and Counter

This example builds a full deduplication pipeline that removes duplicate records, reports how many duplicates were found per source system, and returns both the clean dataset and the audit report. It combines a set for O(1) dedup checks and a Counter for per-source reporting.

from collections import Counter

def dedup_pipeline(records: list) -> dict:
    """
    Remove duplicate records by 'id'.
    Returns unique records plus a duplicate report by source.
    """
    seen_ids: set = set()
    unique_records: list = []
    duplicate_sources: list = []

    for record in records:
        if record["id"] in seen_ids:          # O(1) membership check
            duplicate_sources.append(record["source"])
        else:
            seen_ids.add(record["id"])         # O(1) insert
            unique_records.append(record)

    duplicate_report = Counter(duplicate_sources)

    return {
        "unique": unique_records,
        "total_duplicates": len(records) - len(unique_records),
        "duplicates_by_source": dict(duplicate_report),
    }

sample_records = [
    {"id": "r001", "amount": 99.99,  "source": "kafka"},
    {"id": "r002", "amount": 14.99,  "source": "webhook"},
    {"id": "r001", "amount": 99.99,  "source": "kafka"},     # duplicate
    {"id": "r003", "amount": 49.99,  "source": "batch"},
    {"id": "r002", "amount": 14.99,  "source": "webhook"},   # duplicate
    {"id": "r004", "amount": 29.99,  "source": "batch"},
]

result = dedup_pipeline(sample_records)
print(f"Unique records:   {len(result['unique'])}")           # 4
print(f"Duplicates found: {result['total_duplicates']}")      # 2
print(f"By source:        {result['duplicates_by_source']}")  # {'kafka': 1, 'webhook': 1}

🛠️ Python's collections Module: The Standard Library Upgrades

Python ships a collections module that extends the four core types with specialized containers. These are not third-party dependencies — they are part of the standard library and available in every Python environment from version 3.x onward. Each one solves a specific, recurring pattern more clearly and efficiently than rolling your own with a plain dict or list.

defaultdict(factory) — A dict subclass that automatically initializes missing keys by calling factory(). Use it to build frequency maps, group records by key, and accumulate values without if key not in d guards cluttering every update.

from collections import defaultdict

by_letter = defaultdict(list)
for word in ["apple", "ant", "bat", "bear", "cherry"]:
    by_letter[word[0]].append(word)
print(dict(by_letter))
# {'a': ['apple', 'ant'], 'b': ['bat', 'bear'], 'c': ['cherry']}

Counter(iterable) — A dict subclass for counting hashable objects. Supports arithmetic on counts with +, -, &, | operators and provides .most_common(n) in O(n log n) time.

from collections import Counter
votes = Counter(["Alice", "Bob", "Alice", "Alice", "Bob", "Charlie"])
print(votes.most_common(2))   # [('Alice', 3), ('Bob', 2)]

deque(maxlen=N) — A double-ended queue with O(1) append and pop from both ends. The optional maxlen argument creates a fixed-size sliding window that automatically discards the oldest item when a new one is added — no manual length management needed.

from collections import deque
recent = deque(maxlen=3)   # keeps only the last 3 items
for n in range(6):
    recent.append(n)
print(list(recent))         # [3, 4, 5]

OrderedDict — A dict subclass that retains insertion order even on Python versions before 3.7. Mostly superseded by plain dict in modern Python, but still useful for its move_to_end(key) method, which the LRU cache pattern in this post relies on.

namedtuple(typename, fields) — Creates a new tuple subclass with named fields. Provides attribute-style access, keeps the collection immutable, and incurs almost no memory overhead compared to a plain tuple. For more complex data structures, Python 3.7+ also offers dataclasses.dataclass with optional mutability and type annotations.

from collections import namedtuple
Color = namedtuple("Color", ["red", "green", "blue"])
white = Color(255, 255, 255)
print(white.red)          # 255
print(white._asdict())    # {'red': 255, 'green': 255, 'blue': 255}
print(white._replace(red=128))   # Color(red=128, green=255, blue=255)

For a full deep-dive on the collections module and its advanced use cases in production systems, see the Python standard library documentation or a dedicated follow-up post in this series.

📚 Lessons Learned from Choosing the Wrong Collection

Every one of these lessons was paid for by a real production incident or a code review that went on far longer than it needed to.

The O(n) membership trap is subtle. if x in my_list reads innocuously. It becomes a disaster when the list grows past a few thousand items and the check sits inside a loop. The rule is simple: if you need in in a loop, your container is almost always wrong if it is a list or tuple. Switch to set or dict keys.
"Works on my machine" disguises O(n²) bugs. Development and staging datasets are small. Production data is not. A function that completes in 0.1 seconds on 1,000 items may need 10,000 seconds on 1,000,000 items if the underlying algorithm is quadratic. Always trace nested loops and ask what each inner operation costs.
Tuples as composite dict keys unlock powerful lookups. If you need to look up data by two or more fields simultaneously — (user_id, date) → transaction list, (source_ip, port) → connection state — a dict keyed by tuples is simpler, faster, and more readable than nested dicts.
Counter and defaultdict signal intent. Reaching for Counter(words) instead of writing a for-loop over a plain dict is not just a style choice. It tells every future reader exactly what the code is doing without requiring them to trace the accumulation logic.
Dict ordering is a language guarantee since Python 3.7. You can rely on dicts preserving insertion order in all compliant Python 3.7+ implementations. OrderedDict is no longer needed for basic ordered mappings — reserve it for its move_to_end() method, which plain dict does not provide.
Immutability is a feature, not a restriction. When you use a tuple or namedtuple for a record that should never change, you turn an accidental mutation into an immediate AttributeError. That is a bug caught at the moment it happens, not hours later when an unexpected value silently propagates through your pipeline.

📌 TLDR: One Principle for Every Collection Decision

TLDR: Python's four core collections have fundamentally different internals. list is a dynamic array — fast at the end, O(n) for membership. dict is a compact hash table — O(1) key lookup, guaranteed insertion order since 3.7. set is a hash set — O(1) membership, automatic deduplication, unordered. tuple is immutable and hashable — use it wherever a record must never change or needs to be a dict key. The one rule that prevents most performance bugs: if you need in inside a loop, your collection is almost certainly wrong if it is a list. Switch to set or dict keys, and the O(n) becomes O(1).

Quiet AI help

Explain simpler Compare approaches What next?

Article metadata

Written by

Abstract Algorithms

@abstractalgorithms

Related deep dives

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

31 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

31 min read

ANN Index Types Explained: When to Choose Flat, HNSW, IVF, or IVF-PQ

14 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

31 min · Llm · best next step

Open Collection