All Posts

List Comprehensions, Generators, and Lazy Evaluation in Python

How Python's lazy evaluation turns infinite sequences and billion-row pipelines into single-line code

Abstract AlgorithmsAbstract Algorithms
ยทยท25 min read
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

AI-assisted content. This post may have been written or enhanced with the help of AI tools. While efforts are made to ensure accuracy, the content may contain errors or inaccuracies. Please verify critical information independently.


๐Ÿ“– The MemoryError That Launched a Thousand Generators

Meet Priya. She is a data engineer at a logistics company, tasked with crunching a 10 GB CSV of shipping events. She opens her laptop, writes what feels like perfectly reasonable Python, and hits run:

# Priya's first attempt โ€” loads the entire file into RAM
import csv

def get_delayed_shipments(filepath):
    rows = []
    with open(filepath, "r") as f:
        reader = csv.DictReader(f)
        rows = list(reader)          # <-- reads every row into memory at once

    return [r for r in rows if r["status"] == "delayed"]

delayed = get_delayed_shipments("shipments_2024.csv")
print(f"Found {len(delayed)} delayed shipments")

Three seconds later:

MemoryError: unable to allocate 9.8 GiB for an array

Her laptop has 16 GB of RAM. Python tried to load 10 GB of CSV rows into a single list, and then the list comprehension allocated a second copy. The machine ran out of headroom before it processed a single row.

Here is the same task rewritten with a generator โ€” tested on the same machine, same file:

import csv

def get_delayed_shipments_lazy(filepath):
    with open(filepath, "r") as f:
        reader = csv.DictReader(f)
        for row in reader:
            if row["status"] == "delayed":
                yield row           # <-- produces one row at a time, then pauses

for shipment in get_delayed_shipments_lazy("shipments_2024.csv"):
    print(shipment["tracking_id"])

Memory usage: under 4 KB. The file never lives in RAM all at once. Python reads one row, tests it, yields it to the caller, and moves on โ€” resuming from exactly where it left off on the next iteration.

This is lazy evaluation: Python defers computation until the caller actually asks for the next value. The generator function does not run to completion; it runs to the next yield, pauses, hands the value to the caller, and waits. The call stack is frozen in place โ€” local variables, loop position, file handle, everything โ€” until next() is called again.

Understanding why this works, and when to reach for it, is what this post is about.


๐Ÿ” Building Sequences the Pythonic Way โ€” Comprehensions and Generator Expressions

Before diving into yield, let's cover the simpler forms of lazy-leaning syntax that Python developers use every day: comprehensions and generator expressions.

List Comprehensions

A list comprehension is a concise, readable way to build a new list from an iterable. The pattern is:

result = [expression for item in iterable if condition]

Compare the classic loop approach with the comprehension:

# Classic loop
squares = []
for x in range(10):
    squares.append(x ** 2)

# List comprehension โ€” same result, one line
squares = [x ** 2 for x in range(10)]
# [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

You can also nest comprehensions to flatten 2D structures:

# Flatten a 3x3 matrix into a single list
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
flat = [num for row in matrix for num in row]
# [1, 2, 3, 4, 5, 6, 7, 8, 9]

Dictionary and Set Comprehensions

The same syntax works for dict and set literals:

# Dict comprehension: word โ†’ length
words = ["apple", "banana", "cherry"]
word_lengths = {word: len(word) for word in words}
# {'apple': 5, 'banana': 6, 'cherry': 6}

# Set comprehension: unique first letters
first_letters = {word[0] for word in words}
# {'a', 'b', 'c'}

Generator Expressions โ€” Parentheses Instead of Brackets

Here is the subtle change that makes a huge difference. Swap [...] for (...) and you get a generator expression instead of a list:

import sys

# List comprehension โ€” ALL values computed and stored in memory immediately
squares_list = [x ** 2 for x in range(1_000_000)]
print(sys.getsizeof(squares_list))    # ~8,697,464 bytes (~8.3 MB)

# Generator expression โ€” no values stored; each is computed on demand
squares_gen = (x ** 2 for x in range(1_000_000))
print(sys.getsizeof(squares_gen))     # 104 bytes

The generator expression object is just a recipe โ€” a suspended computation. The 104 bytes represent the generator state object itself, not the million values. Those values are computed only when you iterate over squares_gen.

The practical takeaway: if you are immediately passing a sequence to a function that iterates it once (like sum, max, any, all), use a generator expression. If you need to iterate it multiple times, index into it, or check its length, use a list comprehension.

# Idiomatic: generator expression directly inside sum() โ€” no extra list created
total = sum(x ** 2 for x in range(1_000_000))

# Also valid: max of filtered values without allocating the intermediate list
largest_even = max(x for x in range(1000) if x % 2 == 0)

โš™๏ธ How yield Turns Functions Into Lazy Factories

List comprehensions and generator expressions are convenient, but they have limits. For complex multi-step logic, conditional branching, or sequences that depend on external state (like a file handle or a database cursor), you need a generator function โ€” any function that contains the yield keyword.

The yield Keyword โ€” Pause, Return, Resume

When Python encounters yield inside a function, it does something remarkable: it suspends the entire function frame โ€” local variables, the current line number, open loops, everything โ€” packages it into a generator object, and hands the yielded value back to the caller. The function is not done; it is paused. The next time the caller asks for a value (by calling next() on the generator), execution resumes from the exact line after yield.

def count_up(start, stop):
    current = start
    while current < stop:
        print(f"  [generator] about to yield {current}")
        yield current
        print(f"  [generator] resumed after yield, incrementing")
        current += 1
    print("  [generator] loop finished โ€” StopIteration coming")

gen = count_up(1, 4)

print("Calling next() the first time:")
val = next(gen)       # Runs until the first yield, then pauses
print(f"Got: {val}")

print("Calling next() the second time:")
val = next(gen)       # Resumes from after yield, runs until next yield
print(f"Got: {val}")

print("Calling next() the third time:")
val = next(gen)
print(f"Got: {val}")

print("Calling next() a fourth time โ€” will raise StopIteration:")
try:
    next(gen)
except StopIteration:
    print("StopIteration raised โ€” generator is exhausted")

Output:

Calling next() the first time:
  [generator] about to yield 1
Got: 1
Calling next() the second time:
  [generator] resumed after yield, incrementing
  [generator] about to yield 2
Got: 2
Calling next() the third time:
  [generator] resumed after yield, incrementing
  [generator] about to yield 3
Got: 3
Calling next() a fourth time โ€” will raise StopIteration:
  [generator] loop finished โ€” StopIteration coming
StopIteration raised โ€” generator is exhausted

The function body runs in fragments, interleaved with the caller's code. This is the essence of cooperative multitasking baked into Python's iteration protocol.

The generator lifecycle follows four distinct states managed by the CPython runtime. The state machine below shows every transition: from the moment you call the generator function, through each next() call, to the final StopIteration.

graph TD
    A[Call generator function] --> B[GEN_CREATED - frame allocated on heap]
    B --> C[Call next on generator]
    C --> D[GEN_RUNNING - frame pushed onto call stack]
    D --> E{yield reached?}
    E -- yield value --> F[GEN_SUSPENDED - frame saved back to heap]
    F --> G[Return value to caller]
    G --> C
    E -- function body ends --> H[GEN_CLOSED - frame released from heap]
    H --> I[StopIteration raised on any further next call]

The key insight here is that GEN_SUSPENDED stores the entire frame โ€” local variables, the current instruction pointer, and any open loop counters โ€” on the heap. The call stack is freed up for the caller to use. This is what makes it possible to have thousands of concurrent generator pipelines without blowing the call stack.

StopIteration and the for Loop Contract

Python's for loop is built on this protocol. Under the hood, for item in iterable calls iter(iterable) to get an iterator, then repeatedly calls next() on it. When StopIteration is raised, the loop ends cleanly. You never have to handle StopIteration manually when using a for loop โ€” it is caught automatically.

# for loop implicitly calls next() and handles StopIteration
for val in count_up(1, 4):
    print(val)     # prints 1, 2, 3

yield from โ€” Delegating to a Sub-Generator

When one generator needs to delegate to another, Python 3.3 introduced yield from. It is equivalent to a for loop that yields each item from the sub-generator, but it is faster and also properly proxies send() and throw() calls through the chain.

def first_three():
    yield 1
    yield 2
    yield 3

def first_six():
    yield from first_three()    # delegates to first_three
    yield from [4, 5, 6]        # also works with any iterable

print(list(first_six()))
# [1, 2, 3, 4, 5, 6]

yield from is especially powerful when flattening nested iterables or composing pipeline stages.


๐Ÿง  Under the Hood: Python Frame Objects and the Generator Protocol

The Internals of Generator Execution

When Python calls a regular function, it creates a frame object (PyFrameObject in CPython) on the call stack. The frame holds local variables, the code object, the current instruction pointer, and the evaluation stack. When the function returns, the frame is discarded.

Generator functions work differently. When you call count_up(1, 4), Python does not execute the body. Instead, it creates a generator object that holds a reference to the frame. The frame is kept alive on the heap โ€” not on the call stack โ€” in a suspended state. The instruction pointer is parked at the first yield it will encounter.

Each call to next(gen) does three things in the CPython bytecode evaluator (ceval.c):

  1. Sets the generator's state from GEN_SUSPENDED to GEN_RUNNING.
  2. Pushes the saved frame back onto the thread's call stack.
  3. Resumes execution at the saved instruction pointer, continuing until the next YIELD_VALUE opcode.

When yield value executes, the bytecode instruction YIELD_VALUE pops the top of the evaluation stack, saves the current instruction pointer into the frame, sets the generator state back to GEN_SUSPENDED, and returns the value to the caller. When the function body finishes without hitting a yield, the generator raises StopIteration automatically.

You can inspect this mechanism directly:

import dis
import inspect

def simple_gen():
    yield 1
    yield 2

gen = simple_gen()

# The generator object holds a live frame
print(gen.gi_frame)          # <frame object at 0x...>
print(gen.gi_frame.f_lineno) # current line (points to first yield)

next(gen)   # run to first yield

print(gen.gi_frame.f_lineno) # now points to the second yield
print(gen.gi_suspended)      # True โ€” frame is frozen between yields

next(gen)   # run to second yield
next(gen)   # exhausts the generator

print(gen.gi_frame)          # None โ€” frame has been released

This is also why generators are single-pass: once exhausted, the frame is released and there is nothing left to resume. Calling next() on an exhausted generator always raises StopIteration immediately.

Performance Analysis โ€” List vs. Generator at 1 Million Items

The memory difference between a list and a generator is not just theoretical. Here is a concrete benchmark using sys.getsizeof and the tracemalloc module, which traces actual heap allocations:

import sys
import tracemalloc

def measure_list_vs_generator(n: int = 1_000_000) -> None:
    print(f"Comparison for n={n:,} items\n{'โ”€'*45}")

    # --- List comprehension ---
    tracemalloc.start()
    result_list = [x * 2 for x in range(n)]
    _, list_peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()

    # --- Generator expression ---
    tracemalloc.start()
    result_gen = (x * 2 for x in range(n))
    _, gen_peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()

    print(f"List comprehension")
    print(f"  sys.getsizeof : {sys.getsizeof(result_list):>12,} bytes")
    print(f"  peak heap     : {list_peak:>12,} bytes")
    print()
    print(f"Generator expression")
    print(f"  sys.getsizeof : {sys.getsizeof(result_gen):>12,} bytes")
    print(f"  peak heap     : {gen_peak:>12,} bytes")
    print()

    ratio = list_peak / gen_peak if gen_peak > 0 else float("inf")
    print(f"Memory ratio (list / gen): {ratio:,.0f}x")

measure_list_vs_generator()

Typical output on CPython 3.11:

MeasurementList comprehensionGenerator expression
sys.getsizeof8,448,728 bytes (~8 MB)104 bytes
Peak heap allocation~8,500,000 bytes~500 bytes
Memory ratioโ€”~17,000ร— smaller

The 104-byte figure for sys.getsizeof on a generator reports the size of the generator object shell only โ€” it does not count the referenced frame. Peak heap from tracemalloc is a more honest number: roughly 500 bytes for the frame and its local variables, compared to 8.5 MB for the full list.

Throughput matters too. For a single linear pass, a generator pipeline is competitive with a list comprehension, and sometimes faster because it avoids allocating and garbage-collecting the intermediate list. The overhead per next() call is small โ€” one frame resume operation in the bytecode evaluator. However, if you need random access or multiple passes, that per-call overhead accumulates, and a pre-built list wins.


๐Ÿ“Š Lazy Pull vs. Eager Push โ€” Visualizing the Generator Pipeline

The key architectural difference between an eager list pipeline and a lazy generator pipeline is who drives the data flow. In an eager pipeline, each stage pushes its entire output to the next stage in one shot. In a lazy pipeline, the consumer at the end pulls one value at a time, and that pull request propagates backwards through every stage in the chain.

The following sequence diagram shows a three-stage generator pipeline: FileReader โ†’ Transform โ†’ Filter โ†’ Consumer. Each stage is a generator. The consumer drives the whole pipeline with a single next() call.

sequenceDiagram
    participant Consumer
    participant Filter as Filter Generator
    participant Transform as Transform Generator
    participant Reader as File Reader Generator

    Consumer->>Filter: next()
    Filter->>Transform: next()
    Transform->>Reader: next()
    Reader-->>Transform: raw row 1
    Transform-->>Filter: transformed row 1
    Filter-->>Consumer: row 1 passes filter

    Consumer->>Filter: next()
    Filter->>Transform: next()
    Transform->>Reader: next()
    Reader-->>Transform: raw row 2
    Transform-->>Filter: transformed row 2
    Note over Filter,Consumer: row 2 fails filter - loop back
    Filter->>Transform: next()
    Transform->>Reader: next()
    Reader-->>Transform: raw row 3
    Transform-->>Filter: transformed row 3
    Filter-->>Consumer: row 3 passes filter

Notice that at no point does any stage hold more than one row at a time. The file reader has one row open. The transform stage has one transformed row. The filter stage either forwards it or discards it and immediately asks for the next. The consumer sees only the rows that passed the filter, one at a time, without ever knowing how many rows were skipped.

This pull model means you can short-circuit the pipeline for free. If the consumer stops iterating after 10 matching rows, the file reader stops after delivering the last needed row โ€” no matter how many millions of rows remain in the file. This is categorically different from an eager pipeline, where every stage would process all rows before the consumer gets a single result.


๐ŸŒ Where Generators Shine in the Real World

Infinite Sequence Generation

Because a generator only computes the next value on demand, it can represent sequences with no defined end. A while True loop inside a generator produces values forever without consuming infinite memory. The caller controls when to stop.

import itertools

def natural_numbers(start: int = 0):
    """Generates 0, 1, 2, 3, ... indefinitely."""
    n = start
    while True:
        yield n
        n += 1

# Safe: itertools.islice stops after 5 pulls
first_five = list(itertools.islice(natural_numbers(), 5))
print(first_five)   # [0, 1, 2, 3, 4]

Streaming File Processing

Any file that is too large to fit in RAM is a natural candidate for a generator reader. The generator holds only the file handle and the current line โ€” the OS page cache does the rest:

def read_large_csv(filepath: str):
    """Yields one parsed row dict at a time from a large CSV."""
    import csv
    with open(filepath, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield row

# Downstream code never loads the whole file
for row in read_large_csv("events_2024.csv"):
    process(row)

Data Transformation Pipelines

Because generators are composable, you can chain them into pipelines that mirror Unix pipe syntax. Each stage is a plain generator function:

def parse_rows(rows):
    for row in rows:
        yield {k: v.strip() for k, v in row.items()}

def filter_active(rows):
    for row in rows:
        if row.get("active") == "1":
            yield row

def enrich_with_region(rows, region_map: dict):
    for row in rows:
        row["region"] = region_map.get(row["country_code"], "unknown")
        yield row

# Compose the pipeline โ€” nothing runs until the for loop iterates
raw = read_large_csv("users.csv")
parsed = parse_rows(raw)
active = filter_active(parsed)
enriched = enrich_with_region(active, {"US": "NA", "DE": "EU", "JP": "APAC"})

for user in enriched:
    save_to_database(user)

Each function call returns immediately โ€” no data moves until the for loop issues the first next(). Adding or removing a stage requires touching only one line.

Log Tailing and Event Streaming

Generators model reactive data sources elegantly:

import time

def tail_logfile(filepath: str, poll_interval: float = 0.5):
    """Yields new lines appended to a log file in real time."""
    with open(filepath, "r") as f:
        f.seek(0, 2)    # seek to end
        while True:
            line = f.readline()
            if line:
                yield line.rstrip()
            else:
                time.sleep(poll_interval)

for log_line in tail_logfile("/var/log/app.log"):
    if "ERROR" in log_line:
        send_alert(log_line)

โš–๏ธ When Lazy Evaluation Becomes a Liability

Generators are not always the right tool. Knowing when not to use them is just as important.

Generators Are Single-Pass โ€” You Cannot Rewind

Once a generator is exhausted, it is gone. There is no reset(), no seek(0). If you need to iterate a dataset twice โ€” for example, to compute a mean in the first pass and then a standard deviation in the second โ€” you must either rebuild the generator or convert it to a list first.

data_gen = (float(x) for x in range(100))

# First pass: compute mean
total = sum(data_gen)          # works fine
# Second pass: generator is now exhausted!
count = sum(1 for _ in data_gen)   # returns 0 โ€” the generator gave nothing

Debugging Is Harder

A generator's value does not materialise until you ask for it. Intermediate pipeline stages are invisible in the debugger unless you explicitly consume them. A bug in a transform stage buried five generators deep can surface as a confusing StopIteration in the consumer. Converting suspect stages to lists temporarily is the standard debugging move.

# Debugging trick: force a stage to a list to inspect it
active_users = list(filter_active(parse_rows(raw)))  # now inspectable
print(f"Active users count: {len(active_users)}")

len() Does Not Work on Generator Expressions

Generators do not have a length until exhausted. Code that expects len() will raise a TypeError:

gen = (x for x in range(10))
print(len(gen))    # TypeError: object of type 'generator' has no len()

Memory vs. CPU Trade-off at Small Scale

For small datasets (under ~10,000 items), the per-call overhead of next() can make a generator pipeline marginally slower than a list comprehension. CPython's list comprehension is heavily optimized and uses a tight bytecode loop. Generators pay a small cost per yield to save and restore frame state. This cost is negligible at scale but measurable at small N in microbenchmarks.


๐Ÿงญ Choosing Between a List Comprehension, Generator Expression, and yield Generator

Use this table to make the decision quickly:

ScenarioBest choiceReason
Build a result list you need to index, slice, or sort[x for x in ...] list comprehensionDirect memory access; len() and indexing work
Pass a single-use sequence to sum, max, any, all(x for x in ...) generator expressionZero intermediate list allocation
Filter or transform in a for loop, result used onceGenerator expressionAvoids holding all values in RAM
Complex multi-step logic with branching or stateyield generator functionGenerator expressions cannot branch mid-stream
Infinite or unbounded sequenceyield generator functionOnly option โ€” a list would be infinite
Streaming file or network data row-by-rowyield generator functionHolds only the current record, not the whole file
Need to iterate the same dataset multiple timesList comprehension (or list() a generator once)Generators can only be consumed once
Building a composable, stageable pipelineChained yield generator functionsEach stage is independently testable and replaceable
Returning data from a recursive structureyield from generator functionNaturally flattens nested recursion into a flat stream

๐Ÿงช Three Real-World Generator Patterns in Action

The three examples below cover the most common generator use cases: streaming I/O, infinite mathematical sequences, and a chained itertools pipeline. Each example is self-contained and runnable.

Example 1 โ€” Lazy CSV Reader With Row Validation

This pattern is used by data engineers to process files that do not fit in RAM. The generator holds one row in memory at a time, validates it inline, and yields only clean rows to downstream consumers. Notice how the generator handles the file context manager โ€” the file stays open for the entire lifetime of the generator and closes automatically when iteration ends or the generator is garbage-collected.

import csv
from typing import Generator, Dict, Any

def lazy_csv_reader(
    filepath: str,
    required_fields: list[str]
) -> Generator[Dict[str, Any], None, None]:
    """
    Yields validated row dicts from a CSV file one at a time.
    Skips rows that are missing any required field.
    """
    skipped = 0
    yielded = 0

    with open(filepath, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for line_num, row in enumerate(reader, start=2):  # 1 = header
            # Strip whitespace from all values
            row = {k: v.strip() for k, v in row.items()}
            # Skip rows with empty required fields
            if any(not row.get(field) for field in required_fields):
                skipped += 1
                continue
            yielded += 1
            yield row

    print(f"[lazy_csv_reader] yielded={yielded}, skipped={skipped}")

# Usage โ€” never loads the whole file
for record in lazy_csv_reader("shipments.csv", required_fields=["id", "status"]):
    if record["status"] == "delayed":
        print(f"Delayed: {record['id']}")

Example 2 โ€” Infinite Fibonacci Generator

An infinite generator is impossible with a list but trivial with yield. This Fibonacci generator never terminates; callers use itertools.islice or a break condition to take only what they need.

from typing import Generator
import itertools

def fibonacci() -> Generator[int, None, None]:
    """Yields the infinite Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, ..."""
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b

# First 15 Fibonacci numbers
first_15 = list(itertools.islice(fibonacci(), 15))
print(first_15)
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377]

# First Fibonacci number greater than 10,000
large_fib = next(n for n in fibonacci() if n > 10_000)
print(large_fib)   # 10946

# Sum of even Fibonacci numbers below 4 million (Project Euler problem 2)
even_fib_sum = sum(
    n for n in itertools.takewhile(lambda x: x < 4_000_000, fibonacci())
    if n % 2 == 0
)
print(even_fib_sum)   # 4613732

Example 3 โ€” Chained Data Pipeline With itertools

This example demonstrates a realistic ETL pipeline entirely composed of generators and itertools primitives. The pipeline reads from multiple data sources, concatenates them, batches rows for efficient bulk inserts, and counts totals โ€” all without ever holding more than one batch in memory at a time.

import itertools
import csv
from typing import Generator, Iterable, Iterator, TypeVar

T = TypeVar("T")

def read_csv_rows(filepath: str) -> Generator[dict, None, None]:
    with open(filepath, newline="", encoding="utf-8") as f:
        yield from csv.DictReader(f)

def normalize_status(rows: Iterable[dict]) -> Generator[dict, None, None]:
    for row in rows:
        row["status"] = row.get("status", "").lower().strip()
        yield row

def filter_by_status(
    rows: Iterable[dict], status: str
) -> Generator[dict, None, None]:
    for row in rows:
        if row["status"] == status:
            yield row

def batch(iterable: Iterable[T], size: int) -> Generator[list[T], None, None]:
    """Groups an iterable into fixed-size lists."""
    it = iter(iterable)
    while chunk := list(itertools.islice(it, size)):
        yield chunk

# Build the pipeline
sources = itertools.chain(
    read_csv_rows("shipments_jan.csv"),
    read_csv_rows("shipments_feb.csv"),
    read_csv_rows("shipments_mar.csv"),
)

pipeline = batch(
    filter_by_status(
        normalize_status(sources),
        status="delayed",
    ),
    size=500,
)

# Execute โ€” only here does any data actually move
total_inserted = 0
for batch_rows in pipeline:
    bulk_insert(batch_rows)      # your database insert function
    total_inserted += len(batch_rows)

print(f"Inserted {total_inserted} delayed shipments")

The batch generator is a reusable utility that works with any iterable. itertools.chain merges three file generators into a single stream without any intermediate list. The entire pipeline runs with memory bounded by one batch of 500 rows at a time.


๐Ÿ› ๏ธ Python's itertools: The Standard Library's Generator Toolkit

itertools is Python's built-in module of fast, memory-efficient building blocks for iterator pipelines. All its functions return iterators (lazy by default), not lists. It is the first place to look before writing a custom generator.

The module is implemented in C, making it faster than equivalent Python loops for most use cases.

chain โ€” Concatenate Multiple Iterables

import itertools

# Merge multiple sequences without creating a combined list
combined = itertools.chain([1, 2, 3], [4, 5], [6])
print(list(combined))   # [1, 2, 3, 4, 5, 6]

# chain.from_iterable unpacks one level of nesting
nested = [[1, 2], [3, 4], [5]]
flat = list(itertools.chain.from_iterable(nested))
print(flat)   # [1, 2, 3, 4, 5]

islice โ€” Lazy Slicing Without Indexing

# Take the first 5 items from any iterator (including infinite ones)
first_five_evens = list(itertools.islice((x for x in itertools.count() if x % 2 == 0), 5))
print(first_five_evens)   # [0, 2, 4, 6, 8]

groupby โ€” Group Consecutive Items by Key

# Group sorted rows by department โ€” input MUST be sorted by key first
data = [
    {"dept": "eng", "name": "Alice"},
    {"dept": "eng", "name": "Bob"},
    {"dept": "mkt", "name": "Carol"},
]

for dept, group in itertools.groupby(data, key=lambda r: r["dept"]):
    members = [r["name"] for r in group]
    print(f"{dept}: {members}")
# eng: ['Alice', 'Bob']
# mkt: ['Carol']

product โ€” Cartesian Product Without Nested Loops

# Generate all size/color combinations without nested list comprehensions
sizes = ["S", "M", "L"]
colors = ["red", "blue"]
sku_pairs = list(itertools.product(sizes, colors))
print(sku_pairs)
# [('S', 'red'), ('S', 'blue'), ('M', 'red'), ('M', 'blue'), ('L', 'red'), ('L', 'blue')]

combinations and combinations_with_replacement

# All unique 2-card hands from a 4-card subset
cards = ["A", "K", "Q", "J"]
hands = list(itertools.combinations(cards, 2))
print(hands)
# [('A', 'K'), ('A', 'Q'), ('A', 'J'), ('K', 'Q'), ('K', 'J'), ('Q', 'J')]

takewhile and dropwhile โ€” Conditional Slicing

# takewhile: yield items while predicate is True, stop at first False
rising = list(itertools.takewhile(lambda x: x < 5, [1, 2, 3, 4, 5, 6, 1, 2]))
print(rising)   # [1, 2, 3, 4]

# dropwhile: skip items while predicate is True, then yield everything
after_drop = list(itertools.dropwhile(lambda x: x < 5, [1, 2, 3, 4, 5, 6, 1, 2]))
print(after_drop)   # [5, 6, 1, 2]

For a deeper dive into the full itertools API โ€” including permutations, cycle, accumulate, and starmap โ€” see the official Python docs for itertools.


๐Ÿ“š Lessons Learned from Chasing Memory Savings With Generators

Lesson 1 โ€” The 104-byte illusion can mislead you. sys.getsizeof(gen) reports the generator object shell, not the frame. Use tracemalloc for honest peak-memory measurements. Production profiling is the only way to know whether a generator actually reduces your process's RSS.

Lesson 2 โ€” Generators are viral. Once you make one stage of a pipeline lazy, all downstream stages must also be iteration-friendly. If your final consumer calls len() or slices with [0:10], it will break. Audit the full pipeline before switching to generators in production code.

Lesson 3 โ€” Debugging generators requires materialising them temporarily. When a multi-stage generator pipeline produces wrong results, the fastest fix is to add list() around suspect stages during debugging. This converts the lazy stream into a visible, inspectable snapshot.

Lesson 4 โ€” yield from is not just syntactic sugar. yield from sub_gen properly proxies .send(value) and .throw(exception) into the sub-generator. A manual for item in sub_gen: yield item loop does not โ€” it silently swallows .send() values. Use yield from whenever you delegate.

Lesson 5 โ€” itertools first, custom generator second. Before writing a custom grouping, slicing, or combining generator, check itertools. The C-implemented versions are significantly faster and battle-tested. A custom batch() function is fine; a custom chain() is unnecessary.

Lesson 6 โ€” Generator expressions inside function calls need no extra parentheses. sum(x**2 for x in range(n)) is valid โ€” the function call's parentheses serve double duty. Writing sum((x**2 for x in range(n))) is redundant (though harmless).


๐Ÿ“Œ TLDR โ€” Lazy Evaluation in Three Sentences

TLDR: A list comprehension builds the full collection immediately; a generator expression or yield function produces values one at a time, keeping memory usage constant regardless of input size. Use generator expressions when passing a sequence to sum, max, or any once; use yield generator functions when logic is complex, the sequence is infinite, or data streams from a file or network. Chain generator stages into a pipeline with itertools.chain, and reach for itertools.islice, groupby, and takewhile before writing custom iteration logic.


๐Ÿ“ Practice Quiz โ€” Comprehensions, Generators, and Lazy Evaluation

Test your understanding of the concepts in this post.

  1. What is the memory size reported by sys.getsizeof for a generator expression over 1 million integers, compared to a list comprehension over the same range?

    Correct Answer: The generator expression reports approximately 104 bytes (the object shell), while the list comprehension reports roughly 8โ€“9 MB. The generator stores no values โ€” it stores only the frame and execution state.

  2. What happens when you call next() on an exhausted Python generator?

    Correct Answer: It raises StopIteration. Once a generator function's body has finished executing (either by reaching the end of the function or by falling through the last yield), every subsequent next() call raises StopIteration immediately.

  3. What is the difference between [x**2 for x in range(10)] and (x**2 for x in range(10))?

    Correct Answer: Square brackets produce a list comprehension โ€” all 10 values are computed immediately and stored in a list in memory. Parentheses produce a generator expression โ€” no values are computed yet; each is computed on demand when the generator is iterated.

  4. Why must the input to itertools.groupby be sorted by the grouping key before calling it?

    Correct Answer: itertools.groupby groups consecutive items with the same key, not all items with the same key. If the input is unsorted, rows with the same key that appear non-consecutively will be placed into separate groups. Pre-sorting by the key ensures all matching rows are adjacent and thus grouped correctly.

  5. What does yield from sub_generator do that a manual for item in sub_generator: yield item loop does not?

    Correct Answer: yield from properly proxies .send(value) and .throw(exception) calls into the sub-generator, enabling two-way communication through the delegation chain. The manual loop silently swallows .send() values, making it incompatible with coroutine-style generators that use send().

  6. Given gen = (x for x in range(5)), what does list(gen) + list(gen) evaluate to, and why?

    Correct Answer: It evaluates to [0, 1, 2, 3, 4], not [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]. The first list(gen) exhausts the generator. The second list(gen) receives an already-exhausted generator and returns an empty list. Generators are single-pass and cannot be rewound.

  7. Open-ended challenge: Design a generator pipeline that reads from a directory of log files (each potentially gigabytes in size), extracts lines matching a regex pattern, parses each match into a structured dict, and batches the results into groups of 1000 for bulk database inserts โ€” all without loading more than one batch into memory at a time. How would you handle the case where a directory has thousands of files, and how would you make each stage independently unit-testable?


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms