List Comprehensions, Generators, and Lazy Evaluation in Python
How Python's lazy evaluation turns infinite sequences and billion-row pipelines into single-line code
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with the help of AI tools. While efforts are made to ensure accuracy, the content may contain errors or inaccuracies. Please verify critical information independently.
๐ The MemoryError That Launched a Thousand Generators
Meet Priya. She is a data engineer at a logistics company, tasked with crunching a 10 GB CSV of shipping events. She opens her laptop, writes what feels like perfectly reasonable Python, and hits run:
# Priya's first attempt โ loads the entire file into RAM
import csv
def get_delayed_shipments(filepath):
rows = []
with open(filepath, "r") as f:
reader = csv.DictReader(f)
rows = list(reader) # <-- reads every row into memory at once
return [r for r in rows if r["status"] == "delayed"]
delayed = get_delayed_shipments("shipments_2024.csv")
print(f"Found {len(delayed)} delayed shipments")
Three seconds later:
MemoryError: unable to allocate 9.8 GiB for an array
Her laptop has 16 GB of RAM. Python tried to load 10 GB of CSV rows into a single list, and then the list comprehension allocated a second copy. The machine ran out of headroom before it processed a single row.
Here is the same task rewritten with a generator โ tested on the same machine, same file:
import csv
def get_delayed_shipments_lazy(filepath):
with open(filepath, "r") as f:
reader = csv.DictReader(f)
for row in reader:
if row["status"] == "delayed":
yield row # <-- produces one row at a time, then pauses
for shipment in get_delayed_shipments_lazy("shipments_2024.csv"):
print(shipment["tracking_id"])
Memory usage: under 4 KB. The file never lives in RAM all at once. Python reads one row, tests it, yields it to the caller, and moves on โ resuming from exactly where it left off on the next iteration.
This is lazy evaluation: Python defers computation until the caller actually asks for the next value. The generator function does not run to completion; it runs to the next yield, pauses, hands the value to the caller, and waits. The call stack is frozen in place โ local variables, loop position, file handle, everything โ until next() is called again.
Understanding why this works, and when to reach for it, is what this post is about.
๐ Building Sequences the Pythonic Way โ Comprehensions and Generator Expressions
Before diving into yield, let's cover the simpler forms of lazy-leaning syntax that Python developers use every day: comprehensions and generator expressions.
List Comprehensions
A list comprehension is a concise, readable way to build a new list from an iterable. The pattern is:
result = [expression for item in iterable if condition]
Compare the classic loop approach with the comprehension:
# Classic loop
squares = []
for x in range(10):
squares.append(x ** 2)
# List comprehension โ same result, one line
squares = [x ** 2 for x in range(10)]
# [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
You can also nest comprehensions to flatten 2D structures:
# Flatten a 3x3 matrix into a single list
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
flat = [num for row in matrix for num in row]
# [1, 2, 3, 4, 5, 6, 7, 8, 9]
Dictionary and Set Comprehensions
The same syntax works for dict and set literals:
# Dict comprehension: word โ length
words = ["apple", "banana", "cherry"]
word_lengths = {word: len(word) for word in words}
# {'apple': 5, 'banana': 6, 'cherry': 6}
# Set comprehension: unique first letters
first_letters = {word[0] for word in words}
# {'a', 'b', 'c'}
Generator Expressions โ Parentheses Instead of Brackets
Here is the subtle change that makes a huge difference. Swap [...] for (...) and you get a generator expression instead of a list:
import sys
# List comprehension โ ALL values computed and stored in memory immediately
squares_list = [x ** 2 for x in range(1_000_000)]
print(sys.getsizeof(squares_list)) # ~8,697,464 bytes (~8.3 MB)
# Generator expression โ no values stored; each is computed on demand
squares_gen = (x ** 2 for x in range(1_000_000))
print(sys.getsizeof(squares_gen)) # 104 bytes
The generator expression object is just a recipe โ a suspended computation. The 104 bytes represent the generator state object itself, not the million values. Those values are computed only when you iterate over squares_gen.
The practical takeaway: if you are immediately passing a sequence to a function that iterates it once (like sum, max, any, all), use a generator expression. If you need to iterate it multiple times, index into it, or check its length, use a list comprehension.
# Idiomatic: generator expression directly inside sum() โ no extra list created
total = sum(x ** 2 for x in range(1_000_000))
# Also valid: max of filtered values without allocating the intermediate list
largest_even = max(x for x in range(1000) if x % 2 == 0)
โ๏ธ How yield Turns Functions Into Lazy Factories
List comprehensions and generator expressions are convenient, but they have limits. For complex multi-step logic, conditional branching, or sequences that depend on external state (like a file handle or a database cursor), you need a generator function โ any function that contains the yield keyword.
The yield Keyword โ Pause, Return, Resume
When Python encounters yield inside a function, it does something remarkable: it suspends the entire function frame โ local variables, the current line number, open loops, everything โ packages it into a generator object, and hands the yielded value back to the caller. The function is not done; it is paused. The next time the caller asks for a value (by calling next() on the generator), execution resumes from the exact line after yield.
def count_up(start, stop):
current = start
while current < stop:
print(f" [generator] about to yield {current}")
yield current
print(f" [generator] resumed after yield, incrementing")
current += 1
print(" [generator] loop finished โ StopIteration coming")
gen = count_up(1, 4)
print("Calling next() the first time:")
val = next(gen) # Runs until the first yield, then pauses
print(f"Got: {val}")
print("Calling next() the second time:")
val = next(gen) # Resumes from after yield, runs until next yield
print(f"Got: {val}")
print("Calling next() the third time:")
val = next(gen)
print(f"Got: {val}")
print("Calling next() a fourth time โ will raise StopIteration:")
try:
next(gen)
except StopIteration:
print("StopIteration raised โ generator is exhausted")
Output:
Calling next() the first time:
[generator] about to yield 1
Got: 1
Calling next() the second time:
[generator] resumed after yield, incrementing
[generator] about to yield 2
Got: 2
Calling next() the third time:
[generator] resumed after yield, incrementing
[generator] about to yield 3
Got: 3
Calling next() a fourth time โ will raise StopIteration:
[generator] loop finished โ StopIteration coming
StopIteration raised โ generator is exhausted
The function body runs in fragments, interleaved with the caller's code. This is the essence of cooperative multitasking baked into Python's iteration protocol.
The generator lifecycle follows four distinct states managed by the CPython runtime. The state machine below shows every transition: from the moment you call the generator function, through each next() call, to the final StopIteration.
graph TD
A[Call generator function] --> B[GEN_CREATED - frame allocated on heap]
B --> C[Call next on generator]
C --> D[GEN_RUNNING - frame pushed onto call stack]
D --> E{yield reached?}
E -- yield value --> F[GEN_SUSPENDED - frame saved back to heap]
F --> G[Return value to caller]
G --> C
E -- function body ends --> H[GEN_CLOSED - frame released from heap]
H --> I[StopIteration raised on any further next call]
The key insight here is that GEN_SUSPENDED stores the entire frame โ local variables, the current instruction pointer, and any open loop counters โ on the heap. The call stack is freed up for the caller to use. This is what makes it possible to have thousands of concurrent generator pipelines without blowing the call stack.
StopIteration and the for Loop Contract
Python's for loop is built on this protocol. Under the hood, for item in iterable calls iter(iterable) to get an iterator, then repeatedly calls next() on it. When StopIteration is raised, the loop ends cleanly. You never have to handle StopIteration manually when using a for loop โ it is caught automatically.
# for loop implicitly calls next() and handles StopIteration
for val in count_up(1, 4):
print(val) # prints 1, 2, 3
yield from โ Delegating to a Sub-Generator
When one generator needs to delegate to another, Python 3.3 introduced yield from. It is equivalent to a for loop that yields each item from the sub-generator, but it is faster and also properly proxies send() and throw() calls through the chain.
def first_three():
yield 1
yield 2
yield 3
def first_six():
yield from first_three() # delegates to first_three
yield from [4, 5, 6] # also works with any iterable
print(list(first_six()))
# [1, 2, 3, 4, 5, 6]
yield from is especially powerful when flattening nested iterables or composing pipeline stages.
๐ง Under the Hood: Python Frame Objects and the Generator Protocol
The Internals of Generator Execution
When Python calls a regular function, it creates a frame object (PyFrameObject in CPython) on the call stack. The frame holds local variables, the code object, the current instruction pointer, and the evaluation stack. When the function returns, the frame is discarded.
Generator functions work differently. When you call count_up(1, 4), Python does not execute the body. Instead, it creates a generator object that holds a reference to the frame. The frame is kept alive on the heap โ not on the call stack โ in a suspended state. The instruction pointer is parked at the first yield it will encounter.
Each call to next(gen) does three things in the CPython bytecode evaluator (ceval.c):
- Sets the generator's state from
GEN_SUSPENDEDtoGEN_RUNNING. - Pushes the saved frame back onto the thread's call stack.
- Resumes execution at the saved instruction pointer, continuing until the next
YIELD_VALUEopcode.
When yield value executes, the bytecode instruction YIELD_VALUE pops the top of the evaluation stack, saves the current instruction pointer into the frame, sets the generator state back to GEN_SUSPENDED, and returns the value to the caller. When the function body finishes without hitting a yield, the generator raises StopIteration automatically.
You can inspect this mechanism directly:
import dis
import inspect
def simple_gen():
yield 1
yield 2
gen = simple_gen()
# The generator object holds a live frame
print(gen.gi_frame) # <frame object at 0x...>
print(gen.gi_frame.f_lineno) # current line (points to first yield)
next(gen) # run to first yield
print(gen.gi_frame.f_lineno) # now points to the second yield
print(gen.gi_suspended) # True โ frame is frozen between yields
next(gen) # run to second yield
next(gen) # exhausts the generator
print(gen.gi_frame) # None โ frame has been released
This is also why generators are single-pass: once exhausted, the frame is released and there is nothing left to resume. Calling next() on an exhausted generator always raises StopIteration immediately.
Performance Analysis โ List vs. Generator at 1 Million Items
The memory difference between a list and a generator is not just theoretical. Here is a concrete benchmark using sys.getsizeof and the tracemalloc module, which traces actual heap allocations:
import sys
import tracemalloc
def measure_list_vs_generator(n: int = 1_000_000) -> None:
print(f"Comparison for n={n:,} items\n{'โ'*45}")
# --- List comprehension ---
tracemalloc.start()
result_list = [x * 2 for x in range(n)]
_, list_peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
# --- Generator expression ---
tracemalloc.start()
result_gen = (x * 2 for x in range(n))
_, gen_peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"List comprehension")
print(f" sys.getsizeof : {sys.getsizeof(result_list):>12,} bytes")
print(f" peak heap : {list_peak:>12,} bytes")
print()
print(f"Generator expression")
print(f" sys.getsizeof : {sys.getsizeof(result_gen):>12,} bytes")
print(f" peak heap : {gen_peak:>12,} bytes")
print()
ratio = list_peak / gen_peak if gen_peak > 0 else float("inf")
print(f"Memory ratio (list / gen): {ratio:,.0f}x")
measure_list_vs_generator()
Typical output on CPython 3.11:
| Measurement | List comprehension | Generator expression |
sys.getsizeof | 8,448,728 bytes (~8 MB) | 104 bytes |
| Peak heap allocation | ~8,500,000 bytes | ~500 bytes |
| Memory ratio | โ | ~17,000ร smaller |
The 104-byte figure for sys.getsizeof on a generator reports the size of the generator object shell only โ it does not count the referenced frame. Peak heap from tracemalloc is a more honest number: roughly 500 bytes for the frame and its local variables, compared to 8.5 MB for the full list.
Throughput matters too. For a single linear pass, a generator pipeline is competitive with a list comprehension, and sometimes faster because it avoids allocating and garbage-collecting the intermediate list. The overhead per next() call is small โ one frame resume operation in the bytecode evaluator. However, if you need random access or multiple passes, that per-call overhead accumulates, and a pre-built list wins.
๐ Lazy Pull vs. Eager Push โ Visualizing the Generator Pipeline
The key architectural difference between an eager list pipeline and a lazy generator pipeline is who drives the data flow. In an eager pipeline, each stage pushes its entire output to the next stage in one shot. In a lazy pipeline, the consumer at the end pulls one value at a time, and that pull request propagates backwards through every stage in the chain.
The following sequence diagram shows a three-stage generator pipeline: FileReader โ Transform โ Filter โ Consumer. Each stage is a generator. The consumer drives the whole pipeline with a single next() call.
sequenceDiagram
participant Consumer
participant Filter as Filter Generator
participant Transform as Transform Generator
participant Reader as File Reader Generator
Consumer->>Filter: next()
Filter->>Transform: next()
Transform->>Reader: next()
Reader-->>Transform: raw row 1
Transform-->>Filter: transformed row 1
Filter-->>Consumer: row 1 passes filter
Consumer->>Filter: next()
Filter->>Transform: next()
Transform->>Reader: next()
Reader-->>Transform: raw row 2
Transform-->>Filter: transformed row 2
Note over Filter,Consumer: row 2 fails filter - loop back
Filter->>Transform: next()
Transform->>Reader: next()
Reader-->>Transform: raw row 3
Transform-->>Filter: transformed row 3
Filter-->>Consumer: row 3 passes filter
Notice that at no point does any stage hold more than one row at a time. The file reader has one row open. The transform stage has one transformed row. The filter stage either forwards it or discards it and immediately asks for the next. The consumer sees only the rows that passed the filter, one at a time, without ever knowing how many rows were skipped.
This pull model means you can short-circuit the pipeline for free. If the consumer stops iterating after 10 matching rows, the file reader stops after delivering the last needed row โ no matter how many millions of rows remain in the file. This is categorically different from an eager pipeline, where every stage would process all rows before the consumer gets a single result.
๐ Where Generators Shine in the Real World
Infinite Sequence Generation
Because a generator only computes the next value on demand, it can represent sequences with no defined end. A while True loop inside a generator produces values forever without consuming infinite memory. The caller controls when to stop.
import itertools
def natural_numbers(start: int = 0):
"""Generates 0, 1, 2, 3, ... indefinitely."""
n = start
while True:
yield n
n += 1
# Safe: itertools.islice stops after 5 pulls
first_five = list(itertools.islice(natural_numbers(), 5))
print(first_five) # [0, 1, 2, 3, 4]
Streaming File Processing
Any file that is too large to fit in RAM is a natural candidate for a generator reader. The generator holds only the file handle and the current line โ the OS page cache does the rest:
def read_large_csv(filepath: str):
"""Yields one parsed row dict at a time from a large CSV."""
import csv
with open(filepath, newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
yield row
# Downstream code never loads the whole file
for row in read_large_csv("events_2024.csv"):
process(row)
Data Transformation Pipelines
Because generators are composable, you can chain them into pipelines that mirror Unix pipe syntax. Each stage is a plain generator function:
def parse_rows(rows):
for row in rows:
yield {k: v.strip() for k, v in row.items()}
def filter_active(rows):
for row in rows:
if row.get("active") == "1":
yield row
def enrich_with_region(rows, region_map: dict):
for row in rows:
row["region"] = region_map.get(row["country_code"], "unknown")
yield row
# Compose the pipeline โ nothing runs until the for loop iterates
raw = read_large_csv("users.csv")
parsed = parse_rows(raw)
active = filter_active(parsed)
enriched = enrich_with_region(active, {"US": "NA", "DE": "EU", "JP": "APAC"})
for user in enriched:
save_to_database(user)
Each function call returns immediately โ no data moves until the for loop issues the first next(). Adding or removing a stage requires touching only one line.
Log Tailing and Event Streaming
Generators model reactive data sources elegantly:
import time
def tail_logfile(filepath: str, poll_interval: float = 0.5):
"""Yields new lines appended to a log file in real time."""
with open(filepath, "r") as f:
f.seek(0, 2) # seek to end
while True:
line = f.readline()
if line:
yield line.rstrip()
else:
time.sleep(poll_interval)
for log_line in tail_logfile("/var/log/app.log"):
if "ERROR" in log_line:
send_alert(log_line)
โ๏ธ When Lazy Evaluation Becomes a Liability
Generators are not always the right tool. Knowing when not to use them is just as important.
Generators Are Single-Pass โ You Cannot Rewind
Once a generator is exhausted, it is gone. There is no reset(), no seek(0). If you need to iterate a dataset twice โ for example, to compute a mean in the first pass and then a standard deviation in the second โ you must either rebuild the generator or convert it to a list first.
data_gen = (float(x) for x in range(100))
# First pass: compute mean
total = sum(data_gen) # works fine
# Second pass: generator is now exhausted!
count = sum(1 for _ in data_gen) # returns 0 โ the generator gave nothing
Debugging Is Harder
A generator's value does not materialise until you ask for it. Intermediate pipeline stages are invisible in the debugger unless you explicitly consume them. A bug in a transform stage buried five generators deep can surface as a confusing StopIteration in the consumer. Converting suspect stages to lists temporarily is the standard debugging move.
# Debugging trick: force a stage to a list to inspect it
active_users = list(filter_active(parse_rows(raw))) # now inspectable
print(f"Active users count: {len(active_users)}")
len() Does Not Work on Generator Expressions
Generators do not have a length until exhausted. Code that expects len() will raise a TypeError:
gen = (x for x in range(10))
print(len(gen)) # TypeError: object of type 'generator' has no len()
Memory vs. CPU Trade-off at Small Scale
For small datasets (under ~10,000 items), the per-call overhead of next() can make a generator pipeline marginally slower than a list comprehension. CPython's list comprehension is heavily optimized and uses a tight bytecode loop. Generators pay a small cost per yield to save and restore frame state. This cost is negligible at scale but measurable at small N in microbenchmarks.
๐งญ Choosing Between a List Comprehension, Generator Expression, and yield Generator
Use this table to make the decision quickly:
| Scenario | Best choice | Reason |
| Build a result list you need to index, slice, or sort | [x for x in ...] list comprehension | Direct memory access; len() and indexing work |
Pass a single-use sequence to sum, max, any, all | (x for x in ...) generator expression | Zero intermediate list allocation |
Filter or transform in a for loop, result used once | Generator expression | Avoids holding all values in RAM |
| Complex multi-step logic with branching or state | yield generator function | Generator expressions cannot branch mid-stream |
| Infinite or unbounded sequence | yield generator function | Only option โ a list would be infinite |
| Streaming file or network data row-by-row | yield generator function | Holds only the current record, not the whole file |
| Need to iterate the same dataset multiple times | List comprehension (or list() a generator once) | Generators can only be consumed once |
| Building a composable, stageable pipeline | Chained yield generator functions | Each stage is independently testable and replaceable |
| Returning data from a recursive structure | yield from generator function | Naturally flattens nested recursion into a flat stream |
๐งช Three Real-World Generator Patterns in Action
The three examples below cover the most common generator use cases: streaming I/O, infinite mathematical sequences, and a chained itertools pipeline. Each example is self-contained and runnable.
Example 1 โ Lazy CSV Reader With Row Validation
This pattern is used by data engineers to process files that do not fit in RAM. The generator holds one row in memory at a time, validates it inline, and yields only clean rows to downstream consumers. Notice how the generator handles the file context manager โ the file stays open for the entire lifetime of the generator and closes automatically when iteration ends or the generator is garbage-collected.
import csv
from typing import Generator, Dict, Any
def lazy_csv_reader(
filepath: str,
required_fields: list[str]
) -> Generator[Dict[str, Any], None, None]:
"""
Yields validated row dicts from a CSV file one at a time.
Skips rows that are missing any required field.
"""
skipped = 0
yielded = 0
with open(filepath, newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
for line_num, row in enumerate(reader, start=2): # 1 = header
# Strip whitespace from all values
row = {k: v.strip() for k, v in row.items()}
# Skip rows with empty required fields
if any(not row.get(field) for field in required_fields):
skipped += 1
continue
yielded += 1
yield row
print(f"[lazy_csv_reader] yielded={yielded}, skipped={skipped}")
# Usage โ never loads the whole file
for record in lazy_csv_reader("shipments.csv", required_fields=["id", "status"]):
if record["status"] == "delayed":
print(f"Delayed: {record['id']}")
Example 2 โ Infinite Fibonacci Generator
An infinite generator is impossible with a list but trivial with yield. This Fibonacci generator never terminates; callers use itertools.islice or a break condition to take only what they need.
from typing import Generator
import itertools
def fibonacci() -> Generator[int, None, None]:
"""Yields the infinite Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, ..."""
a, b = 0, 1
while True:
yield a
a, b = b, a + b
# First 15 Fibonacci numbers
first_15 = list(itertools.islice(fibonacci(), 15))
print(first_15)
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377]
# First Fibonacci number greater than 10,000
large_fib = next(n for n in fibonacci() if n > 10_000)
print(large_fib) # 10946
# Sum of even Fibonacci numbers below 4 million (Project Euler problem 2)
even_fib_sum = sum(
n for n in itertools.takewhile(lambda x: x < 4_000_000, fibonacci())
if n % 2 == 0
)
print(even_fib_sum) # 4613732
Example 3 โ Chained Data Pipeline With itertools
This example demonstrates a realistic ETL pipeline entirely composed of generators and itertools primitives. The pipeline reads from multiple data sources, concatenates them, batches rows for efficient bulk inserts, and counts totals โ all without ever holding more than one batch in memory at a time.
import itertools
import csv
from typing import Generator, Iterable, Iterator, TypeVar
T = TypeVar("T")
def read_csv_rows(filepath: str) -> Generator[dict, None, None]:
with open(filepath, newline="", encoding="utf-8") as f:
yield from csv.DictReader(f)
def normalize_status(rows: Iterable[dict]) -> Generator[dict, None, None]:
for row in rows:
row["status"] = row.get("status", "").lower().strip()
yield row
def filter_by_status(
rows: Iterable[dict], status: str
) -> Generator[dict, None, None]:
for row in rows:
if row["status"] == status:
yield row
def batch(iterable: Iterable[T], size: int) -> Generator[list[T], None, None]:
"""Groups an iterable into fixed-size lists."""
it = iter(iterable)
while chunk := list(itertools.islice(it, size)):
yield chunk
# Build the pipeline
sources = itertools.chain(
read_csv_rows("shipments_jan.csv"),
read_csv_rows("shipments_feb.csv"),
read_csv_rows("shipments_mar.csv"),
)
pipeline = batch(
filter_by_status(
normalize_status(sources),
status="delayed",
),
size=500,
)
# Execute โ only here does any data actually move
total_inserted = 0
for batch_rows in pipeline:
bulk_insert(batch_rows) # your database insert function
total_inserted += len(batch_rows)
print(f"Inserted {total_inserted} delayed shipments")
The batch generator is a reusable utility that works with any iterable. itertools.chain merges three file generators into a single stream without any intermediate list. The entire pipeline runs with memory bounded by one batch of 500 rows at a time.
๐ ๏ธ Python's itertools: The Standard Library's Generator Toolkit
itertools is Python's built-in module of fast, memory-efficient building blocks for iterator pipelines. All its functions return iterators (lazy by default), not lists. It is the first place to look before writing a custom generator.
The module is implemented in C, making it faster than equivalent Python loops for most use cases.
chain โ Concatenate Multiple Iterables
import itertools
# Merge multiple sequences without creating a combined list
combined = itertools.chain([1, 2, 3], [4, 5], [6])
print(list(combined)) # [1, 2, 3, 4, 5, 6]
# chain.from_iterable unpacks one level of nesting
nested = [[1, 2], [3, 4], [5]]
flat = list(itertools.chain.from_iterable(nested))
print(flat) # [1, 2, 3, 4, 5]
islice โ Lazy Slicing Without Indexing
# Take the first 5 items from any iterator (including infinite ones)
first_five_evens = list(itertools.islice((x for x in itertools.count() if x % 2 == 0), 5))
print(first_five_evens) # [0, 2, 4, 6, 8]
groupby โ Group Consecutive Items by Key
# Group sorted rows by department โ input MUST be sorted by key first
data = [
{"dept": "eng", "name": "Alice"},
{"dept": "eng", "name": "Bob"},
{"dept": "mkt", "name": "Carol"},
]
for dept, group in itertools.groupby(data, key=lambda r: r["dept"]):
members = [r["name"] for r in group]
print(f"{dept}: {members}")
# eng: ['Alice', 'Bob']
# mkt: ['Carol']
product โ Cartesian Product Without Nested Loops
# Generate all size/color combinations without nested list comprehensions
sizes = ["S", "M", "L"]
colors = ["red", "blue"]
sku_pairs = list(itertools.product(sizes, colors))
print(sku_pairs)
# [('S', 'red'), ('S', 'blue'), ('M', 'red'), ('M', 'blue'), ('L', 'red'), ('L', 'blue')]
combinations and combinations_with_replacement
# All unique 2-card hands from a 4-card subset
cards = ["A", "K", "Q", "J"]
hands = list(itertools.combinations(cards, 2))
print(hands)
# [('A', 'K'), ('A', 'Q'), ('A', 'J'), ('K', 'Q'), ('K', 'J'), ('Q', 'J')]
takewhile and dropwhile โ Conditional Slicing
# takewhile: yield items while predicate is True, stop at first False
rising = list(itertools.takewhile(lambda x: x < 5, [1, 2, 3, 4, 5, 6, 1, 2]))
print(rising) # [1, 2, 3, 4]
# dropwhile: skip items while predicate is True, then yield everything
after_drop = list(itertools.dropwhile(lambda x: x < 5, [1, 2, 3, 4, 5, 6, 1, 2]))
print(after_drop) # [5, 6, 1, 2]
For a deeper dive into the full itertools API โ including permutations, cycle, accumulate, and starmap โ see the official Python docs for itertools.
๐ Lessons Learned from Chasing Memory Savings With Generators
Lesson 1 โ The 104-byte illusion can mislead you. sys.getsizeof(gen) reports the generator object shell, not the frame. Use tracemalloc for honest peak-memory measurements. Production profiling is the only way to know whether a generator actually reduces your process's RSS.
Lesson 2 โ Generators are viral. Once you make one stage of a pipeline lazy, all downstream stages must also be iteration-friendly. If your final consumer calls len() or slices with [0:10], it will break. Audit the full pipeline before switching to generators in production code.
Lesson 3 โ Debugging generators requires materialising them temporarily. When a multi-stage generator pipeline produces wrong results, the fastest fix is to add list() around suspect stages during debugging. This converts the lazy stream into a visible, inspectable snapshot.
Lesson 4 โ yield from is not just syntactic sugar. yield from sub_gen properly proxies .send(value) and .throw(exception) into the sub-generator. A manual for item in sub_gen: yield item loop does not โ it silently swallows .send() values. Use yield from whenever you delegate.
Lesson 5 โ itertools first, custom generator second. Before writing a custom grouping, slicing, or combining generator, check itertools. The C-implemented versions are significantly faster and battle-tested. A custom batch() function is fine; a custom chain() is unnecessary.
Lesson 6 โ Generator expressions inside function calls need no extra parentheses. sum(x**2 for x in range(n)) is valid โ the function call's parentheses serve double duty. Writing sum((x**2 for x in range(n))) is redundant (though harmless).
๐ TLDR โ Lazy Evaluation in Three Sentences
TLDR: A list comprehension builds the full collection immediately; a generator expression or
yieldfunction produces values one at a time, keeping memory usage constant regardless of input size. Use generator expressions when passing a sequence tosum,max, oranyonce; useyieldgenerator functions when logic is complex, the sequence is infinite, or data streams from a file or network. Chain generator stages into a pipeline withitertools.chain, and reach foritertools.islice,groupby, andtakewhilebefore writing custom iteration logic.
๐ Practice Quiz โ Comprehensions, Generators, and Lazy Evaluation
Test your understanding of the concepts in this post.
What is the memory size reported by
sys.getsizeoffor a generator expression over 1 million integers, compared to a list comprehension over the same range?Correct Answer: The generator expression reports approximately 104 bytes (the object shell), while the list comprehension reports roughly 8โ9 MB. The generator stores no values โ it stores only the frame and execution state.
What happens when you call
next()on an exhausted Python generator?Correct Answer: It raises
StopIteration. Once a generator function's body has finished executing (either by reaching the end of the function or by falling through the lastyield), every subsequentnext()call raisesStopIterationimmediately.What is the difference between
[x**2 for x in range(10)]and(x**2 for x in range(10))?Correct Answer: Square brackets produce a list comprehension โ all 10 values are computed immediately and stored in a list in memory. Parentheses produce a generator expression โ no values are computed yet; each is computed on demand when the generator is iterated.
Why must the input to
itertools.groupbybe sorted by the grouping key before calling it?Correct Answer:
itertools.groupbygroups consecutive items with the same key, not all items with the same key. If the input is unsorted, rows with the same key that appear non-consecutively will be placed into separate groups. Pre-sorting by the key ensures all matching rows are adjacent and thus grouped correctly.What does
yield from sub_generatordo that a manualfor item in sub_generator: yield itemloop does not?Correct Answer:
yield fromproperly proxies.send(value)and.throw(exception)calls into the sub-generator, enabling two-way communication through the delegation chain. The manual loop silently swallows.send()values, making it incompatible with coroutine-style generators that usesend().Given
gen = (x for x in range(5)), what doeslist(gen) + list(gen)evaluate to, and why?Correct Answer: It evaluates to
[0, 1, 2, 3, 4], not[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]. The firstlist(gen)exhausts the generator. The secondlist(gen)receives an already-exhausted generator and returns an empty list. Generators are single-pass and cannot be rewound.Open-ended challenge: Design a generator pipeline that reads from a directory of log files (each potentially gigabytes in size), extracts lines matching a regex pattern, parses each match into a structured dict, and batches the results into groups of 1000 for bulk database inserts โ all without loading more than one batch into memory at a time. How would you handle the case where a directory has thousands of files, and how would you make each stage independently unit-testable?
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
Spark Structured Streaming: Micro-Batch vs Continuous Processing
๐ The 15-Minute Gap: How a Fraud Team Discovered They Needed Real-Time Streaming A fintech team runs payment fraud detection with a well-tuned Spark batch job. Every 15 minutes it reads a day's worth of transaction events from S3, scores them agains...
Stateful Aggregations in Spark Structured Streaming: mapGroupsWithState
TLDR: mapGroupsWithState gives each streaming key its own mutable state object, persisted in a fault-tolerant state store that checkpoints to object storage on every micro-batch. Where window aggregations assume fixed time boundaries, mapGroupsWithSt...
Shuffles in Spark: Why groupBy Kills Performance
TLDR: A Spark shuffle is the most expensive operation in any distributed job โ it moves every matching key across the network, writes temporary sorted files to disk, and forces a hard synchronization barrier between every upstream and downstream stag...
