Home/Blog/Llm/Chain of Thought Prompting: Teaching LLMs to Think Step by Step

LlmAdvanced•27 min read•Apr 18, 2026

Chain of Thought Prompting: Teaching LLMs to Think Step by Step

The prompting technique that unlocked multi-step reasoning in LLMs — and how to use it effectively in production

Abstract Algorithms

Helping engineers master software engineering topics.

Chain of Thought Prompting: Teaching LLMs to Think Step by Step

TLDR: Chain of Thought (CoT) prompting tells a language model to reason out loud before answering. By generating intermediate steps, the model steers itself toward correct conclusions — turning guesswork into structured reasoning. It's the difference between asking a student to "just write the answer" and asking them to "show their work."

🔥 The Answer Was "11" — And It Was Wrong

Here is a real type of failure that large language models produce every day.

Prompt (no reasoning guidance):

Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 balls. How many tennis balls does he have now?

Answer:

Model output: 11

That answer is wrong. The correct answer is 11... wait — actually that is right. Let's use a harder example to make the problem crystal clear.

Prompt (a harder version):

A store has 48 apples. They sell 1/3 of them in the morning,
then receive a new shipment of 15 apples. They then sell half
of what they have. How many apples remain?

Answer:

Model output (without CoT): 24

Let's check: Start with 48 → sell 1/3 (16) → 32 remain → add 15 → 47 → sell half (23.5, round to 23) → 24 remain. The model happened to get it right here, but let's look at what happens with a multi-hop variation that trips models consistently:

Prompt (multi-hop logic):

There are 3 boxes. Box A has twice as many balls as Box B.
Box C has 5 fewer balls than Box A. Box B has 10 balls.
How many balls are in all three boxes combined?

Answer:

Model output (without CoT): 45

Correct answer: Box B = 10, Box A = 20, Box C = 15 → total = 45. Again correct — but now watch what happens when we add one more hop and the model starts hallucinating:

Prompt (four-hop version that breaks models):

Box B has 10 balls. Box A has twice as many as Box B.
Box C has 5 fewer than Box A. Box D has three times as many
as Box C. What is the total?

Answer:

Model output (without CoT): 95

Correct answer: B=10, A=20, C=15, D=45 → total = 90. Off by 5, because the model collapsed the intermediate steps.

Now, with Chain of Thought:

Box B has 10 balls. Box A has twice as many as Box B.
Box C has 5 fewer than Box A. Box D has three times as many
as Box C. What is the total?

Let's think step by step.

Model output (with CoT):

Step 1: Box B = 10 balls.
Step 2: Box A = 2 × Box B = 2 × 10 = 20 balls.
Step 3: Box C = Box A - 5 = 20 - 5 = 15 balls.
Step 4: Box D = 3 × Box C = 3 × 15 = 45 balls.
Step 5: Total = 10 + 20 + 15 + 45 = 90 balls.

Answer: 90

Result: correct. Four words — "Let's think step by step" — turned a wrong answer into a right one. That is Chain of Thought prompting in its simplest form. The rest of this post explains why it works, how to use it effectively, and when to skip it entirely.

📖 What Chain of Thought Prompting Actually Does Under the Hood

Chain of Thought prompting is a technique where you instruct a language model to generate intermediate reasoning steps before producing a final answer. It was formally introduced by Google researchers Wei et al. in their 2022 paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models."

But CoT is not magic, and it's important to understand why it works. Language models are next-token predictors: given a sequence of tokens, they predict the most likely next token according to patterns learned during training. When you ask a model to skip straight to an answer, the output distribution for the final answer token is shaped only by the question tokens — which gives the model very little "working memory."

When you add intermediate reasoning steps, each step becomes a new set of tokens that constrains the distribution for subsequent tokens. The model effectively uses its own prior output as working memory, steering the generation toward paths that are internally consistent with the earlier steps. The reasoning chain acts as scaffolding: each sentence narrows the probability space for what comes next.

Think of it like a student working through a math problem. If you ask them to just write "42" at the bottom of the page, their brain has to hold all the intermediate calculations in short-term memory. Ask them to write each step on the page, and suddenly those intermediate answers become permanent checkpoints — they can build each new step on a verified foundation.

🔍 The CoT Family: Zero-Shot, Few-Shot, Auto-CoT, and Beyond

Chain of Thought is not a single technique — it's a family of related methods. Understanding the landscape helps you pick the right variant for each task.

Zero-Shot CoT: "Let's think step by step"

Introduced by Kojima et al. (2022), Zero-Shot CoT adds no examples — just an instruction. The phrase "Let's think step by step" appended to a question is enough to trigger structured reasoning in sufficiently large models (generally 7B+ parameters).

Prompt:

Q: If a train travels 60 miles per hour and needs to cover 210 miles,
   how long will the trip take?

Let's think step by step.

Few-Shot CoT: Teaching by Example

The original Wei et al. formulation provides 2–8 worked examples in the prompt. Each example includes a question, an explicit chain of reasoning steps, and the final answer. The model learns the reasoning format from the examples and applies it to the new question.

Prompt structure:

Q: [Example 1 question]
A: [Step 1...] [Step 2...] ... The answer is X.

Q: [Example 2 question]
A: [Step 1...] [Step 2...] ... The answer is Y.

Q: [New question]
A:

Auto-CoT: Generating Examples Automatically

Few-shot CoT requires hand-crafted examples — expensive and brittle. Auto-CoT (Zhang et al., 2022) uses Zero-Shot CoT to automatically generate reasoning chains for a pool of questions, then selects diverse examples via clustering to build the few-shot prompt. No human annotation required.

Self-Consistency CoT: Majority-Vote Over Multiple Reasoning Paths

Self-Consistency (Wang et al., 2022) samples the model multiple times with temperature > 0, generating several different reasoning paths. The final answer is chosen by majority vote across those paths. This significantly improves accuracy at the cost of 3–10x more tokens (and API cost).

Tree of Thought: Branching Exploration

Tree of Thought (Yao et al., 2023) extends CoT from a linear chain into a tree structure. The model generates multiple candidate reasoning branches at each step, evaluates each branch, and explores the most promising paths — similar to search algorithms. Best for problems requiring exploration and backtracking.

Comparison at a glance:

Variant	Needs Examples	Cost	Best For	Min Model Size
Zero-Shot CoT	No	Low	Quick reasoning, arithmetic	~7B params
Few-Shot CoT	Yes (2–8)	Low-Medium	Consistent format tasks	~7B params
Auto-CoT	No (auto-generated)	Medium	Scalable pipelines	~13B params
Self-Consistency	No	High (3–10x)	High-stakes accuracy	~13B params
Tree of Thought	No	Very High	Exploration, planning, puzzles	~70B params

📊 Visualizing How Reasoning Flows Differently Across CoT Variants

The three main CoT approaches — standard prompting, linear chain-of-thought, and Tree of Thought — differ dramatically in the path they take from question to answer. The diagram below makes that structural difference explicit at a glance. Standard prompting takes a single direct hop, which fails when multiple reasoning steps are needed. Linear CoT breaks the problem into a verified sequence of steps, each anchored by the previous one. Tree of Thought explores several candidate paths in parallel, scores them, and commits only to the strongest branch — enabling backtracking when a reasoning path hits a dead end.

flowchart TD
    Q[Question]

    Q --> S[Standard Prompt]
    S --> A1[Final Answer - direct]

    Q --> C[Zero-Shot or Few-Shot CoT]
    C --> S1[Step 1]
    S1 --> S2[Step 2]
    S2 --> S3[Step 3]
    S3 --> A2[Final Answer]

    Q --> T[Tree of Thought]
    T --> B1[Branch A: reasoning path 1]
    T --> B2[Branch B: reasoning path 2]
    T --> B3[Branch C: reasoning path 3]
    B1 --> E1[Evaluate branch A]
    B2 --> E2[Evaluate branch B - best]
    B3 --> E3[Prune branch C]
    E2 --> A3[Final Answer]

Notice how standard prompting collapses the entire solution into a single token prediction, while linear CoT creates a series of intermediate "checkpoints" the model can verify as it goes. For most day-to-day production tasks, linear CoT delivers around 80% of Tree of Thought's accuracy at roughly 5% of the token cost — making it the go-to default.

⚙️ Implementing CoT in Python: Zero-Shot, Few-Shot, and Self-Consistency

This section shows three runnable Python patterns you can drop into any project that uses the OpenAI API. Each pattern builds on the previous one.

Zero-Shot CoT Template

The simplest possible implementation — append the magic phrase and let the model do the rest.

from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from environment

def zero_shot_cot(question: str) -> str:
    """Appends 'Let's think step by step' and returns the full reasoning + answer."""
    prompt = f"{question}\n\nLet's think step by step."
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a careful reasoning assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0,  # deterministic for reproducibility
    )
    return response.choices[0].message.content

# Example usage
question = (
    "A factory produces 240 widgets per day. "
    "It operates 6 days a week. "
    "How many widgets does it produce in 4 weeks?"
)
print(zero_shot_cot(question))

Expected output (abbreviated):

Step 1: Widgets per day = 240.
Step 2: Days per week = 6, so widgets per week = 240 × 6 = 1440.
Step 3: Weeks = 4, so total = 1440 × 4 = 5760.
Answer: 5760 widgets.

Few-Shot CoT with Three Worked Examples

When the task has a consistent structure (e.g., always arithmetic word problems), few-shot examples teach the model the exact reasoning format you want.

FEW_SHOT_SYSTEM_PROMPT = """You are a math tutor. Solve each problem step by step,
labelling each step clearly, then state the final answer on a new line as:
'Answer: <number>'"""

FEW_SHOT_EXAMPLES = [
    {
        "question": "Sam has 12 apples. He gives away 4. Then he buys 3 more bags of 6 apples each. How many does he have?",
        "answer": (
            "Step 1: Sam starts with 12 apples.\n"
            "Step 2: He gives away 4, so 12 - 4 = 8 apples remain.\n"
            "Step 3: He buys 3 bags × 6 apples = 18 new apples.\n"
            "Step 4: Total = 8 + 18 = 26 apples.\n"
            "Answer: 26"
        ),
    },
    {
        "question": "A car travels at 80 km/h for 2.5 hours. How far does it travel?",
        "answer": (
            "Step 1: Speed = 80 km/h, Time = 2.5 hours.\n"
            "Step 2: Distance = Speed × Time = 80 × 2.5 = 200 km.\n"
            "Answer: 200"
        ),
    },
    {
        "question": "A cinema has 300 seats. 60% are occupied on Monday and 80% on Tuesday. How many more seats were occupied on Tuesday?",
        "answer": (
            "Step 1: Monday occupancy = 300 × 0.60 = 180 seats.\n"
            "Step 2: Tuesday occupancy = 300 × 0.80 = 240 seats.\n"
            "Step 3: Difference = 240 - 180 = 60 seats.\n"
            "Answer: 60"
        ),
    },
]

def build_few_shot_messages(question: str) -> list[dict]:
    messages = [{"role": "system", "content": FEW_SHOT_SYSTEM_PROMPT}]
    for example in FEW_SHOT_EXAMPLES:
        messages.append({"role": "user", "content": example["question"]})
        messages.append({"role": "assistant", "content": example["answer"]})
    messages.append({"role": "user", "content": question})
    return messages

def few_shot_cot(question: str) -> str:
    messages = build_few_shot_messages(question)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0,
    )
    return response.choices[0].message.content

# Example usage
new_question = (
    "A swimming pool holds 50,000 litres. "
    "It drains at 200 litres/minute. "
    "How many hours does it take to fully drain?"
)
print(few_shot_cot(new_question))

Self-Consistency CoT: Majority Vote for Higher Accuracy

Self-consistency generates multiple independent reasoning paths and picks the most common answer. It is more expensive but measurably more accurate on hard problems.

from collections import Counter
import re

def extract_final_answer(text: str) -> str:
    """Pull the numeric answer from a CoT response."""
    match = re.search(r"Answer:\s*([0-9,.]+)", text, re.IGNORECASE)
    if match:
        return match.group(1).replace(",", "").strip()
    # Fallback: last number in the text
    numbers = re.findall(r"\b\d+(?:\.\d+)?\b", text)
    return numbers[-1] if numbers else "unknown"

def self_consistency_cot(question: str, n_samples: int = 5) -> str:
    """
    Sample n_samples independent reasoning chains with temperature > 0,
    then return the most common final answer (majority vote).
    """
    answers = []
    full_responses = []

    for i in range(n_samples):
        messages = build_few_shot_messages(question)
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            temperature=0.7,  # non-zero to get diverse reasoning paths
        )
        text = response.choices[0].message.content
        full_responses.append(text)
        answers.append(extract_final_answer(text))

    # Majority vote
    vote_counts = Counter(answers)
    winner, count = vote_counts.most_common(1)[0]

    print(f"Votes: {dict(vote_counts)}")
    print(f"Winning answer: {winner} ({count}/{n_samples} paths agree)")
    return winner

# Example usage
hard_question = (
    "A recipe requires 2.5 cups of flour per batch. "
    "You want to make 7 batches but only have 15 cups. "
    "How many more cups of flour do you need?"
)
self_consistency_cot(hard_question, n_samples=5)

📊 Visual Reference

flowchart TD
    Pretrained["Pretrained
Model
(Frozen)"]
    LoRA["LoRA Adapter
(Trainable)"]
    Finetune["Fine-tune on
Custom Data"]
    Result["Fine-tuned
Model"]

    Pretrained --> LoRA
    LoRA --> Finetune
    Finetune --> Result

🧪 Step-by-Step Walkthroughs: Math and Logic in Action

To make the mechanics concrete, let's trace through two worked examples — one arithmetic and one logical deduction — showing the exact prompt and the reasoning chain the model produces.

Worked Example 1 — Multi-step arithmetic: This example chains four sequential calculations. Without CoT, models frequently drop intermediate results. With CoT, each calculated value is pinned in the output, available for the next step.

Prompt:

A train departs at 08:00 and travels 3 legs:
- Leg 1: 120 km at 60 km/h
- Leg 2: 90 km at 45 km/h
- Leg 3: 150 km at 100 km/h
What time does it arrive? Let's think step by step.

Reasoning chain:

Step 1: Leg 1 time = 120 ÷ 60 = 2 hours.
Step 2: Leg 2 time = 90 ÷ 45 = 2 hours.
Step 3: Leg 3 time = 150 ÷ 100 = 1.5 hours.
Step 4: Total travel = 2 + 2 + 1.5 = 5.5 hours.
Step 5: Departure 08:00 + 5.5 hours = 13:30.
Answer: 13:30

Worked Example 2 — Logical deduction: This example tests the model on a syllogism with four entities. Without CoT, models often hallucinate a relationship. With CoT, the chain of inferences is made explicit.

Prompt:

All mammals are warm-blooded.
All dogs are mammals.
Rex is a dog.
Is Rex warm-blooded? Let's think step by step.

Reasoning chain:

Step 1: The first premise states all mammals are warm-blooded.
Step 2: The second premise states all dogs are mammals.
Step 3: Therefore, all dogs are warm-blooded (by transitivity of set membership).
Step 4: Rex is a dog (third premise).
Step 5: Therefore, Rex is warm-blooded.
Answer: Yes, Rex is warm-blooded.

Both examples follow the same pattern: decompose, compute or deduce one hop at a time, chain outputs forward. Notice how each step explicitly references its inputs — this is what prevents the model from "forgetting" an earlier calculated value.

⚖️ When Chain of Thought Helps — and When It's a Waste of Tokens

CoT is powerful but not universally beneficial. Using it blindly adds cost and latency with no accuracy gain — or sometimes actively hurts.

CoT significantly helps when:

Multi-step arithmetic — more than two calculation steps. The model uses the intermediate outputs as working memory.
Logical deduction / syllogisms — chains of "if A then B" reasoning.
Commonsense reasoning with implicit steps — e.g., "Is a whale heavier than a car?" requires recalling facts and comparing them.
Code debugging — reasoning through what each line does before identifying the bug.
Planning / scheduling — ordering steps with dependencies.

CoT has little effect or actively hurts when:

Simple factual recall — "What is the capital of France?" Adding a reasoning chain just adds noise and can confuse the model into second-guessing a correct direct answer.
Very small models (under ~7B parameters) — smaller models were not trained on enough reasoning examples to use CoT effectively. Their reasoning chains are often incoherent or circular, and the final answer is no better — sometimes worse — than a direct prompt.
Single-hop questions — "2 + 2 = ?" doesn't benefit from five reasoning steps.
Latency-critical applications — if you need sub-100ms responses, CoT adds tokens that cost time. For simple classification or routing tasks, skip it.
Tasks with a fixed, well-defined answer format — if your model is reliably 99% accurate on a task without CoT, adding it gains nothing and costs tokens.

A useful rule of thumb: If a human solving this problem on paper would write intermediate steps, CoT will help. If a human could answer by reflex, it won't.

🏗️ Advanced CoT Variants: Tree of Thought, ReAct, and Program of Thought

Once you've mastered basic CoT, three advanced variants are worth knowing. Each one extends the core idea in a different direction.

Tree of Thought (ToT)

Tree of Thought (Yao et al., 2023) replaces the single linear reasoning chain with a search tree. At each reasoning step, the model proposes multiple candidate next-thoughts, evaluates them (by scoring plausibility or by a second LLM call), and explores the most promising branches. It can backtrack when a branch leads to a dead end.

ToT shines on tasks that require planning, multi-path exploration, or creative problem-solving — like the game of 24 (making 24 from four numbers using arithmetic operators) or writing a coherent multi-chapter story outline. It's overkill for most production use cases due to its very high token cost.

ReAct: Reasoning + Acting

ReAct (Yao et al., 2022) interleaves CoT reasoning steps with tool calls — web search, calculator, database queries, code execution. The pattern is:

Thought: I need to find the current population of Japan.
Action: search("Japan population 2024")
Observation: Japan population is approximately 123 million.
Thought: Now I can answer.
Answer: 123 million.

ReAct is the foundation of most modern AI agent frameworks (LangChain, LangGraph). It transforms CoT from a pure text-generation technique into a way for models to interact with the real world to gather information before reasoning.

Program of Thought (PoT)

Program of Thought (Chen et al., 2022) addresses a specific failure mode of natural language CoT: arithmetic errors. Instead of reasoning in natural language, PoT instructs the model to generate a Python program that solves the problem, then executes it.

# PoT output for: "What is 17.5% of 3248?"
tax_base = 3248
tax_rate = 0.175
result = tax_base * tax_rate
print(result)  # 568.4

PoT is remarkably effective for financial calculations, scientific formulas, and any task where arithmetic precision matters. The model's job is just to write correct code — the Python interpreter handles the arithmetic, eliminating a whole class of numeric errors.

🌍 Where CoT Is Already Running in Production

Chain of Thought is not a research curiosity — it's deployed at scale across real systems today.

Google PaLM and Gemini math benchmarks: Google's PaLM paper demonstrated that CoT prompting unlocked dramatic accuracy improvements on the GSM8K grade-school math benchmark — jumping from roughly 17% (standard prompting) to 58% accuracy on 540B parameter PaLM. Self-consistency CoT pushed this further to 74%. These benchmarks drove Google's decision to embed CoT natively into Gemini's reasoning pipeline.

ChatGPT's Code Interpreter / Advanced Data Analysis: When you upload a CSV and ask ChatGPT to analyse it, the model uses a form of Program of Thought — it generates Python code, executes it in a sandbox, observes the output, and iterates. The "reasoning" steps are the code cells, not natural language.

Customer service diagnostic reasoning: Enterprise support bots at companies like Salesforce and Zendesk use structured CoT to diagnose customer issues: first understand the symptom, then reason through possible causes in priority order, then recommend a fix. This structured approach dramatically reduces hallucinated diagnoses compared to direct-answer prompts.

Medical triage systems: Researchers at Stanford and Mass General have prototyped clinical decision-support tools where an LLM reasons through patient symptoms step by step before suggesting a differential diagnosis. CoT makes the reasoning auditable — clinicians can check each step, not just the final recommendation. This auditability is a safety requirement in healthcare contexts.

⚖️ The Real Costs of Chain of Thought: Tokens, Time, and Convincing-But-Wrong Answers

CoT is not free. Understanding its cost profile helps you deploy it responsibly.

Token cost and latency: A zero-shot CoT response for a multi-step problem might generate 150–400 tokens of reasoning before the final answer, versus 5–20 tokens for a direct answer. At GPT-4o pricing, that's a 10–20x cost increase per query for the reasoning portion. For high-volume pipelines, this adds up fast. Self-consistency CoT (5 samples) multiplies this by another 5x.

Reasoning chains that look right but are wrong: This is the most dangerous failure mode. CoT produces confident-sounding, step-by-step reasoning that leads to a wrong answer. Because each step looks plausible, it's harder to spot the error than a simple wrong answer would be. Always implement answer verification for high-stakes outputs — either via self-consistency voting or an independent check.

Self-consistency cost: Running 5–10 independent reasoning chains per query (self-consistency) improves accuracy noticeably but can cost 5–15x more than a single chain. Reserve it for tasks where accuracy matters more than cost — final answers in a legal or financial context, not routine classification.

Model size dependency: CoT works well only on sufficiently large models. On models smaller than ~7B parameters, the generated reasoning chains are often circular, repetitive, or internally contradictory. For small-model deployments, fine-tuning with reasoning traces is a better path than hoping the model can CoT in zero-shot.

🧭 Choosing the Right CoT Variant for Your Task and Budget

Use this table as a quick-reference decision guide before writing your prompt.

Situation	Recommended Approach	Why
Arithmetic word problems, budget is limited	Zero-Shot CoT ("Let's think step by step")	Simple, cheap, very effective for math
Consistent task type with specific output format	Few-Shot CoT (2–4 examples)	Examples constrain the reasoning format
High-stakes single-answer query, cost not a concern	Self-Consistency CoT (5 samples)	Majority vote significantly reduces errors
Complex planning or creative problem solving	Tree of Thought	Needs multi-path exploration and backtracking
Need perfect arithmetic precision	Program of Thought (generate + run Python)	Offloads arithmetic to an interpreter
Simple factual lookup or single-hop question	No CoT (direct prompt)	CoT adds noise, not signal
Small model deployment (under 7B params)	Fine-tune with reasoning traces, not zero-shot CoT	Zero-shot CoT is ineffective at small scale
Agent using external tools (search, DB, API)	ReAct (Thought / Action / Observation loop)	CoT + tool calls = grounded reasoning

🛠️ LangChain: Implementing CoT Chains in Practice

LangChain is a Python framework for building LLM-powered applications. It provides first-class support for CoT via prompt templates, chain composition, and the LangChain Expression Language (LCEL). Here is how to implement a reusable CoT chain using LangChain's current LCEL interface.

Basic CoT chain with LangChain prompt templates:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Zero-Shot CoT prompt template
cot_prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        "You are a careful reasoning assistant. Always think step by step "
        "before giving a final answer. Label each step clearly.",
    ),
    (
        "human",
        "{question}\n\nLet's think step by step.",
    ),
])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
parser = StrOutputParser()

# LCEL chain: prompt | llm | parser
cot_chain = cot_prompt | llm | parser

# Run it
result = cot_chain.invoke({
    "question": (
        "A bookshelf has 5 shelves. Each shelf holds 30 books. "
        "The owner removes 12 books from each shelf to donate. "
        "How many books remain?"
    )
})
print(result)

Multi-step reasoning with a SequentialChain pattern (extract answer after reasoning):

from langchain_core.prompts import ChatPromptTemplate

# Step 1: Generate the reasoning chain
reasoning_prompt = ChatPromptTemplate.from_messages([
    ("system", "Solve the problem step by step. Show all your work."),
    ("human", "{question}\n\nStep-by-step reasoning:"),
])

# Step 2: Extract the final answer from the reasoning
extraction_prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract only the final numeric or one-word answer from the reasoning below. "
               "Return just the answer with no explanation."),
    ("human", "Reasoning:\n{reasoning}\n\nFinal answer only:"),
])

# Compose the chain using LCEL
reasoning_chain = reasoning_prompt | llm | parser
extraction_chain = extraction_prompt | llm | parser

def two_step_cot(question: str) -> dict:
    reasoning = reasoning_chain.invoke({"question": question})
    answer = extraction_chain.invoke({"reasoning": reasoning})
    return {"reasoning": reasoning, "answer": answer.strip()}

# Example
output = two_step_cot(
    "A project takes 8 weeks with 4 engineers. "
    "How many weeks would it take with 6 engineers? (Assume linear scaling)"
)
print("Reasoning:\n", output["reasoning"])
print("\nFinal Answer:", output["answer"])

The two-step pattern is particularly useful in production: the first chain generates a full reasoning trace for logging and auditability, and the second chain extracts a clean, parseable final answer. This gives you the benefits of CoT (better accuracy) plus a structured output you can actually use in downstream code.

For a full deep-dive on LangChain's LCEL and agent patterns, see the LangChain LCEL guide in this series.

🧠 Deep Dive: What Actually Happens Inside a Chain of Thought Generation

When you ask an LLM a question without CoT, the model generates the answer token by token in a single forward pass. Each token is sampled from a probability distribution conditioned on all previous tokens. The model has no mechanism to "go back" and reconsider — once a token is generated, it is part of the context that shapes every subsequent token.

CoT changes this by forcing the model to generate intermediate tokens (the reasoning steps) before the answer token. Those intermediate tokens become part of the context that conditions the final answer. Crucially, each reasoning step narrows the probability distribution over subsequent tokens — a model that has generated "the carry is 1, so the ones digit is 3" is now far more likely to generate "23" than "24" as the final answer. The reasoning acts as a self-constructed retrieval context.

Internals: Why Step-by-Step Generation Improves Accuracy

The attention mechanism in transformers allows each token to attend to all previous tokens with different weights. When intermediate reasoning steps are present, the final answer token can attend strongly to the relevant fact established two or three steps back — for example, "carry = 1" — even if it was generated far earlier in the sequence. Without CoT, the model must compress the entire reasoning chain into its hidden state in a single pass, which exceeds the effective capacity of the hidden state for multi-step problems.

Self-consistency CoT exploits this further: sample the chain of thought n times with temperature > 0, then majority-vote on the final answers. The variance across reasoning paths surfaces cases where the model is genuinely uncertain, and the majority answer is reliably more accurate than any single greedy decode.

Performance Analysis: Token Cost vs. Accuracy Gain

Adding a CoT chain increases the number of output tokens by 3–10× depending on task complexity. At GPT-4o pricing, a 500-token reasoning chain costs roughly $0.0025 per call — negligible for a low-volume internal tool, significant at 10M calls per day. Self-consistency with 5 samples multiplies that by 5. The accuracy gains are task-dependent: arithmetic and symbolic reasoning see 20–40% improvement; sentiment classification sees near-zero improvement (and sometimes regresses due to overthinking).

📚 Lessons Learned from Using CoT in Real Projects

These are the mistakes that show up most often when developers first deploy Chain of Thought prompting — and how to avoid them.

1. Using CoT for simple tasks wastes tokens and confuses the model. A common beginner mistake is adding "Let's think step by step" to every single prompt. For a simple classification task ("Is this email spam?"), CoT makes the model overthink, sometimes generating a long chain of reasoning that contradicts the obvious answer. Apply CoT only when the task genuinely requires multiple reasoning steps.

2. Too-short reasoning chains fail on hard problems. If you constrain the model's output length or truncate the CoT chain before it's finished, the final answer degrades sharply. Longer chains = better results on harder problems. Set your max_tokens generously when using CoT for complex tasks.

3. Mixing reasoning formats in few-shot examples confuses the model. If one example uses "Step 1: ... Step 2: ..." and another uses numbered lists and a third uses bullet points, the model receives mixed signals about what format to follow. Keep the reasoning format completely consistent across all few-shot examples — same labelling style, same "Answer: X" final line.

4. Never trust a reasoning chain that looks convincing without verification. A well-formatted CoT response that says "Step 4: 45 × 3 = 125" is wrong (it should be 135), but it reads perfectly fluently. For production systems where accuracy matters, always implement a verification pass — either self-consistency voting or a simple Python execution step for arithmetic tasks.

5. CoT on models smaller than 7B is usually counterproductive. Small models don't have enough in-weights reasoning capability to generate reliable intermediate steps. Their CoT outputs often hallucinate intermediate values, producing wrong answers that look more confident than a simple direct wrong answer would. Test without CoT first on small models.

6. For API cost management, cache your few-shot examples. The few-shot examples in your CoT prompt are static — they never change between requests. Use OpenAI's prompt caching (or equivalent) to avoid paying for those tokens on every call. With 4 examples of ~200 tokens each, caching saves 800 tokens per request at scale.

📌 TLDR & Key Takeaways

Chain of Thought (CoT) is a prompting technique that instructs an LLM to generate intermediate reasoning steps before answering. It works because each step constrains the probability distribution for subsequent tokens, acting as working memory.
Zero-Shot CoT ("Let's think step by step") requires no examples and works on models ≥ 7B parameters. It's the cheapest, fastest way to add CoT to any prompt.
Few-Shot CoT provides 2–8 worked examples to teach the model the exact reasoning format. Best when your task has a consistent structure.
Self-Consistency CoT samples multiple reasoning paths and majority-votes the final answer. Significantly more accurate, but 5–10x more expensive.
CoT helps on multi-step math, logic, planning, and debugging. It does not help on simple factual recall, single-hop questions, or tiny models.
Advanced variants — Tree of Thought, ReAct, and Program of Thought — extend CoT for exploration, tool use, and precise arithmetic respectively.
Always verify CoT outputs for high-stakes tasks: a well-formatted reasoning chain can still reach a wrong conclusion.
The one-sentence takeaway: Add "Let's think step by step" to any multi-step problem prompt — it's the cheapest accuracy improvement available in the LLM toolkit.

Article tools

Explain simpler Compare approaches What next?

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Article metadata