Home/Blog/Langchain/LangChain Memory: Conversation History and Summarization

LangchainAdvanced•18 min read•Mar 28, 2026

LangChain Memory: Conversation History and Summarization

Keep context across turns: ConversationBufferMemory, ConversationSummaryMemory, and the LCEL memory pattern — before LangGraph checkpointing.

Abstract Algorithms

Helping engineers master software engineering topics.

LangChain Memory: Conversation History and Summarization

> TLDR: LLMs are stateless — every API call starts fresh. LangChain memory classes (Buffer, Window, Summary, SummaryBuffer) explicitly inject history into each call, and `RunnableWithMessageHistory` is the modern LCEL replacement for the legacy `ConversationChain`.

📖 The Amnesia Tax: Every Conversation Starts from Zero

You have built a customer support chatbot. The first message goes perfectly:

User: Hi, my account number is ACC-8837 and I cannot log in. Bot: Got it! Account ACC-8837 is locked after three failed attempts. I have sent a password reset link to your registered email.

The user follows the link, it does not work, and they come back:

User: The reset link is broken. Can you send another one? Bot: I would be happy to help! Could you please provide your account number?

Infuriating. The bot has not forgotten — it never knew. Every call to the LLM API is completely stateless. The model receives a list of messages and returns a response, and then it retains absolutely nothing. There is no persistent session on the server side. No hidden context store. No magic. Each request starts with a blank slate.

This is the problem LangChain memory classes solve. They act as an explicit history manager: reading past exchanges before each LLM call and writing the new exchange back after the response. The user experiences a bot that remembers; under the hood you are just sending a progressively longer prompt.

Four strategies exist for managing that growing prompt — each making a different tradeoff between completeness, cost, and token usage. This guide walks through all four, introduces the modern LCEL memory pattern with RunnableWithMessageHistory, and shows a full multi-turn Code Review Assistant that retains context across five conversation turns.

🔍 What the LLM API Actually Receives on Every Call

Before choosing a memory strategy, it helps to see exactly what the API receives. Every call to llm.invoke(messages) sends a structured list of messages. Once the response is returned, the API discards everything. There is no session, no thread ID, no continuation token.

A completely stateless call looks like this:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

llm = ChatOpenAI(model="gpt-4o-mini")
response = llm.invoke([HumanMessage(content="My account is ACC-8837.")])
# The model processes this, returns a response, and forgets everything.

To make the model appear stateful, the full conversation history must be included in the next call's messages list:

from langchain_core.messages import HumanMessage, AIMessage

messages = [
    HumanMessage(content="My account number is ACC-8837."),
    AIMessage(content="Got it. Account ACC-8837 is locked."),
    HumanMessage(content="Can you resend the reset link?"),
]
response = llm.invoke(messages)
# The model can now reference ACC-8837 because the history is in THIS call.

The context window is your memory. Everything the model needs to know must fit in the token budget for a single request. GPT-4o supports 128,000 tokens, but every token has a cost — both in dollars and in latency. A 60-turn support session with verbose responses can consume 30,000 tokens or more. The four strategies below each manage that budget differently.

⚙️ Four Memory Strategies: From Full Buffer to Compressed Summary

ConversationBufferMemory — Append Everything

The simplest approach: store every message in a list and send the complete list on every call.

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationBufferMemory(return_messages=True)
chain = ConversationChain(llm=llm, memory=memory, verbose=False)

print(chain.predict(input="My account number is ACC-8837."))
# → "Got it! Account ACC-8837 is on record."

print(chain.predict(input="I cannot log in. What should I try?"))
# The chain includes the previous turn — the model knows about ACC-8837.

# Inspect what is stored
print(memory.chat_memory.messages)
# [HumanMessage(...), AIMessage(...), HumanMessage(...), AIMessage(...)]

The memory object implements two key methods on every turn: load_memory_variables({}) reads the history before the LLM call, and save_context({"input": ...}, {"output": ...}) writes the new exchange back after. The memory_key parameter (default "history") controls the prompt template variable that receives the history string.

When to use: Prototyping, short conversations under 20 turns, cases where exact replay of every word matters (legal bots, compliance assistants).

When to avoid: Any session of unpredictable length — token cost grows linearly with no ceiling.

ConversationBufferWindowMemory — Keep Only the Last K Turns

A sliding window that retains only the most recent k human/AI exchange pairs. Older turns are silently dropped.

from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(k=4, return_messages=True)
chain = ConversationChain(llm=llm, memory=memory)

With k=4, your token cost is bounded regardless of how long the session runs. The tradeoff: the model loses access to anything said more than four turns ago. If the user says "As I mentioned at the start, my account is ACC-8837," and that was five turns back, the model cannot retrieve it.

ConversationSummaryMemory — Compress Old Turns into a Running Narrative

Instead of discarding old turns, this class summarizes them using a second LLM call. The summary replaces the raw message list in the prompt.

from langchain.memory import ConversationSummaryMemory

memory = ConversationSummaryMemory(llm=llm, return_messages=True)
chain = ConversationChain(llm=llm, memory=memory)

After turn three, the prompt might include: "The user identified their account as ACC-8837. They reported being unable to log in. The assistant sent a password reset link." This gives the model access to the semantics of the conversation at a fraction of the raw token cost. The tradeoff: summarization is lossy. Specific values — account numbers, quoted error messages, code snippets — can be paraphrased away during compression.

Cost note: Every turn triggers an extra LLM call for summarization. This can effectively double API costs at high volume. Reserve SummaryMemory for low-to-medium volume bots where precision of old context matters more than throughput cost.

ConversationSummaryBufferMemory — The Production Hybrid

This is the recommended starting point for most production chatbots. It keeps recent turns verbatim (within a configurable token budget) and summarizes anything older into a running digest.

from langchain.memory import ConversationSummaryBufferMemory

memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=1000,  # Keep verbatim until this limit is crossed
    return_messages=True
)
chain = ConversationChain(llm=llm, memory=memory)

The max_token_limit parameter is the threshold that triggers summarization. Recent exchanges — which are statistically most likely to be referenced — remain fully intact. Older context is compressed. You get high-fidelity recent memory with bounded token growth, making it the right default for any chatbot that needs to handle sessions of unknown length.

📊 Memory Flow: From User Input to LLM Response

Every memory-backed chain follows a consistent read-before-write cycle. The memory object wraps each LLM call like a decorator: it injects history before the model sees the prompt, and it persists the exchange after the response arrives.

graph TD
    A[User sends a message] --> B[Chain receives input]
    B --> C[Memory.load_memory_variables]
    C --> D[Retrieve stored history or running summary]
    D --> E[Assemble full prompt: history + new message]
    E --> F[LLM API call with full context]
    F --> G[LLM generates response]
    G --> H[Memory.save_context]
    H --> I[Write new input + output to memory store]
    I --> J[Return response to user]
    J --> A

Figure: The memory read-before-write cycle on every conversation turn. The LLM itself remains stateless; the history management lives entirely in the LangChain chain layer above it.

The key architectural point: the LLM never "remembers" anything between calls. Memory is a client-side concern. You can swap memory backends — in-memory list, Redis, DynamoDB — without touching the model, the prompt template, or any other part of the chain.

🧠 Deep Dive: Inside LangChain's Memory Architecture

Under the Internals: How Memory Classes Store and Retrieve State

All four LangChain memory classes share the same two-method interface: load_memory_variables(inputs) and save_context(inputs, outputs). The ConversationChain (and RunnableWithMessageHistory) calls them in sequence around every LLM invocation.

Internally, every class wraps a ChatMessageHistory object — a simple ordered list of BaseMessage objects (HumanMessage, AIMessage, SystemMessage). The classes differ only in what they do before handing that list to the prompt:

BufferMemory returns the full list unchanged.
BufferWindowMemory slices the list to the last k pairs before returning.
SummaryMemory maintains a moving_summary_buffer string and prepends it as a SystemMessage. Each save_context call triggers a compression LLM call using an internal _predict_new_summary method.
SummaryBufferMemory stores both a raw ChatMessageHistory and a moving_summary_buffer. When the raw buffer exceeds max_token_limit tokens, the oldest messages are pruned from the raw list and folded into the summary string via a dedicated compression prompt.

The compression prompt used by the summary classes is:

Progressively summarize the lines of conversation provided, adding onto
the previous summary and returning a new summary.

SUMMARY:
{summary}

NEW LINES OF CONVERSATION:
{new_lines}

NEW SUMMARY:

This compression call is a separate LLM invocation — it uses the same model you configure on the chain unless you explicitly pass a different llm instance to the memory constructor. Passing a cheaper, faster model (e.g., gpt-4o-mini for compression while using gpt-4o for main responses) is a practical cost optimization.

Performance Analysis: Token Cost vs. Recall Quality

The four strategies form a clear spectrum across the dimensions that matter in production:

Strategy	Token growth	Recall fidelity	Extra LLM calls per turn	Best for
BufferMemory	Linear — unbounded	Perfect — every word	0	Prototypes, short sessions
BufferWindowMemory	Constant — capped at k pairs	Last k turns only	0	Long sessions, cost-sensitive bots
SummaryMemory	Near-constant	Semantic — facts may be paraphrased	1 always	Low-volume, accuracy-critical bots
SummaryBufferMemory	Bounded by `max_token_limit`	Recent: perfect; old: semantic	1 when threshold is crossed	Production chatbots, default choice

The latency bottleneck for SummaryMemory variants is token counting via tiktoken before deciding whether to trigger compression. For high-throughput applications (thousands of concurrent sessions), this CPU overhead is measurable and should be profiled. In most cases it is negligible compared to the LLM round-trip itself.

🌍 Real-World Application: A Multi-Turn Code Review Assistant

A Code Review Assistant is an ideal showcase for memory: the user submits code in the first turn and then asks follow-up questions that only make sense in the context of what was previously reviewed. Without memory, every follow-up requires the user to paste the code again.

The following five-turn session uses ConversationSummaryBufferMemory. The first three turns fit comfortably within the 800-token verbatim buffer; by turn four, older context is summarized, yet the model retains the critical facts.

from langchain.memory import ConversationSummaryBufferMemory
from langchain.chains import ConversationChain
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationSummaryBufferMemory(llm=llm, max_token_limit=800)

template = """You are a senior code reviewer. Be concise and specific.

{history}
Human: {input}
Assistant:"""

prompt = PromptTemplate(input_variables=["history", "input"], template=template)
chain = ConversationChain(llm=llm, memory=memory, prompt=prompt)

# Turn 1 — user submits code for review
response = chain.predict(input="""
Review this Python function:
def fetch_user(user_id):
    conn = db.connect()
    result = conn.execute(f"SELECT * FROM users WHERE id = {user_id}")
    return result.fetchone()
""")
# → Critical issue: SQL injection vulnerability on the f-string query.
#   Use a parameterized query: conn.execute("SELECT * FROM users WHERE id = ?", (user_id,))
#   Also: the connection is never closed — wrap in a context manager.

# Turn 2 — follow-up: ask for corrected code
response = chain.predict(input="Show me the corrected version with both fixes.")
# → def fetch_user(user_id):
#       with db.connect() as conn:
#           result = conn.execute("SELECT * FROM users WHERE id = ?", (user_id,))
#           return result.fetchone()

# Turn 3 — ask about a related pattern
response = chain.predict(input="Is SQLAlchemy connection pooling better here?")
# → Yes — use create_engine() with pool_size and pool_recycle for production.
#   The context manager approach you have now is fine for scripts.

# Turn 4 — reference the corrected code from turn 2
response = chain.predict(input="Can the function you showed handle None as user_id?")
# → No. Add a guard: if user_id is None: raise ValueError("user_id must not be None")
#   The parameterized query would pass NULL to the database, returning unexpected results.

# Turn 5 — synthesize all findings
response = chain.predict(input="Summarize every issue found and the final recommended code.")
# → Three issues addressed:
#   1. SQL injection — fixed with parameterized query
#   2. Unclosed connection — fixed with context manager
#   3. None input — add a ValueError guard before the query
print(response)

Turn 5 works because the verbatim buffer retained the key exchange from turns 1 and 2 (still under 800 tokens). The model can enumerate all three issues by reading the buffer directly — no semantic reconstruction required. If the session had been longer, turns 1 and 2 would have been summarized into: "Reviewed fetch_user. Issues: SQL injection, unclosed connection. Fixed with parameterized query and context manager." That summary still contains the three-issue count, so turn 5 would remain correct.

🛠️ Community Memory Backends: Redis, MongoDB, and DynamoDB

The in-memory ChatMessageHistory is appropriate for single-process development and testing. In production, where multiple workers serve the same user concurrently or sessions must survive process restarts, you need a persistent backend.

The langchain-community package ships ready-made history implementations for the most common stores. They all implement BaseChatMessageHistory and can be dropped in as the chat_memory argument of any LangChain memory class.

# pip install langchain-community redis
from langchain_community.chat_message_histories import RedisChatMessageHistory
from langchain.memory import ConversationSummaryBufferMemory
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

history = RedisChatMessageHistory(
    session_id="user-ACC-8837",
    url="redis://localhost:6379",
    ttl=3600,  # Expire inactive sessions after 1 hour
)
memory = ConversationSummaryBufferMemory(
    llm=llm,
    chat_memory=history,
    max_token_limit=1000,
    return_messages=True,
)

# MongoDB backend
from langchain_community.chat_message_histories import MongoDBChatMessageHistory

history = MongoDBChatMessageHistory(
    session_id="user-ACC-8837",
    connection_string="mongodb://localhost:27017",
    database_name="chat_sessions",
    collection_name="message_history",
)

# DynamoDB backend
from langchain_community.chat_message_histories import DynamoDBChatMessageHistory

history = DynamoDBChatMessageHistory(
    table_name="LangChainSessions",
    session_id="user-ACC-8837",
)

The session_id is your conversation key — tie it to your authentication system. A typical pattern is f"user-{auth_user_id}-{session_uuid}" so that a returning user can optionally resume a previous session, or start fresh with a new UUID.

⚖️ Trade-offs & Failure Modes in Conversational Memory

The most expensive lesson teams learn is deploying ConversationBufferMemory to production and discovering that power users with 150-turn sessions start hitting context window limits at 2 AM. The following table maps the failure modes to their causes and mitigations:

Failure mode	Root cause	Mitigation
Context window overflow	BufferMemory grows unbounded in long sessions	Switch to BufferWindowMemory or SummaryBufferMemory with `max_token_limit`
Critical context forgotten	WindowMemory drops turns older than k	Increase k, or move to SummaryBufferMemory to preserve semantic context
Summarization fact drift	SummaryMemory paraphrases away specific values (account numbers, code snippets)	Use SummaryBufferMemory with a higher verbatim token limit
Summarization latency spike	Every turn triggers an extra LLM round-trip	Only use summary classes for < 1000 req/min; use window memory for high-throughput
Stale context after correction	User corrects a fact; the old summary is already written	Design prompts to explicitly accept overrides: "My account number is now X, not Y"

The most impactful mitigation: instrument your memory usage from day one. Log len(memory.chat_memory.messages) and the token count per session. Set an alert when a session exceeds 70% of your model's context window. A single SummaryBufferMemory with max_token_limit at 60–70% of the available window gives you a safe operating envelope with no engineering surprises.

🧭 Decision Guide: Choosing the Right Memory Strategy

Situation	Recommendation
Use when	Building a production chatbot with sessions of unknown length → start with `ConversationSummaryBufferMemory(max_token_limit=1000)` and tune the limit to your model's context window capacity
Avoid when	Conversations are guaranteed short (under 10 turns, under 2000 tokens total) — `ConversationBufferMemory` is simpler and has no downsides at that scale
Better alternative	When you need multi-agent coordination, tool call history, or the ability to resume a complex multi-step workflow mid-execution → LangGraph checkpointing handles these cases more cleanly than any LangChain memory class
Edge cases	If the user completely changes topic mid-session (switching from billing support to technical support), a running summary can carry stale context into the new topic — consider resetting memory explicitly on topic-change signals

🧪 The Modern LCEL Memory Pattern: RunnableWithMessageHistory

ConversationChain is the legacy API. For any new project, use RunnableWithMessageHistory — LangChain's LCEL-native memory wrapper. It is composable with the | pipe operator, supports async streaming natively, and cleanly separates session management (your responsibility) from chain logic (LangChain's responsibility).

from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful coding assistant."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}"),
])

chain = prompt | llm  # Clean LCEL pipe — no ConversationChain wrapper

# In-memory store: swap for RedisChatMessageHistory in production
store: dict[str, BaseChatMessageHistory] = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

chain_with_history = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

# Pass session identity via config on every call
config = {"configurable": {"session_id": "user-ACC-8837"}}

r1 = chain_with_history.invoke({"input": "My account is ACC-8837."}, config=config)
r2 = chain_with_history.invoke({"input": "What account did I just mention?"}, config=config)
print(r2.content)
# → "You mentioned account ACC-8837."

To persist sessions across restarts, replace ChatMessageHistory() with RedisChatMessageHistory(session_id=session_id, url=REDIS_URL) inside get_session_history. No other changes are needed — the chain, the prompt, and the invocation logic remain identical.

Comparison with LangGraph checkpointing: LangGraph takes a fundamentally different approach. Instead of a memory object injected into a chain, the full agent state — including every message, every tool call result, and every intermediate graph node output — is persisted as a checkpoint tied to a thread_id. This is more powerful, but it requires adopting the graph abstraction. For a simple conversational chatbot without tools or branching logic, RunnableWithMessageHistory is the right tool. For an agent that calls external APIs, branches based on results, and needs to resume mid-workflow after a timeout, graduate to LangGraph checkpointing (see Related Posts below).

📚 Lessons Learned: What Production Chatbots Teach About Memory

Set a token budget before your first deploy, not after your first incident. ConversationBufferMemory is tempting because it requires zero configuration. Every team that skips the token limit ends up setting one after an outage. Pick SummaryBufferMemory from the start and you never have this conversation.

Session IDs are your responsibility. LangChain does not manage session lifecycle. You must generate, store, and expire session IDs. Tie them to your authentication layer. A user who logs out should not automatically resume a previous session unless you explicitly build that resumption flow.

The summary is lossy — design around that. SummaryMemory does not preserve quoted strings, exact numbers, or structured data verbatim. If precision matters (account numbers, order IDs, code snippets), those values need to stay in the verbatim buffer or be stored separately in a structured store alongside the memory.

Test with long conversations, not just smoke tests. Add automated tests that simulate 50-turn and 100-turn sessions. Assert that critical facts mentioned in turn 1 are still retrievable in turn 50. This is the single most commonly skipped quality gate in chatbot engineering, and the most reliably painful omission to discover in production.

Prefer RunnableWithMessageHistory for all new code. ConversationChain is a legacy API that will eventually be deprecated. The LCEL pattern composes with any runnable, supports streaming with .stream(), and is async-ready with .ainvoke(). There is no reason to use ConversationChain in a codebase started after LangChain 0.2.

📌 TLDR: Summary & Key Takeaways

LLMs are stateless by design. Every API call receives only the messages you send. Conversation history must be assembled and passed explicitly on every turn.
ConversationBufferMemory sends the full history — perfect recall, unbounded token cost.
ConversationBufferWindowMemory caps cost by keeping only the last k turns — simplest budget control, but early context is permanently lost.
ConversationSummaryMemory compresses old turns into a running narrative — semantic recall at near-constant token cost, but lossy on specific values.
ConversationSummaryBufferMemory is the production default — recent turns verbatim, older turns summarized, bounded by max_token_limit.
RunnableWithMessageHistory is the modern LCEL replacement for ConversationChain — composable, async-native, and backend-agnostic.
Community backends (Redis, MongoDB, DynamoDB) all implement the same interface — swap them with a single constructor change, zero chain logic changes.
For agents that need tool call state, graph branching, or cross-session persistence, LangGraph checkpointing is the next step up.

One-liner to remember: LangChain memory is prompt management — every strategy is a different answer to the same question: which messages go into this call?

AI-generated article quiz

Test your understanding

🧠

Ready to test what you just learned?

Generate four focused questions from this article. Answers include immediate explanations.

Guided series path

Agentic AI: LangChain and LangGraph

View all lessons →

Lesson 7 of 16

← Previous lessonFrom LangChain to LangGraph: When Agents Need State MachinesIntermediate · 18 min Next lesson →LangChain RAG: Retrieval-Augmented Generation in PracticeIntermediate · 19 min

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Article metadata