All Posts

LangChain Memory: Conversation History and Summarization

Keep context across turns: ConversationBufferMemory, ConversationSummaryMemory, and the LCEL memory pattern — before LangGraph checkpointing.

Abstract AlgorithmsAbstract Algorithms
··20 min read
Cover Image for LangChain Memory: Conversation History and Summarization
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: LLMs are stateless — every API call starts fresh. LangChain memory classes (Buffer, Window, Summary, SummaryBuffer) explicitly inject history into each call, and RunnableWithMessageHistory is the modern LCEL replacement for the legacy ConversationChain.


📖 The Amnesia Tax: Every Conversation Starts from Zero

You have built a customer support chatbot. The first message goes perfectly:

User: Hi, my account number is ACC-8837 and I cannot log in. Bot: Got it! Account ACC-8837 is locked after three failed attempts. I have sent a password reset link to your registered email.

The user follows the link, it does not work, and they come back:

User: The reset link is broken. Can you send another one? Bot: I would be happy to help! Could you please provide your account number?

Infuriating. The bot has not forgotten — it never knew. Every call to the LLM API is completely stateless. The model receives a list of messages and returns a response, and then it retains absolutely nothing. There is no persistent session on the server side. No hidden context store. No magic. Each request starts with a blank slate.

This is the problem LangChain memory classes solve. They act as an explicit history manager: reading past exchanges before each LLM call and writing the new exchange back after the response. The user experiences a bot that remembers; under the hood you are just sending a progressively longer prompt.

Four strategies exist for managing that growing prompt — each making a different tradeoff between completeness, cost, and token usage. This guide walks through all four, introduces the modern LCEL memory pattern with RunnableWithMessageHistory, and shows a full multi-turn Code Review Assistant that retains context across five conversation turns.


🔍 What the LLM API Actually Receives on Every Call

Before choosing a memory strategy, it helps to see exactly what the API receives. Every call to llm.invoke(messages) sends a structured list of messages. Once the response is returned, the API discards everything. There is no session, no thread ID, no continuation token.

A completely stateless call looks like this:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

llm = ChatOpenAI(model="gpt-4o-mini")
response = llm.invoke([HumanMessage(content="My account is ACC-8837.")])
# The model processes this, returns a response, and forgets everything.

To make the model appear stateful, the full conversation history must be included in the next call's messages list:

from langchain_core.messages import HumanMessage, AIMessage

messages = [
    HumanMessage(content="My account number is ACC-8837."),
    AIMessage(content="Got it. Account ACC-8837 is locked."),
    HumanMessage(content="Can you resend the reset link?"),
]
response = llm.invoke(messages)
# The model can now reference ACC-8837 because the history is in THIS call.

The context window is your memory. Everything the model needs to know must fit in the token budget for a single request. GPT-4o supports 128,000 tokens, but every token has a cost — both in dollars and in latency. A 60-turn support session with verbose responses can consume 30,000 tokens or more. The four strategies below each manage that budget differently.


⚙️ Four Memory Strategies: From Full Buffer to Compressed Summary

ConversationBufferMemory — Append Everything

The simplest approach: store every message in a list and send the complete list on every call.

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationBufferMemory(return_messages=True)
chain = ConversationChain(llm=llm, memory=memory, verbose=False)

print(chain.predict(input="My account number is ACC-8837."))
# → "Got it! Account ACC-8837 is on record."

print(chain.predict(input="I cannot log in. What should I try?"))
# The chain includes the previous turn — the model knows about ACC-8837.

# Inspect what is stored
print(memory.chat_memory.messages)
# [HumanMessage(...), AIMessage(...), HumanMessage(...), AIMessage(...)]

The memory object implements two key methods on every turn: load_memory_variables({}) reads the history before the LLM call, and save_context({"input": ...}, {"output": ...}) writes the new exchange back after. The memory_key parameter (default "history") controls the prompt template variable that receives the history string.

When to use: Prototyping, short conversations under 20 turns, cases where exact replay of every word matters (legal bots, compliance assistants).

When to avoid: Any session of unpredictable length — token cost grows linearly with no ceiling.


ConversationBufferWindowMemory — Keep Only the Last K Turns

A sliding window that retains only the most recent k human/AI exchange pairs. Older turns are silently dropped.

from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(k=4, return_messages=True)
chain = ConversationChain(llm=llm, memory=memory)

With k=4, your token cost is bounded regardless of how long the session runs. The tradeoff: the model loses access to anything said more than four turns ago. If the user says "As I mentioned at the start, my account is ACC-8837," and that was five turns back, the model cannot retrieve it.


ConversationSummaryMemory — Compress Old Turns into a Running Narrative

Instead of discarding old turns, this class summarizes them using a second LLM call. The summary replaces the raw message list in the prompt.

from langchain.memory import ConversationSummaryMemory

memory = ConversationSummaryMemory(llm=llm, return_messages=True)
chain = ConversationChain(llm=llm, memory=memory)

After turn three, the prompt might include: "The user identified their account as ACC-8837. They reported being unable to log in. The assistant sent a password reset link." This gives the model access to the semantics of the conversation at a fraction of the raw token cost. The tradeoff: summarization is lossy. Specific values — account numbers, quoted error messages, code snippets — can be paraphrased away during compression.

Cost note: Every turn triggers an extra LLM call for summarization. This can effectively double API costs at high volume. Reserve SummaryMemory for low-to-medium volume bots where precision of old context matters more than throughput cost.


ConversationSummaryBufferMemory — The Production Hybrid

This is the recommended starting point for most production chatbots. It keeps recent turns verbatim (within a configurable token budget) and summarizes anything older into a running digest.

from langchain.memory import ConversationSummaryBufferMemory

memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=1000,  # Keep verbatim until this limit is crossed
    return_messages=True
)
chain = ConversationChain(llm=llm, memory=memory)

The max_token_limit parameter is the threshold that triggers summarization. Recent exchanges — which are statistically most likely to be referenced — remain fully intact. Older context is compressed. You get high-fidelity recent memory with bounded token growth, making it the right default for any chatbot that needs to handle sessions of unknown length.


📊 Memory Flow: From User Input to LLM Response

Every memory-backed chain follows a consistent read-before-write cycle. The memory object wraps each LLM call like a decorator: it injects history before the model sees the prompt, and it persists the exchange after the response arrives.

graph TD
    A[User sends a message] --> B[Chain receives input]
    B --> C[Memory.load_memory_variables]
    C --> D[Retrieve stored history or running summary]
    D --> E[Assemble full prompt: history + new message]
    E --> F[LLM API call with full context]
    F --> G[LLM generates response]
    G --> H[Memory.save_context]
    H --> I[Write new input + output to memory store]
    I --> J[Return response to user]
    J --> A

Figure: The memory read-before-write cycle on every conversation turn. The LLM itself remains stateless; the history management lives entirely in the LangChain chain layer above it.

The key architectural point: the LLM never "remembers" anything between calls. Memory is a client-side concern. You can swap memory backends — in-memory list, Redis, DynamoDB — without touching the model, the prompt template, or any other part of the chain.


🧠 Deep Dive: Inside LangChain's Memory Architecture

Under the Internals: How Memory Classes Store and Retrieve State

All four LangChain memory classes share the same two-method interface: load_memory_variables(inputs) and save_context(inputs, outputs). The ConversationChain (and RunnableWithMessageHistory) calls them in sequence around every LLM invocation.

Internally, every class wraps a ChatMessageHistory object — a simple ordered list of BaseMessage objects (HumanMessage, AIMessage, SystemMessage). The classes differ only in what they do before handing that list to the prompt:

  • BufferMemory returns the full list unchanged.
  • BufferWindowMemory slices the list to the last k pairs before returning.
  • SummaryMemory maintains a moving_summary_buffer string and prepends it as a SystemMessage. Each save_context call triggers a compression LLM call using an internal _predict_new_summary method.
  • SummaryBufferMemory stores both a raw ChatMessageHistory and a moving_summary_buffer. When the raw buffer exceeds max_token_limit tokens, the oldest messages are pruned from the raw list and folded into the summary string via a dedicated compression prompt.

The compression prompt used by the summary classes is:

Progressively summarize the lines of conversation provided, adding onto
the previous summary and returning a new summary.

SUMMARY:
{summary}

NEW LINES OF CONVERSATION:
{new_lines}

NEW SUMMARY:

This compression call is a separate LLM invocation — it uses the same model you configure on the chain unless you explicitly pass a different llm instance to the memory constructor. Passing a cheaper, faster model (e.g., gpt-4o-mini for compression while using gpt-4o for main responses) is a practical cost optimization.

Performance Analysis: Token Cost vs. Recall Quality

The four strategies form a clear spectrum across the dimensions that matter in production:

StrategyToken growthRecall fidelityExtra LLM calls per turnBest for
BufferMemoryLinear — unboundedPerfect — every word0Prototypes, short sessions
BufferWindowMemoryConstant — capped at k pairsLast k turns only0Long sessions, cost-sensitive bots
SummaryMemoryNear-constantSemantic — facts may be paraphrased1 alwaysLow-volume, accuracy-critical bots
SummaryBufferMemoryBounded by max_token_limitRecent: perfect; old: semantic1 when threshold is crossedProduction chatbots, default choice

The latency bottleneck for SummaryMemory variants is token counting via tiktoken before deciding whether to trigger compression. For high-throughput applications (thousands of concurrent sessions), this CPU overhead is measurable and should be profiled. In most cases it is negligible compared to the LLM round-trip itself.


🌍 Real-World Application: A Multi-Turn Code Review Assistant

A Code Review Assistant is an ideal showcase for memory: the user submits code in the first turn and then asks follow-up questions that only make sense in the context of what was previously reviewed. Without memory, every follow-up requires the user to paste the code again.

The following five-turn session uses ConversationSummaryBufferMemory. The first three turns fit comfortably within the 800-token verbatim buffer; by turn four, older context is summarized, yet the model retains the critical facts.

from langchain.memory import ConversationSummaryBufferMemory
from langchain.chains import ConversationChain
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
memory = ConversationSummaryBufferMemory(llm=llm, max_token_limit=800)

template = """You are a senior code reviewer. Be concise and specific.

{history}
Human: {input}
Assistant:"""

prompt = PromptTemplate(input_variables=["history", "input"], template=template)
chain = ConversationChain(llm=llm, memory=memory, prompt=prompt)

# Turn 1 — user submits code for review
response = chain.predict(input="""
Review this Python function:
def fetch_user(user_id):
    conn = db.connect()
    result = conn.execute(f"SELECT * FROM users WHERE id = {user_id}")
    return result.fetchone()
""")
# → Critical issue: SQL injection vulnerability on the f-string query.
#   Use a parameterized query: conn.execute("SELECT * FROM users WHERE id = ?", (user_id,))
#   Also: the connection is never closed — wrap in a context manager.

# Turn 2 — follow-up: ask for corrected code
response = chain.predict(input="Show me the corrected version with both fixes.")
# → def fetch_user(user_id):
#       with db.connect() as conn:
#           result = conn.execute("SELECT * FROM users WHERE id = ?", (user_id,))
#           return result.fetchone()

# Turn 3 — ask about a related pattern
response = chain.predict(input="Is SQLAlchemy connection pooling better here?")
# → Yes — use create_engine() with pool_size and pool_recycle for production.
#   The context manager approach you have now is fine for scripts.

# Turn 4 — reference the corrected code from turn 2
response = chain.predict(input="Can the function you showed handle None as user_id?")
# → No. Add a guard: if user_id is None: raise ValueError("user_id must not be None")
#   The parameterized query would pass NULL to the database, returning unexpected results.

# Turn 5 — synthesize all findings
response = chain.predict(input="Summarize every issue found and the final recommended code.")
# → Three issues addressed:
#   1. SQL injection — fixed with parameterized query
#   2. Unclosed connection — fixed with context manager
#   3. None input — add a ValueError guard before the query
print(response)

Turn 5 works because the verbatim buffer retained the key exchange from turns 1 and 2 (still under 800 tokens). The model can enumerate all three issues by reading the buffer directly — no semantic reconstruction required. If the session had been longer, turns 1 and 2 would have been summarized into: "Reviewed fetch_user. Issues: SQL injection, unclosed connection. Fixed with parameterized query and context manager." That summary still contains the three-issue count, so turn 5 would remain correct.


🛠️ Community Memory Backends: Redis, MongoDB, and DynamoDB

The in-memory ChatMessageHistory is appropriate for single-process development and testing. In production, where multiple workers serve the same user concurrently or sessions must survive process restarts, you need a persistent backend.

The langchain-community package ships ready-made history implementations for the most common stores. They all implement BaseChatMessageHistory and can be dropped in as the chat_memory argument of any LangChain memory class.

# pip install langchain-community redis
from langchain_community.chat_message_histories import RedisChatMessageHistory
from langchain.memory import ConversationSummaryBufferMemory
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

history = RedisChatMessageHistory(
    session_id="user-ACC-8837",
    url="redis://localhost:6379",
    ttl=3600,  # Expire inactive sessions after 1 hour
)
memory = ConversationSummaryBufferMemory(
    llm=llm,
    chat_memory=history,
    max_token_limit=1000,
    return_messages=True,
)
# MongoDB backend
from langchain_community.chat_message_histories import MongoDBChatMessageHistory

history = MongoDBChatMessageHistory(
    session_id="user-ACC-8837",
    connection_string="mongodb://localhost:27017",
    database_name="chat_sessions",
    collection_name="message_history",
)
# DynamoDB backend
from langchain_community.chat_message_histories import DynamoDBChatMessageHistory

history = DynamoDBChatMessageHistory(
    table_name="LangChainSessions",
    session_id="user-ACC-8837",
)

The session_id is your conversation key — tie it to your authentication system. A typical pattern is f"user-{auth_user_id}-{session_uuid}" so that a returning user can optionally resume a previous session, or start fresh with a new UUID.


⚖️ Trade-offs & Failure Modes in Conversational Memory

The most expensive lesson teams learn is deploying ConversationBufferMemory to production and discovering that power users with 150-turn sessions start hitting context window limits at 2 AM. The following table maps the failure modes to their causes and mitigations:

Failure modeRoot causeMitigation
Context window overflowBufferMemory grows unbounded in long sessionsSwitch to BufferWindowMemory or SummaryBufferMemory with max_token_limit
Critical context forgottenWindowMemory drops turns older than kIncrease k, or move to SummaryBufferMemory to preserve semantic context
Summarization fact driftSummaryMemory paraphrases away specific values (account numbers, code snippets)Use SummaryBufferMemory with a higher verbatim token limit
Summarization latency spikeEvery turn triggers an extra LLM round-tripOnly use summary classes for < 1000 req/min; use window memory for high-throughput
Stale context after correctionUser corrects a fact; the old summary is already writtenDesign prompts to explicitly accept overrides: "My account number is now X, not Y"

The most impactful mitigation: instrument your memory usage from day one. Log len(memory.chat_memory.messages) and the token count per session. Set an alert when a session exceeds 70% of your model's context window. A single SummaryBufferMemory with max_token_limit at 60–70% of the available window gives you a safe operating envelope with no engineering surprises.


🧭 Decision Guide: Choosing the Right Memory Strategy

SituationRecommendation
Use whenBuilding a production chatbot with sessions of unknown length → start with ConversationSummaryBufferMemory(max_token_limit=1000) and tune the limit to your model's context window capacity
Avoid whenConversations are guaranteed short (under 10 turns, under 2000 tokens total) — ConversationBufferMemory is simpler and has no downsides at that scale
Better alternativeWhen you need multi-agent coordination, tool call history, or the ability to resume a complex multi-step workflow mid-execution → LangGraph checkpointing handles these cases more cleanly than any LangChain memory class
Edge casesIf the user completely changes topic mid-session (switching from billing support to technical support), a running summary can carry stale context into the new topic — consider resetting memory explicitly on topic-change signals

🧪 The Modern LCEL Memory Pattern: RunnableWithMessageHistory

ConversationChain is the legacy API. For any new project, use RunnableWithMessageHistory — LangChain's LCEL-native memory wrapper. It is composable with the | pipe operator, supports async streaming natively, and cleanly separates session management (your responsibility) from chain logic (LangChain's responsibility).

from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful coding assistant."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}"),
])

chain = prompt | llm  # Clean LCEL pipe — no ConversationChain wrapper

# In-memory store: swap for RedisChatMessageHistory in production
store: dict[str, BaseChatMessageHistory] = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

chain_with_history = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

# Pass session identity via config on every call
config = {"configurable": {"session_id": "user-ACC-8837"}}

r1 = chain_with_history.invoke({"input": "My account is ACC-8837."}, config=config)
r2 = chain_with_history.invoke({"input": "What account did I just mention?"}, config=config)
print(r2.content)
# → "You mentioned account ACC-8837."

To persist sessions across restarts, replace ChatMessageHistory() with RedisChatMessageHistory(session_id=session_id, url=REDIS_URL) inside get_session_history. No other changes are needed — the chain, the prompt, and the invocation logic remain identical.

Comparison with LangGraph checkpointing: LangGraph takes a fundamentally different approach. Instead of a memory object injected into a chain, the full agent state — including every message, every tool call result, and every intermediate graph node output — is persisted as a checkpoint tied to a thread_id. This is more powerful, but it requires adopting the graph abstraction. For a simple conversational chatbot without tools or branching logic, RunnableWithMessageHistory is the right tool. For an agent that calls external APIs, branches based on results, and needs to resume mid-workflow after a timeout, graduate to LangGraph checkpointing (see Related Posts below).


📚 Lessons Learned: What Production Chatbots Teach About Memory

Set a token budget before your first deploy, not after your first incident. ConversationBufferMemory is tempting because it requires zero configuration. Every team that skips the token limit ends up setting one after an outage. Pick SummaryBufferMemory from the start and you never have this conversation.

Session IDs are your responsibility. LangChain does not manage session lifecycle. You must generate, store, and expire session IDs. Tie them to your authentication layer. A user who logs out should not automatically resume a previous session unless you explicitly build that resumption flow.

The summary is lossy — design around that. SummaryMemory does not preserve quoted strings, exact numbers, or structured data verbatim. If precision matters (account numbers, order IDs, code snippets), those values need to stay in the verbatim buffer or be stored separately in a structured store alongside the memory.

Test with long conversations, not just smoke tests. Add automated tests that simulate 50-turn and 100-turn sessions. Assert that critical facts mentioned in turn 1 are still retrievable in turn 50. This is the single most commonly skipped quality gate in chatbot engineering, and the most reliably painful omission to discover in production.

Prefer RunnableWithMessageHistory for all new code. ConversationChain is a legacy API that will eventually be deprecated. The LCEL pattern composes with any runnable, supports streaming with .stream(), and is async-ready with .ainvoke(). There is no reason to use ConversationChain in a codebase started after LangChain 0.2.


📌 TLDR: Summary & Key Takeaways

  • LLMs are stateless by design. Every API call receives only the messages you send. Conversation history must be assembled and passed explicitly on every turn.
  • ConversationBufferMemory sends the full history — perfect recall, unbounded token cost.
  • ConversationBufferWindowMemory caps cost by keeping only the last k turns — simplest budget control, but early context is permanently lost.
  • ConversationSummaryMemory compresses old turns into a running narrative — semantic recall at near-constant token cost, but lossy on specific values.
  • ConversationSummaryBufferMemory is the production default — recent turns verbatim, older turns summarized, bounded by max_token_limit.
  • RunnableWithMessageHistory is the modern LCEL replacement for ConversationChain — composable, async-native, and backend-agnostic.
  • Community backends (Redis, MongoDB, DynamoDB) all implement the same interface — swap them with a single constructor change, zero chain logic changes.
  • For agents that need tool call state, graph branching, or cross-session persistence, LangGraph checkpointing is the next step up.

One-liner to remember: LangChain memory is prompt management — every strategy is a different answer to the same question: which messages go into this call?


📝 Practice Quiz

  1. A stateless LLM chatbot forgets the user's account number after the first message because:

    • A) The LLM deletes messages from its context after processing them
    • B) Each API call includes only the messages you pass — history is not stored server-side
    • C) LLM APIs enforce a one-message-per-session rule by default
    • D) The account number exceeds the context window token limit Correct Answer: B
  2. A production support bot runs sessions averaging 80 turns and 15,000 tokens of history. The team needs to prevent context window errors while preserving as much semantic fidelity as possible. Which memory strategy is most appropriate?

    • A) ConversationBufferMemory with no token limit
    • B) ConversationBufferWindowMemory with k=3
    • C) ConversationSummaryBufferMemory with max_token_limit=3000
    • D) No memory — send only the current message each turn Correct Answer: C
  3. In RunnableWithMessageHistory, what is the role of the get_session_history function?

    • A) It summarizes past messages into a compressed string for the prompt
    • B) It returns a BaseChatMessageHistory instance keyed to a session ID, enabling per-user storage
    • C) It counts tokens in the current conversation and triggers summarization when needed
    • D) It configures the LLM temperature and model parameters for a given session Correct Answer: B
  4. (Open-ended challenge) A user says: "My name is Alice, account ACC-8837, and I have called three times about this billing error." You are using ConversationSummaryMemory. Twenty turns later, the agent needs Alice's account number to process a refund. Describe two concrete failure modes that could prevent the agent from retrieving ACC-8837, and propose one architectural pattern that eliminates both.


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms