How Chatbot Memory Actually Works in Python (And Why LLMs Are Forgetful by Design)
You build your first chatbot. It responds brilliantly to your opening message. You follow up with a reference to what it just said — and it stares back at you blankly, with zero recollection. What went wrong?
Nothing, actually. The model behaved exactly as designed. Understanding why is the single most important conceptual leap you can make as a chatbot developer.
—
1. The Big Misconception: LLMs Are Stateless by Design
Most beginners assume that a language model “remembers” the conversation as it progresses, the way a human would. This is the number-one misconception in applied LLM development, and it leads to hours of confusing debugging.
The reality: every API call is completely independent. When you send a request to GPT-4, Claude, or Gemini, the model has no memory of any previous request you’ve made — not five seconds ago, not five milliseconds ago. It processes exactly what you send in the current HTTP request and returns a response. That’s it. The slate is wiped clean every single time.
This is what “stateless” means in practice. The model itself stores nothing between calls. If you want it to “remember” something, you are responsible for sending that information back with every request.
—
2. The Messages List Pattern: Engineering the Illusion of Memory
So how do real chatbots appear to maintain context? Through a deceptively simple technique: manually managing a list of messages and appending it to every API call.
Here’s the core pattern in Python:
“`python
from openai import OpenAI
client = OpenAI()
messages = [{“role”: “system”, “content”: “You are a helpful assistant.”}]
def chat(user_input):
messages.append({“role”: “user”, “content”: user_input})
response = client.chat.completions.create(
model=”gpt-4o”,
messages=messages
)
reply = response.choices[0].message.content
messages.append({“role”: “assistant”, “content”: reply})
return reply
“`
Every turn, you append the user’s message, fire the API call with the full history, then append the model’s reply. On the next turn, you repeat — sending the entire conversation from the beginning.
This is the foundational pattern behind every production chatbot. The model isn’t remembering anything; you’re reconstructing its memory from scratch on every call. The model simply reads the full transcript and responds accordingly. It’s less like a brain and more like a very fast reader who can only work from documents you hand them.
—
3. The Token Problem: When Memory Gets Expensive
The messages list pattern works beautifully — until it doesn’t. Every model has a context window, measured in tokens (roughly 0.75 words per token). GPT-4o supports up to 128,000 tokens, which sounds enormous, until you’re running a customer support bot that handles hour-long conversations or processes large documents alongside dialogue.
Two compounding problems emerge as history grows:
- Cost: You pay for every input token on every call. A 50-turn conversation means you’re re-sending the entire history 50 times. Costs compound rapidly at scale.
- Latency: Larger context windows mean slower inference times. A bloated message history degrades user experience.
- Context overflow: Exceed the limit and the API throws an error — or silently truncates the oldest messages, destroying context you assumed was intact.
Ignoring token growth is how chatbots fail quietly in production.
—
4. Pruning Strategies: Trimming History Without Losing the Thread
The solution is active context management. Two practical strategies dominate:
Sliding Window
Keep only the last N messages (e.g., the most recent 10 turns). Implementation is trivial:
“`python
messages = [system_message] + conversation_history[-10:]
“`
This is fast and predictable, but it loses early context — a problem if the user referenced something from much earlier in the conversation.
Token Budgeting with `tiktoken`
OpenAI’s `tiktoken` library lets you count tokens precisely before sending a request. You can enforce a hard budget:
“`python
import tiktoken
enc = tiktoken.encoding_for_model(“gpt-4o”)
def trim_to_budget(messages, max_tokens=4000):
while sum(len(enc.encode(m[“content”])) for m in messages[1:]) > max_tokens:
messages.pop(1) # Remove oldest non-system message
return messages
“`
This approach is more precise and ensures you never accidentally exceed your budget, regardless of message length variation.
A hybrid approach — sliding window for speed, token budgeting for safety — is what most production systems use.
—
5. LangChain Memory Types: Choosing the Right Tool
For developers who prefer an abstraction layer, LangChain offers built-in memory components that handle this bookkeeping automatically.
`ConversationBufferMemory` stores the raw message history verbatim — functionally equivalent to the manual messages list pattern. It’s transparent, predictable, and ideal for short conversations where you need full fidelity of every exchange.
`ConversationSummaryMemory` takes a different approach: instead of accumulating raw turns, it uses the LLM itself to progressively summarize older portions of the conversation, replacing verbose history with compact summaries. The result is a context window that stays lean while preserving the meaning of past interactions.
| | `ConversationBufferMemory` | `ConversationSummaryMemory` |
|—|—|—|
| Best for | Short sessions, debugging | Long-running conversations |
| Token usage | Grows linearly | Stays bounded |
| Fidelity | Exact | Lossy (by design) |
| LLM calls | Per user turn | Extra call to summarize |
Use `ConversationBufferMemory` when accuracy of wording matters (e.g., legal or technical contexts). Use `ConversationSummaryMemory` when you’re optimizing for long sessions and can tolerate slight paraphrasing of earlier content.
—
6. Takeaway: Memory Management Is the Core Skill
Building a chatbot that responds is trivial. Building one that remembers coherently, scales economically, and degrades gracefully is where real engineering begins.
Mastering the messages list pattern — and knowing when and how to prune it — is not an advanced optimization. It is the foundational skill of production chatbot development. Every framework, every abstraction, every memory component you encounter is ultimately a variation on this one idea: the model only knows what you tell it, so be deliberate about what you send.
Get this right, and every other aspect of chatbot architecture becomes considerably easier to reason about.