Memory Architectures for AI Agents
Compare memory implementations across systems. Flat files, structured databases, vector stores, and hybrid approaches. Map MemGPT, Claude, ChatGPT, and coding agents to episodic, semantic, and procedural memory concepts.
The Problem This Solves
The Write Outside the Window pattern establishes that you need persistence, but the question is what kind: a flat markdown file, a vector database, a relational store, or some combination. The answer depends on what you’re storing, how you query it, and how much infrastructure complexity you’re willing to take on.
Memory Taxonomy
The cognitive science framing (episodic, semantic, procedural) maps cleanly to different implementation choices, each needing different storage, retrieval, and update patterns.
Episodic Memory
What happened: conversation history, tool executions, task outcomes.
Claude Code uses CLAUDE.md to accumulate lessons learned: every time Claude discovers something about the project (a circular import, a test that needs a specific mock) it writes it down, and future sessions read this file to avoid repeating the same mistakes.
# Project Memory
- Auth module has circular import; use interface not direct import
- Rate limiter tests fail without Redis mock
- User.email unique constraint not enforced at ORM level
Semantic Memory
What is known: facts, knowledge, documentation, domain information.
This is where RAG lives. Documents chunked, embedded, and indexed in a vector store. At query time, relevant chunks come back based on semantic similarity.
def store_fact(fact, metadata):
embedding = embed_model.encode(fact)
vector_store.add(embedding, {"text": fact, "meta": metadata})
def retrieve_fact(query):
query_embedding = embed_model.encode(query)
results = vector_store.search(query_embedding, k=5)
return [r["text"] for r in results]
Procedural Memory
How to act: system prompts, agent instructions, behavioral patterns.
This is the most overlooked memory type because people don’t think of system prompts as “memory,” but that’s exactly what they are: persistent instructions that shape every interaction.
You are a code reviewer.
Your process:
1. Read the changed files
2. Check for security issues
3. Check for performance problems
4. Verify test coverage
Output format: JSON with issues array.
Architecture Comparison
Flat File Memory
Used by: Claude Code, Cursor, Windsurf.
Plain text or markdown files in the project root. The simplest possible implementation, and for most projects the right one.
Pros: Human readable, human editable, version controllable, zero infrastructure.
Cons: Linear search only; scales poorly past a few thousand lines with no ranking or filtering.
Don’t underestimate flat files. A well-maintained CLAUDE.md with 50 lines of hard-won project knowledge outperforms a vector store full of auto-generated summaries, because every line was written by a human who knew which constraints actually mattered.
Vector Store Memory
Used by: RAG systems, MemGPT, Letta.
Embeddings stored in a vector database (Pinecone, Weaviate, Chroma, pgvector).
Pros: Semantic search at scale, millions of documents, built-in relevance ranking.
Cons: Requires an embedding model, retrieval is approximate not exact, and metadata management adds complexity that’s easy to underestimate.
Best for: Large document corpora. Overkill for project-level memory where a flat file suffices.
Structured Database Memory
Used by: Enterprise systems, LangGraph state.
Relational, document, or graph databases with explicit schemas.
Pros: Exact queries, rich capabilities (joins, aggregations, filters), typed fields.
Cons: Schema design upfront, less flexible for unstructured queries, semantic search needs a separate component.
Best for: When you know the shape of your data and need precise lookups rather than fuzzy similarity.
Hybrid Approaches
Most production systems combine multiple approaches:
class AgentMemory:
def __init__(self):
self.episodic = MessageStore()
self.semantic = VectorStore()
self.procedural = SystemPrompt()
self.flat = FlatFile("CLAUDE.md")
def read(self, query):
recent = self.episodic.last_n(10)
relevant = self.semantic.search(query)
quick = self.flat.read_all() # loaded in full; small by design
return combine(recent, relevant, quick)
Build a hybrid when you have genuinely different query patterns. Don’t build one because it seems more sophisticated.
System Comparisons
MemGPT / Letta
The most ambitious approach in this space: a full memory hierarchy where the system decides what stays in working memory versus what gets archived. Working memory holds the current conversation and active task, archival memory holds everything else (searchable on demand), and core memory holds facts that must persist across all interactions.
The design is appealing because it mirrors how humans actually handle memory, offloading things we don’t need right now and retrieving them when relevant. Letta’s own Context-Bench evaluates how well agents maintain facts and context across long interactions, and the results show that even purpose-built memory systems struggle with multi-hop retrieval once archival memory grows large. The problem is that “what to archive” is itself an LLM call. If the model decides to archive something it should have kept, or keep something it should have evicted, you get degraded behavior that’s genuinely hard to debug. You can’t easily inspect why it forgot something; the memory management layer adds a second source of failure on top of the model itself.
Claude (via AGENTS.md / CLAUDE.md)
Flat file memory with human-in-the-loop curation: the model writes to the file, humans edit it directly. This sounds too simple to be worth comparing to MemGPT, but in practice the human curation step does something automatic systems can’t: it filters out noise. A well-maintained file with 50 carefully selected entries beats a vector store with 5000 auto-generated summaries of mixed quality. The constraint is a feature.
ChatGPT (via the bio Tool)
ChatGPT’s memory architecture is visible in its leaked system prompts. Memory is implemented as a tool called bio that persists facts as timestamped lines injected after the system prompt under a # Model Set Context heading. When you say “remember that I prefer Python over JavaScript,” the model calls the bio tool, which stores the fact. On future sessions, all stored memories are injected fresh into the context before the conversation begins.
Several design choices stand out: memories are summarized and merged automatically (“I love dogs” plus “I love cats” becomes “User loves dogs and cats”), the user can say “forget everything” and the model calls the tool to clear the store, and the entire memory is visible to the user through Settings. It’s tool-mediated persistence with explicit user control, sitting between Claude’s fully manual flat-file approach and MemGPT’s fully automatic archival system.
The evolution tells you something. Earlier ChatGPT versions had bio enabled by default; more recent prompts show it disabled with a redirect to Settings. OpenAI appears to have learned that unbounded memory accumulation creates its own problems, echoing the system prompt accumulation pattern described in System Prompt Engineering. When memory grows without curation, the model’s context fills with stale preferences and outdated facts that were relevant three months ago but no longer are.
Coding Agents
Cursor and Windsurf split the problem: config files handle procedural memory, and optional repository-level semantic indexes handle semantic memory over the codebase. Simpler than MemGPT, more structured than a flat file, and enough for most coding workflows.
Choosing an Architecture
| Factor | Flat File | Vector Store | Database | Hybrid |
|---|---|---|---|---|
| Scale | <10k lines | Any | Any | Any |
| Retrieval | Linear | Semantic | Exact | Flexible |
| Complexity | Low | Medium | High | High |
| Infrastructure | None | Embedding model + DB | DB server | Multiple |
| Best for | Projects, teams | Large corpora | Structured data | Production systems |
Start with flat file memory. It’s sufficient for most projects and teams, and the human curation it requires is a feature, not a limitation. Add a vector store when you have a document corpus too large to curate manually, and go hybrid only when you have genuinely distinct query patterns that a single approach can’t serve. Most teams that start with a hybrid architecture would have been better off with a flat file and a month of accumulated knowledge.