Context Engineering: A Practitioner's Guide

Context engineering is the discipline of deciding what information goes into an LLM's context window, how it's structured, and when to change it. This guide covers the core techniques, the patterns that keep recurring, and the mistakes that keep breaking production systems.

Andrej Karpathy on Context Engineering , Anthropic: Effective Context Engineering for AI Agents , LangChain: Context Engineering for Agents

What Context Engineering Actually Is

Every time you call an LLM, you send it a context window: a structured bundle of text that includes system instructions, conversation history, tool outputs, retrieved documents, and whatever else your application assembled for that call. The model has no access to your codebase, your database, or your documentation unless you put it there, which means what you include determines what the model can do, and what you leave out determines where it fails.

Context engineering is the discipline of assembling that window well. Deciding what goes in, what stays out, how it’s ordered, and when pieces get swapped, compressed, or evicted as a conversation or agent workflow progresses.

Andrej Karpathy’s definition captures it well: “Context engineering is the delicate art and science of filling the context window with just the right information for the next step.” The phrase “for the next step” does real work there; it implies a multi-step process where context isn’t static but evolves, accumulates, and eventually needs active management or it degrades.

The parallels to OS memory management are closer than they look. You have a finite pool of working memory; you decide what stays resident and what gets evicted; you compress when things fill up; you optimize for locality so the most relevant material is closest to where it’s needed. Context engineering is memory management for LLMs, and the engineers who already think in those terms tend to pick it up fast.

Why It Emerged

Early LLM applications were single-turn: you wrote a prompt, the model responded, and that was the end of it. The prompt was the context, and phrasing was the whole game because there was nothing else to optimize.

Production systems look fundamentally different. A coding agent accumulates dozens of tool calls, file reads, and error messages over a session. A RAG pipeline retrieves documents, re-ranks them, and assembles them alongside instructions and conversation history. A multi-agent system passes context between parent and child agents, each with different information needs and different tolerance for noise.

After a few turns of a multi-turn agent conversation, the system prompt might account for 5% of the tokens in the window, while the other 95% is accumulated conversation history, tool outputs, and retrieved context that the application assembled programmatically. Nobody is hand-crafting that 95%; it’s built by code, which means it’s an engineering problem with engineering solutions.

The Hard Part

Context windows have finite capacity, and quality degrades well before you hit the limit. The NoLiMa benchmark found that 11 of 13 leading models dropped to half their baseline performance at just 32k tokens, not at the edge of their advertised window but at a fraction of it.

Filling the window with everything you have is worse than including nothing at all, because irrelevant information actively degrades the model’s attention on the information that matters. Including a file “just in case” isn’t free; it competes for attention with the tokens the model actually needs to see, and in long enough contexts that competition has measurable costs.

Both over-inclusion and under-inclusion cause failures, and the boundary between them shifts with every task. That tension is why context engineering is an engineering discipline and not a checklist you can apply once.

The Core Techniques

Context engineering breaks down into a handful of recurring problems, each with patterns that practitioners keep rediscovering independently across different domains and frameworks.

The most impactful decision is what not to put in the window. Select, Don’t Dump covers the principle: for a bug fix, including the broken function, the failing test, and the error log (90 lines total) consistently outperforms dumping the entire file, test suite, and README (2,200 lines). The model can actually focus on what matters instead of wading through noise.

Order matters too, since models attend to the beginning and end of context more than the middle (the “lost in the middle” effect). The Pyramid exploits this by putting the most important information first with progressively more specific detail, giving the model the right framing before asking it to act.

In multi-turn sessions, context accumulates in ways that silently degrade quality. Old tool outputs, superseded instructions, and resolved errors all stay in the window competing for attention unless you actively manage them, and Compress & Restart addresses this by summarizing what matters, dropping what doesn’t, and starting fresh before quality degrades past a useful threshold. Related to this: the context window is working memory, not long-term storage, and Write Outside the Window moves important state to external storage like files, databases, or scratchpads, loading it back selectively when the model actually needs it.

Multi-agent systems introduce a different problem entirely. Sharing one massive context window across agents performs worse than giving each agent a focused context scoped to its specific task, and Isolate covers why. Anthropic’s multi-agent research system uses 15x more total tokens but gets better results, because each agent sees only what it needs instead of drowning in context meant for other agents.

Finally, putting information in the window doesn’t guarantee the model uses it. Without explicit anchoring instructions, the model often ignores retrieved context and falls back to whatever it absorbed during training, which is why Grounding exists and why it’s the most commonly skipped step in RAG pipelines.

How It Differs from Prompt Engineering

Prompt engineering focuses on phrasing: how you word the instruction, what few-shot examples you include, and how you format the expected output. These things still matter, but they operate on a small slice of the window that shrinks with every turn of conversation.

Context engineering operates on the rest: what documents get retrieved, how conversation history gets managed, what information flows between agents, and when to compress and restart. Prompt engineering is writing a good question; context engineering is building the information environment the question lives in, and for multi-turn systems that environment is 95% of what determines output quality.

The skill sets don’t overlap as much as people assume. The person who writes great single-turn prompts and the person who designs context management for a 50-step agent workflow are solving fundamentally different problems. When something breaks in a context-engineered system, the failure mode is usually “wrong information in the window” or “right information in the wrong position,” not “the phrasing was off.”

What This Looks Like in Practice

The difference becomes concrete when you see the same task handled both ways.

Prompt engineering approach:

Write a function to validate user input that checks email format
and password strength.

Context engineering approach:

Current codebase patterns:
- Validation uses pydantic schemas in models.py
- Password validation requires 8+ chars, one number, one special
- Email validation uses the email-validator library
- Error handling returns HTTP 422 with detail dict

Task: Write a validate_user_input function for auth.py
that follows these patterns.

The first produces generic code that might work in isolation, while the second produces code that fits the codebase because the model has enough context to match existing conventions. The difference isn’t the phrasing of the task; it’s the information assembled around it, and this is a simple single-turn example. In a 50-step agent workflow where context accumulates, degrades, and needs active management, the information assembly problem becomes the dominant factor in output quality.

Common Mistakes

Treating the context window as unlimited. Most models advertise 128k or 200k tokens, but quality degrades long before you approach those limits. Set your budget at 60-70% of the model’s effective window and trigger compression before you hit it, because by the time you notice degradation it’s already been affecting outputs for several turns.

Optimizing the prompt when context is the problem. If your agent degrades after 10 turns, the system prompt almost certainly isn’t the issue; the accumulated context is, and rewriting the prompt while ignoring the 95% of tokens you’re not managing is solving the wrong problem entirely.

No context lifecycle management. Context that was relevant three turns ago might be noise now, and without active eviction of stale information every turn makes the next turn worse because the model has more irrelevant tokens competing for attention with the relevant ones.

Dumping retrieved documents verbatim. RAG pipelines that stuff raw retrieved chunks into the context without re-ranking, truncating, or ordering them are wasting most of the tokens they spent retrieving, and the model ends up attending to mediocre chunks at the expense of the good ones. See the RAG pipelines guide for the full pipeline.

Where to Go from Here

Select and Pyramid solve the majority of context quality problems, so start there and work through the pattern catalog as your system gets more complex.

The domain guides go deeper: RAG pipelines for retrieval systems, coding agents for Claude Code and Cursor configuration, and code generation for getting models to match your codebase style. Context Rot has the benchmark data on why all of this matters.