Benchmarks & Research
Wang et al., March 2026
An indexed memory mechanism that compresses context without discarding evidence. Rather than lossy summarization, Memex maintains a compact working context of structured summaries with stable indices, while full-fidelity artifacts live in an external store. The agent learns when to dereference an index to recover exact past evidence. Trained via reinforcement learning with reward shaping for memory usage under a context budget. Directly implements several patterns at once: Write Outside the Window for the external store, Compress for the working context, and Progressive Disclosure for the index-then-retrieve loop.
arXiv, March 2026
Constructs instance-specific few-shot prompts by synthesizing on-the-fly examples matched to the current input. Addresses the core limitation of fixed example sets: examples selected for average cases perform poorly on edge cases. Dataset-free approach makes this practical for low-resource scenarios. Directly relevant to the Few-Shot Selection pattern: when no labeled pool exists, synthesize.
Gu, Feb 2026 · 377k evaluation questions
Large-scale benchmark (PAPerBench, ~29,000 instances across 1k to 256k tokens) with theoretical analysis of attention dilution under context scaling. Finds consistent performance degradation in both personalization and privacy as context length increases. The theoretical contribution matters: proves this is an inherent limitation of soft attention in fixed-capacity Transformers, independent of training data. Reinforces Context Rot with a mechanistic explanation for why it happens.
Gloaguen et al. (ETH Zurich), Feb 2026
Counterintuitive finding: across multiple coding agents and LLMs, repository context files (AGENTS.md, .cursorrules) tend to reduce task success rates compared to providing no context, while increasing inference cost by over 20%. Overly detailed context files encourage broader exploration but make tasks harder through unnecessary requirements. The conclusion aligns with Select, Don't Dump: a few targeted requirements outperform a detailed documentation dump. Human-written context files should describe only minimal requirements.
Zeng et al. (HKUST-NLP), Feb 2026
First benchmark to test context degradation in long-running agentic scenarios specifically. NoLiMa tests single-step retrieval; LOCA-bench tests agents that explore, act, and accumulate context over time. Confirmed that advanced context management strategies substantially improve success rates in these scenarios.
McMillan, Feb 2026 · 9,649 experiments
The largest empirical study on context format and structure to date. Tested 11 models across 4 formats and schemas up to 10,000 tables. Found that format matters less than model capability (21 percentage point gap between frontier and open source) and that novel compact formats can incur a "grep tax" where the model spends extra tokens trying to parse unfamiliar structures.
Ruan et al., Feb 2026
Formalizes agents as a tuple of (Instruction, Context, Tools, Model) and automates their creation. Achieved a 16.28% improvement over the strongest baseline. Directly relevant to the Isolate and Recursive Delegation patterns: the orchestrator curates task-relevant context and delegates via on-the-fly agent creation.
Wan et al., Feb 2026
Attention analysis reveals "conversational inertia": models develop strong diagonal attention to previous responses as conversation histories grow, causing them to over-weight the most recent turn at the expense of integrating the full context. Directly explains why long-running conversations degrade and why rolling summaries outperform verbatim history at scale. Proposed mitigation introduces contrastive demonstrations that reduce inertia without retraining.
Dec 2025
Quantified the operational cost of long context. Llama-3.1-70B showed a 719% latency increase at 15k-word context, while accuracy only dropped from 98.5% to 98%. Models don't lose their way; they become operationally expensive. Memory bandwidth is the bottleneck, not compute.
JetBrains Research, Dec 2025
Practical hybrid approach combining observation masking and LLM summarization on SWE-bench. Achieved 7-11% cost reduction. Useful as a real-world engineering case study rather than a benchmark paper.
Stanford/SambaNova, Oct 2025
Demonstrated a +10.6% improvement on agent benchmarks and +8.6% on finance tasks through better context engineering alone. No model changes. The key insight: contexts should function as "comprehensive, evolving playbooks," not concise summaries. Also introduced the concept of "context collapse," where iterative rewriting erodes detail over time.
arXiv, May 2025
Systematic study of how LLMs fail in multi-turn settings. Models make assumptions in early turns and prematurely commit to solutions, then over-rely on those early conclusions for the remainder of the conversation. The failure compounds: a wrong assumption in turn 3 typically persists through turn 20. Reinforces the case for explicit state management and periodic context resets; assumption drift compounds when left unchecked.
Modarressi et al., Feb 2025
The benchmark that put hard numbers on context rot. 11 of 13 models dropped to half their baseline performance at just 32k tokens. Not at the edge of their window. At a fraction of it. Undermines the "bigger window, better results" assumption that most context strategies rely on.
Chroma Research
Documents how increasing input tokens impacts LLM performance through systematic "Needle in a Haystack" testing. The data behind the term "context rot" that the patterns on this site address.
Letta
Benchmark specifically for agentic context engineering. Evaluates how well models chain file operations, trace entity relationships, and manage multi-step retrieval. See the live leaderboard for current rankings.
Li et al., Feb 2026 · 1,136 tasks · 66 repos · 8 languages
Existing coding benchmarks measure whether agents solve the task. ContextBench measures whether they retrieved the right code context along the way. With human-annotated gold contexts and per-step metrics for recall, precision, and efficiency, it exposes a consistent failure mode: agents explore far more context than they actually put to work, and sophisticated scaffolding produces only marginal improvement in retrieval quality. That gap between explored and used context is where selective context strategies pay off.
Mei et al., Jul 2025 · 1,400+ papers
Wide-coverage survey treating context engineering as a formal discipline. Breaks the field into retrieval, generation, processing, and management, then maps how those combine into RAG, memory systems, tool-integrated reasoning, and multi-agent architectures. Best used as a map of the space rather than a source for specific empirical claims. Also surfaces a structural asymmetry the individual benchmarks miss: models handle complex input contexts well but struggle to produce equivalently complex outputs.
Cao et al., Mar 2026 · 5 benchmarks · 188K–3T tokens
Takes a different approach to the long-context problem: instead of larger windows or retrieval pipelines, let a coding agent work through the corpus as a file system. Using grep, terminal commands, Python scripts, and intermediate files, off-the-shelf coding agents beat published state-of-the-art by 17.3% on average across five benchmarks spanning 188K to three trillion tokens. The counterintuitive secondary finding: adding explicit retrieval tools to the agent did not help and sometimes degraded performance. Native tool proficiency and file system familiarity covered the retrieval function without a dedicated layer.
Yuan et al., Mar 2026 · ICLR MemAgents Workshop · 3×3 study
Controlled study isolating where memory-augmented agent failures actually happen. Crossing three write strategies against three retrieval methods, retrieval dominates: accuracy swings 20 points across retrieval methods but only 3–8 points across write strategies. Raw chunking with zero LLM calls matches or beats expensive fact extraction and summarization. Retrieval failure accounts for 11–46% of errors depending on config; utilization failure sits stable at 4–8% regardless. The implication for memory system design: fix retrieval before adding write-time complexity.
Mar 2026 · 2022–2026 coverage
Memory-focused survey organized around a write-manage-read loop and a three-dimensional taxonomy: temporal scope, representational substrate, and control policy. Covers five mechanism families from context-resident compression to policy-learned management. The evaluation section is the most useful part: it traces the shift from static recall benchmarks to multi-session agentic tests that interleave memory with decision-making, and identifies gaps current systems haven't closed.
Best Practices
Birgitta Böckeler (Thoughtworks), Feb 2026
Practitioner walkthrough of context configuration surfaces across Claude Code, Cursor, and Windsurf. Covers AGENTS.md, .cursorrules, .windsurfrules, memory hooks, and MCP servers as context engineering levers. Particularly useful for the distinction between static context (config files) and dynamic context (memory, tool output). Published on martinfowler.com, which gives it reach beyond the AI-specialist audience.
Anthropic, Nov 2025
Detailed account of why context compaction alone is not enough when agents work across multiple sessions. Anthropic's solution centers on two engineered handoff mechanisms: an initializer agent that sets up the environment on the first run, and a coding agent that makes incremental progress and leaves structured artifacts for the next session. The specific mechanisms are instructive: a progress log for session handoff, a feature list in JSON to prevent premature "done" declarations, git commits as recovery checkpoints, and explicit verification loops before starting new work. The pattern is borrowed directly from how humans hand off engineering work across shifts.
Harshal Patil (FlexAI/n8n), Feb 2026
Product management perspective on why fine-tuning lost to context engineering: high effort, foundation models improved too fast, context engineering extracts enough value without retraining. Also frames the shift from "context window size" to "context relevance" as the real bottleneck.
Drew Breunig
Taxonomy of context failure modes: poisoning (hallucination enters context), distraction (context overwhelms training), confusion (superfluous context influences response), clash (parts disagree). Useful diagnostic framework when things go wrong.
Anthropic
Introduces the pyramid approach, the principle of finding the smallest set of high-signal tokens, and the concept of "right altitude" for instructions. Most of the patterns on this site trace back to principles articulated here.
Anthropic
The just-in-time approach to context: maintain lightweight identifiers, load data dynamically at runtime. Progressive disclosure rather than pre-loading. The basis for the Progressive Disclosure pattern.
LangChain
Defines the four-strategy framework: Write (persist outside the window), Select (pull relevant context in), Compress (summarize and trim), Isolate (separate contexts per agent). Clean taxonomy that maps directly to the patterns here.
Redis
Production-oriented guidance: treat context as infrastructure, prune aggressively, store and reuse context development. The emphasis on memory layers (short-term session + long-term cross-session) as essential infrastructure rather than optional features.
Practitioner Perspectives
OpenAI, March 2026
OpenAI's latest model supports up to 1M tokens of context, enabling agents to plan, execute, and verify tasks across long horizons. The window size is a milestone, but the context engineering question remains: a million tokens of poorly structured context will still degrade. The ETH Zurich AGENTS.md study and the PAPerBench attention dilution findings apply at any window size. Bigger windows make context engineering more important, not less.
arXiv, Jan 2026
Benchmark combining multi-turn conversation with reasoning-intensive retrieval, closer to real-world usage than benchmarks that treat the two separately. Highlights a gap in existing evaluation: systems tuned on single-turn retrieval benchmarks perform significantly worse when the conversation history must inform what to retrieve. Useful reference when designing context pipelines for dialogue-heavy applications.
Birgitta Böckeler (Thoughtworks), Feb 2026
Synthesis piece on the OpenAI team's five-month experiment building a codebase maintained entirely by AI agents. Böckeler dissects their approach into three interlocking parts: context engineering (a continuously updated knowledge base plus dynamic runtime context), architectural constraints enforced both by agents and deterministic linters, and periodic cleanup agents that fight entropy. The useful claim is that context quality cannot be separated from code structure and maintenance loops. You cannot engineer context in isolation and expect it to hold.
Mart van der Jagt, Mar 2026
Maps the three core LLM limitations (context window, reasoning, memory) against the five mechanisms the brain evolved to work with limited working memory: selective attention, chunking, associative retrieval, cognitive offloading, and learning consolidation. The central argument: the brain never evolved bigger working memory; it evolved sharper attention and better retrieval. Worth reading for the framework it offers on which limitations will yield to further scaling and which will need architectural innovation to crack.
Industry consensus, 2026
A recurring theme across practitioners: most production AI failures trace back to poor context rather than model limitations. The model is not the product; the orchestration is.
Andrej Karpathy
"Context engineering is the delicate art and science of filling the context window with just the right information for the next step." Framed LLMs as "the kernel process of a new Operating System" where context is the managed memory.
Tobi Lutke, Shopify CEO
"Context engineering describes the core skill better" than prompt engineering. "The art of providing all the context for the task to be plausibly solvable by the LLM."
Cognition AI team
"Context engineering is effectively the #1 job of engineers building AI agents." Warns that context hand-off between agents requires deliberate design; spawning sub-tasks and hoping for the best is how things break.