Research

Papers, benchmarks, and articles that shaped the patterns on this site. Each entry is annotated with why it matters.

Benchmarks & Research

Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

An indexed memory mechanism that compresses context without discarding evidence. Rather than lossy summarization, Memex maintains a compact working context of structured summaries with stable indices, while full-fidelity artifacts live in an external store. The agent learns when to dereference an index to recover exact past evidence. Trained via reinforcement learning with reward shaping for memory usage under a context budget. Directly implements several patterns at once: Write Outside the Window for the external store, Compress for the working context, and Progressive Disclosure for the index-then-retrieve loop.

TATRA: Training-Free Instance-Adaptive Few-Shot Prompting

Constructs instance-specific few-shot prompts by synthesizing on-the-fly examples matched to the current input. Addresses the core limitation of fixed example sets: examples selected for average cases perform poorly on edge cases. Dataset-free approach makes this practical for low-resource scenarios. Directly relevant to the Few-Shot Selection pattern: when no labeled pool exists, synthesize.

Long Context, Less Focus: A Scaling Gap in LLMs

Large-scale benchmark (PAPerBench, ~29,000 instances across 1k to 256k tokens) with theoretical analysis of attention dilution under context scaling. Finds consistent performance degradation in both personalization and privacy as context length increases. The theoretical contribution matters: proves this is an inherent limitation of soft attention in fixed-capacity Transformers, independent of training data. Reinforces Context Rot with a mechanistic explanation for why it happens.

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful?

Counterintuitive finding: across multiple coding agents and LLMs, repository context files (AGENTS.md, .cursorrules) tend to reduce task success rates compared to providing no context, while increasing inference cost by over 20%. Overly detailed context files encourage broader exploration but make tasks harder through unnecessary requirements. The conclusion aligns with Select, Don't Dump: a few targeted requirements outperform a detailed documentation dump. Human-written context files should describe only minimal requirements.

LOCA-bench: Long-Running Agent Context Rot

First benchmark to test context degradation in long-running agentic scenarios specifically. NoLiMa tests single-step retrieval; LOCA-bench tests agents that explore, act, and accumulate context over time. Confirmed that advanced context management strategies substantially improve success rates in these scenarios.

Structured Context Engineering for File-Native Agentic Systems

The largest empirical study on context format and structure to date. Tested 11 models across 4 formats and schemas up to 10,000 tables. Found that format matters less than model capability (21 percentage point gap between frontier and open source) and that novel compact formats can incur a "grep tax" where the model spends extra tokens trying to parse unfamiliar structures.

AOrchestra: Automated Sub-Agent Creation

Formalizes agents as a tuple of (Instruction, Context, Tools, Model) and automates their creation. Achieved a 16.28% improvement over the strongest baseline. Directly relevant to the Isolate and Recursive Delegation patterns: the orchestrator curates task-relevant context and delegates via on-the-fly agent creation.

Mitigating Conversational Inertia in Multi-Turn Agents

Attention analysis reveals "conversational inertia": models develop strong diagonal attention to previous responses as conversation histories grow, causing them to over-weight the most recent turn at the expense of integrating the full context. Directly explains why long-running conversations degrade and why rolling summaries outperform verbatim history at scale. Proposed mitigation introduces contrastive demonstrations that reduce inertia without retraining.

Context Tax: Latency vs. Accuracy at Scale

Quantified the operational cost of long context. Llama-3.1-70B showed a 719% latency increase at 15k-word context, while accuracy only dropped from 98.5% to 98%. Models don't lose their way; they become operationally expensive. Memory bandwidth is the bottleneck, not compute.

Efficient Context Management

Practical hybrid approach combining observation masking and LLM summarization on SWE-bench. Achieved 7-11% cost reduction. Useful as a real-world engineering case study rather than a benchmark paper.

ACE Framework

Demonstrated a +10.6% improvement on agent benchmarks and +8.6% on finance tasks through better context engineering alone. No model changes. The key insight: contexts should function as "comprehensive, evolving playbooks," not concise summaries. Also introduced the concept of "context collapse," where iterative rewriting erodes detail over time.

LLMs Get Lost In Multi-Turn Conversation

Systematic study of how LLMs fail in multi-turn settings. Models make assumptions in early turns and prematurely commit to solutions, then over-rely on those early conclusions for the remainder of the conversation. The failure compounds: a wrong assumption in turn 3 typically persists through turn 20. Reinforces the case for explicit state management and periodic context resets; assumption drift compounds when left unchecked.

NoLiMa: Long-Context Benchmark

The benchmark that put hard numbers on context rot. 11 of 13 models dropped to half their baseline performance at just 32k tokens. Not at the edge of their window. At a fraction of it. Undermines the "bigger window, better results" assumption that most context strategies rely on.

Context Rot

Documents how increasing input tokens impacts LLM performance through systematic "Needle in a Haystack" testing. The data behind the term "context rot" that the patterns on this site address.

Context-Bench

Benchmark specifically for agentic context engineering. Evaluates how well models chain file operations, trace entity relationships, and manage multi-step retrieval. See the live leaderboard for current rankings.

ContextBench: A Benchmark for Context Retrieval in Coding Agents

Existing coding benchmarks measure whether agents solve the task. ContextBench measures whether they retrieved the right code context along the way. With human-annotated gold contexts and per-step metrics for recall, precision, and efficiency, it exposes a consistent failure mode: agents explore far more context than they actually put to work, and sophisticated scaffolding produces only marginal improvement in retrieval quality. That gap between explored and used context is where selective context strategies pay off.

A Survey of Context Engineering for Large Language Models

Wide-coverage survey treating context engineering as a formal discipline. Breaks the field into retrieval, generation, processing, and management, then maps how those combine into RAG, memory systems, tool-integrated reasoning, and multi-agent architectures. Best used as a map of the space rather than a source for specific empirical claims. Also surfaces a structural asymmetry the individual benchmarks miss: models handle complex input contexts well but struggle to produce equivalently complex outputs.

Coding Agents are Effective Long-Context Processors

Takes a different approach to the long-context problem: instead of larger windows or retrieval pipelines, let a coding agent work through the corpus as a file system. Using grep, terminal commands, Python scripts, and intermediate files, off-the-shelf coding agents beat published state-of-the-art by 17.3% on average across five benchmarks spanning 188K to three trillion tokens. The counterintuitive secondary finding: adding explicit retrieval tools to the agent did not help and sometimes degraded performance. Native tool proficiency and file system familiarity covered the retrieval function without a dedicated layer.

Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory

Controlled study isolating where memory-augmented agent failures actually happen. Crossing three write strategies against three retrieval methods, retrieval dominates: accuracy swings 20 points across retrieval methods but only 3–8 points across write strategies. Raw chunking with zero LLM calls matches or beats expensive fact extraction and summarization. Retrieval failure accounts for 11–46% of errors depending on config; utilization failure sits stable at 4–8% regardless. The implication for memory system design: fix retrieval before adding write-time complexity.

Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers

Memory-focused survey organized around a write-manage-read loop and a three-dimensional taxonomy: temporal scope, representational substrate, and control policy. Covers five mechanism families from context-resident compression to policy-learned management. The evaluation section is the most useful part: it traces the shift from static recall benchmarks to multi-session agentic tests that interleave memory with decision-making, and identifies gaps current systems haven't closed.

Best Practices

Context Engineering for Coding Agents

Practitioner walkthrough of context configuration surfaces across Claude Code, Cursor, and Windsurf. Covers AGENTS.md, .cursorrules, .windsurfrules, memory hooks, and MCP servers as context engineering levers. Particularly useful for the distinction between static context (config files) and dynamic context (memory, tool output). Published on martinfowler.com, which gives it reach beyond the AI-specialist audience.

Effective Harnesses for Long-Running Agents

Detailed account of why context compaction alone is not enough when agents work across multiple sessions. Anthropic's solution centers on two engineered handoff mechanisms: an initializer agent that sets up the environment on the first run, and a coding agent that makes incremental progress and leaves structured artifacts for the next session. The specific mechanisms are instructive: a progress log for session handoff, a feature list in JSON to prevent premature "done" declarations, git commits as recovery checkpoints, and explicit verification loops before starting new work. The pattern is borrowed directly from how humans hand off engineering work across shifts.

AI Copilot Product Trends, Q1 2026

Product management perspective on why fine-tuning lost to context engineering: high effort, foundation models improved too fast, context engineering extracts enough value without retraining. Also frames the shift from "context window size" to "context relevance" as the real bottleneck.

How Contexts Fail and How to Fix Them

Taxonomy of context failure modes: poisoning (hallucination enters context), distraction (context overwhelms training), confusion (superfluous context influences response), clash (parts disagree). Useful diagnostic framework when things go wrong.

Effective Context Engineering for AI Agents

Introduces the pyramid approach, the principle of finding the smallest set of high-signal tokens, and the concept of "right altitude" for instructions. Most of the patterns on this site trace back to principles articulated here.

Contextual Retrieval

The just-in-time approach to context: maintain lightweight identifiers, load data dynamically at runtime. Progressive disclosure rather than pre-loading. The basis for the Progressive Disclosure pattern.

Context Engineering for Agents

Defines the four-strategy framework: Write (persist outside the window), Select (pull relevant context in), Compress (summarize and trim), Isolate (separate contexts per agent). Clean taxonomy that maps directly to the patterns here.

Context Engineering Best Practices

Production-oriented guidance: treat context as infrastructure, prune aggressively, store and reuse context development. The emphasis on memory layers (short-term session + long-term cross-session) as essential infrastructure rather than optional features.

Case Studies

Building a Real-Time Context Engine for AI Agents

Technical breakdown triggered by IBM's $11B Confluent acquisition. Argues that RAG is document retrieval, but context is operational state. Proposes an event-driven architecture (Kafka + Snowplow) for real-time context that changes minute-by-minute. Useful for understanding where static context patterns break down.

Real-World Applications of Context Engineering

Collection of industry case studies. Five Sigma (insurance): 80% fewer claim processing errors, 25% higher adjuster productivity via RAG and dynamic context assembly. Financial services: 40% reduction in user frustration. Telecom: 67% fewer escalations. Concrete numbers on what good context engineering delivers in production.

How We Built a Multi-Agent Research System

Sub-agents with isolated contexts outperformed a single agent, using 15x more tokens total but producing higher quality output. Structured note-taking to NOTES.md for persistence across agent boundaries. The primary case study behind the Isolate pattern.

Context Engineering Case Studies: Etsy-Specific Q&A

How Etsy reduced hallucinations in company-specific question answering through explicit instructions and relevant contextual information. Practical example of the Pyramid pattern applied to enterprise knowledge retrieval.

Codified Context: Infrastructure for AI Agents in a Complex Codebase

Detailed account of context infrastructure built alongside a 108,000-line C# system over 283 development sessions. The architecture splits into three layers: a hot-memory constitution encoding conventions and retrieval hooks, 19 specialized domain agents, and a cold-memory store of 34 on-demand spec documents. Session-level metrics trace how the infrastructure grew and where it prevented failures. The hot/cold memory split is a direct implementation of Write Outside the Window in a long-running production codebase.

Tools & Frameworks

Model Context Protocol (MCP)

Standardized protocol for context retrieval. "USB-C for AI." Adopted by Block, OpenAI, Microsoft. Enables dynamic, information-rich environments rather than static prompts. The protocol layer that makes Progressive Disclosure practical at scale.

MemGPT / Letta

Memory-first agent framework. Treats the LLM like an operating system kernel with managed memory blocks. The architecture that inspired the Write Outside the Window pattern: persistent context with size limits, labels, and access patterns, managed through system calls.

Code Execution with MCP: Building More Efficient Agents

Explains why direct tool calling scales poorly as the number of connected MCP servers grows. Two failure modes: loading all tool definitions upfront bloats the context window, and passing intermediate results back through the model doubles token usage for operations like "fetch this document and attach it elsewhere." The proposed solution is to expose MCP servers as code APIs in an execution environment rather than as direct tools. The agent writes code to interact with them, processes intermediate data outside the model, and only surfaces the final result. Concrete illustration of why Progressive Disclosure and selective context matter even at the tool-use layer.

Context Lens

Framework-agnostic proxy that intercepts LLM API calls and visualizes context window composition in real time. See what your AI actually sees.

Practitioner Perspectives

GPT-5.4: 1M Token Context Window

OpenAI's latest model supports up to 1M tokens of context, enabling agents to plan, execute, and verify tasks across long horizons. The window size is a milestone, but the context engineering question remains: a million tokens of poorly structured context will still degrade. The ETH Zurich AGENTS.md study and the PAPerBench attention dilution findings apply at any window size. Bigger windows make context engineering more important, not less.

RECOR: Reasoning-Focused Multi-Turn Conversational Retrieval

Benchmark combining multi-turn conversation with reasoning-intensive retrieval, closer to real-world usage than benchmarks that treat the two separately. Highlights a gap in existing evaluation: systems tuned on single-turn retrieval benchmarks perform significantly worse when the conversation history must inform what to retrieve. Useful reference when designing context pipelines for dialogue-heavy applications.

Harness Engineering

Synthesis piece on the OpenAI team's five-month experiment building a codebase maintained entirely by AI agents. Böckeler dissects their approach into three interlocking parts: context engineering (a continuously updated knowledge base plus dynamic runtime context), architectural constraints enforced both by agents and deterministic linters, and periodic cleanup agents that fight entropy. The useful claim is that context quality cannot be separated from code structure and maintenance loops. You cannot engineer context in isolation and expect it to hold.

The Future of Context Engineering

Maps the three core LLM limitations (context window, reasoning, memory) against the five mechanisms the brain evolved to work with limited working memory: selective attention, chunking, associative retrieval, cognitive offloading, and learning consolidation. The central argument: the brain never evolved bigger working memory; it evolved sharper attention and better retrieval. Worth reading for the framework it offers on which limitations will yield to further scaling and which will need architectural innovation to crack.

Industry consensus, 2026

A recurring theme across practitioners: most production AI failures trace back to poor context rather than model limitations. The model is not the product; the orchestration is.

Andrej Karpathy

"Context engineering is the delicate art and science of filling the context window with just the right information for the next step." Framed LLMs as "the kernel process of a new Operating System" where context is the managed memory.

Tobi Lutke, Shopify CEO

"Context engineering describes the core skill better" than prompt engineering. "The art of providing all the context for the task to be plausibly solvable by the LLM."

Cognition AI team

"Context engineering is effectively the #1 job of engineers building AI agents." Warns that context hand-off between agents requires deliberate design; spawning sub-tasks and hoping for the best is how things break.