Research

Papers, benchmarks, and articles that shaped the patterns on this site. Each entry is annotated with why it matters.

Benchmarks & Research

NoLiMa: Long-Context Benchmark

Modarressi et al., Feb 2025

The benchmark that put hard numbers on context rot. 11 of 13 models dropped to half their baseline performance at just 32k tokens. Not at the edge of their window. At a fraction of it. Undermines the "bigger window, better results" assumption that most context strategies rely on.

ACE Framework

Stanford/SambaNova, Oct 2025

Demonstrated a +10.6% improvement on agent benchmarks and +8.6% on finance tasks through better context engineering alone. No model changes. The key insight: contexts should function as "comprehensive, evolving playbooks," not concise summaries. Also introduced the concept of "context collapse," where iterative rewriting erodes detail over time.

LOCA-bench: Long-Running Agent Context Rot

Zeng et al. (HKUST-NLP), Feb 2026

First benchmark to test context degradation in long-running agentic scenarios specifically. Unlike NoLiMa (which tests single-step retrieval), LOCA-bench tests agents that explore, act, and accumulate context over time. Confirmed that advanced context management strategies substantially improve success rates in these scenarios.

Structured Context Engineering for File-Native Agentic Systems

McMillan, Feb 2026 · 9,649 experiments

The most comprehensive empirical study on context format and structure to date. Tested 11 models across 4 formats and schemas up to 10,000 tables. Found that format matters less than model capability (21 percentage point gap between frontier and open source) and that novel compact formats can incur a "grep tax" where the model spends extra tokens trying to parse unfamiliar structures.

Context Tax: Latency vs. Accuracy at Scale

Dec 2025

Quantified the operational cost of long context. Llama-3.1-70B showed a 719% latency increase at 15k-word context, while accuracy only dropped from 98.5% to 98%. Models don't lose their way; they become operationally expensive. The bottleneck is memory bandwidth, not computational FLOPs.

Context Rot

Chroma Research

Documents how increasing input tokens impacts LLM performance through systematic "Needle in a Haystack" testing. The data behind the term "context rot" that the patterns on this site address.

Context-Bench

Letta

Benchmark specifically for agentic context engineering. Evaluates how well models chain file operations, trace entity relationships, and manage multi-step retrieval. See the live leaderboard for current rankings.

AOrchestra: Automated Sub-Agent Creation

Ruan et al., Feb 2026

Formalizes agents as a tuple of (Instruction, Context, Tools, Model) and automates their creation. Achieved a 16.28% improvement over the strongest baseline. Directly relevant to the Isolate and Recursive Delegation patterns: the orchestrator curates task-relevant context and delegates via on-the-fly agent creation.

Efficient Context Management

JetBrains Research, Dec 2025

Practical hybrid approach combining observation masking and LLM summarization on SWE-bench. Achieved 7-11% cost reduction. Useful as a real-world engineering case study rather than a benchmark paper.

Best Practices

Effective Context Engineering for AI Agents

Anthropic

Introduces the pyramid approach, the principle of finding the smallest set of high-signal tokens, and the concept of "right altitude" for instructions. Most of the patterns on this site trace back to principles articulated here.

Context Engineering for Agents

LangChain

Defines the four-strategy framework: Write (persist outside the window), Select (pull relevant context in), Compress (summarize and trim), Isolate (separate contexts per agent). Clean taxonomy that maps directly to the patterns here.

Context Engineering Best Practices

Redis

Production-oriented guidance: treat context as infrastructure, prune aggressively, store and reuse context development. The emphasis on memory layers (short-term session + long-term cross-session) as essential infrastructure rather than optional features.

How Contexts Fail and How to Fix Them

Drew Breunig

Taxonomy of context failure modes: poisoning (hallucination enters context), distraction (context overwhelms training), confusion (superfluous context influences response), clash (parts disagree). Useful diagnostic framework when things go wrong.

Contextual Retrieval

Anthropic

The just-in-time approach to context: maintain lightweight identifiers, load data dynamically at runtime. Progressive disclosure rather than pre-loading. The basis for the Progressive Disclosure pattern.

AI Copilot Product Trends, Q1 2026

Harshal Patil (FlexAI/n8n), Feb 2026

Product management perspective on why fine-tuning lost to context engineering: high effort, foundation models improved too fast, context engineering extracts enough value without retraining. Also frames the shift from "context window size" to "context relevance" as the real bottleneck.

Case Studies

How We Built a Multi-Agent Research System

Anthropic

Sub-agents with isolated contexts outperformed a single agent, using 15x more tokens total but producing higher quality output. Structured note-taking to NOTES.md for persistence across agent boundaries. The primary case study behind the Isolate pattern.

Context Engineering Case Studies: Etsy-Specific Q&A

Etsy Engineering

How Etsy reduced hallucinations in company-specific question answering through explicit instructions and relevant contextual information. Practical example of the Pyramid pattern applied to enterprise knowledge retrieval.

Real-World Applications of Context Engineering

MarkTechPost, Aug 2025

Collection of industry case studies. Five Sigma (insurance): 80% fewer claim processing errors, 25% higher adjuster productivity via RAG and dynamic context assembly. Financial services: 40% reduction in user frustration. Telecom: 67% fewer escalations. Concrete numbers on what good context engineering delivers in production.

Building a Real-Time Context Engine for AI Agents

OSO Engineering, Feb 2026

Technical breakdown triggered by IBM's $11B Confluent acquisition. Argues that RAG is document retrieval, but context is operational state. Proposes an event-driven architecture (Kafka + Snowplow) for real-time context that changes minute-by-minute. Useful for understanding where static context patterns break down.

Tools & Frameworks

MemGPT / Letta

Letta

Memory-first agent framework. Treats the LLM like an operating system kernel with managed memory blocks. The architecture that inspired the Write Outside the Window pattern: persistent context with size limits, labels, and access patterns, managed through system calls.

Model Context Protocol (MCP)

Anthropic

Standardized protocol for context retrieval. "USB-C for AI." Adopted by Block, OpenAI, Microsoft. Enables dynamic, information-rich environments rather than static prompts. The protocol layer that makes Progressive Disclosure practical at scale.

Context Lens

Open Source

Framework-agnostic proxy that intercepts LLM API calls and visualizes context window composition in real time. See what your AI actually sees.

Practitioner Perspectives

Andrej Karpathy

"Context engineering is the delicate art and science of filling the context window with just the right information for the next step." Framed LLMs as "the kernel process of a new Operating System" where context is the managed memory.

Tobi Lutke, Shopify CEO

"Context engineering describes the core skill better" than prompt engineering. "The art of providing all the context for the task to be plausibly solvable by the LLM."

Cognition AI team

"Context engineering is effectively the #1 job of engineers building AI agents." Warns that context hand-off between agents needs careful engineering, not just spawning sub-tasks.

Industry consensus, 2026

A recurring theme across practitioners: most production AI failures stem from poor context, not model limitations. The model is not the product; the orchestration is.