Modarressi et al., Feb 2025
The benchmark that put hard numbers on context rot. 11 of 13 models dropped to half their baseline performance at just 32k tokens. Not at the edge of their window. At a fraction of it. Undermines the "bigger window, better results" assumption that most context strategies rely on.
Stanford/SambaNova, Oct 2025
Demonstrated a +10.6% improvement on agent benchmarks and +8.6% on finance tasks through better context engineering alone. No model changes. The key insight: contexts should function as "comprehensive, evolving playbooks," not concise summaries. Also introduced the concept of "context collapse," where iterative rewriting erodes detail over time.
Zeng et al. (HKUST-NLP), Feb 2026
First benchmark to test context degradation in long-running agentic scenarios specifically. Unlike NoLiMa (which tests single-step retrieval), LOCA-bench tests agents that explore, act, and accumulate context over time. Confirmed that advanced context management strategies substantially improve success rates in these scenarios.
McMillan, Feb 2026 · 9,649 experiments
The most comprehensive empirical study on context format and structure to date. Tested 11 models across 4 formats and schemas up to 10,000 tables. Found that format matters less than model capability (21 percentage point gap between frontier and open source) and that novel compact formats can incur a "grep tax" where the model spends extra tokens trying to parse unfamiliar structures.
Dec 2025
Quantified the operational cost of long context. Llama-3.1-70B showed a 719% latency increase at 15k-word context, while accuracy only dropped from 98.5% to 98%. Models don't lose their way; they become operationally expensive. The bottleneck is memory bandwidth, not computational FLOPs.
Chroma Research
Documents how increasing input tokens impacts LLM performance through systematic "Needle in a Haystack" testing. The data behind the term "context rot" that the patterns on this site address.
Letta
Benchmark specifically for agentic context engineering. Evaluates how well models chain file operations, trace entity relationships, and manage multi-step retrieval. See the live leaderboard for current rankings.
Ruan et al., Feb 2026
Formalizes agents as a tuple of (Instruction, Context, Tools, Model) and automates their creation. Achieved a 16.28% improvement over the strongest baseline. Directly relevant to the Isolate and Recursive Delegation patterns: the orchestrator curates task-relevant context and delegates via on-the-fly agent creation.
JetBrains Research, Dec 2025
Practical hybrid approach combining observation masking and LLM summarization on SWE-bench. Achieved 7-11% cost reduction. Useful as a real-world engineering case study rather than a benchmark paper.