Agentic Context Efficiency: A Benchmark
Four models ran the same 90-turn agentic task. The one that front-loaded all source reads hit 100% cache utilisation; the one that read on demand consumed 10,000x more fresh input tokens.
The Setup
Four models were given an identical agentic task: build a comprehensive 8-module developer course on context engineering from a detailed specification, using only pre-loaded research files on disk. Same agent framework, same tools (read, write, bash, and a todo-based task loop), same research material. No web search, no live retrieval.
The experiment was designed to measure output quality, not cache efficiency. But the token data revealed something more interesting: a 26x difference in fresh input token consumption between the most and least efficient models, driven entirely by how each model chose to use its tools.
What the Data Shows
| Model | Fresh input tokens | Cache utilisation | Total cost | Words produced |
|---|---|---|---|---|
| Claude Sonnet 4.6 | 191 | 100% | $7.12 | 51,112 |
| Kimi K2.5 | 475,752 | 93% | $1.47 | 41,251 |
| GLM-5 | 1,935,379 | 73% | $3.25 | 38,886 |
| MiniMax M2.5 | 523,489 | 90% | $0.89 | 29,269 |
Sonnet’s 191 fresh input tokens is not a rounding error. Across a 160-turn session, it consumed essentially no fresh input after the first turn.
GLM spent 1.9 million fresh input tokens for comparable output. The fresh input alone cost $1.94 at standard GLM pricing, more than Kimi’s entire session.
What Sonnet Did Differently
Sonnet’s first todo was: “Read all research files and source material.”
It read 15 files in sequence: three research summaries, ten pattern files, two guide files. Then it wrote a 900-word structured reference document, _research-notes.md, before writing a single line of course content.
The notes file:
# Research Notes for Context Engineering Course
## Key Stats to Cite
### Context Rot / Performance Degradation
- NoLiMa: 11/12 models drop below 50% performance at 32k tokens
- LOCA-bench: first benchmark for context rot in agentic scenarios
- Context tax paper: 719.64% latency increase for Llama-3.1-70B at 15k words
- DSBC: multi-task accuracy drops from 55.86% (1 task) to 25.46% (3 tasks)
### The 10 Patterns
1. The Pyramid - domain → architecture → specific context → task
2. Select, Don't Dump - smallest high-signal token set
3. Compress & Restart - detect threshold, summarize state, start fresh
...
## Failure Modes
1. Context stuffing - adding more hoping quality improves
2. Tool result accumulation - unprocessed tool output filling the window
...
After writing this file, Sonnet never read a source file again. For the next 159 turns, it drew on the conversation history, which contained the notes file and all its previous outputs.
The todo body after closing the research turn:
Completed. Read all 3 research files, all 10 pattern files, 2 guide files. Summary notes written to
_research-notes.md.
Every subsequent todo followed the same discipline: a completion log documenting what was produced, what decisions were made, what word counts were hit. The todos became a structured record of the session, not just a task queue.
Why This Produces 100% Cache Utilisation
In a multi-turn agent session, each turn’s context is the system prompt, tool definitions, and the full conversation history up to that point. Everything before the current turn is the cached prefix. As long as nothing changes in those earlier turns, the provider serves them from cache.
Reading a large file mid-session breaks this. If turn 30 reads a 50,000-token research document, those tokens weren’t in turn 29’s context. The provider sees a prefix that matches up to the point of the file read, then diverges. Only the matching prefix hits cache. The file content is computed fresh.
Sonnet moved all file reads to turn 1. From turn 2 onwards, every API call started with an identical prefix: system prompt, tool definitions, conversation history containing the notes file and all prior outputs. The provider matched the full prefix every time. 100% cache hit.
GLM read files on turns 8, 12, 19, 23, and 31. Each read broke the prefix for that turn, injecting fresh tokens. With 15 source files scattered across the session, GLM never built a stable enough prefix to achieve high cache utilisation.
The QA Bracket
Sonnet also added a “Final QA pass” as its last todo. After completing all 8 modules and 8 exercises, it ran verification:
- Checked every file existed
- Verified word counts against the spec
- Spot-checked key statistics against its research notes
- Applied style guidelines from the project’s AGENTS.md file
The QA todo body:
Data accuracy: Spot-checked all key stats (NoLiMa 11/12, DSBC 55.86%→25.46%, 719% latency, ACE +10.6%, 14,556 token tool defs). All match research notes.
Style fixes applied: Removed all em dashes and en dashes from all 8 modules, 8 exercises, resources.md, and course-outline.md.
No other model read the project’s style guidelines, let alone applied them across all output files.
This is the Anchor Turn applied in reverse: rather than front-loading knowledge, it back-loads verification. Together they bracket the work. Knowledge consolidation at the start, consistency check at the end.
The GLM Anti-Pattern
GLM’s planning todos were the most detailed of any model. Each included a ## Data to Incorporate section with specific statistics and source file references:
Include: 15x token cost from Anthropic paper, ACE +10.6% improvement, NoLiMa 11/12 below 50% at 32k
Rather than reading those files once upfront, GLM returned to them throughout the session as needed. The citations were accurate (GLM clearly read the material) but the timing meant each read cost full input price.
GLM spent more on fresh input tokens than Kimi spent on its entire session, while producing 6% fewer words and scoring lower on content quality.
Now the lesson is not that thorough research hurts; Sonnet read more source files than GLM did. It’s that timing determines cost. Batch reads at the start and the provider caches them for the rest of the session. Spread them out and you pay full price every time.
What This Means for Agent Design
The Anchor Turn is a design decision you can build into any agent workflow, not an emergent behavior you hope for. Before a long-running task, have the agent read all relevant source material, write a structured summary, then proceed. Sonnet did this unprompted; there’s no reason to leave it to chance. Provider usage APIs already return cache read and cache write token counts separately, so you can verify it’s working. Sonnet’s 100% cache read rate came directly from the API response metadata.
The efficiency benefit scales with session length. In a 10-turn session, the difference is marginal. In a 100-turn session with 80,000 tokens of context per turn, the gap between 73% and 100% cache utilisation runs to roughly $2 per session at Anthropic pricing. For agents running hundreds of sessions per day, that compounds fast.
Quality didn’t suffer for the efficiency. The model with the best cache utilisation also produced the highest quality output by independent evaluation. The anchor turn is not a cost-cutting compromise. A single authoritative reference document produces more consistent output than relying on the model to piece together scattered raw content across a long session.