Context Window Economics

Token costs force context engineering tradeoffs. Understanding the actual cost structure, broken down by input, output, cached, and fresh, changes how you design systems.

Anthropic: Prompt Caching Documentation , Artificial Analysis: Prompt Caching Cost Analysis , RAG or Long-Context LLMs? (arXiv) , RouteLLM: Cost-Effective LLM Routing (LMSYS)

The Cost Structure Most Teams Ignore

Every API call has four token categories priced differently: fresh input tokens, cached input tokens (cache hits), cache write tokens, and output tokens. Most teams think about total token count, but the cost breakdown within that count determines whether a system is economically viable at scale.

Representative pricing as of early 2026. These numbers are approximate and shift frequently; verify against current provider pricing pages (Anthropic, OpenAI, Google) before making production decisions:

Provider / Model	Fresh input ($/M)	Cache hit ($/M)	Cache write ($/M)	Output ($/M)
Claude Sonnet 4.6	$3.00	$0.30	$3.75	$15.00
Claude Haiku 3.5	$0.80	$0.08	$1.00	$4.00
GPT-4o	$2.50	$1.25	n/a	$10.00
GPT-4o mini	$0.15	$0.075	n/a	$0.60
Gemini 2.0 Flash	$0.10	$0.025	n/a	$0.40

Cache hit rate is the single most controllable cost lever in this table. Claude cache hits cost 10% of the fresh input price, so a system with a 2,000-token system prompt accessed 10,000 times per day spends either $60/day in fresh tokens or $6/day in cached ones; that’s a difference that compounds fast at production volume.

Output tokens cost three to five times more than input tokens across every major provider, which is the ratio that makes verbosity expensive: a model that produces 500-word responses when 200 words would suffice burns 2.5x the output budget per call. This is why output constraints belong in the system design, not as a last-minute style request.

The Three Cost Archetypes

Context engineering decisions map to three fundamentally different cost structures. Most systems blend all three, but isolating them makes the tradeoffs clearer.

Archetype 1: Long Context (Everything in One Call)

The long-context archetype sends all available information in a single large context with no retrieval, no compression, no splitting.

Cost structure: dominated by input tokens, linear with content size. A 50,000-token context at $3/M costs $0.15 per call; at 10,000 calls/day that’s $1,500/day in input costs alone, before outputs.

When it’s cheaper than the alternative: when the same large context is reused across many requests and cache hit rates run high. At 90%+ cache utilization, a 50,000-token context costs $0.015/call in cached input plus a one-time $0.1875 write per TTL period, and the economics flip decisively in favor of long context at scale.

The Anchor Turn benchmark from this site’s own measurements shows how dramatic the effect gets: across 90 turns of the same agentic task, a model that front-loaded all reads and achieved 100% cache utilization spent 191 fresh input tokens total, versus 1.9M for a model that read files on-demand throughout the session. At Claude Sonnet pricing, that’s $0.0006 versus $5.70 in fresh input costs for the same task, a 9,500x difference driven entirely by cache strategy.

Archetype 2: RAG (Retrieve and Include)

The RAG archetype retrieves relevant documents at query time and includes them in context, so context size scales with query complexity rather than document corpus size.

Cost structure: retrieval infrastructure cost plus smaller, variable input token costs. A RAG call that retrieves 3 relevant chunks (1,500 tokens) costs $0.0045/call in input at Claude Sonnet pricing, which is 33x cheaper per call than the 50,000-token long-context approach.

When it’s cheaper: when the same large document corpus is relevant to some queries but not others. RAG wins on per-call cost when retrieval precision is high and the corpus is used sparsely, but when most calls need most of the corpus, the retrieval infrastructure overhead and per-call reconstruction cost exceeds the caching benefit.

Quality tradeoff: RAG introduces retrieval failures, and missed relevant documents are context quality problems with no equivalent in the long-context approach. The research (arXiv 2407.16833) finds RAG outperforms long-context at selective retrieval tasks but falls behind when full-document understanding is needed.

Archetype 3: Multi-Agent (Isolated Sub-Contexts)

The multi-agent archetype splits a large task across multiple agents, each with a smaller context. This is the Isolate pattern in practice.

Cost structure: multiple calls with smaller contexts, plus coordination overhead. Total token count is higher than a single long-context call for the same task, but the cost can be lower due to routing cheaper models for sub-tasks and achieving better cache hit rates on stable sub-agent contexts.

The Anchor Turn benchmark shows that sub-agents with isolated contexts produced higher quality output than a single long-context agent while using 15x more tokens total. Whether that’s economically favorable depends on which models handle the sub-tasks and what the quality differential is worth.

The Output Token Problem

Input optimization gets all the attention, but output cost is often the actual bottleneck. Output tokens cost 3-5x more than fresh input tokens, so a system running 50,000 calls per day at 500 output tokens each spends $375/day on output alone at Claude Sonnet pricing, which is more than the input cost at similar volume. For content-generation, summarization, and agentic tasks where output length is variable, output budget control is the primary cost lever most teams never touch.

What controls output cost:

Max tokens setting. The hard ceiling; set it at the practical maximum needed for your use case. Leaving it at the model’s limit wastes budget on responses that will never reach that length.
Output format. JSON and structured outputs are typically more token-dense than prose for the same semantic content, so prose narration around data adds tokens that often aren’t needed.
Verbosity instructions. “Respond concisely” in a system prompt has measurable effect, but “respond in three sentences or fewer” has more; specific length constraints outperform vague quality instructions.
Model routing. Smaller models have lower absolute output costs. Routing simple queries to a cheaper model preserves quality where it matters and cuts cost where it doesn’t, which is the optimization that tends to have the highest ROI in practice.

The Cost Modeling Framework

For any production system, the monthly cost calculation has five components:

Monthly cost = 
  (fresh_input_tokens × fresh_rate)
+ (cached_input_tokens × cache_hit_rate)
+ (cache_writes × cache_write_rate)
+ (output_tokens × output_rate)
+ (retrieval_infra_cost if RAG)

The variables you actually control:

Cache hit rate: a function of how stable your input prefix is and how many requests share it, most improvable via Context Caching and system prompt discipline.
Average context size: a function of how aggressively you apply Select, Don’t Dump and Compress & Restart.
Output length: a function of format constraints and model instructions.
Model selection: the largest single variable. GPT-4o mini input costs $0.15/M versus Claude Sonnet’s $3.00/M, so routing a majority of calls to a cheaper model while reserving the expensive model for complex tasks yields significant savings. RouteLLM (LMSYS, 2024) demonstrated over 2x cost reduction with minimal quality loss, and production reports of 70-85% savings from model routing are common.

A worked example:

A support bot handling 100,000 queries/day. System prompt: 1,500 tokens (stable). Context per query: 3,000 tokens (customer profile + recent history). Average output: 300 tokens.

Without caching (naive):

Input: 4,500 tokens × 100,000 queries = 450M tokens/day at $3/M = $1,350/day

With caching (system prompt cached):

Fresh input: 3,000 tokens × 100,000 = 300M tokens/day at $3/M = $900/day
Cache hits: 1,500 tokens × 100,000 = 150M tokens/day at $0.30/M = $45/day
Cache writes: ~20 writes/day × 1,500 tokens at $3.75/M = $0.11/day
Output: 300 tokens × 100,000 = 30M tokens/day at $15/M = $450/day
Total: ~$1,395/day, caching saves a whopping $45 because output dominates

With Haiku for simple queries (50% routing):

50,000 Sonnet calls: $450 input + $225 output = $675
50,000 Haiku calls: $45 input + $60 output = $105
Total: ~$780/day, a 44% reduction from model routing alone

That caching result surprises people. You spend the engineering effort on cache optimization, and it barely moves the needle because nobody looked at the output cost line. The framework is the point; the specific values require your actual token counts.

What This Means for Architecture Decisions

Cost pressure produces the same decision in most systems: compress more aggressively, cache more aggressively, route cheaper models for lower-stakes tasks. The context engineering patterns exist independently of cost, but cost pressure is often what motivates actually implementing them.

The dangerous failure mode is optimizing for cost without understanding the quality tradeoffs. Compression loses information; RAG misses relevant documents; routing to cheaper models produces worse outputs on complex tasks. Each optimization has a quality ceiling that only evaluation data can reveal, and teams that optimize cost without measuring quality are just making things worse in a way that takes longer to notice.

In practice: model costs before building, establish quality baselines, measure the quality impact of each optimization, and stop optimizing when the marginal cost reduction no longer justifies the marginal quality loss. This requires both a cost model and an eval suite, and teams that have neither are flying blind on both dimensions.