Context Rot Across Models

Data-driven comparison of how different models handle long context. NoLiMa and RULER benchmarks reveal which models maintain quality and which degrade fastest across GPT-4o, Claude, Gemini, Llama, and Mistral.

NoLiMa Benchmark (arXiv) , RULER: What's the Real Context Size? , Lost in the Middle (Liu et al.) , Claude 3.5 Sonnet Context Window , OpenAI GPT-4o Documentation

The Variability Problem

Not all models rot equally. The Context Rot pattern establishes that degradation is universal, but the rate and shape vary dramatically across model families. Some models maintain usable quality to 64k tokens. Others collapse by 16k. Picking the wrong model for a long-context application means your system degrades at half the context length you designed for.

A note on freshness: Model rankings shift with every major release. The benchmarks below reflect testing through early 2025. Use them as a framework for comparison, but verify against current benchmarks before making production decisions.

The Benchmarks

NoLiMa tests models on needle-in-haystack retrieval at various context lengths. The task is straightforward: find specific information placed at different positions within long documents. Simple, but it reveals how attention degrades with distance.

RULER goes further, evaluating retrieval and reasoning across long contexts. It measures effective context size rather than advertised size, which is the number that actually matters for your application.

Lost in the Middle (Liu et al.) documents the U-shaped attention pattern: models attend well to information near the beginning and end of context, but poorly to information in the middle. Almost every model shows this pattern to some degree.

Why Models Differ

The degradation profiles below have architectural explanations. Sliding window attention (used in some Mistral models) explicitly limits attention range, which produces consistent performance within the window but a hard ceiling beyond it. Sparse attention patterns reduce computation but create blind spots; where those blind spots fall depends on the specific implementation. RoPE scaling and other positional encoding extensions let models extrapolate beyond their training context lengths, but quality degrades as the extrapolation stretches further.

These choices explain why two models with the same advertised window can have completely different effective ranges.

Model Comparisons

GPT-4o

128k advertised window. Maintains strong performance to approximately 64k tokens for retrieval tasks, then degrades gradually. At full window, performance drops but remains usable for simple retrieval. For multi-hop reasoning, treat 64k as the effective ceiling.

Claude 3.5 Sonnet

200k advertised window. Stays above 80% accuracy on needle retrieval to approximately 128k tokens, the best range of any commercial model at time of testing. Shows the U-shaped attention pattern, but the middle drop is less severe than GPT-4o.

Gemini 1.5 Pro

Up to 2 million tokens on paper, but independent benchmark data at that scale barely exists. Most testing stops at 128k or 256k. Within the tested range, competitive with Claude 3.5 Sonnet. Don’t assume quality at the extreme end of the advertised window until someone actually tests it independently.

Llama 3.1 (70B and 405B)

128k advertised window, but effective range varies dramatically by model size. The 405B holds up to 64k tokens. The 70B shows notable quality drop by 32k. The 8B degrades even faster. If you need long-context performance from open-weight models, the largest variant isn’t optional; it’s required.

Comparative Summary

Model	Advertised Window	Effective Range (80% accuracy)	Degradation Profile
Claude 3.5 Sonnet	200k	~128k	Gradual, shallow middle drop
GPT-4o	128k	~64k	Moderate, standard U-shape
Gemini 1.5 Pro	2M	~128k (tested)	Flat to tested limits
Llama 3.1 405B	128k	~64k	Size-dependent degradation
Llama 3.1 70B	128k	~32k	Steep degradation
Mistral Large	128k	~64k†	Moderate degradation

† Independent benchmark data for Mistral is thinner than for OpenAI or Anthropic. Treat this figure as directional.

Practical Implications

Effective range is the only number that matters. Advertised window sizes are marketing. A 128k window where quality degrades at 64k is a 64k model for any task that requires reliability.

Open-weight models need scale for long context. The jump from 70B to 405B Llama isn’t just “somewhat better.” It’s the difference between a 32k effective range and a 64k one. If you’re choosing open-weight models for long-context work, the compute cost of the larger model pays for itself in quality.

Benchmarks are a starting point, not a guarantee. Your specific retrieval patterns, reasoning chains, and context structure will produce a different degradation curve than NoLiMa’s needle-in-haystack tasks. Always validate with your own data at your target context length.

Recommendations

For critical long-context tasks, Claude 3.5 Sonnet is the current benchmark leader. For GPT-4o, treat 64k as the practical ceiling for high-stakes reasoning and verify beyond that on your specific task.

For open-weight models, test at your actual target context length before committing. Llama’s degradation curves vary enough by model size that a benchmark at 32k tells you little about behavior at 64k.

Regardless of model choice, Select, Don’t Dump still applies. Even well-performing models benefit from targeted context, and the gains compound with longer windows. Monitor quality at your operating length and set a threshold for triggering Compress & Restart before you hit the effective ceiling, not after.