Research Archive

Secondary sources, overlapping papers, and useful references that are worth keeping but not prominent enough for the main research shelf.

Context Degradation

Context Tax: Latency vs. Accuracy at Scale

Quantified the operational cost of long context. Llama-3.1-70B showed a 719% latency increase at 15k-word context, while accuracy only dropped from 98.5% to 98%. Models don’t lose their way; they become operationally expensive. Memory bandwidth is the bottleneck, not compute.

LLMs Get Lost In Multi-Turn Conversation

Systematic study of how LLMs fail in multi-turn settings. Models make assumptions in early turns and prematurely commit to solutions, then over-rely on those early conclusions for the remainder of the conversation. The failure compounds: a wrong assumption in turn 3 typically persists through turn 20. Reinforces the case for explicit state management and periodic context resets; assumption drift compounds when left unchecked.

GPT-5.4: 1M Token Context Window

OpenAI’s latest model supports up to 1M tokens of context, enabling agents to plan, execute, and verify tasks across long horizons. The window size is a milestone, but the context engineering question remains: a million tokens of poorly structured context will still degrade. The ETH Zurich AGENTS.md study and the PAPerBench attention dilution findings apply at any window size. Bigger windows make context engineering more important, not less.

Agent Memory & Retrieval

TATRA: Training-Free Instance-Adaptive Few-Shot Prompting

Constructs instance-specific few-shot prompts by synthesizing on-the-fly examples matched to the current input. Addresses the core limitation of fixed example sets: examples selected for average cases perform poorly on edge cases. Dataset-free approach makes this practical for low-resource scenarios. Directly relevant to the Few-Shot Selection pattern: when no labeled pool exists, synthesize.

Coding Agents & Harnesses

AOrchestra: Automated Sub-Agent Creation

Formalizes agents as a tuple of (Instruction, Context, Tools, Model) and automates their creation. Achieved a 16.28% improvement over the strongest baseline. Directly relevant to the Isolate and Recursive Delegation patterns: the orchestrator curates task-relevant context and delegates via on-the-fly agent creation.

Efficient Context Management

Practical hybrid approach combining observation masking and LLM summarization on SWE-bench. Achieved 7-11% cost reduction. Useful as a real-world engineering case study rather than a benchmark paper.

ACE Framework

Demonstrated a +10.6% improvement on agent benchmarks and +8.6% on finance tasks through better context engineering alone. No model changes. The key insight: contexts should function as “comprehensive, evolving playbooks,” not concise summaries. Also introduced the concept of “context collapse,” where iterative rewriting erodes detail over time.

Infrastructure & Tools

Building a Real-Time Context Engine for AI Agents

Technical breakdown triggered by IBM’s $11B Confluent acquisition. Argues that RAG is document retrieval, but context is operational state. Proposes an event-driven architecture (Kafka + Snowplow) for real-time context that changes minute-by-minute. Useful for understanding where static context patterns break down.

Context Engineering Best Practices

Production-oriented guidance: treat context as infrastructure, prune aggressively, store and reuse context development. The emphasis on memory layers (short-term session + long-term cross-session) as essential infrastructure rather than optional features.

Field Maps

AI Copilot Product Trends, Q1 2026

Product management perspective on why fine-tuning lost to context engineering: high effort, foundation models improved too fast, context engineering extracts enough value without retraining. Also frames the shift from “context window size” to “context relevance” as the real bottleneck.