Case Study: Context Engineering in a Production Coding Agent

How a CLI coding agent structures its context across system prompts, project memory, skill files, and tool descriptions. What failed, what was changed, and the token budgets that emerged from iterating on real tasks.

Anthropic: Effective Context Engineering for AI Agents , ACE Framework (Stanford/SambaNova)

The Context

A CLI-based coding agent built for professional daily use across multiple client projects, handling everything from file editing and git operations to infrastructure debugging, content writing, and research tasks. Sessions routinely run 30-80 turns, spanning multiple files and tools, with Claude as the underlying model.

The challenge: the agent needs to be useful across wildly different projects (Python backends, Astro static sites, infrastructure debugging, Kubernetes operations) without being configured individually for each one. The context engineering has to work generically while still being project-specific enough to be useful.

The Problem

The first iteration stuffed everything into the system prompt: global coding guidelines, project-specific context, tool descriptions, memory, style preferences, all in one block at the start of every conversation. The system prompt grew past 8k tokens, and the agent started exhibiting classic context rot symptoms; it ignored project conventions it had been told about, applied guidelines from one project to a different one, and occasionally contradicted its own instructions.

The second problem was subtler: different tasks need different context. A git commit task needs the commit style guide and diff analysis skills, while a content writing task needs voice guides, anti-pattern lists, and writing samples. Including all of this all the time meant the agent carried 15-20k tokens of irrelevant context for any given task, and the quality cost was worse than the token cost; the model’s attention was spread across instructions for tasks it wasn’t doing.

The Architecture

Four patterns from this catalog solve it together: layered context with different loading policies at each layer.

Layer 1: Global Guidelines (~500 tokens)

A small, stable set of rules that apply to every task across every project. Date and time handling (always check, never assume), writing style rules (no em dashes, no inline imports), git behavior (never push without approval), credential handling. These are The Pyramid base: general rules that constrain everything above them.

Small by design: every token here competes with task-specific context in every single conversation. A guideline earns its place in the global layer only if it applies universally and if violating it has caused real problems more than once.

Layer 2: Project Context (~300-800 tokens)

Each project has its own context file (AGENTS.md) that describes the project’s tech stack, build commands, deployment process, and conventions. The agent loads this automatically when it detects which project it’s working in.

Deliberately lean: an early version included full architecture descriptions, design decision rationale, and coding standards; it regularly exceeded 3k tokens. The ETH Zurich study on AGENTS.md files confirmed what we observed: overly detailed project context files reduce task success rates and increase cost. The current versions describe only what the agent can’t infer from reading the code itself: deployment targets, build commands, and hard constraints that aren’t obvious from the codebase.

Layer 3: Skills (~800-2000 tokens per skill, loaded on demand)

Progressive Disclosure does the most work here. Skills are specialized instruction sets that the agent loads only when the task matches. A commit skill teaches the agent how to write commit messages in a specific style and analyze diffs. A writing skill loads voice guides and anti-patterns. A diagnostics skill provides Loki query patterns and known-issue templates for a specific production system.

The key design decision: skills are described in a lightweight index (name + one-sentence description) in the system prompt, and the full skill content is loaded only when the agent recognizes it needs one. The index costs about 400 tokens. Loading a skill adds 800-2000 tokens of task-specific instructions. Without this pattern, carrying all skills in every conversation would add 10-15k tokens of mostly irrelevant instructions.

Layer 4: Memory (~200-600 tokens)

Persistent facts stored across sessions using Write Outside the Window. Two scopes: global memory (preferences and facts that apply across all projects) and project memory (architecture decisions, recurring patterns, key contacts for a specific codebase).

Memory entries are one sentence each, written in imperative or declarative form: “Deploy target is statichost, not Vercel,” or “No inline imports in Python.” This density is intentional; anything that needs a paragraph usually belongs in a skill file.

What Failed

Verbose project context was the first failure mode. The initial AGENTS.md files were 2-3k tokens of architecture documentation. The agent would reference design rationale from the project context when making code changes, but the overhead degraded performance on simple tasks where that context was noise. Cutting project context to 300-800 tokens improved both speed and accuracy.

Static skill loading caused the next problem. An early version loaded all skills upfront. Sessions that needed a commit skill also carried the full frontend design skill, the production diagnostics skill, and the TDD skill. Switching to on-demand loading was the single biggest quality improvement.

Flat memory failed once it grew without bounds. At 80+ entries, the memory section consumed 2k+ tokens and the agent stopped reliably applying individual entries. Splitting into global/project scope and pruning aggressively, one sentence per fact and no stale entries, brought memory under control.

Key Takeaway

The system prompt is only 5% of the context budget. The other 95% (conversation history, tool outputs, file contents) is assembled by code, which means context engineering for a coding agent is primarily a software engineering problem: designing the right data structures, loading policies, and eviction strategies for a resource-constrained system.