System Prompt Growth Over Time
Claude's system prompt grew 23x in 18 months. ChatGPT's doubled. Dated snapshots from real products show the accumulation problem playing out in public, with timestamps.
The Accumulation Problem, Measured
The System Prompt Engineering deep dive describes the accumulation problem in theory: system prompts grow incrementally as each failure prompts a new instruction, until the whole thing is longer than it needs to be and harder to maintain than the code it governs. Reverse-engineered system prompts from major products, collected with dates in TheBigPromptLibrary, let us see this pattern play out with actual numbers.
Claude: 209 to 4,806 Words
| Date | Version | Words | Change |
|---|---|---|---|
| Mar 2024 | Claude 3 | 209 | Baseline |
| Jul 2024 | Claude 3.5 Sonnet | 961 | +360% |
| Jun 2025 | Claude Sonnet 4 (on claude.ai) | 4,806 | +400% |
| Sep 2025 | Claude Sonnet 4.5 | 1,662 | -65% |
| Jan 2026 | Claude Cowork | 3,722 | +124% |
The Claude 3 system prompt was 209 words. The entire behavioral specification was a single paragraph: identity, knowledge cutoff, a few guidelines about controversial topics, and a note to use markdown for code. No tool definitions, no artifact rules, no content policy sections.
Eighteen months later, the Claude Sonnet 4 prompt on claude.ai had grown to 4,806 words. The growth wasn’t gradual; it came in bursts as new product features landed. Artifacts required hundreds of words of formatting rules, library availability lists, and storage restrictions. The REPL tool needed usage guidance with worked examples. Behavioral guidelines expanded from one paragraph to multiple sections covering wellbeing, refusal handling, formatting, and content policy. Each addition was motivated by a real product need, but the cumulative effect is a system prompt that consumes over 5,000 tokens on every API call.
The Claude Sonnet 4.5 prompt broke the trend. It dropped back to 1,662 words, a 65% reduction. Anthropic appears to have refactored: the artifact definitions and tool specifications were separated from the core behavioral prompt, leaving a tighter system prompt focused on identity, behavior, and formatting. This is the structural fix the system-prompt-engineering guide recommends: separate stable instructions from variable tool definitions so they can be managed independently.
But the Cowork prompt, built for Claude’s desktop agent mode, climbed right back to 3,722 words by layering agentic features on top: a TodoList tool, a Task tool for spawning subagents, an AskUserQuestion tool, citation requirements, and a skills system. Each one justified, each one adding length. The cycle restarted.
ChatGPT: A Different Growth Pattern
| Date | Version | Words | Change |
|---|---|---|---|
| Nov 2023 | GPT-4 (Gizmo) | 1,295 | Baseline |
| May 2024 | GPT-4o | 912 | -30% |
| Mar 2025 | GPT-4.5 | 1,049 | +15% |
| Jun 2025 | GPT-4.1 | 1,267 | +21% |
| Jul 2025 | ChatGPT 4o (full, with Study Mode) | 2,886 | +128% |
ChatGPT’s growth pattern is more restrained. The base prompt stayed relatively flat between 900 and 1,300 words across model generations, because OpenAI’s philosophy delegates behavioral specification to training rather than to the system prompt. The role definition has barely changed in two years: “You are ChatGPT, a large language model trained by OpenAI” plus a personality tag and a sentence or two of behavioral guidance.
The growth, when it comes, is driven entirely by tool additions. The July 2025 full prompt doubles in size because it includes Study Mode, file search with a sophisticated query language (featuring freshness-aware QDF operators and mandatory citation formatting), Canvas, DALL-E with content policy rules, and Python execution. Strip the tools and the core behavioral prompt is still under 300 words.
ChatGPT’s approach is a fundamentally different design choice from Anthropic’s. Claude’s system prompt tries to govern behavior explicitly through detailed prose instructions; ChatGPT’s tries to stay minimal and lets the model’s training carry the behavioral weight. The tradeoff: Claude’s approach is more predictable and auditable (you can read the rules), while ChatGPT’s is more concise but harder to debug when the model behaves unexpectedly (the rules are implicit in the weights).
What the Growth Tells You
Tool definitions are the primary growth driver. In both products, the base behavioral prompt grows slowly while tool definitions drive the big jumps. Claude’s artifact system alone accounts for over 1,500 words. ChatGPT’s file search tool with its QDF query language is similarly large. If you’re watching your own system prompt grow, audit the tool definitions first; they’re almost certainly the largest section.
Refactoring works but doesn’t stick. Anthropic’s Sonnet 4.5 refactor cut the prompt by 65%, but the very next product variant (Cowork) grew it back by 124%. New features bring new instructions, and without a discipline of removing content when adding content, the cycle repeats. This mirrors what happens in application codebases: a refactoring sprint produces a clean architecture, then six months of feature work erodes it back to the previous state.
The two philosophies produce different failure modes. Claude’s long explicit prompts risk Context Rot by consuming tokens that could go to user context, but they fail predictably (you can read the instruction that caused the behavior). ChatGPT’s short implicit prompts preserve token budget but fail opaquely (the model does something unexpected and there’s no instruction to point to). If you’re designing your own system, the choice between explicit and implicit behavioral specification should be driven by how much you need to debug and audit the model’s behavior.
Caching is what makes the long approach viable. Claude’s 5,000+ token system prompt would be prohibitively expensive without prompt caching, which means the hit is paid once and subsequent turns in a conversation are cheap. If your deployment doesn’t support prompt caching, you’re forced toward the shorter, implicit approach regardless of your preference for explicitness.
For your own system prompts: track the word count over time. If it’s growing faster than your feature surface justifies, you’re accumulating rather than engineering. Set a review trigger (quarterly, or when the prompt crosses a threshold like 2,000 tokens) and audit for instructions that are defensive patches rather than structural requirements.