System Prompt Growth Over Time

Claude's system prompt grew 23x in 18 months, while ChatGPT's doubled. Dated snapshots from real products show the accumulation problem playing out in public.

0xeb/TheBigPromptLibrary (MIT) , Anthropic: System Prompts Release Notes

The Accumulation Problem, Measured

The System Prompt Engineering deep dive describes the accumulation problem in theory: system prompts grow incrementally as each failure prompts a new instruction, until the whole thing is longer than it needs to be and harder to maintain than the code it governs. Reverse-engineered system prompts from major products, collected with dates in TheBigPromptLibrary, let us see this pattern play out with actual numbers.

Claude: 209 to 4,806 Words

Date	Version	Words	Change
Mar 2024	Claude 3	209	Baseline
Jul 2024	Claude 3.5 Sonnet	961	+360%
Jun 2025	Claude Sonnet 4 (on claude.ai)	4,806	+400%
Sep 2025	Claude Sonnet 4.5	1,662	-65%
Jan 2026	Claude Cowork	3,722	+124%

The Claude 3 system prompt was 209 words. The entire behavioral specification was a single paragraph: identity, knowledge cutoff, a few guidelines about controversial topics, and a note to use markdown for code. No tool definitions, no artifact rules, no content policy sections.

Eighteen months later, the Claude Sonnet 4 prompt on claude.ai had grown to 4,806 words. Growth came in bursts as new product features landed. Artifacts required hundreds of words of formatting rules, library availability lists, and storage restrictions. The REPL tool needed usage guidance with worked examples. Behavioral guidelines expanded from one paragraph to multiple sections covering wellbeing, refusal handling, formatting, and content policy. Each addition was motivated by a real product need, but the cumulative effect is a system prompt that consumes over 5,000 tokens on every API call.

The Claude Sonnet 4.5 prompt broke the trend. It dropped back to 1,662 words, a 65% reduction. Anthropic appears to have refactored: the artifact definitions and tool specifications were separated from the core behavioral prompt, leaving a tighter system prompt focused on identity, behavior, and formatting. This is the structural fix the system-prompt-engineering guide recommends: separate stable instructions from variable tool definitions so they can be managed independently.

But the Cowork prompt, built for Claude’s desktop agent mode, climbed right back to 3,722 words by layering agentic features on top: a TodoList tool, a Task tool for spawning subagents, an AskUserQuestion tool, citation requirements, and a skills system. Each feature had a reason to exist, and each one added length. The accumulation cycle started again.

ChatGPT: A Different Growth Pattern

Date	Version	Words	Change
Nov 2023	GPT-4 (Gizmo)	1,295	Baseline
May 2024	GPT-4o	912	-30%
Mar 2025	GPT-4.5	1,049	+15%
Jun 2025	GPT-4.1	1,267	+21%
Jul 2025	ChatGPT 4o (full, with Study Mode)	2,886	+128%

ChatGPT’s growth pattern is more restrained. The base prompt stayed relatively flat between 900 and 1,300 words across model generations, because OpenAI’s philosophy delegates behavioral specification to training rather than to the system prompt. The role definition has barely changed in two years: “You are ChatGPT, a large language model trained by OpenAI” plus a personality tag and a sentence or two of behavioral guidance.

The growth, when it comes, is driven entirely by tool additions. The July 2025 full prompt doubles in size because it includes Study Mode, file search with a sophisticated query language (featuring freshness-aware QDF operators and mandatory citation formatting), Canvas, DALL-E with content policy rules, and Python execution. Strip the tools and the core behavioral prompt is still under 300 words.

ChatGPT’s approach is a fundamentally different design choice from Anthropic’s. Claude’s system prompt tries to govern behavior explicitly through detailed prose instructions; ChatGPT’s tries to stay minimal and lets the model’s training carry the behavioral weight. The tradeoff: Claude’s approach is more predictable and auditable (you can read the rules), while ChatGPT’s is more concise but harder to debug when the model behaves unexpectedly (the rules are implicit in the weights).

What the Growth Tells You

Tool definitions are the primary growth driver. In both products, the base behavioral prompt grows slowly while tool definitions drive the big jumps. Claude’s artifact system alone accounts for over 1,500 words. ChatGPT’s file search tool with its QDF query language is similarly large. If you’re watching your own system prompt grow, audit the tool definitions first; they’re almost certainly the largest section.

Refactoring works but doesn’t stick. Anthropic’s Sonnet 4.5 refactor cut the prompt by 65%, but the very next product variant (Cowork) grew it back by 124%. New features bring new instructions, and without a discipline of removing content when adding content, the cycle repeats. This mirrors what happens in application codebases: a refactoring sprint produces a clean architecture, then six months of feature work erodes it back to the previous state.

The two philosophies produce different failure modes. Claude’s long explicit prompts risk Context Rot by consuming tokens that could go to user context, but they fail predictably (you can read the instruction that caused the behavior). ChatGPT’s short implicit prompts preserve token budget but fail opaquely (the model does something unexpected and there’s no instruction to point to). If you’re designing your own system, the choice between explicit and implicit behavioral specification should be driven by how much you need to debug and audit the model’s behavior.

Caching is what makes the long approach viable. Claude’s 5,000+ token system prompt would be prohibitively expensive without prompt caching, which means the hit is paid once and subsequent turns in a conversation are cheap. If your deployment doesn’t support prompt caching, you’re forced toward the shorter, implicit approach regardless of your preference for explicitness.

For your own system prompts: track the word count over time. If it’s growing faster than your feature surface justifies, the prompt is accumulating patches instead of expressing a maintained design. Set a review trigger (quarterly, or when the prompt crosses a threshold like 2,000 tokens) and audit for instructions that are defensive patches rather than structural requirements.