Anatomy of a Production System Prompt
Real system prompts from Claude, ChatGPT, and GitHub Copilot, annotated against context engineering patterns. What they get right, where they break their own rules, and what the structure tells you about each product's priorities.
Why Look at Real Prompts
The System Prompt Engineering deep dive covers the principles: put hard constraints first, keep it under 2,000 tokens, audit quarterly. Those principles are easier to internalize when you can see them applied, and violated, in production systems that millions of people use daily.
Reverse-engineered system prompts from major AI products have been collected in public repositories (notably TheBigPromptLibrary, MIT-licensed). The prompts themselves are proprietary, but short excerpts with commentary fall squarely in fair use territory, and the structural patterns they reveal are worth examining in detail. What follows is a walkthrough of three system prompts representing very different design philosophies, annotated against the patterns in this catalog.
Claude Sonnet 4 (claude.ai, June 2025)
The Claude Sonnet 4 system prompt is large; conservatively over 5,000 tokens before the user says anything, and that’s before tools, artifacts, and the REPL environment definition get added. The bulk of it is structural: artifact creation rules, tool definitions, behavioral guidelines, and content policies.
What it gets right structurally: hard constraints appear early and with visual emphasis. The CRITICAL BROWSER STORAGE RESTRICTION section uses all-caps headers and bold text to ensure the model doesn’t miss it:
NEVER use localStorage, sessionStorage, or ANY browser storage APIs in artifacts. These APIs are NOT supported and will cause artifacts to fail in the Claude.ai environment.
This follows the Attention Anchoring principle directly. The constraint is stated positively after the prohibition (“Instead, you MUST: Use React state…”), which is exactly what the Negative Constraints pattern recommends: hard stop followed by a redirect.
Where it drifts: the role definition and identity section appear after the tool definitions. That’s several thousand tokens of artifact rules, REPL instructions, and CSV handling guidance before the model reads “The assistant is Claude, created by Anthropic.” By the time the behavioral guidelines land, the model has already processed the bulk of its operational context. The Pyramid would put identity and behavioral framing before tool mechanics, since the role shapes how the model interprets everything else.
The budget problem: at 5,000+ tokens of system prompt on every API call, this is a concrete example of why Context Budget allocation matters. On a short task where the user asks a three-sentence question, over 80% of the context window’s initial content is system prompt. The saving grace is Context Caching; because the system prompt is identical across turns, it hits cache after the first call and subsequent turns pay only for the delta. Without caching this prompt would be expensive at scale, but with caching the stable prefix pays for itself.
The tool descriptions are a case study in themselves. The repl tool definition runs hundreds of tokens and includes detailed “when to use” and “when NOT to use” sections, complete with examples:
Use the analysis tool ONLY for: Complex math problems that require a high level of accuracy… Do NOT use analysis for problems like “4,847 times 3,291?” […] Use analysis only for MUCH harder calculations
This matches the Tool Descriptions pattern precisely: scope, trigger conditions, anti-trigger conditions, and worked examples of what falls on each side of the line. It’s one of the better tool definitions in any production system.
GPT-4.5 (ChatGPT, March 2025)
The GPT-4.5 system prompt is strikingly minimal compared to Claude’s. The core identity fits in a handful of lines:
You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2023-10 Current date: 2025-03-05 Personality: v2 You are a highly capable, thoughtful, and precise assistant.
That’s the entire behavioral specification: one sentence of role framing, plus metadata. Everything else in the prompt is tool definitions for Canvas, DALL-E, and Python execution.
What this tells you about product philosophy: OpenAI is betting that the model’s training carries the behavioral weight, so the system prompt focuses almost entirely on tool routing. The role definition does no real Role Framing work; “highly capable, thoughtful, and precise” describes the default behavior without constraining it. Compare this to Claude’s extensive behavioral guidelines with explicit sections on wellbeing, refusal handling, and tone, and you see two completely different theories of how much the system prompt should govern.
Where the minimalism costs them: tool definitions carry operational rules that would be more effective as behavioral guidelines. The DALL-E section includes nine numbered rules about content policy, covering everything from post-1912 artists to named individuals to copyrighted characters. These are Negative Constraints buried inside a tool definition, which means the model only encounters them when it’s already decided to generate an image; promoting them to the behavioral section would give them better attention placement.
The Canvas tool definition includes this:
NEVER use this function. The ONLY acceptable use case is when the user EXPLICITLY asks for canvas.
That’s a hard constraint formatted as a tool instruction. It should be a top-level behavioral rule, not buried in a tool parameter description where it has to compete with the model’s general inclination to use available tools.
The Select, Don’t Dump problem: the full prompt includes complete tool schemas for features many users never touch in a given session (Canvas, DALL-E, Python). Every session pays the token cost for tools the user may not invoke. A Progressive Disclosure approach would expose a minimal tool set initially and surface additional tools only when the conversation moves toward those capabilities.
GitHub Copilot CLI (January 2026)
The Copilot CLI prompt is the tightest of the three, and it reads like it was written by someone who actually thinks about context budgets. The opening is pure operational instruction:
You are the GitHub Copilot CLI, a terminal assistant built by GitHub. You are an interactive CLI tool that helps users with software engineering tasks.
Be concise and direct. Make tool calls without explanation. Minimize response length. When providing output or explanation, limit your response to 3 sentences or less.
Four sentences that accomplish what Claude’s prompt takes paragraphs to do: identity, purpose, behavioral constraints, output format. The Pyramid is fully respected; the most important instructions, including how to behave and what to optimize for, land in the first 50 tokens.
Context efficiency as a first-class concern: the prompt explicitly instructs the model to minimize LLM turns through parallel tool calling and command chaining, which is unusual. Most system prompts tell the model what to do but not how efficiently to do it; Copilot CLI treats context efficiency as a behavioral requirement, not an afterthought:
CRITICAL: Minimize the number of LLM turns by using tools efficiently: USE PARALLEL TOOL CALLING; when you need to perform multiple independent operations, make ALL tool calls in a SINGLE response. Chain related bash commands with && instead of separate calls
This is the Context Budget pattern applied at the instruction level: the system prompt doesn’t just allocate tokens, it tells the model to minimize how many tokens it needs across the entire session.
Negative constraints done right: the <prohibited_actions> section is a compact list of hard stops, but each one is specific enough to be actionable. “Don’t share sensitive data with 3rd party systems” and “Don’t commit secrets into source code” are both clear, concrete boundaries, not the vague “be careful with sensitive information” that leaves the model guessing. The section closes with a positive redirect (“If this prevents you from accomplishing your task, please stop and let the user know”), which means the model knows what to do when it hits a constraint, not just what to avoid.
What’s missing: there’s no explicit behavioral guidance for tone beyond “be concise,” and the prompt assumes conciseness implies the right register for a CLI tool. That’s probably correct for this specific product but wouldn’t generalize to a conversational assistant where the model needs to know whether it’s friendly, formal, or terse.
Patterns Across All Three
Three production prompts, three different sizes, three different philosophies, but some patterns hold across all of them.
Tool definitions dominate. In all three prompts, tool definitions consume more tokens than behavioral instructions. Claude’s artifact and REPL definitions, GPT-4.5’s Canvas and DALL-E policies, Copilot’s bash and edit guidelines; these are the longest sections in every case. If you’re optimizing system prompt length, tool descriptions are where the budget goes and where the pruning opportunities live.
Negative constraints cluster near specific tools. Content policy rules appear next to the tools they govern rather than in a central behavioral section. This makes sense from a maintenance perspective (the person adding DALL-E restrictions edits the DALL-E section) but weakens their attention placement. Promoting the most critical constraints to the top of the prompt, regardless of which tool they relate to, would give them better positioning.
Nobody budgets explicitly. None of these prompts show signs of deliberate token budgeting. Claude’s prompt is long because it has a lot of features to describe, GPT-4.5’s is short because it delegates to training, and Copilot’s is tight because the CLI context demands it, but in all three cases the length is a side effect of the product’s feature surface rather than a deliberate allocation against a target token count.
Caching makes the long prompts viable. Claude’s 5,000+ token system prompt would be prohibitively expensive without prompt caching, and the fact that it exists at this length tells you the team is counting on cache hits to amortize the cost. If your API provider doesn’t support prompt caching, you need to be much more aggressive about length.
What to Take From This
If you’re writing a system prompt for a production application, these three examples bracket the realistic design space. You probably don’t need Claude’s level of explicit behavioral instruction unless you’re building a general-purpose assistant with content policy requirements. You probably can’t get away with GPT-4.5’s minimalism unless you trust the model’s training to carry most of the behavioral weight. Copilot CLI’s approach, tight identity followed by operational efficiency requirements followed by tool definitions with clear boundaries, is the closest to a transferable template for most tool-using applications.
The structural lesson is consistent across all three: the ordering of your system prompt is a context engineering decision, not an editorial one. Where you place constraints relative to tools, identity relative to behavior, and hard stops relative to soft guidance directly affects how reliably the model follows each instruction.