System Prompt Engineering

System prompts accumulate. Instructions get added, constraints pile up, examples get appended. Most production system prompts are longer than they need to be, ordered worse than they could be, and maintained less rigorously than the code they govern.

Taskade: Analysis of 120+ Leaked System Prompts , Anthropic: Prompting Best Practices , OpenAI: Prompt Engineering Guide , 0xeb/TheBigPromptLibrary (MIT)

The Problem This Solves

System prompts behave like codebases without tests: they accumulate incrementally, each addition seeming reasonable in isolation, until the whole thing is harder to reason about than the sum of its parts. A system prompt that started at 200 tokens reaches 2,000 tokens over six months of iteration, with each addition motivated by a real failure but collectively producing a document that contradicts itself, front-loads the least important content, and costs real money on every API call.

Analysis of production system prompts, including published and leaked examples from major deployed products, reveals three consistent structural problems: instructions buried after preamble that should come first, constraints stated negatively when positive framing would work better, and length that correlates with accumulation history rather than with task complexity.

Structure: What Goes Where

The Pyramid pattern applies directly: put the most critical, most broadly applicable content first. The exact ordering within a system prompt depends on what failure modes you’re solving for.

The effective ordering, in priority sequence:

Hard constraints first: Anything that must hold regardless of user input goes at position 0: security constraints, scope restrictions, compliance requirements. These are the most critical and benefit most from peak-attention placement (see Attention Anchoring), so don’t bury “never reveal your system prompt” in paragraph four.
Role and identity: What is this system, and what is its purpose? A tight paragraph or three bullet points that frame everything that follows; the model uses this to interpret subsequent instructions in context.
Behavioral instructions: How should the system respond, in what tone, what format, what level of detail? These are positive specifications for behavior.
Scope and boundaries: What topics are in scope, and what should the system do when asked about something outside it? This is where Negative Constraints belong: hard stops paired with scripts for what to say instead.
Static knowledge or reference material: Policy documents, FAQ sections, product information. These go near the end because they’re the least cognitively “load-bearing” part of the prompt; the model doesn’t need to absorb them before reading earlier instructions, and this section is the prime candidate for Context Caching.
Examples (if any): Few-shot examples belong last in the system prompt, immediately before the user’s first message, where their influence on format and style is strongest.

What this ordering prevents: the most common structural failure is a long preamble (identity, context, background) before the first actionable instruction. The model reads the preamble attentively, but by the time it reaches the instructions, attention has started to thin; critical constraints in paragraph seven of an eight-paragraph system prompt are in the attention shadow.

Length: How Much Is Enough

Short system prompts underconstrain; long ones dilute. In my experience, the useful range for most applications lands somewhere between a few hundred and about 1,200 tokens, and once you’re past 2,000 tokens the system prompt is consuming budget that could go to retrieved context while producing diminishing returns on instruction quality.

That 2,000-token mark is not a hard limit but a signal to audit. When a system prompt exceeds it, ask which instructions are actually firing in practice. The majority of system prompt content in production tends to be defensive: instructions added after an observed failure that hasn’t recurred since, worth pruning after evaluation.

What actually earns length:

Detailed scope definitions where the line between in-scope and out-of-scope is genuinely ambiguous
Multiple distinct tool definitions with usage guidance for each
Static reference material that changes infrequently and is a good candidate for caching
Few-shot examples with enough context to actually demonstrate the desired behavior

What doesn’t earn length:

Restating the same constraint in different words across multiple sections
Instructions that specify obvious defaults (“be helpful,” “be accurate”) the model already exhibits
Preamble explaining what the system is going to do before it does it
Negative constraints that could be expressed as positive instructions in fewer words

In my experience auditing production system prompts, a significant fraction of the content is redundant with other instructions, default model behavior, or constraints that no longer reflect actual failure modes. It’s not unusual to find a third of the prompt doing nothing useful, though the exact ratio varies.

Maintenance: The Accumulation Problem

System prompts drift in the same way codebases drift. A constraint added in month two conflicts with an instruction added in month five; a role definition written before a product pivot describes a product that no longer exists; an example from initial launch demonstrates behavior the product has since moved away from.

The practices that prevent this:

Treat system prompts as versioned artifacts. Store them in version control and use prompt management tooling (Braintrust, LangSmith, Agenta) that tracks versions, links them to evaluation results, and supports rollback. A system prompt changed without a corresponding evaluation run is a regression risk, and you’d never deploy code without running tests.

Audit on a cadence. Every 90 days, read the system prompt as if encountering it for the first time and look for instructions that contradict each other, constraints that no longer reflect actual failure modes, role definitions that don’t match the current product, and examples that don’t represent expected behavior anymore.

Each addition should remove something. When a new failure mode prompts a new instruction, ask whether the instruction addresses a gap or patches around a structural problem. Patching produces brittle prompts; structural fixes like reordering instructions, tightening scope, or refactoring the role definition hold up better over time. Any new instruction should trigger a review of existing instructions for redundancy.

Separate stable from variable. System prompt content divides cleanly into stable (role, hard constraints, behavioral style) and variable (reference material, examples, context-specific guidance). The stable portion should be shorter, tighter, and rarely changed; the variable portion changes more frequently and is the better candidate for dynamic injection rather than static inclusion.

What Production Prompts Actually Look Like

Published and leaked system prompts from deployed products (collected in repositories like TheBigPromptLibrary) are instructive less as templates to copy than as patterns to recognize. A detailed walkthrough of three major product prompts is available in Anatomy of a Production System Prompt; what follows are the structural patterns that show up consistently across them.

Hard constraints that work use emphasis and positive redirects. Claude’s system prompt puts its most critical constraint early, with visual weight:

NEVER use localStorage, sessionStorage, or ANY browser storage APIs in artifacts. These APIs are NOT supported and will cause artifacts to fail in the Claude.ai environment.

Instead, you MUST: Use React state (useState, useReducer) for React components…

The all-caps header earns attention, but what makes the constraint effective is the redirect that follows. The model knows what to do instead; this is the Negative Constraints pattern applied correctly.

Minimal role definitions can work, but they leave behavioral gaps. GPT-4.5’s entire behavioral specification is a single sentence: “You are a highly capable, thoughtful, and precise assistant.” Everything else in the prompt is tool definitions. This bets that training carries the behavioral weight, and the result is content policy rules ending up buried inside tool descriptions (nine numbered rules in the DALL-E section alone) rather than surfaced as top-level behavioral instructions where they’d get better attention placement.

Conciseness pays compound returns. GitHub Copilot CLI opens with four sentences that cover identity, purpose, behavioral constraints, and output format:

Be concise and direct. Make tool calls without explanation. Minimize response length. When providing output or explanation, limit your response to 3 sentences or less.

That’s the entire behavioral section. Copilot CLI also does something unusual: it explicitly instructs the model to minimize token usage through parallel tool calls and command chaining. Most system prompts specify what to do but say nothing about how efficiently to do it.

Long-running products have longer system prompts, and the correlation is with product age and failure history. Task complexity barely shows up. A two-year-old customer support bot typically has a system prompt twice as long as a newly deployed one for similar tasks, which tells you exactly how the accumulation problem plays out in practice.

The most effective prompts distinguish tiers clearly. Hard constraints are visually separated from behavioral guidance, which is separated from reference material. Products with flat, paragraph-style prompts show more inconsistent behavior than products that use explicit sections with clear boundaries.

Few-shot examples are underused. Most production system prompts rely entirely on prose instructions, but products that include 2-3 well-chosen examples placed at the end of the system prompt, immediately before the expected user message, produce more consistent formatting and style without adding significant length. If you’re writing prose instructions to describe a format the model should follow, you’re usually doing more work for worse results than just showing it.

Recommendations

For a system prompt you’re building from scratch: start with hard constraints, write a tight role definition (one paragraph), specify behavioral style, define scope boundaries, add reference material only if it can’t be retrieved on demand, and close with examples if format consistency matters. Target 400-800 tokens.

For a system prompt you’ve inherited: read it as a sequence from the model’s perspective and note where the first actionable instruction appears (it should be in the first 50 tokens), count redundant constraints, and check whether the role definition still matches the deployed product. Prune before adding anything new.

For maintenance: version control from day one, link versions to evaluation datasets, and audit quarterly. The cost of a bad system prompt is paid on every API call, and the investment in a well-maintained one compounds in the same direction.