Context Engineering for Customer Support Bots

Customer support is the most common production LLM use case and one of the easiest places to damage trust. Wrong return windows, hallucinated policies, and contradictions across turns usually trace back to bad context assembly.

Anthropic: Contextual Retrieval , AWS: Managing Chat History and Context at Scale in Generative AI Chatbots , Intercom: Fin AI Agent

The Customer Support Context Problem

A support bot that confidently tells a customer the return window is 30 days, when it’s 14, has not failed because the model is bad. It has failed because the model fell back to training data when it should have used your policy document. A bot that contradicts what it said three messages ago has not failed because of reasoning errors; it failed because nothing in its context tied the earlier message to the current one.

Customer support bots have a specific context structure that most teams design wrong. There are four distinct information sources, and each one needs to be handled differently: the policy knowledge base, the customer’s profile and order history, the current conversation, and prior interaction history. Teams that treat all of these as one undifferentiated pile of context produce bots that hallucinate, contradict themselves, and frustrate customers with generic responses that don’t acknowledge what they already told you.

Policy Documents: Ground, Don’t Assume

The single most important thing to get right in a support bot is policy grounding: the model must answer from your actual policy documents instead of its training data’s approximation of typical policies. These differ in the specifics that matter most: exact return windows, specific eligibility conditions, exceptions, and recent changes.

This requires Grounding. Retrieval gets the policy document into context; grounding makes the model actually use it.

Answer using ONLY the information in the policy documents below.
If the policy does not cover this question, say:
"I don't have specific information on that. Let me connect
you with our support team."

Do not use general knowledge about typical policies.

Policy Documents:
[retrieved policy sections, most relevant first]

The instruction to not use general knowledge is necessary. Without it, the model treats policy documents as suggestions and fills gaps with plausible-sounding but incorrect information drawn from training data.

Retrieval quality matters here. A policy knowledge base with dense, overlapping sections needs semantic chunking and re-ranking; a simple keyword search will surface the wrong paragraph when a customer asks about a variant the indexing doesn’t match. Anthropic’s contextual retrieval approach, where each chunk carries context about its source and section, works well for policy documents because it preserves the relationship between a policy rule and any exceptions that appear nearby.

Customer Profile: Inject Once, Reference Often

A customer who has to re-explain their order number, subscription tier, or recent purchase history at the start of every conversation is getting a worse experience than they’d get from a keyword search tool. The support bot should arrive knowing who the customer is.

The profile injection belongs at the top of the system prompt, in the Pyramid pattern’s domain layer:

Customer: Jane Smith (customer since 2023)
Subscription: Pro tier, annual billing, renews 2026-08-15
Recent orders: #78234 (shipped 2026-02-28), #77891 (delivered 2026-02-10)
Open tickets: None
Previous contacts: 2 (last contact: 2026-01-14, resolved: billing question)

Keep this section compact. Five to eight key facts are enough; a 2,000-token customer profile is a budget problem waiting to happen. The model needs actionable facts for the current session, so inject those and fetch the rest on demand. Use Select, Don’t Dump here as rigorously as anywhere.

The profile section should never contain sensitive data beyond what’s needed for the current session. Account numbers, payment details, and full order history can be fetched on demand via tool calls when the customer actually asks about them; they don’t need to pre-load in every context.

Conversation History: Budget It Before It Overflows

Support conversations are multi-turn by nature, and the context window fills up faster than most teams expect. A typical policy document consumes 3,000-5,000 tokens; the customer profile another 500; and a conversation that spans 15 message pairs is already at 2,000 tokens of history, often more if the earlier turns contained product details or policy excerpts.

The failure mode is gradual: the bot works well for the first ten turns, then quality degrades as history pushes other context out of the effective window. Teams notice resolution rates drop in long sessions and attribute it to model limitations. It’s usually a budget problem.

Set a hard cap on conversation history. The exact share depends on your use case; the important thing is having a cap enforced before it becomes a problem. When the conversation exceeds that limit, compress the oldest turns before appending new ones:

def compress_history(messages, token_limit):
    total = count_tokens(messages)
    if total <= token_limit:
        return messages
    # Keep the most recent turns intact
    recent = messages[-6:]
    # Summarize everything before
    older = messages[:-6]
    summary = llm.summarize(older, instruction=
        "Summarize the key facts established: "
        "what the customer needs, what was tried, "
        "what was promised.")
    return [{"role": "system", "content": f"Earlier context: {summary}"}] + recent

This applies Compress & Restart within the session. The model retains what was established (the customer’s issue, any commitments made, what was already tried) without carrying the full verbatim history.

Prior Interactions: Use Temporal Decay

A customer’s last contact from eight months ago is less relevant than one from last week. Including a full interaction history at equal weight is actively harmful; the model may cite a resolution from a previous ticket that’s no longer applicable.

Apply Temporal Decay: weight recent interactions heavily, compress or exclude older ones. A practical approach:

Last 30 days: include key facts from prior contacts directly
30-90 days: include as a brief summary line (“Prior contact Feb 2026: billing dispute, resolved”)
Older than 90 days: exclude unless the current query explicitly references a past interaction

If the current session involves an escalation or a return visit on an unsolved issue, the prior interaction context becomes more relevant and should be elevated; the decay is a default that can be overridden.

Handoff Context: What the Human Agent Needs

When a conversation escalates to a human agent, the context assembled for the bot doesn’t automatically transfer in a useful form. The human agent gets a wall of conversation transcript and has to read it to understand the situation.

The Write Outside the Window pattern applies here: generate a structured handoff summary as a durable artifact at escalation time, separate from the conversation history.

Handoff Summary (generated at escalation)
Customer: Jane Smith | Order: #78234
Issue: Claims item not received despite "delivered" status
Attempted: Checked tracking (shows delivered 2/28), offered redelivery (declined)
Customer expectation: Full refund
Commitment made: None yet
Priority signals: Annual Pro subscriber, 3rd contact in 90 days

This is the context the human agent needs. It’s tighter than the full transcript, surfaces the commitment state (nothing was promised), and includes the signals that affect how the agent should approach the conversation.

Common Mistakes

No grounding instruction: Teams add a policy knowledge base to the retrieval pipeline, see that the bot references policy content, and assume grounding is working. It isn’t working unless you explicitly tell the model to use the retrieved context and say so when it doesn’t cover the question; without that instruction, the model will confidently fill gaps from training data.

Profiling overload: Injecting the full CRM record (50+ fields) to cover every possible question means 40 of those fields consume tokens on every turn where they’re irrelevant. Include what’s actionable for the current session and fetch the rest on demand.

No history budget: The most common cause of quality degradation in long support sessions is uncapped conversation history crowding out the policy documents and customer profile that the model actually needs; set limits before you hit them, not after.

Ignoring the handoff: The escalation moment is when the context engineering work pays off for human agents too, and a bot that escalates without a structured summary is handing off noise rather than information.