Context Engineering for Multi-Turn Conversations
Conversation history is the context problem most applications have and least teams think about. It grows unbounded, degrades quality silently, and fails in predictable ways that a small amount of engineering prevents.
The Conversation Context Problem
A fresh conversation and a twenty-turn conversation use the same model with very different results. The twenty-turn version has accumulated two problems at once: the history is consuming tokens that could go to retrieving fresh information, and the model’s attention is distributed across all those turns, spread thin where it should be focused on the current question.
Most applications handle this by appending every message to a list and sending the whole list on every call, which works until the conversation gets long, then works less well, and nobody notices because the failure is gradual. Users start getting vaguer responses and attribute it to the model being bad at their task. It’s almost always the history.
There are three approaches to managing conversation history, each with different characteristics, and none is universal; the right choice depends on what your application actually needs to preserve across turns.
Three History Strategies
Full Window (Naive)
Keep every message verbatim until you hit the context limit, then truncate from the oldest end.
def assemble_messages(history, new_message, limit=50_000):
messages = history + [new_message]
while count_tokens(messages) > limit:
messages.pop(0) # drop oldest message
return messages
When it’s appropriate: short-lived sessions under 15 turns, tasks that genuinely require the verbatim wording of earlier messages (legal review, debugging exact error messages), and applications where every word of history is equally important.
The failure mode: silent quality degradation as history grows, because the model reads earlier turns but attention thins out across the window. Research on multi-turn conversations shows models tend to over-rely on early-turn assumptions, making answers more rigid as history grows. Abrupt truncation creates an even worse problem: the model’s context starts mid-conversation with no framing for what was established, so it fills in the gap with assumptions.
Rolling Summary
Maintain a structured summary of the conversation state, updated after each turn or every few turns. The model receives the current summary plus the last N verbatim turns.
def update_summary(current_summary, new_turns, model):
prompt = f"""Update this conversation summary with the new turns below.
Preserve: decisions made, facts established, commitments given, open questions.
Drop: pleasantries, repeated questions, abandoned approaches.
Current summary:
{current_summary}
New turns:
{format_turns(new_turns)}
Updated summary:"""
return model.complete(prompt)
def assemble_messages(summary, recent_history, new_message):
context = f"Conversation so far:\n{summary}\n\n"
return context + format_turns(recent_history) + new_message
When it’s appropriate: long-running sessions of 15+ turns, assistants that need to remember user preferences or established constraints across many messages, and applications where the semantic content of earlier turns matters but the exact wording does not.
The failure mode: compression loss, because a summary cannot perfectly preserve every nuance from the original turns. The model you use to summarize also makes its own mistakes, and if a key fact gets dropped or garbled in a summary update, it stays wrong for the rest of the session. For high-stakes applications, run the summarization step with a verification pass.
Hybrid: Verbatim Recent, Compressed Older
Keep the last 5-8 turns verbatim, summarize everything older. The verbatim turns provide exact phrasing for the current exchange while the summary provides session continuity.
VERBATIM_TURNS = 6
SUMMARY_TOKEN_LIMIT = 800
def assemble_messages(all_turns, new_message, summary):
recent = all_turns[-VERBATIM_TURNS:]
context_parts = []
if summary:
context_parts.append(f"Earlier in this conversation:\n{summary}")
context_parts.extend(format_turns(recent))
context_parts.append(new_message)
return context_parts
This is the approach that works for most general-purpose chat applications; the recent turns handle the immediate exchange naturally while the summary prevents the model from losing track of what was established earlier. The summary gets updated every 5-10 turns as the verbatim window slides forward.
Topic Shifts
When the conversation topic changes significantly, carrying the previous topic’s context forward is often a liability rather than a benefit. The model has to read through irrelevant history to answer the current question, attention dilutes across two unrelated topics, and earlier context can bleed into answers where it doesn’t belong.
Topic shifts have two patterns that need different responses:
Natural pivot within a session: The user finishes one task and starts another, so the prior topic’s history should be summarized and archived. You can detect this programmatically when embedding similarity between the new message and the recent turns drops significantly, or let users signal it explicitly via a “new topic” UI gesture. The specific threshold depends on your embedding model and domain; calibrate against a few examples of real topic shifts in your data.
Returning to an earlier topic: The user circles back to something discussed three topics ago; the verbatim turns for that topic are gone, but the summary should contain the key facts. If it doesn’t, that’s a summary quality problem worth fixing. Applications that expect frequent topic switching benefit from per-topic summaries: a dictionary keyed by topic with the conversation state for each, rather than a single rolling summary that blends everything together.
The Middle Problem in Conversations
The “lost in the middle” effect (see Attention Anchoring) applies directly to conversation history. In a 30-turn conversation, turns 5-25 receive systematically less attention than the opening turns and the most recent ones, which means the model tends to forget commitments made mid-conversation, lose track of constraints established in early-middle turns, and over-index on whatever was discussed most recently.
Two mitigations that work in practice:
Restate critical context at anchor positions: If a constraint was established in turn 8 of a 25-turn conversation, restate it in the system prompt update for the next summary cycle. The middle of a long history is where attention is weakest; don’t rely on the model to surface it from there.
Surface commitments in the summary: The rolling summary should explicitly carry any commitments, established constraints, or open items forward. These facts belong at the start of the context where they get attention, not buried in turn 12 of a verbatim transcript.
Returning Users
A returning user from a previous session is a different problem from a long conversation. You can’t include the prior session verbatim, since that might be 20k tokens from a conversation a week ago, but you also can’t start cold. The Write Outside the Window pattern applies here: persist a compact user memory artifact at session close, inject it at session open.
# At session close
def generate_session_memory(conversation, user_id):
memory = model.complete(f"""Summarize what to remember about this user:
- Preferences stated
- Decisions made
- Ongoing tasks or open questions
- Context about their situation
Conversation:
{conversation}
Memory (max 200 words):""")
store.save(user_id, memory)
# At session open
def open_session(user_id, system_prompt):
memory = store.load(user_id)
if memory:
return system_prompt + f"\n\nReturning user context:\n{memory}"
return system_prompt
Keep the injected memory compact at 150-300 tokens. The temptation is to inject everything from the prior session, but the result is a bloated system prompt that consumes budget on every turn without contributing much to the current conversation. Inject what’s actionable for the new session; leave the rest in storage where it can be retrieved on demand if the conversation goes that direction.
Common Mistakes
Truncating from the oldest end without a summary: Hard truncation leaves the model starting mid-conversation with no framing, and the model fills in the gap with assumptions that are often wrong. Always summarize before dropping turns.
One-size rolling summaries: A summary instruction that says “summarize the conversation” produces vague summaries that preserve everything equally; decisions and constraints get the same weight as pleasantries. A good summary preserves decisions and constraints aggressively, drops pleasantries and repeated questions completely, and notes abandoned approaches briefly so the model doesn’t re-suggest them.
Ignoring the returning-user problem: The first message of a new session with a returning user often assumes continuity that doesn’t exist in the context window (“I was working on X yesterday”). Either inject prior session context or handle the undefined reference gracefully, because the alternative is the model confidently bluffing about something it has no context for.
Not testing at long session lengths: Most teams test at 5-10 turns, but production conversations can run 50+ turns. The bugs in history management don’t surface until 30 turns in, and they usually look like model quality problems rather than context problems, which means they get misdiagnosed and the wrong thing gets fixed.