Evaluating and Observing Context Quality

Most teams have no idea whether their context engineering is actually working. They ship a RAG pipeline, check that it returns answers, and call it done. Here is how to measure what is actually happening inside the context window.

Why Measuring Context Quality Is Harder Than It Sounds

Output quality and context quality are not the same thing, and this distinction is where most teams go wrong. A model can produce a confident, well-formed answer from bad context and often does; the answer sounds right, the customer doesn’t flag it, and the bad context pattern gets baked into the pipeline. The problem surfaces weeks later when something goes visibly wrong and there’s no diagnostic data to explain why.

Context quality has to be measured independently from output quality, at multiple points in the pipeline. The questions you’re trying to answer are: is the right information in the window, is it in the right proportion, is the window getting so long that quality is degrading, and are changes to the context strategy actually improving outcomes? None of these are answered by monitoring model outputs alone.

Four Metrics That Matter

Context Relevance

The fraction of context tokens that are directly relevant to the current query. A context window with 10,000 tokens where 3,000 are relevant has a relevance ratio of 0.30, which is a common number for teams who haven’t applied Select, Don’t Dump and roughly predicts their quality ceiling.

Computing this requires a relevance signal. For RAG pipelines, RAGAS provides a context relevance metric that uses the model itself to score each retrieved chunk against the query; it costs inference tokens, but running it on a sample of production traffic is worth the expense. The absolute number matters less than the trend: track it weekly and watch for drops, which indicate something changed in your retrieval or context assembly.

from ragas import evaluate
from ragas.metrics import context_relevancy

# Score a sample of production requests
results = evaluate(
    dataset=production_sample,
    metrics=[context_relevancy]
)
print(f"Context relevance: {results['context_relevancy']:.2f}")

Context Utilization

What percentage of the context the model actually referenced when producing its answer. This is harder to measure directly but can be approximated through attribution: ask the model to cite the source for each claim in its response, then check how many chunks were cited versus how many were included.

High utilization above 70% means the context selection is working well; low utilization means you’re including content the model is ignoring, either because it’s irrelevant or because it’s buried at positions where attention is weak. Attention Anchoring failures show up here as selective utilization: the model uses context from the start and end of the window and ignores the middle.

Context Redundancy

How much of your context is duplicate or near-duplicate information, which consumes tokens without adding signal since the model doesn’t reason better because it saw the same fact twice. In practice, redundancy comes from three sources: overlapping retrieved chunks that contain the same paragraph, conversation history that repeats established facts, and system prompt boilerplate that reappears in multiple sections.

Measure this by embedding your context chunks and computing pairwise cosine similarity; chunks with high similarity are candidates for deduplication. Most production pipelines that haven’t addressed this have more redundancy than you’d expect, and it’s budget burned for zero benefit.

Context-to-Output Ratio

The ratio of input tokens to output tokens, useful as a rough trend indicator. Support bots and extraction tasks should have low ratios because you’re processing a lot of context to produce a short answer, while creative or synthesis tasks allow higher ratios. What you’re watching for is the ratio increasing over time, which happens in agent loops and long conversations as context accumulates. A context-to-output ratio that doubles over 10 turns is a Compress & Restart trigger, not a coincidence.

Detecting Context Rot in Production

Context rot doesn’t announce itself; output quality degrades gradually, and the first visible signals are often indirect like customer satisfaction scores dropping, human escalation rates ticking up, or users starting to rephrase their questions. By the time those signals appear, the context problem has been happening for a while.

Better signals to instrument directly:

Context length vs. answer quality correlation. Log context token counts alongside any quality signal you have (human ratings, thumbs up/down, resolution rates) and plot them together. If quality degrades as context length increases past a threshold, you’ve found your effective window boundary, which in most deployments lands between 30k and 60k tokens, well below the advertised limits.

Session turn count vs. quality. In multi-turn applications, log which turn number each interaction is at and compare quality metrics across early turns (1-5) and late turns (15+). A meaningful drop from early to late turns is a history management problem, and it’s worth knowing about before your users tell you.

Timestamp spread of context sources. If your context includes documents with timestamps, log the age distribution of what’s being retrieved. A support bot that consistently retrieves policy documents from 18 months ago despite having current versions is a Temporal Decay problem hiding in plain sight.

Observability Tooling

The major LLM observability platforms (LangSmith, Braintrust, Helicone, Arize Phoenix) all provide trace-level visibility into what’s in the context window at each step, but the gap most teams hit is that they set up tracing for outputs and costs without tracing context composition.

The useful things to log at trace time, beyond token counts:

  • Which chunks were retrieved and their relevance scores
  • The full assembled context, not just the user message
  • Context section sizes broken down by system prompt, history, retrieved docs, and current turn
  • Any truncation that happened and what got cut

LangSmith’s dataset comparison workflow is the most practical way to A/B test context strategies. Capture a baseline set of production requests with their full context, run them through an alternative context assembly strategy, and compare output quality on matched pairs. The alternative doesn’t have to be better everywhere to be worth shipping; you need to know where it improves and where it regresses.

A Practical Starting Point

If your system has no context observability today, these three things will give you more signal than any sophisticated framework:

  1. Log context token counts by section (system prompt, retrieved docs, history, current turn) for every request. The distribution will immediately show which section is eating the budget.

  2. Sample a small percentage of production requests and run context relevance scoring on them weekly. Watch the number over time; if it’s dropping, something in your retrieval or context assembly changed.

  3. Correlate context length with your best available quality signal: Even a simple thumbs up/down or resolution flag is enough to find your effective window boundary. NoLiMa found that 11 of 13 models dropped below half their baseline performance at just 32k tokens; your effective boundary is likely in that range or lower, well below the advertised limits.

The expensive metrics, like LLM-as-judge scoring, attribution analysis, and full trace comparison, are worth adding once you have baseline numbers. Starting without any measurement is the common failure, because most teams only discover context problems when they cause visible production incidents, at which point there’s no historical data to debug with.