Context Engineering for Data Extraction

Extracting structured data from documents is one of the highest-value LLM use cases in production, and also where poor context engineering shows up most visibly: missing fields, wrong values, and silent failures that corrupt downstream systems.

Anthropic: Structured Outputs , Azure OpenAI: Best Practices for Structured Extraction from Documents , OpenAI: Structured Outputs

The Document Context Problem

Most teams treat extraction as a schema problem: define the right JSON shape, pass it to the model, parse the output. When this breaks, they add retry logic. When retries don’t fix it, they add validation. When validation fails too often, they assume the model is the problem.

The model is usually not the problem; the context is. Extraction failures cluster into three categories, and all three are context failures: the document is presented in a format the model can’t reliably parse; the schema doesn’t tell the model what to do when fields are absent or ambiguous; or the document is too long and the relevant information is diluted by irrelevant content. Fix the context and most extraction problems resolve without touching the schema or the model.

Document Context Preparation

The difference between extraction from raw PDF text and extraction from cleaned, structured plain text is substantial. Models trained primarily on structured text perform better when documents are converted to a format that mirrors that training distribution, rather than the output of a PDF parser full of garbled whitespace, split words, and floating headers.

Before a document enters the extraction context, three things matter:

Normalization: Tables in PDF text often serialize as space-separated columns that look like noise. Convert them to Markdown tables or pipe-delimited format. Headers that repeat across pages add noise; strip them after the first occurrence. Footnotes mixed into body text confuse field attribution; move them to a clearly delimited section at the end.

Anchoring: Tell the model explicitly what type of document it is reading and what it is extracting. “The following is a commercial lease agreement. Extract the fields below.” gives the model prior context that shapes interpretation of ambiguous terms. Without this, “term” in a contract and “term” in a software license mean different things, and the model has to infer which one applies.

Scope reduction: For long documents, extract from sections. A 40-page contract has a specific section where payment terms appear; send that section and leave the rest out. This is the Select pattern applied to document extraction: send the minimum context that contains the answer. For most structured document types (invoices, contracts, intake forms), the relevant fields map to identifiable sections, and a lightweight pre-processing step that locates those sections before calling the model pays off at scale.

Schema Design for Extraction

The extraction schema is part of the context. A schema with a date field named date teaches the model nothing. A schema with a field named payment_due_date with a description of “The date by which payment must be received, in ISO 8601 format. Look in the Payment Terms section.” gives the model a retrieval signal and a format constraint in one instruction.

Every field should have a description that includes where to find it and what to do if it’s not present: The most common extraction failure is a model hallucinating a value because the field exists in the schema but not in the document. The fix is explicit nullability: "If this field is not present in the document, return null." in the field description eliminates most hallucinations on optional fields without any other changes. The Schema Steering pattern covers this in depth.

Avoid nested schemas for initial extraction: Flat schemas with explicit field descriptions outperform deeply nested ones on extraction accuracy, particularly on edge cases. Nest when the data is genuinely hierarchical (a line-item table inside an invoice), not when nesting is just an organizational preference.

Enumerate constrained fields: If a field can only take specific values, list them. "payment_method": "One of: wire_transfer, check, ach, credit_card. If unspecified, return null." dramatically reduces variance on fields where the document might use synonyms or abbreviations.

Multi-Pass Extraction

For complex documents, a single extraction pass produces worse results than a staged approach. The first pass extracts with high recall and low precision: pull everything that might be relevant for each field, including the surrounding sentence. The second pass normalizes and validates: given the extracted text, apply the format constraint and decide whether the extraction is confident.

This pattern costs more tokens but produces higher accuracy on documents with ambiguous or inconsistently formatted fields. The key design decision is what to do on the second pass when confidence is low. Three options: return null (safest for downstream systems), return the raw extracted text with a confidence flag (lets humans review), or trigger a third pass with a more targeted question. For production extraction pipelines, “return null with a confidence flag” is the right default; it surfaces the failure clearly rather than silently propagating a wrong value.

The Grounding pattern applies here directly: require the model to return the source text it used for each field alongside the normalized value. This makes validation simple (compare source to extracted value), makes debugging easy (see exactly what the model read), and reduces hallucination because the model must anchor its answer to actual document text.

Handling Tables and Semi-Structured Content

Tables are the hardest case in document extraction. A well-formatted Markdown table in context extracts reliably. A serialized PDF table that appears as Date Amount Description\n12/01 1500.00 Consulting fees... does not.

For tabular data, convert to Markdown before extraction. If the source document has a table with headers, represent it as a proper Markdown table in the context. For tables without headers (common in financial documents), add synthetic headers based on what each column contains: | Item | Quantity | Unit Price | Total | is better than unmarked columns.

For semi-structured content like addresses, invoice line items, and party signatures, extract as a structured block and then normalize separately. One model call to pull the raw block, one more call to parse it into canonical fields. Combining both operations in a single call degrades accuracy when the raw content is irregular.

Common Mistakes

Using the document title as the only anchoring: “Extract fields from this document” is weaker than “This is a commercial invoice from [vendor]. Extract the billing fields below,” because the document type primes the model for the vocabulary and conventions of that document type in a way that a generic instruction cannot.

Large schemas with many optional fields: Every optional field that could be null is an opportunity for the model to hallucinate something that sounds plausible; if a field is rarely populated in your document corpus, remove it from the schema and add it back when it actually appears in your documents.

Treating validation as error recovery: Validation that catches wrong values after extraction is useful, but it’s not a substitute for extraction context that makes wrong values unlikely in the first place. If your validation is rejecting a meaningful percentage of outputs, the extraction context needs work rather than more validation rules.

Sending the full document when only a section matters: The Context Rot problem applies directly to extraction: a 40-page document with 38 irrelevant pages degrades extraction accuracy on the 2 relevant ones, and the evidence from NoLiMa and similar benchmarks is clear that relevant information buried in a large context is harder to retrieve than the same information in a focused context. Section-scoped extraction is worth the additional pre-processing cost.

Putting It Together

The extraction pipeline that works in production looks more like: normalize the document format, anchor it with document type and extraction intent, scope to the relevant section, use a schema with descriptions and explicit nullability, extract with source attribution, validate against source text, and flag low-confidence fields for human review rather than silently accepting them. Each step is a context decision, and each one affects accuracy more than model choice does.

A good baseline: start with section-scoped extraction on a flat schema with field descriptions and null instructions, require source attribution on every field, and treat persistent validation failures as a context problem to diagnose before reaching for a bigger model.