Context Engineering for Legal and Compliance

Legal document analysis demands context engineering that most domains don't: every claim must be traceable to a specific clause, hallucinated content creates liability, and the documents themselves are longer than most models can reliably process.

NoLiMa Benchmark: Long-Context Evaluation , Anthropic: Structured Outputs

Why Legal Is Different

Most context engineering guides assume that a wrong answer is a bad user experience, something you can fix in the next release. In legal work, a wrong answer is a liability event. When a legal analysis tool hallucinates a contract clause, the downstream consequence isn’t a confused user; it’s a missed obligation, a failed compliance check, or advice that exposes someone to legal risk. The error tolerance is fundamentally different, and that changes which context engineering patterns apply and how aggressively you need to apply them.

Legal documents also have properties that make them uniquely challenging for LLMs. They’re long (a commercial lease runs 40-80 pages, a regulatory filing can exceed 200), deeply cross-referential (Section 4.2(b) might modify the definition in Section 1.15, which itself references an external regulation), and they use precise vocabulary where similar words carry materially different legal force (“shall” vs. “may” vs. “will”). These properties interact badly with context rot because the model needs to hold relationships between distant sections while processing dense, specialized language.

Document Structure as Context

Legal documents have explicit structure: numbered sections, defined terms, and cross-references. Most context assembly pipelines ignore that structure, and when you chunk a contract for retrieval, you lose the context that gives each clause its meaning.

Preserve section hierarchy: When extracting clauses for context, include the section path: “Article 4 > Section 4.2 > Subsection (b)” rather than just the raw text. The model needs to know where a clause sits in the document’s hierarchy to understand its scope and force.

Resolve cross-references before insertion: If Section 7.3 references “the Termination Events defined in Section 5.1,” resolve that reference and include the definition inline or immediately adjacent. A model that sees Section 7.3 without the Section 5.1 definitions will either hallucinate them or reason incorrectly about what triggers termination; this is the single most common failure mode in legal context engineering, where the model has one end of a cross-reference but not the other.

Include defined terms: Legal documents define terms precisely (‘“Affiliate” means any entity that directly or indirectly controls…’), and those definitions change the meaning of every subsequent clause that uses the term. Include the definitions section alongside any clauses you extract, or at minimum include the definitions for terms that appear in the extracted clauses.

Citation Grounding

In legal analysis, an unsourced claim is worthless. The model needs to cite specific sections, clauses, and provisions for its analysis to be useful, and that requires explicit Grounding instructions.

Analyze the following contract clauses for compliance with GDPR Article 28.
For each finding:
- Quote the specific contract language that is relevant
- Cite the section number (e.g., "Section 4.2(b)")
- State whether the clause meets, partially meets, or fails the requirement
- If it fails, quote the specific GDPR text it conflicts with

Do not make claims about the contract without citing a specific section.
If you cannot find relevant language in the provided sections, state
that the contract does not address the requirement in the sections provided.

That last instruction is critical. Without it, the model will fill gaps with plausible legal language that sounds authoritative and is completely fabricated. In legal work, “I don’t see this addressed in the provided sections” is a useful finding. It surfaces a gap the human reviewer needs to close.

Multi-Document Comparison

Comparing contracts for redline analysis or compliance checking requires multiple documents in context simultaneously. The combined token count almost always exceeds what the model can reliably process without careful curation.

Split by theme: Extract the relevant sections from each document organized by topic. For a compliance check, pull the data protection clauses from the contract alongside the corresponding regulatory requirements, so the model can compare parallel sections directly.

Use structured output for comparison: Apply Schema Steering to force the model to produce structured comparisons:

{
  "clause_topic": "Data retention",
  "contract_language": "Section 8.1: Data shall be retained...",
  "regulatory_requirement": "GDPR Art. 5(1)(e): kept in a form...",
  "assessment": "partial_compliance",
  "gap": "Contract specifies 7-year retention without..."
}

The schema prevents the model from producing vague narrative comparisons and forces it to ground each assessment in specific language from both documents.

Budget Constraints

A 60-page contract exceeds 40k tokens, so full-document inclusion puts the task deep in Context Rot territory. At that length, the model’s ability to attend to any specific clause degrades significantly.

Targeted extraction over full inclusion: Unless the task requires reviewing the entire document (which should be decomposed into sub-tasks per section), extract only the clauses relevant to the specific question; a question about termination rights doesn’t need the representations and warranties section.

Two-pass analysis for broad reviews: First pass: include the table of contents and section headers with the task description. Let the model identify which sections are relevant. Second pass: include only those sections with full context and the specific analysis instructions. This is Progressive Disclosure adapted for legal documents.

Common Mistakes

Full document inclusion: Including a 60-page contract as a single context block. The model will appear to analyze it, but sections in the middle receive significantly less attention than those at the start and end, which means critical clauses in sections 10-30 get systematically under-analyzed.

Missing cross-reference resolution: Including a clause that references definitions or conditions in other sections without including those referenced sections. The model fills the gap with training-data-derived assumptions, which may not match the contract’s actual terms.

No explicit “I don’t know” instruction: Without it, the model will fabricate plausible legal analysis for questions the provided context doesn’t address. In legal work, confident fabrication is worse than admitting the limitation.

Treating legal analysis as single-shot: Complex legal review should be decomposed into focused sub-tasks, each with its own curated context, rather than asking the model to review an entire agreement in one pass. Section-by-section analysis with aggregated findings produces more reliable results than a single full-document pass.