Executable Ground Truth
When correctness depends on precise logic, put the truth in executable code rather than prose or hand-written expected values. The model can write around the script, but the script owns the answer.
The Problem This Solves
Some tasks look like ordinary implementation work but hide a precision problem. Tax calculators, pricing rules, pension limits, eligibility checks, compliance scoring, date math, and currency rounding all depend on exact logic. A model can produce code for them, but it is a poor witness for whether the output numbers are correct.
The common failure is to ask the agent to write both the feature and the expected values for the tests. The result looks reassuring: end-to-end tests exist, fixtures are filled, assertions pass. But the same probabilistic system produced the implementation and the numbers that bless it, so the test can preserve a mistake instead of catching it.
How It Works
Move the domain truth into a small executable artifact that is separate from the product implementation. Most teams use a reference script or fixture generator; a rules-engine export, SQL query, or spreadsheet calculation can work when that is where the domain truth already lives. The artifact should take named inputs and produce the expected outputs deterministically.
This changes the model’s job: instead of mentally verifying arithmetic or inventing expected values, it wires the oracle into the workflow, generates fixtures, runs the script, feeds the outputs into Playwright or integration tests, and debugs mismatches. The script becomes runnable context with a stable interface.
The reference script still needs review. This pattern does not remove the need to understand the domain logic; it gives that logic one explicit home, where humans can inspect it and tests can reuse it.
Example
ZZP Pensioen Planner has a Dutch jaarruimte calculator. The UI asks for income, pension accrual, age, and previous unused allowance, then displays the maximum deductible contribution.
Without executable ground truth, an agent might create Playwright tests like this:
await expect(page.getByTestId('result')).toHaveText('€ 4.382');
That number may be right, but nothing in the test explains where it came from. If the agent calculated it wrong, the test now protects the wrong answer.
With executable ground truth, the expected value comes from a reference script:
node scripts/jaarruimte-oracle.mjs tests/scenarios/standard-employee.json \
> tests/fixtures/standard-employee.expected.json
The E2E test reads the generated fixture:
const expected = await readExpected('standard-employee');
await expect(page.getByTestId('result')).toHaveText(expected.jaarruimteFormatted);
Now the important review target is the oracle script and its source rules. The UI test verifies that the product behaves like the reviewed logic, rather than asking the agent to keep Dutch pension arithmetic straight in its head. That is a much nicer place for the weird edge cases to live.
When to Use
- Calculators, pricing flows, tax rules, eligibility logic, and financial projections
- E2E tests where expected values are derived from domain rules
- Data-driven articles or reports where every number should come from a query or script
- Agent loops where verification can run after each item and return concrete error output
When Not to Use
- When the reference script would be a line-for-line copy of the implementation under test
- When the expected behavior is subjective, exploratory, or intentionally creative
- When the domain truth already lives in a trusted external system that tests can query directly
- When a schema is enough to constrain the output shape; use Schema Steering instead
Related Patterns
- Grounding tells the model to rely on provided context; executable ground truth makes that context runnable
- Schema Steering constrains the shape of outputs, while this pattern constrains their correctness
- Write Outside the Window persists the oracle outside the conversation so future runs inherit the same truth source
- Trace the Work records which oracle, fixtures, and checks were used for a change