ADR-001: Evidence-First Extraction

Status: Accepted

Date: 2025-01-15

Decision Makers: Engineering Team

Related Issues: ALI-20

Context

CoquiTitle generates title study reports from property documents. Users need to:

Trust the extracted information
Verify claims against source documents
Understand where each piece of data came from

Initially, we considered extracting data first, then finding supporting evidence. This led to:

Evidence that didn't match extracted claims
Missing citations for some fields
Difficulty tracing data provenance

Decision

Extract evidence citations during the extraction pass, not as a separate step after report generation.

The pipeline now:

OCR documents to extract text and bounding boxes
Extract structured data AND evidence simultaneously
Store both in linked database records
Generate reports with inline citations

Rationale

Evidence integrity: By extracting evidence during data extraction, we ensure every claim has a direct source reference. The LLM sees the document text and marks citations in a single pass.

Reduced hallucination: When the model must cite its sources simultaneously with extraction, it's less likely to fabricate data that doesn't exist in the documents.

Audit trail: Every extracted field has a linked evidence_source record with:

Document ID
Page number
Bounding box coordinates
Extraction confidence

Alternatives Considered

Alternative A: Post-hoc Evidence Search

Description: Extract data first, then search documents to find supporting evidence.

Pros:

Simpler extraction prompt
Potentially faster initial extraction

Cons:

Evidence might not match extracted claims
Some fields may lack citations
Two separate LLM passes = higher cost

Why not chosen: Evidence quality was too low; mismatches between claims and citations undermined trust.

Alternative B: Human-in-the-loop Evidence

Description: Have humans verify and add citations after extraction.

Pros:

Highest accuracy
Human judgment for ambiguous cases

Cons:

Doesn't scale
Slow turnaround
Expensive

Why not chosen: We need automated processing; human review should be exception-based, not required for every document.

Consequences

Positive

Every extracted field has traceable evidence
Reports include inline citations with page/bbox references
Users can click citations to view source documents
Reduced hallucination in extraction

Negative

More complex extraction prompts
Longer extraction time (more output tokens)
Evidence schema adds database complexity

Neutral

Changed mental model from "extract then cite" to "extract with citations"

Implementation

Updated extraction prompts to request {value, evidence} pairs
Created evidence_sources table in ridpr schema
Modified report generator to render citations
Added UI component for evidence visualization

Context​

Decision​

Rationale​

Alternatives Considered​

Alternative A: Post-hoc Evidence Search​

Alternative B: Human-in-the-loop Evidence​

Consequences​

Positive​

Negative​

Neutral​

Implementation​

References​