ADR-001: Evidence-First Extraction
Status: Accepted
Date: 2025-01-15
Decision Makers: Engineering Team
Related Issues: ALI-20
Context
CoquiTitle generates title study reports from property documents. Users need to:
- Trust the extracted information
- Verify claims against source documents
- Understand where each piece of data came from
Initially, we considered extracting data first, then finding supporting evidence. This led to:
- Evidence that didn't match extracted claims
- Missing citations for some fields
- Difficulty tracing data provenance
Decision
Extract evidence citations during the extraction pass, not as a separate step after report generation.
The pipeline now:
- OCR documents to extract text and bounding boxes
- Extract structured data AND evidence simultaneously
- Store both in linked database records
- Generate reports with inline citations
Rationale
Evidence integrity: By extracting evidence during data extraction, we ensure every claim has a direct source reference. The LLM sees the document text and marks citations in a single pass.
Reduced hallucination: When the model must cite its sources simultaneously with extraction, it's less likely to fabricate data that doesn't exist in the documents.
Audit trail: Every extracted field has a linked evidence_source record with:
- Document ID
- Page number
- Bounding box coordinates
- Extraction confidence
Alternatives Considered
Alternative A: Post-hoc Evidence Search
Description: Extract data first, then search documents to find supporting evidence.
Pros:
- Simpler extraction prompt
- Potentially faster initial extraction
Cons:
- Evidence might not match extracted claims
- Some fields may lack citations
- Two separate LLM passes = higher cost
Why not chosen: Evidence quality was too low; mismatches between claims and citations undermined trust.
Alternative B: Human-in-the-loop Evidence
Description: Have humans verify and add citations after extraction.
Pros:
- Highest accuracy
- Human judgment for ambiguous cases
Cons:
- Doesn't scale
- Slow turnaround
- Expensive
Why not chosen: We need automated processing; human review should be exception-based, not required for every document.
Consequences
Positive
- Every extracted field has traceable evidence
- Reports include inline citations with page/bbox references
- Users can click citations to view source documents
- Reduced hallucination in extraction
Negative
- More complex extraction prompts
- Longer extraction time (more output tokens)
- Evidence schema adds database complexity
Neutral
- Changed mental model from "extract then cite" to "extract with citations"
Implementation
- Updated extraction prompts to request
{value, evidence}pairs - Created
evidence_sourcestable inridprschema - Modified report generator to render citations
- Added UI component for evidence visualization