Skip to main content

ADR-001: Evidence-First Extraction

Status: Accepted

Date: 2025-01-15

Decision Makers: Engineering Team

Related Issues: ALI-20


Context

CoquiTitle generates title study reports from property documents. Users need to:

  1. Trust the extracted information
  2. Verify claims against source documents
  3. Understand where each piece of data came from

Initially, we considered extracting data first, then finding supporting evidence. This led to:

  • Evidence that didn't match extracted claims
  • Missing citations for some fields
  • Difficulty tracing data provenance

Decision

Extract evidence citations during the extraction pass, not as a separate step after report generation.

The pipeline now:

  1. OCR documents to extract text and bounding boxes
  2. Extract structured data AND evidence simultaneously
  3. Store both in linked database records
  4. Generate reports with inline citations

Rationale

Evidence integrity: By extracting evidence during data extraction, we ensure every claim has a direct source reference. The LLM sees the document text and marks citations in a single pass.

Reduced hallucination: When the model must cite its sources simultaneously with extraction, it's less likely to fabricate data that doesn't exist in the documents.

Audit trail: Every extracted field has a linked evidence_source record with:

  • Document ID
  • Page number
  • Bounding box coordinates
  • Extraction confidence

Alternatives Considered

Alternative A: Post-hoc Evidence Search

Description: Extract data first, then search documents to find supporting evidence.

Pros:

  • Simpler extraction prompt
  • Potentially faster initial extraction

Cons:

  • Evidence might not match extracted claims
  • Some fields may lack citations
  • Two separate LLM passes = higher cost

Why not chosen: Evidence quality was too low; mismatches between claims and citations undermined trust.

Alternative B: Human-in-the-loop Evidence

Description: Have humans verify and add citations after extraction.

Pros:

  • Highest accuracy
  • Human judgment for ambiguous cases

Cons:

  • Doesn't scale
  • Slow turnaround
  • Expensive

Why not chosen: We need automated processing; human review should be exception-based, not required for every document.

Consequences

Positive

  • Every extracted field has traceable evidence
  • Reports include inline citations with page/bbox references
  • Users can click citations to view source documents
  • Reduced hallucination in extraction

Negative

  • More complex extraction prompts
  • Longer extraction time (more output tokens)
  • Evidence schema adds database complexity

Neutral

  • Changed mental model from "extract then cite" to "extract with citations"

Implementation

  1. Updated extraction prompts to request {value, evidence} pairs
  2. Created evidence_sources table in ridpr schema
  3. Modified report generator to render citations
  4. Added UI component for evidence visualization

References