Skip to main content

Extraction Pipeline

This page provides detailed documentation for the OCR and extraction stages of the CoquiTitle processing pipeline.

Pipeline Overview

Step 1: OCR Processing

Lambda: coquititle-ocr-processor Technology: Google Document AI Trigger: Direct invocation by API's triggerProcessing() endpoint

Process

  1. API's triggerProcessing() endpoint invokes OCR processor with doc_id parameter
  2. Document AI processes PDF with enterprise OCR
  3. Extracts:
    • Full document text with paragraph structure
    • Individual tokens with bounding boxes (normalized 0-1 coordinates)
    • Line structure with text anchors
    • Confidence scores per token
    • Page dimensions

Output Tables

TableContent
pagesPage-level text and dimensions
ocr_tokensWord-level tokens with bounding boxes
ocr_linesLine-level text for evidence resolution

Atomic Extraction Trigger

After each document's OCR completes, the processor checks if ALL documents for the case have been processed. Only when the last document finishes does it trigger the extractor.

# Pseudocode for atomic trigger
completed = count_completed_docs(case_id)
total = get_total_docs(case_id)
if completed == total:
invoke_extractor(case_id, run_id)

Step 2: Multi-Pass Data Extraction

Lambda: coquititle-extractor Model: Configurable per case via cases.extractor_model Architecture: Evidence-first 2-pass extraction with explicit multimodal context caching

Why Two Passes?

  1. Pass 1 extracts high-level summaries to identify entities (owners, acquisitions, encumbrances)
  2. Pass 2 uses entity context to extract detailed information with focused prompts

This approach:

  • Reduces hallucination by grounding Pass 2 in Pass 1 entities
  • Enables parallel processing of independent details
  • Allows context pruning to only include relevant documents per entity

Pass 1: Summary Extraction (3 parallel calls)

Three parallel LLM calls extract different aspects of the property:

PassPrompt KeyFields Extracted
1A Propertycoquititle/extractor/pass1_propertydescription, property_id, colindancias, cabida
1B Ownershipcoquititle/extractor/pass1_titularesowners, acquisitions, events, derived_current_rights
1C Gravamenescoquititle/extractor/pass1_gravamenesencumbrances, servitudes, cancellations

Pass 2: Detail Extraction (N parallel calls)

For each acquisition and encumbrance discovered in Pass 1, Pass 2 extracts detailed information with context pruning to only include relevant documents.

Entity TypePrompt KeyFields Extracted
Acquisitioncoquititle/extractor/pass2_acquisitiondeed details, sellers, price, conditions
Encumbrancecoquititle/extractor/pass2_encumbranceholder, amount, terms, inscription

Context Pruning

Pass 2 uses intelligent context pruning:

  • Identifies which documents are relevant to each entity
  • Creates focused context windows
  • Reduces input tokens by ~60% compared to full-document context
# Example: Only include docs 2-3 for acquisition from 2020
relevant_docs = identify_relevant_docs(acquisition, all_docs)
pruned_context = build_context(relevant_docs)

Context Caching

The extractor uses Gemini's context caching for multimodal content:

# Cache creation (30 min TTL)
cache = create_cached_content(
model=model_name,
parts=[page_images + ocr_text],
display_name=f"case_{case_id}",
ttl="1800s"
)

# Reuse cache across passes
response = generate_with_cache(cache, pass1_prompt)

Benefits:

  • Reduces input token costs by ~90% for Pass 2 calls
  • Faster response times (cached content doesn't need re-processing)
  • Automatic TTL cleanup

Step 3: Pending Documents Processing

Lambda: coquititle-pending-docs-processor Purpose: Ingest "documentos presentados" that affect the title but aren't yet inscribed

Process

  1. Fetch pending presentations from pending_presentations table (scraped from Karibe)
  2. For each pending document:
    • Check pending_docs_cache for existing extraction
    • If not cached: OCR + LLM extraction
    • Store in cache for future cases with same document
  3. Merge pending docs into extraction schema

Caching Strategy

Pending documents are cached by asiento_karibe (unique document ID) because:

  • Same document may affect multiple properties
  • OCR + extraction is expensive
  • Cache hit rate is high for active properties

Evidence-First Architecture

A key design principle is evidence-first extraction: every extracted value must cite its source.

Evidence Reference Format

{
"name": "Juan Pérez García",
"evidence": {
"quote": "Juan Pérez García, casado con María López",
"page": 2,
"line": "D1-P2-L045"
}
}

Benefits

  1. Traceability: Every fact can be traced to source document
  2. Validation: Evidence resolver can verify claims
  3. Visualization: UI can highlight source text with bounding boxes
  4. Confidence: Users can assess extraction quality