Extraction Pipeline

This page provides detailed documentation for the OCR and extraction stages of the CoquiTitle processing pipeline.

Pipeline Overview

Step 1: OCR Processing

Lambda: coquititle-ocr-processor Technology: Google Document AI Trigger: Direct invocation by API's triggerProcessing() endpoint

Process

API's triggerProcessing() endpoint invokes OCR processor with doc_id parameter
Document AI processes PDF with enterprise OCR
Extracts:
- Full document text with paragraph structure
- Individual tokens with bounding boxes (normalized 0-1 coordinates)
- Line structure with text anchors
- Confidence scores per token
- Page dimensions

Output Tables

Table	Content
`pages`	Page-level text and dimensions
`ocr_tokens`	Word-level tokens with bounding boxes
`ocr_lines`	Line-level text for evidence resolution

Atomic Extraction Trigger

After each document's OCR completes, the processor checks if ALL documents for the case have been processed. Only when the last document finishes does it trigger the extractor.

# Pseudocode for atomic trigger
completed = count_completed_docs(case_id)
total = get_total_docs(case_id)
if completed == total:
    invoke_extractor(case_id, run_id)

Step 2: Multi-Pass Data Extraction

Lambda: coquititle-extractor Model: Configurable per case via cases.extractor_model Architecture: Evidence-first 2-pass extraction with explicit multimodal context caching

Why Two Passes?

Pass 1 extracts high-level summaries to identify entities (owners, acquisitions, encumbrances)
Pass 2 uses entity context to extract detailed information with focused prompts

This approach:

Reduces hallucination by grounding Pass 2 in Pass 1 entities
Enables parallel processing of independent details
Allows context pruning to only include relevant documents per entity

Pass 1: Summary Extraction (3 parallel calls)

Three parallel LLM calls extract different aspects of the property:

Pass	Prompt Key	Fields Extracted
1A Property	`coquititle/extractor/pass1_property`	description, property_id, colindancias, cabida
1B Ownership	`coquititle/extractor/pass1_titulares`	owners, acquisitions, events, derived_current_rights
1C Gravamenes	`coquititle/extractor/pass1_gravamenes`	encumbrances, servitudes, cancellations

Pass 2: Detail Extraction (N parallel calls)

For each acquisition and encumbrance discovered in Pass 1, Pass 2 extracts detailed information with context pruning to only include relevant documents.

Entity Type	Prompt Key	Fields Extracted
Acquisition	`coquititle/extractor/pass2_acquisition`	deed details, sellers, price, conditions
Encumbrance	`coquititle/extractor/pass2_encumbrance`	holder, amount, terms, inscription

Context Pruning

Pass 2 uses intelligent context pruning:

Identifies which documents are relevant to each entity
Creates focused context windows
Reduces input tokens by ~60% compared to full-document context

# Example: Only include docs 2-3 for acquisition from 2020
relevant_docs = identify_relevant_docs(acquisition, all_docs)
pruned_context = build_context(relevant_docs)

Context Caching

The extractor uses Gemini's context caching for multimodal content:

# Cache creation (30 min TTL)
cache = create_cached_content(
    model=model_name,
    parts=[page_images + ocr_text],
    display_name=f"case_{case_id}",
    ttl="1800s"
)

# Reuse cache across passes
response = generate_with_cache(cache, pass1_prompt)

Benefits:

Reduces input token costs by ~90% for Pass 2 calls
Faster response times (cached content doesn't need re-processing)
Automatic TTL cleanup

Step 3: Pending Documents Processing

Lambda: coquititle-pending-docs-processor Purpose: Ingest "documentos presentados" that affect the title but aren't yet inscribed

Process

Fetch pending presentations from pending_presentations table (scraped from Karibe)
For each pending document:
- Check pending_docs_cache for existing extraction
- If not cached: OCR + LLM extraction
- Store in cache for future cases with same document
Merge pending docs into extraction schema

Caching Strategy

Pending documents are cached by asiento_karibe (unique document ID) because:

Same document may affect multiple properties
OCR + extraction is expensive
Cache hit rate is high for active properties

Evidence-First Architecture

A key design principle is evidence-first extraction: every extracted value must cite its source.

Evidence Reference Format

{
  "name": "Juan Pérez García",
  "evidence": {
    "quote": "Juan Pérez García, casado con María López",
    "page": 2,
    "line": "D1-P2-L045"
  }
}

Benefits

Traceability: Every fact can be traced to source document
Validation: Evidence resolver can verify claims
Visualization: UI can highlight source text with bounding boxes
Confidence: Users can assess extraction quality

System Overview - High-level architecture
Evidence Resolution - How evidence citations are validated
Data Model - Database schema for extractions
Observability - Langfuse tracing for extraction

Pipeline Overview​

Step 1: OCR Processing​

Process​

Output Tables​

Atomic Extraction Trigger​

Step 2: Multi-Pass Data Extraction​

Why Two Passes?​

Pass 1: Summary Extraction (3 parallel calls)​

Pass 2: Detail Extraction (N parallel calls)​

Context Pruning​

Context Caching​

Step 3: Pending Documents Processing​

Process​

Caching Strategy​

Evidence-First Architecture​

Evidence Reference Format​

Benefits​

Related Pages​

Pipeline Overview

Step 1: OCR Processing

Process

Output Tables

Atomic Extraction Trigger

Step 2: Multi-Pass Data Extraction

Why Two Passes?

Pass 1: Summary Extraction (3 parallel calls)

Pass 2: Detail Extraction (N parallel calls)

Context Pruning

Context Caching

Step 3: Pending Documents Processing

Process

Caching Strategy

Evidence-First Architecture

Evidence Reference Format

Benefits

Related Pages