Extraction Pipeline
This page provides detailed documentation for the OCR and extraction stages of the CoquiTitle processing pipeline.
Pipeline Overview
Step 1: OCR Processing
Lambda: coquititle-ocr-processor
Technology: Google Document AI
Trigger: Direct invocation by API's triggerProcessing() endpoint
Process
- API's
triggerProcessing()endpoint invokes OCR processor withdoc_idparameter - Document AI processes PDF with enterprise OCR
- Extracts:
- Full document text with paragraph structure
- Individual tokens with bounding boxes (normalized 0-1 coordinates)
- Line structure with text anchors
- Confidence scores per token
- Page dimensions
Output Tables
| Table | Content |
|---|---|
pages | Page-level text and dimensions |
ocr_tokens | Word-level tokens with bounding boxes |
ocr_lines | Line-level text for evidence resolution |
Atomic Extraction Trigger
After each document's OCR completes, the processor checks if ALL documents for the case have been processed. Only when the last document finishes does it trigger the extractor.
# Pseudocode for atomic trigger
completed = count_completed_docs(case_id)
total = get_total_docs(case_id)
if completed == total:
invoke_extractor(case_id, run_id)
Step 2: Multi-Pass Data Extraction
Lambda: coquititle-extractor
Model: Configurable per case via cases.extractor_model
Architecture: Evidence-first 2-pass extraction with explicit multimodal context caching
Why Two Passes?
- Pass 1 extracts high-level summaries to identify entities (owners, acquisitions, encumbrances)
- Pass 2 uses entity context to extract detailed information with focused prompts
This approach:
- Reduces hallucination by grounding Pass 2 in Pass 1 entities
- Enables parallel processing of independent details
- Allows context pruning to only include relevant documents per entity
Pass 1: Summary Extraction (3 parallel calls)
Three parallel LLM calls extract different aspects of the property:
| Pass | Prompt Key | Fields Extracted |
|---|---|---|
| 1A Property | coquititle/extractor/pass1_property | description, property_id, colindancias, cabida |
| 1B Ownership | coquititle/extractor/pass1_titulares | owners, acquisitions, events, derived_current_rights |
| 1C Gravamenes | coquititle/extractor/pass1_gravamenes | encumbrances, servitudes, cancellations |
Pass 2: Detail Extraction (N parallel calls)
For each acquisition and encumbrance discovered in Pass 1, Pass 2 extracts detailed information with context pruning to only include relevant documents.
| Entity Type | Prompt Key | Fields Extracted |
|---|---|---|
| Acquisition | coquititle/extractor/pass2_acquisition | deed details, sellers, price, conditions |
| Encumbrance | coquititle/extractor/pass2_encumbrance | holder, amount, terms, inscription |
Context Pruning
Pass 2 uses intelligent context pruning:
- Identifies which documents are relevant to each entity
- Creates focused context windows
- Reduces input tokens by ~60% compared to full-document context
# Example: Only include docs 2-3 for acquisition from 2020
relevant_docs = identify_relevant_docs(acquisition, all_docs)
pruned_context = build_context(relevant_docs)
Context Caching
The extractor uses Gemini's context caching for multimodal content:
# Cache creation (30 min TTL)
cache = create_cached_content(
model=model_name,
parts=[page_images + ocr_text],
display_name=f"case_{case_id}",
ttl="1800s"
)
# Reuse cache across passes
response = generate_with_cache(cache, pass1_prompt)
Benefits:
- Reduces input token costs by ~90% for Pass 2 calls
- Faster response times (cached content doesn't need re-processing)
- Automatic TTL cleanup
Step 3: Pending Documents Processing
Lambda: coquititle-pending-docs-processor
Purpose: Ingest "documentos presentados" that affect the title but aren't yet inscribed
Process
- Fetch pending presentations from
pending_presentationstable (scraped from Karibe) - For each pending document:
- Check
pending_docs_cachefor existing extraction - If not cached: OCR + LLM extraction
- Store in cache for future cases with same document
- Check
- Merge pending docs into extraction schema
Caching Strategy
Pending documents are cached by asiento_karibe (unique document ID) because:
- Same document may affect multiple properties
- OCR + extraction is expensive
- Cache hit rate is high for active properties
Evidence-First Architecture
A key design principle is evidence-first extraction: every extracted value must cite its source.
Evidence Reference Format
{
"name": "Juan Pérez García",
"evidence": {
"quote": "Juan Pérez García, casado con María López",
"page": 2,
"line": "D1-P2-L045"
}
}
Benefits
- Traceability: Every fact can be traced to source document
- Validation: Evidence resolver can verify claims
- Visualization: UI can highlight source text with bounding boxes
- Confidence: Users can assess extraction quality
Related Pages
- System Overview - High-level architecture
- Evidence Resolution - How evidence citations are validated
- Data Model - Database schema for extractions
- Observability - Langfuse tracing for extraction