Skip to main content

Adding New Extraction Fields

This guide explains how to add new fields to the CoquiTitle document extraction pipeline.

note

Related to Linear issue ALI-21

Overview

The extraction pipeline uses a multi-pass approach:

  1. OCR Pass - Documents are OCR'd using Google Vertex AI
  2. Extraction Pass - LLM extracts structured data from OCR text
  3. Evidence Pass - Citations are generated linking fields to source documents
  4. Validation Pass - Extracted data is validated for consistency

Step 1: Define the Schema

Add the new field to the extraction schema in alianza-hq/backend/coquititle/schemas/.

# schemas/property_schema.py

PROPERTY_SCHEMA = {
"type": "object",
"properties": {
# Existing fields...
"owner_name": {"type": "string"},
"finca_number": {"type": "string"},

# Add your new field
"new_field_name": {
"type": "string", # or "number", "boolean", "array", "object"
"description": "Clear description of what this field contains"
}
}
}

Field Naming Conventions

  • Use snake_case for field names
  • Be descriptive: mortgage_amount not amt
  • Group related fields with common prefixes: owner_name, owner_address

Step 2: Update the Extraction Prompt

Modify the extraction prompt in lambdas/extractor/prompts/:

EXTRACTION_PROMPT = """
Extract the following information from the document:

- owner_name: The registered property owner
- finca_number: The property's finca number
- new_field_name: [Description of how to find and extract this field]

...
"""

Prompt Best Practices

  1. Be specific about where the field typically appears
  2. Provide examples of valid values
  3. Explain edge cases (e.g., "If not found, return null")
  4. Reference document types where applicable

Step 3: Add Evidence Extraction

Update the evidence schema to capture citations for the new field:

# In evidence extractor configuration
EVIDENCE_FIELDS = [
"owner_name",
"finca_number",
"new_field_name", # Add here
]

This enables:

  • Bounding box extraction for the field's source text
  • Document ID linking for citations
  • Confidence scoring

Step 4: Update Database Schema

Create a migration if the field needs to be stored:

-- supabase/migrations/YYYYMMDD_add_new_field.sql

ALTER TABLE ridpr.extractions
ADD COLUMN new_field_name TEXT;

COMMENT ON COLUMN ridpr.extractions.new_field_name IS
'Description of the field';

Apply the migration:

cd alianza-infra
supabase db push

Step 5: Update the Report Template

If the field should appear in generated reports:

# In report generator
def generate_report(extraction_data):
return {
"property_info": {
"owner": extraction_data.get("owner_name"),
"finca": extraction_data.get("finca_number"),
"new_field": extraction_data.get("new_field_name"), # Add here
}
}

Step 6: Testing

Unit Tests

Add tests for the new field extraction:

def test_extract_new_field():
sample_text = "Document containing new_field_name: expected_value"
result = extract_fields(sample_text)
assert result["new_field_name"] == "expected_value"

Integration Tests

Test with real documents:

# Run extraction on test documents
python -m pytest tests/integration/test_extraction.py -k "new_field"

Manual Validation

  1. Upload test documents through the UI
  2. Verify extraction results in Supabase
  3. Check evidence citations are correct
  4. Validate report output

Step 7: Deploy

Deploy the changes in order:

# 1. Database migration (if needed)
cd alianza-infra
terraform apply

# 2. Lambda functions
cd alianza-hq/backend/coquititle/lambdas/extractor
./deploy-container.sh

# 3. Verify in production
curl -X POST https://api.alianzacap.com/coquititle/extract \
-H "Authorization: Bearer $TOKEN" \
-d '{"document_id": "test-doc-id"}'

Field Types Reference

TypeSchema DefinitionExample
String{"type": "string"}"John Doe"
Number{"type": "number"}125000.00
Date{"type": "string", "format": "date"}"2025-01-15"
Boolean{"type": "boolean"}true
Array{"type": "array", "items": {...}}["item1", "item2"]
Object{"type": "object", "properties": {...}}{"nested": "value"}

Common Patterns

Monetary Amounts

"mortgage_amount": {
"type": "number",
"description": "Mortgage amount in USD"
}

Dates

"inscription_date": {
"type": "string",
"format": "date",
"description": "Date of inscription (YYYY-MM-DD)"
}

Multiple Values

"prior_owners": {
"type": "array",
"items": {"type": "string"},
"description": "List of prior property owners"
}

Troubleshooting

Field Not Extracting

  1. Check the prompt clearly describes the field
  2. Verify the field exists in test documents
  3. Review LLM logs for extraction errors
  4. Try more specific prompt language

Evidence Not Linking

  1. Ensure the field is in EVIDENCE_FIELDS
  2. Check OCR quality for the source text
  3. Verify bounding box coordinates are valid

Schema Validation Errors

  1. Check field type matches extracted data
  2. Verify required fields are present
  3. Review migration for column type mismatches