Adding New Extraction Fields
This guide explains how to add new fields to the CoquiTitle document extraction pipeline.
note
Related to Linear issue ALI-21
Overview
The extraction pipeline uses a multi-pass approach:
- OCR Pass - Documents are OCR'd using Google Vertex AI
- Extraction Pass - LLM extracts structured data from OCR text
- Evidence Pass - Citations are generated linking fields to source documents
- Validation Pass - Extracted data is validated for consistency
Step 1: Define the Schema
Add the new field to the extraction schema in alianza-hq/backend/coquititle/schemas/.
# schemas/property_schema.py
PROPERTY_SCHEMA = {
"type": "object",
"properties": {
# Existing fields...
"owner_name": {"type": "string"},
"finca_number": {"type": "string"},
# Add your new field
"new_field_name": {
"type": "string", # or "number", "boolean", "array", "object"
"description": "Clear description of what this field contains"
}
}
}
Field Naming Conventions
- Use
snake_casefor field names - Be descriptive:
mortgage_amountnotamt - Group related fields with common prefixes:
owner_name,owner_address
Step 2: Update the Extraction Prompt
Modify the extraction prompt in lambdas/extractor/prompts/:
EXTRACTION_PROMPT = """
Extract the following information from the document:
- owner_name: The registered property owner
- finca_number: The property's finca number
- new_field_name: [Description of how to find and extract this field]
...
"""
Prompt Best Practices
- Be specific about where the field typically appears
- Provide examples of valid values
- Explain edge cases (e.g., "If not found, return null")
- Reference document types where applicable
Step 3: Add Evidence Extraction
Update the evidence schema to capture citations for the new field:
# In evidence extractor configuration
EVIDENCE_FIELDS = [
"owner_name",
"finca_number",
"new_field_name", # Add here
]
This enables:
- Bounding box extraction for the field's source text
- Document ID linking for citations
- Confidence scoring
Step 4: Update Database Schema
Create a migration if the field needs to be stored:
-- supabase/migrations/YYYYMMDD_add_new_field.sql
ALTER TABLE ridpr.extractions
ADD COLUMN new_field_name TEXT;
COMMENT ON COLUMN ridpr.extractions.new_field_name IS
'Description of the field';
Apply the migration:
cd alianza-infra
supabase db push
Step 5: Update the Report Template
If the field should appear in generated reports:
# In report generator
def generate_report(extraction_data):
return {
"property_info": {
"owner": extraction_data.get("owner_name"),
"finca": extraction_data.get("finca_number"),
"new_field": extraction_data.get("new_field_name"), # Add here
}
}
Step 6: Testing
Unit Tests
Add tests for the new field extraction:
def test_extract_new_field():
sample_text = "Document containing new_field_name: expected_value"
result = extract_fields(sample_text)
assert result["new_field_name"] == "expected_value"
Integration Tests
Test with real documents:
# Run extraction on test documents
python -m pytest tests/integration/test_extraction.py -k "new_field"
Manual Validation
- Upload test documents through the UI
- Verify extraction results in Supabase
- Check evidence citations are correct
- Validate report output
Step 7: Deploy
Deploy the changes in order:
# 1. Database migration (if needed)
cd alianza-infra
terraform apply
# 2. Lambda functions
cd alianza-hq/backend/coquititle/lambdas/extractor
./deploy-container.sh
# 3. Verify in production
curl -X POST https://api.alianzacap.com/coquititle/extract \
-H "Authorization: Bearer $TOKEN" \
-d '{"document_id": "test-doc-id"}'
Field Types Reference
| Type | Schema Definition | Example |
|---|---|---|
| String | {"type": "string"} | "John Doe" |
| Number | {"type": "number"} | 125000.00 |
| Date | {"type": "string", "format": "date"} | "2025-01-15" |
| Boolean | {"type": "boolean"} | true |
| Array | {"type": "array", "items": {...}} | ["item1", "item2"] |
| Object | {"type": "object", "properties": {...}} | {"nested": "value"} |
Common Patterns
Monetary Amounts
"mortgage_amount": {
"type": "number",
"description": "Mortgage amount in USD"
}
Dates
"inscription_date": {
"type": "string",
"format": "date",
"description": "Date of inscription (YYYY-MM-DD)"
}
Multiple Values
"prior_owners": {
"type": "array",
"items": {"type": "string"},
"description": "List of prior property owners"
}
Troubleshooting
Field Not Extracting
- Check the prompt clearly describes the field
- Verify the field exists in test documents
- Review LLM logs for extraction errors
- Try more specific prompt language
Evidence Not Linking
- Ensure the field is in
EVIDENCE_FIELDS - Check OCR quality for the source text
- Verify bounding box coordinates are valid
Schema Validation Errors
- Check field type matches extracted data
- Verify required fields are present
- Review migration for column type mismatches