Adding New Extraction Fields

This guide explains how to add new fields to the CoquiTitle document extraction pipeline.

note

Related to Linear issue ALI-21

Overview

The extraction pipeline uses a multi-pass approach:

OCR Pass - Documents are OCR'd using Google Vertex AI
Extraction Pass - LLM extracts structured data from OCR text
Evidence Pass - Citations are generated linking fields to source documents
Validation Pass - Extracted data is validated for consistency

Step 1: Define the Schema

Add the new field to the extraction schema in alianza-hq/backend/coquititle/schemas/.

# schemas/property_schema.py

PROPERTY_SCHEMA = {
    "type": "object",
    "properties": {
        # Existing fields...
        "owner_name": {"type": "string"},
        "finca_number": {"type": "string"},

        # Add your new field
        "new_field_name": {
            "type": "string",  # or "number", "boolean", "array", "object"
            "description": "Clear description of what this field contains"
        }
    }
}

Field Naming Conventions

Use snake_case for field names
Be descriptive: mortgage_amount not amt
Group related fields with common prefixes: owner_name, owner_address

Step 2: Update the Extraction Prompt

Modify the extraction prompt in lambdas/extractor/prompts/:

EXTRACTION_PROMPT = """
Extract the following information from the document:

- owner_name: The registered property owner
- finca_number: The property's finca number
- new_field_name: [Description of how to find and extract this field]

...
"""

Prompt Best Practices

Be specific about where the field typically appears
Provide examples of valid values
Explain edge cases (e.g., "If not found, return null")
Reference document types where applicable

Step 3: Add Evidence Extraction

Update the evidence schema to capture citations for the new field:

# In evidence extractor configuration
EVIDENCE_FIELDS = [
    "owner_name",
    "finca_number",
    "new_field_name",  # Add here
]

This enables:

Bounding box extraction for the field's source text
Document ID linking for citations
Confidence scoring

Step 4: Update Database Schema

Create a migration if the field needs to be stored:

-- supabase/migrations/YYYYMMDD_add_new_field.sql

ALTER TABLE ridpr.extractions
ADD COLUMN new_field_name TEXT;

COMMENT ON COLUMN ridpr.extractions.new_field_name IS
'Description of the field';

Apply the migration:

cd alianza-infra
supabase db push

Step 5: Update the Report Template

If the field should appear in generated reports:

# In report generator
def generate_report(extraction_data):
    return {
        "property_info": {
            "owner": extraction_data.get("owner_name"),
            "finca": extraction_data.get("finca_number"),
            "new_field": extraction_data.get("new_field_name"),  # Add here
        }
    }

Step 6: Testing

Unit Tests

Add tests for the new field extraction:

def test_extract_new_field():
    sample_text = "Document containing new_field_name: expected_value"
    result = extract_fields(sample_text)
    assert result["new_field_name"] == "expected_value"

Integration Tests

Test with real documents:

# Run extraction on test documents
python -m pytest tests/integration/test_extraction.py -k "new_field"

Manual Validation

Upload test documents through the UI
Verify extraction results in Supabase
Check evidence citations are correct
Validate report output

Step 7: Deploy

Deploy the changes in order:

# 1. Database migration (if needed)
cd alianza-infra
terraform apply

# 2. Lambda functions
cd alianza-hq/backend/coquititle/lambdas/extractor
./deploy-container.sh

# 3. Verify in production
curl -X POST https://api.alianzacap.com/coquititle/extract \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"document_id": "test-doc-id"}'

Field Types Reference

Type	Schema Definition	Example
String	`{"type": "string"}`	`"John Doe"`
Number	`{"type": "number"}`	`125000.00`
Date	`{"type": "string", "format": "date"}`	`"2025-01-15"`
Boolean	`{"type": "boolean"}`	`true`
Array	`{"type": "array", "items": {...}}`	`["item1", "item2"]`
Object	`{"type": "object", "properties": {...}}`	`{"nested": "value"}`

Common Patterns

Monetary Amounts

"mortgage_amount": {
    "type": "number",
    "description": "Mortgage amount in USD"
}

Dates

"inscription_date": {
    "type": "string",
    "format": "date",
    "description": "Date of inscription (YYYY-MM-DD)"
}

Multiple Values

"prior_owners": {
    "type": "array",
    "items": {"type": "string"},
    "description": "List of prior property owners"
}

Troubleshooting

Field Not Extracting

Check the prompt clearly describes the field
Verify the field exists in test documents
Review LLM logs for extraction errors
Try more specific prompt language

Evidence Not Linking

Ensure the field is in EVIDENCE_FIELDS
Check OCR quality for the source text
Verify bounding box coordinates are valid

Schema Validation Errors

Check field type matches extracted data
Verify required fields are present
Review migration for column type mismatches

Overview​

Step 1: Define the Schema​

Field Naming Conventions​

Step 2: Update the Extraction Prompt​

Prompt Best Practices​

Step 3: Add Evidence Extraction​

Step 4: Update Database Schema​

Step 5: Update the Report Template​

Step 6: Testing​

Unit Tests​

Integration Tests​

Manual Validation​

Step 7: Deploy​

Field Types Reference​

Common Patterns​

Monetary Amounts​

Dates​

Multiple Values​

Troubleshooting​

Field Not Extracting​

Evidence Not Linking​

Schema Validation Errors​

Related Documentation​