Spec 054: Strict Evidence Schema Validation
Status: Implemented (PR #92, 2026-01-03) Priority: High Complexity: Low Related: PIPELINE-BRITTLENESS.md, ANALYSIS-026
SSOT (Implemented)
- Code:
src/ai_psychiatrist/services/evidence_validation.py(validate_evidence_schema(),EvidenceSchemaError) - Wire-up:
src/ai_psychiatrist/agents/quantitative.py(QuantitativeAssessmentAgent._extract_evidence()) - Tests:
tests/unit/services/test_evidence_validation.py,tests/unit/agents/test_quantitative.py
Problem Statement
When the LLM returns malformed evidence JSON, non-list values are silently coerced to empty arrays:
# Current behavior in QuantitativeAssessmentAgent._extract_evidence()
arr = obj.get(key, []) if isinstance(obj, dict) else []
if not isinstance(arr, list):
arr = [] # silently coerced
Example of silent corruption:
{
"PHQ8_Sleep": "Patient mentioned trouble sleeping", // String, not array
"PHQ8_Tired": ["I feel exhausted"] // Correct
}
Result: PHQ8_Sleep becomes [] silently, losing the evidence.
Previous Behavior (Fixed)
# src/ai_psychiatrist/agents/quantitative.py - _extract_evidence()
obj = parse_llm_json(clean)
evidence_dict: dict[str, list[str]] = {}
for key in PHQ8_DOMAIN_KEYS:
arr = obj.get(key, []) if isinstance(obj, dict) else []
if not isinstance(arr, list):
arr = [] # silent coercion (bug)
evidence_dict[key] = list({str(q).strip() for q in arr if str(q).strip()})
Problems:
1. Non-list values (e.g., string/object/number) are silently treated as [].
2. If the model returns a valid JSON object but wrong types, the run “succeeds” with corrupted evidence.
3. In few-shot mode, this can silently reduce retrieval quality and distort confidence signals.
Implemented Solution
Add explicit type validation immediately after JSON parsing, before any processing.
Implementation
Schema Validation Function
# New (shared with Spec 053): src/ai_psychiatrist/services/evidence_validation.py
from typing import Any
from ai_psychiatrist.agents.prompts.quantitative import PHQ8_DOMAIN_KEYS
from ai_psychiatrist.infrastructure.logging import get_logger
logger = get_logger(__name__)
class EvidenceSchemaError(ValueError):
"""Raised when evidence JSON does not match expected schema."""
def __init__(self, message: str, violations: dict[str, str]):
super().__init__(message)
self.violations = violations
def validate_evidence_schema(obj: object) -> dict[str, list[str]]:
"""Validate and normalize evidence extraction JSON schema.
Expected schema:
{
"PHQ8_NoInterest": ["quote1", "quote2", ...],
"PHQ8_Depressed": [...],
...
}
Args:
obj: Parsed JSON object from LLM
Returns:
Validated dict with all keys present and values as list[str]
Raises:
EvidenceSchemaError: If the top-level is not an object, or any value is not a list[str].
"""
if not isinstance(obj, dict):
raise EvidenceSchemaError(
f"Expected JSON object at top level, got {type(obj).__name__}",
violations={"__root__": f"Expected object, got {type(obj).__name__}"},
)
violations: dict[str, str] = {}
validated: dict[str, list[str]] = {}
for key in PHQ8_DOMAIN_KEYS:
value = obj.get(key)
# Case 1: Key missing - acceptable, use empty list
if value is None:
validated[key] = []
continue
# Case 2: Not a list - VIOLATION
if not isinstance(value, list):
violations[key] = f"Expected list, got {type(value).__name__}: {str(value)[:100]}"
continue
# Case 3: List must contain only strings
normalized: list[str] = []
for i, item in enumerate(value):
if not isinstance(item, str):
violations[key] = (
f"Expected list[str] but element {i} was {type(item).__name__}: "
f"{str(item)[:100]}"
)
break
stripped = item.strip()
if stripped:
normalized.append(stripped)
if key in violations:
continue
# Preserve order while de-duping.
seen: set[str] = set()
deduped: list[str] = []
for quote in normalized:
if quote in seen:
continue
seen.add(quote)
deduped.append(quote)
validated[key] = deduped
# If any violations, raise with details
if violations:
raise EvidenceSchemaError(
f"Evidence schema violations in {len(violations)} fields",
violations=violations,
)
return validated
Integration
# src/ai_psychiatrist/agents/quantitative.py - modify _extract_evidence()
from ai_psychiatrist.services.evidence_validation import validate_evidence_schema, EvidenceSchemaError
async def _extract_evidence(self, transcript_text: str) -> dict[str, list[str]]:
"""Extract evidence quotes for each PHQ-8 domain."""
# ... existing LLM call ...
obj = parse_llm_json(clean)
# NEW: Strict schema validation
try:
evidence = validate_evidence_schema(obj)
except EvidenceSchemaError as e:
import hashlib # stdlib; used for privacy-safe hashing (no transcript text)
logger.error(
"evidence_schema_validation_failed",
violations=e.violations,
response_hash=hashlib.sha256(clean.encode("utf-8")).hexdigest()[:12],
response_len=len(clean),
)
raise # Propagate - fail loudly, don't silently degrade
return evidence
Error Handling Strategy
When schema validation fails:
| Option | Behavior | Recommendation |
|---|---|---|
| Fail loudly | Raise exception, participant marked as failed | ✅ Default |
| Retry | Trigger LLM retry with corrective prompt | Consider for Phase 2 |
| Repair | Attempt to fix (e.g., wrap string in list) | ❌ Too risky |
Recommendation: Fail loudly. This matches ANALYSIS-026 principle of no silent degradation.
Testing
# tests/unit/services/test_evidence_validation.py
import pytest
from ai_psychiatrist.services.evidence_validation import validate_evidence_schema, EvidenceSchemaError
def test_valid_schema_passes():
obj = {
"PHQ8_NoInterest": ["quote 1", "quote 2"],
"PHQ8_Depressed": [],
"PHQ8_Sleep": ["quote 3"],
"PHQ8_Tired": [],
"PHQ8_Appetite": [],
"PHQ8_Failure": [],
"PHQ8_Concentrating": [],
"PHQ8_Moving": [],
}
result = validate_evidence_schema(obj)
assert result["PHQ8_NoInterest"] == ["quote 1", "quote 2"]
assert result["PHQ8_Depressed"] == []
def test_missing_keys_filled_with_empty():
obj = {"PHQ8_NoInterest": ["quote"]} # Only one key
result = validate_evidence_schema(obj)
assert result["PHQ8_NoInterest"] == ["quote"]
assert result["PHQ8_Depressed"] == [] # Missing key → []
def test_string_instead_of_list_raises():
obj = {
"PHQ8_NoInterest": "This is a string, not a list",
"PHQ8_Depressed": [],
}
with pytest.raises(EvidenceSchemaError) as exc_info:
validate_evidence_schema(obj)
assert "PHQ8_NoInterest" in exc_info.value.violations
assert "Expected list" in exc_info.value.violations["PHQ8_NoInterest"]
def test_dict_instead_of_list_raises():
obj = {
"PHQ8_Sleep": {"quote": "nested object"},
}
with pytest.raises(EvidenceSchemaError) as exc_info:
validate_evidence_schema(obj)
assert "PHQ8_Sleep" in exc_info.value.violations
def test_number_instead_of_list_raises():
obj = {"PHQ8_Tired": 42}
with pytest.raises(EvidenceSchemaError) as exc_info:
validate_evidence_schema(obj)
assert "int" in exc_info.value.violations["PHQ8_Tired"]
def test_null_value_treated_as_missing():
obj = {"PHQ8_Appetite": None}
result = validate_evidence_schema(obj)
assert result["PHQ8_Appetite"] == []
def test_whitespace_only_strings_filtered():
obj = {"PHQ8_Failure": ["valid", " ", "", "also valid"]}
result = validate_evidence_schema(obj)
assert result["PHQ8_Failure"] == ["valid", "also valid"]
def test_non_string_list_items_raises():
obj = {"PHQ8_Concentrating": ["valid", 123, True]}
with pytest.raises(EvidenceSchemaError) as exc_info:
validate_evidence_schema(obj)
assert "PHQ8_Concentrating" in exc_info.value.violations
def test_multiple_violations_collected():
obj = {
"PHQ8_NoInterest": "string 1",
"PHQ8_Depressed": "string 2",
"PHQ8_Sleep": [], # Valid
}
with pytest.raises(EvidenceSchemaError) as exc_info:
validate_evidence_schema(obj)
assert len(exc_info.value.violations) == 2
assert "PHQ8_NoInterest" in exc_info.value.violations
assert "PHQ8_Depressed" in exc_info.value.violations
Impact Analysis
Before This Spec
LLM returns: {"PHQ8_Sleep": "trouble sleeping"}
Result: evidence["PHQ8_Sleep"] = [] # SILENT LOSS
After This Spec
LLM returns: {"PHQ8_Sleep": "trouble sleeping"}
Result: EvidenceSchemaError raised
Participant marked as failed
Error logged with violations + hashes (no transcript text)
We KNOW something went wrong
Migration
No migration needed. This is a strictness improvement that will cause some existing implicit failures to become explicit.
Expected during rollout: Some participants that previously "succeeded" with corrupted data will now fail explicitly. This is the intended behavior.
Rollout Plan
- Phase 1: Implement and deploy
- Phase 2: Run evaluation, collect failure rate
- Phase 3: If failure rate is high (>5%), investigate LLM prompt improvements
- Phase 4: Consider retry logic if failures are transient
Success Criteria
- Zero silent type coercion in evidence processing
- All schema violations logged with privacy-safe context (violations + hashes only)
- Test coverage for all edge cases
- No performance regression (<1ms overhead)
Relationship to Other Specs
- Spec 053 (Hallucination Detection): Runs AFTER this spec validates schema
- ANALYSIS-026: This extends the "no silent fallbacks" principle to schema validation