Skip to content

Evidence Extraction Mechanism: How It Actually Works

Audience: Anyone wanting to understand the core engineering behind PHQ-8 scoring Last Updated: 2026-01-03


Overview

This document explains how evidence extraction works, why it succeeds or fails, and how that leads to coverage.

Task validity note: PHQ-8 is a 2-week frequency instrument, while DAIC-WOZ interviews are not structured as PHQ administration. Transcript-only item-level scoring is often underdetermined, so N/A outputs and ~50% coverage are expected in rigorous runs. See docs/clinical/task-validity.md.


The Pipeline (High Level)

┌─────────────────────────────────────────────────────────────┐
│                     INTERVIEW TRANSCRIPT                    │
│  "I've been feeling really down lately. Can't sleep at all. │
│   Work is stressful but I still enjoy my hobbies..."        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              STEP 1: EVIDENCE EXTRACTION (LLM)              │
│                                                             │
│  LLM reads entire transcript and extracts quotes for each   │
│  of the 8 PHQ-8 items.                                      │
│                                                             │
│  Output: JSON with arrays of evidence per item              │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              STEP 2: SCORING (LLM)                          │
│                                                             │
│  For each item WITH evidence:                               │
│    → Score 0-3 based on frequency/severity                  │
│                                                             │
│  For each item WITHOUT evidence:                            │
│    → Return "N/A" (cannot assess)                           │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    FINAL PHQ-8 ASSESSMENT                   │
│                                                             │
│  NoInterest: 2    Depressed: 1    Sleep: 2    Tired: N/A    │
│  Appetite: N/A    Failure: 1      Concentrating: 0          │
│  Moving: N/A                                                │
│                                                             │
│  Coverage: 5/8 = 62.5%                                      │
└─────────────────────────────────────────────────────────────┘

Step 1: Evidence Extraction (The LLM Part)

What Happens

The LLM receives a prompt containing: 1. The full interview transcript 2. Instructions to find quotes for each PHQ-8 domain 3. Expected JSON output format

The Actual Prompt

From src/ai_psychiatrist/agents/prompts/quantitative.py:

Analyze the following therapy transcript and extract specific text chunks
that provide evidence for each PHQ-8 domain.

PHQ-8 domains:
- nointerest: little interest or pleasure in activities
- depressed: feeling down, depressed, or hopeless
- sleep: sleep problems (trouble falling/staying asleep or sleeping too much)
- tired: feeling tired or having little energy
- appetite: appetite changes (poor appetite or overeating)
- failure: negative self-perception or feeling like a failure
- concentrating: trouble concentrating on tasks
- moving: psychomotor changes (moving/speaking slowly or restlessness)

Return a JSON object with arrays of relevant transcript quotes for each domain.

What the LLM Does Internally

The LLM semantically analyzes the transcript:

  1. Reads the entire text
  2. For each sentence, determines which PHQ-8 domain it relates to (if any)
  3. Groups quotes by domain
  4. Returns structured JSON

Example Analysis:

Transcript Quote LLM's Semantic Understanding Assigned Domain
"I can't sleep at night" Mentions sleep difficulty PHQ8_Sleep
"I feel worthless" Negative self-perception PHQ8_Failure
"I love playing guitar" Positive interest mention (none - positive)
"My job is stressful" Work stress, not PHQ symptom (none)

Why Extraction Can Fail

Failure Type What Happens Example
Not discussed Patient never mentioned that symptom No mention of appetite → no Appetite evidence
LLM misses it LLM doesn't recognize the relevance "I'm so drained" not mapped to Tired
Ambiguous language Could be interpreted multiple ways "I'm fine" - denial or truth?
JSON parsing error LLM returns malformed output Missing quote, bad escaping

JSON Parsing Robustness (CRITICAL)

Problem: LLMs sometimes output malformed JSON (Python-style True instead of true, missing commas, etc.). This was causing silent data corruption where few-shot mode would degrade to zero-shot without indication.

Solution (as of 2026-01-03):

  1. Ollama format:"json": Evidence extraction now uses Ollama's grammar-level JSON constraint, which guarantees well-formed JSON at token generation time. See Ollama Structured Outputs.

  2. Canonical Parser: All JSON parsing uses parse_llm_json() in responses.py:

  3. Applies tolerant fixups (smart quotes, trailing commas)
  4. Falls back to Python literal parsing for True/False/None
  5. NO SILENT FALLBACKS - raises on failure

  6. No Silent Degradation: If JSON parsing fails, the system raises an exception instead of silently returning empty evidence. This prevents corrupted research results.

Code Location: src/ai_psychiatrist/infrastructure/llm/responses.py:parse_llm_json()

Related: ANALYSIS-026 - Full audit of JSON parsing architecture

⚠️ CRITICAL: Mode Isolation (Zero-Shot vs Few-Shot)

Zero-shot and few-shot are INDEPENDENT RESEARCH METHODOLOGIES. They must be completely isolated.

A previous bug allowed silent fallback to empty evidence:

# OLD BUG (FIXED):
except (json.JSONDecodeError, ValueError):
    obj = {}  # <-- SILENT: Few-shot becomes zero-shot!

This violated mode isolation: - Few-shot mode with empty evidence → no references → same as zero-shot - Published results claiming "few-shot" could be partially zero-shot - Comparative analysis between modes would be invalid

The fix ensures: - _extract_evidence() raises on failure instead of returning {} - Few-shot mode fails loudly if it can't build proper references - Mode isolation is maintained throughout the pipeline


Evidence Schema Validation (Spec 054)

After JSON parsing, the evidence structure is validated:

from ai_psychiatrist.services.evidence_validation import validate_evidence_schema

evidence = validate_evidence_schema(parsed_json)
# Raises EvidenceSchemaError if:
# - Top-level is not an object
# - Any value is not a list
# - List contains non-strings

Why this matters: Without schema validation, wrong types (e.g., string instead of list) would silently become empty arrays, corrupting evidence counts and retrieval.


Evidence Hallucination Detection (Spec 053)

The LLM can return "evidence" quotes that don't exist in the transcript. This is silent corruption that pollutes retrieval and confidence signals.

Solution: Validate that each extracted quote is grounded in the source transcript:

from ai_psychiatrist.services.evidence_validation import validate_evidence_grounding

validated_evidence, stats = validate_evidence_grounding(
    evidence=evidence,
    transcript_text=transcript.text,
    mode="substring",  # or "fuzzy" with rapidfuzz
)

Grounding modes: - substring (default): Conservative. normalize(quote) in normalize(transcript). - fuzzy: Uses rapidfuzz.fuzz.partial_ratio for whitespace/punctuation drift. Requires rapidfuzz dependency.

Configuration:

QUANTITATIVE_EVIDENCE_QUOTE_VALIDATION_ENABLED=true    # default
QUANTITATIVE_EVIDENCE_QUOTE_VALIDATION_MODE=substring  # or fuzzy
QUANTITATIVE_EVIDENCE_QUOTE_FUZZY_THRESHOLD=0.85       # if mode=fuzzy
QUANTITATIVE_EVIDENCE_QUOTE_FAIL_ON_ALL_REJECTED=false # default (strict mode = true)

Privacy: Only hashes and counts are logged, never raw transcript text.

SSOT: src/ai_psychiatrist/services/evidence_validation.py


Step 2: Scoring (Back to the LLM)

What Happens

For items WITH evidence, the LLM is asked:

"Based on this evidence, what score (0-3) should this symptom receive?"

For items WITHOUT evidence:

LLM returns "N/A" (cannot assess without evidence)

Scoring Criteria

Score Meaning Frequency
0 Not at all 0-1 days in past 2 weeks
1 Several days 2-6 days
2 More than half the days 7-11 days
3 Nearly every day 12-14 days
N/A Cannot assess No evidence found

Note: The "N/A" response is specific to this evidence-based extraction method. In standard clinical PHQ-8 administration, all items receive a score from 0 to 3 based on patient self-report. Our system returns N/A when insufficient evidence exists in the transcript to make an informed prediction.

What Determines Score vs N/A

The decision tree:

Has evidence for this item?
├── YES → Attempt scoring (0-3)
│         └── Does evidence indicate frequency?
│             ├── YES → Assign 0, 1, 2, or 3
│             └── NO  → Conservative: likely 0 or 1
└── NO  → Return N/A

How Coverage is Calculated

Per-Item Coverage

For each PHQ-8 item, across all participants:

Item Coverage = (Number of participants with a score) / (Total participants)

Example: Sleep item - 40 participants got a score (0, 1, 2, or 3) - 1 participant got N/A - Sleep coverage = 40/41 = 97.6%

Per-Participant Coverage

For each participant, across all 8 items:

Participant Coverage = (Items with scores) / 8

Example: Participant 303 - 4 items scored: Depressed, Sleep, Tired, Failure - 4 items N/A: NoInterest, Appetite, Concentrating, Moving - Participant coverage = 4/8 = 50%

Overall Coverage

Total scored items across all participants:

Overall Coverage = (Total items with scores) / (Total participants × 8)

For concrete example runs (including per-item counts and coverage), see: - docs/results/run-history.md - docs/results/reproduction-results.md

Output artifacts are stored locally under data/outputs/ (gitignored due to DAIC-WOZ licensing; not committed to repo).


What Parameters Affect Extraction?

Temperature

Note: The paper text does not specify exact sampling settings; the effects below are heuristics and can vary by model/backend. See Agent Sampling Registry.

Value Effect on Extraction
0.0 (default) Conservative and reproducible (greedy decoding); may miss subtle evidence
0.2 Slightly more permissive; may catch more evidence but increases variability
0.7+ Too creative, may hallucinate evidence

Model Choice

Model Extraction Quality
gemma3:27b Paper’s main baseline model family; exact behavior depends on build/quantization/backend
MedGemma 27B Appendix F: lower MAE on the subset with available evidence, but fewer predictions overall (more abstention)
Smaller models Often less robust on nuance (heuristic)

Summary: The Complete Picture

  1. LLM reads transcript and extracts quotes per symptom (semantic analysis)
  2. Evidence exists?
  3. Yes → LLM scores it (0-3)
  4. No → N/A
  5. Coverage = percentage of items that got scores instead of N/A

The key insight: Extraction depends on: - Whether the symptom was discussed in the interview - How well the LLM recognizes relevant language - Model parameters (temperature, model size)


Code References

File What It Does
src/ai_psychiatrist/agents/quantitative.py Evidence extraction and scoring
src/ai_psychiatrist/agents/prompts/quantitative.py Prompt templates
src/ai_psychiatrist/services/embedding.py Few-shot retrieval + similarity computation