Clinical Understanding: How This System Works

Audience: Clinicians, researchers, non-CS folks Last Updated: 2026-01-03

The Big Picture

This system reads interview transcripts (like DAIC-WOZ clinical interviews) and selectively infers PHQ-8 depression item scores when the transcript contains sufficient evidence. When it cannot justify an item score from transcript evidence, it returns N/A (abstention).

PHQ-8 item scores are defined by 2-week frequency, but DAIC-WOZ transcripts are not structured as PHQ administration. This creates a real validity constraint for transcript-only item scoring; see: docs/clinical/task-validity.md.

Key Concepts Explained

1. The PHQ-8 Structure

The PHQ-8 has 8 items (questions), each scored 0-3: - 0 = Not at all - 1 = Several days - 2 = More than half the days - 3 = Nearly every day

Total score ranges 0-24. The items are: 1. Little interest or pleasure (Anhedonia) 2. Feeling down, depressed, hopeless 3. Sleep problems 4. Low energy, fatigue 5. Appetite changes 6. Feeling bad about yourself 7. Trouble concentrating 8. Psychomotor changes (moving/speaking slower or restless)

2. What "Evidence Extraction" Means

Analogy: Imagine you're reading a patient's interview transcript. Before you score each PHQ-8 item, you first highlight passages that are relevant to each symptom.

That's what evidence extraction does: 1. The LLM reads the entire transcript 2. For each PHQ-8 item, it finds and extracts quotes (evidence) from the interview that relate to that symptom 3. Examples: - For "sleep problems": might extract "I've been waking up at 3am every night" - For "low interest": might extract "I used to love painting but haven't touched it in months"

Why it matters: The more evidence found, the more confident the system can be about scoring. If no evidence is found for an item, the system often returns "N/A" (can't assess).

3. What "Coverage" Means

Coverage = What percentage of the 8 items got actual scores (vs N/A)

Examples: - If 4 out of 8 items were scored and 4 were N/A → 50% coverage - If 6 out of 8 items were scored → 75% coverage - If all 8 items were scored → 100% coverage

Clinical parallel: Sometimes a clinical interview doesn't touch on every symptom domain. If the patient never discussed sleep, you can't really score the sleep item. Same logic here.

4. What the LLM Actually Does

The system makes multiple LLM calls per patient:

Step 1: Evidence Extraction

LLM reads transcript
Outputs JSON with quotes for each PHQ-8 item
Output is schema-validated and evidence-grounded (rejected quotes are logged without transcript text)
If parsing/validation fails, the participant evaluation fails loudly (no silent fallbacks)

Step 2: Few-Shot Retrieval

Uses the extracted evidence to find similar patients from the training data
"This patient talks about sleep like Patient X did, who had score 2 on sleep"

Step 3: Scoring

LLM sees: the transcript, the evidence, and examples from similar patients
Outputs: a score (0-3) or "N/A" for each item, plus reasoning

5. What MAE (Mean Absolute Error) Means

MAE is how far off the predictions are, on average.

Simple example: - Patient's true score on Item 1: 2 - System predicted: 1 - Error = |2 - 1| = 1

Do this for all items across all patients, average the errors → MAE

Paper's reported MAE: 0.619 (few-shot mode)

What this means clinically: On average, the system is off by about 0.6 points per item. On a 0-3 scale, that's reasonably accurate but not perfect.

6. How It All Connects

Interview Transcript
        ↓
   Evidence Extraction (find relevant quotes)
        ↓
   Similar Patient Retrieval (few-shot examples)
        ↓
   LLM Scoring (predict 0-3 or N/A per item)
        ↓
   MAE Calculation (compare to ground truth)

Key relationships:

Factor	Affects	How
Evidence quality	Coverage	Better evidence → fewer N/A items
Coverage	MAE calculation	N/A items are excluded from MAE
Few-shot examples	Score accuracy	Similar patients help calibrate predictions
Interview richness	Everything	Sparse interviews → sparse evidence → low coverage

Why We're Seeing What We're Seeing

The Core Driver: Evidence Availability (Not “Model Knowledge”)

Many DAIC-WOZ interviews do not contain explicit PHQ-8 frequency language for each item. The system is designed to abstain (N/A) when evidence is insufficient rather than hallucinate frequency.

Variable Coverage (Often ~50% on DAIC-WOZ)

Coverage varies across participants and items. This depends on: - What symptoms the patient discussed - Whether extracted quotes can be grounded in the transcript - How explicit the symptom mentions were

The Paper's Approach

The paper excludes N/A items from MAE calculation. This is valid because: 1. It matches clinical reality (can't score what wasn't discussed) 2. It focuses accuracy metrics on what the system actually predicted 3. Coverage is reported separately so you know how much was skipped

What This Means for Going Forward

Potential Improvements

Better Evidence Extraction
Reduce malformed JSON rates via prompt tightening and/or an explicit repair step
Could improve coverage by reducing empty-evidence cases
Prompt Engineering
Adjust how we ask the LLM to extract evidence
Be more explicit about valid output formats

What the Results Will Tell Us

When you run a reproduction/evaluation, you'll see: - MAE_item: Average error per item (compare to paper's 0.619) - Coverage: Percentage of items with predictions - By-participant breakdown: Which patients were harder to assess

If our MAE is close to 0.619 with reasonable coverage, we've successfully reproduced the paper's methodology.

Summary

In one sentence: The system extracts symptom-related quotes from interviews, optionally retrieves similar examples, predicts 0-3 scores per PHQ-8 item (or N/A if insufficient evidence), and we evaluate accuracy and abstention jointly via coverage-aware metrics (AURC/AUGRC) plus item-level MAE on predicted items.

Known limitation: Item-level PHQ-8 scoring from transcript-only evidence is often underdetermined because PHQ-8 is a 2-week frequency instrument. This is a dataset/task constraint, not just an engineering issue; see docs/clinical/task-validity.md.

Technical Appendix: Paper-Specified Parameters

From the paper (Section 2.4.2 and Appendix D):

LLM Calls Per Participant

Step	Model	Purpose
1. Evidence Extraction	Gemma 3 27B	Find relevant quotes for each PHQ-8 item
2. Scoring	Gemma 3 27B	Predict 0-3 scores using evidence + examples

Total: 2 LLM calls per participant (plus embedding calls)

Few-Shot Hyperparameters (Paper Appendix D)

Parameter	Optimal Value	What It Means
N_example	2	Number of similar examples per PHQ-8 item
N_chunk	8	Lines per transcript chunk
Step size	2	Sliding window overlap
Dimension	4096	Embedding vector size

Maximum reference chunks per participant: 2 examples × 8 items = 16 chunks

How Similar Examples Are Found

Training transcripts are pre-chunked (8 lines each, sliding by 2)
Each chunk is pre-embedded using Qwen 3 8B Embedding (4096 dimensions)
For a new patient:
Evidence extracted by LLM is embedded
Cosine similarity finds the 2 most similar training chunks per item
Those chunks + their ground truth scores become the "few-shot examples"

Paper Results (Section 3.2)

Mode	MAE	Notes
Zero-shot	0.796	No examples, just prompt
Few-shot	0.619	With 2 similar examples per item
Few-shot + MedGemma	0.505	Better MAE but fewer predictions

The paper reports that few-shot reduced MAE by 22% compared to zero-shot; reproduction results may differ depending on model/backend and retrieval configuration.