Hypotheses for Improvement: First-Principles Analysis
Status: Research findings document Created: 2026-01-05 Analysis Scope: Full pipeline from DAIC-WOZ → Evidence Extraction → Embedding → Scoring
Executive Summary
A first-principles audit of the PHQ-8 scoring pipeline reveals several fundamental mismatches between the dataset, the task, and our implementation. These are not bugs in the traditional sense—the code executes correctly—but rather methodological constraints that limit what is achievable with this approach on this dataset.
Key Finding: DAIC-WOZ was designed to capture behavioral indicators of depression, not to elicit explicit PHQ-8 frequency information. Our quantitative prompts are (correctly) conservative about scoring without frequency evidence, but the dataset often does not provide it.
Task Validity SSOT:
docs/clinical/task-validity.md— comprehensive analysis of construct mismatch and valid scientific claims.
Run 13 SSOT snapshot (clean post-BUG-035 comparative baseline; 41 participants processed in both modes): - Zero-shot: item MAE = 0.6079, coverage = 50.0% (40/41 evaluated; 1 excluded: no evidence) - Few-shot: item MAE = 0.6571, coverage = 48.5% (41/41 evaluated) - Key result: zero-shot beats few-shot after the BUG-035 fix, so the gap is not a prompt confound artifact.
Run 12 pipeline stats snapshot (pre-BUG-035; useful for evidence/grounding/retrieval distributions): - Evidence grounding rejects ~49.5% of extracted quotes (deduped across modes). - Only 32.0% of item assessments had any grounded LLM evidence (105/328). - Few-shot references are sparse: 15.2% of item assessments had any references (50/328), receiving 52 total references.
Run 13 is documented in docs/results/run-history.md. The Run 12 pipeline stats above are derived from Run 12 artifacts in data/outputs/ and summarized in docs/results/few-shot-analysis.md.
Peer-Review “Reject” Threats (Adversarial List)
These are the issues most likely to trigger rejection on construct validity / method validity grounds unless explicitly addressed via ablations or wording.
A) Construct validity: PHQ-8 is self-report frequency; transcripts often lack frequency (Major)
- PHQ-8 is explicitly a “past two weeks / frequency” instrument; DAIC-WOZ is not a PHQ interview. Most interview statements are qualitative (no explicit day counts).
- Our prompts correctly push the model to abstain when frequency is unclear (
src/ai_psychiatrist/agents/prompts/quantitative.py:37-45andsrc/ai_psychiatrist/agents/prompts/quantitative.py:111-117), but that means the system is fundamentally measuring “inferable PHQ evidence from transcript” rather than PHQ itself.
Implication for claims: You must frame the task as selective, evidence-grounded inference rather than “PHQ-8 from transcripts” in an absolute sense.
B) Few-shot prompt confound (Fixed; historical runs only) (Major)
Historical runs had a prompt confound: few-shot prompting could still differ from zero-shot even when retrieval returned zero usable references (an empty reference wrapper containing the string “No valid evidence found”).
This is now fixed (BUG-035): empty reference bundles format to "" and the <Reference Examples>
block is omitted, so few-shot-with-no-refs is byte-identical to zero-shot.
Implication: pre-fix “few-shot vs zero-shot” comparative claims are confounded and require post-fix reruns to measure the true retrieval effect.
C) Participant-only transcripts remove disambiguating question context (Major)
Participant-only transcripts are effective at reducing protocol leakage into embeddings, but they also remove the questions that disambiguate short answers (semantic void problem). This can reduce evidence yield and coverage.
Mitigation: Ablate against transcripts_participant_qa (minimal question context) and quantify the impact on evidence grounding rate, coverage, and MAE/AUGRC.
D) Privacy/ethics risk: log artifacts leaking restricted text (Major)
Any workflow that logs raw transcript text, retrieved reference text, or LLM outputs can leak restricted corpus content into run artifacts.
Current status:
- Retrieval audit logs in EmbeddingService are privacy-safe (Spec 064): they emit chunk_hash and
chunk_chars (no raw chunk previews).
- Ensure auxiliary scripts follow the same policy (e.g., chunk scoring should avoid logging
chunk_preview / response_preview).
1. Dataset-Task Mismatch (Critical)
What DAIC-WOZ Was Designed For
Per the DAIC-WOZ documentation:
"These interviews were collected as part of a larger effort to create a computer agent that interviews people and identifies verbal and nonverbal indicators of mental illness."
The virtual interviewer "Ellie" conducts semi-structured interviews designed to: - Create interactional situations favorable to assessing distress indicators - Capture behavioral markers correlated with depression - Collect multimodal data (audio, video, text)
What PHQ-8 Scoring Requires
PHQ-8 is a frequency-based instrument asking "Over the last 2 weeks, how often have you been bothered by [symptom]?": - 0 = Not at all (0-1 days) - 1 = Several days (2-6 days) - 2 = More than half the days (7-11 days) - 3 = Nearly every day (12-14 days)
The Mismatch
The interview doesn't ask about frequency. Participants don't state frequency. They say things like: - "I've been feeling tired" (no frequency) - "I have trouble sleeping sometimes" (vague) - "I've been stressed lately" (qualitative)
This explains why (see Run 12 pipeline stats snapshot above): - Only 32.0% of item assessments have any grounded evidence (105/328) - ~49.5% of extracted quotes fail evidence grounding - Coverage stabilizes around 46–49% in both modes
2. Evidence Extraction Paradox
Current Prompt Logic
Our prompts (see src/ai_psychiatrist/agents/prompts/quantitative.py:111-117) say:
5. If no relevant evidence exists, mark as "N/A" rather than assuming absence
6. Only assign numeric scores (0-3) when evidence clearly indicates frequency
The Paradox
This is methodologically correct but practically limiting: - Most transcripts don't contain explicit frequency statements - Correct behavior: output N/A for most items - Result: ~50% abstention rate
Hypothesis 2A: Frequency Can Be Inferred
A skilled psychiatrist doesn't require patients to say "I felt tired 8 out of 14 days." They infer frequency from: - Temporal markers ("lately", "recently", "since [event]") - Intensity qualifiers ("always", "sometimes", "occasionally") - Impact statements ("I can't function", "it's been hard") - Context patterns (multiple mentions across the interview)
Current Status: Our prompts discourage inference. They demand explicit frequency.
Improvement Hypothesis: Update prompts to allow clinical inference while maintaining transparency:
When explicit frequency is not stated, you may infer approximate frequency from:
- Temporal language ("lately" → several days, "always" → nearly every day)
- Intensity markers ("sometimes" → several days)
- Functional impact ("can't work" → more than half the days)
Document your inference in the reason field.
Trade-off: Higher coverage, potentially lower precision. Needs ablation.
3. Chunk Scoring Validity Issues
Observation from Scored Chunks
Examining chunks scored for PHQ8_Sleep (see data/embeddings/*.chunk_scores.json):
Chunk 303:37 scored Sleep=3:
"i need my rest because i'm out there driving that bus..."
"what am i like irritated tired um lazy"
"feel like i wanna lay down probably go to sleep"
Problem: This participant is expressing: - Value for rest ("I need my rest") - Desire to sleep ("feel like i wanna lay down") - General tiredness
NOT: Trouble falling/staying asleep or sleeping too much (the actual PHQ-8 Sleep item)
Hypothesis 3A: Semantic Confusion in Chunk Scoring
The LLM scorer is confusing: | What participant said | What LLM inferred | Actual PHQ-8 construct | |-----------------------|-------------------|------------------------| | "I need rest" | Sleep problems | Not a symptom | | "I feel tired" | Sleep issues | Different item (Tired) | | "I want to nap" | Sleeping too much | Maybe, context-dependent |
Improvement Hypothesis: Add explicit symptom definitions to chunk scoring prompt:
PHQ8_Sleep asks about: "Trouble falling or staying asleep, OR sleeping too much"
- Wanting rest is NOT a sleep problem
- Feeling tired belongs to PHQ8_Tired, not PHQ8_Sleep
- "Sleeping too much" means actually sleeping excessive hours, not wanting to
4. Embedding Space Limitations
Current Approach
- Extract evidence text from test transcript
- Embed evidence text
- Find similar chunks from reference corpus
- Use reference chunk scores as anchors
Hypothesis 4A: Semantic Similarity ≠ Severity Similarity
Embedding captures topic similarity, not severity similarity: - "I can't sleep at night" (severe) ≈ "I value good rest" (not a symptom) - Both are "about sleep" in embedding space - One is PHQ8_Sleep=3, one is PHQ8_Sleep=0
Evidence: Item-tag filtering helps (Spec 34), but doesn't solve the severity confusion within a topic.
Hypothesis 4B: Score Reranking
Improvement Hypothesis: After semantic retrieval, rerank by: 1. Presence of severity markers in reference chunk 2. Score distribution (prefer balanced exemplars) 3. Exclude chunks that are topic-adjacent but not symptom-indicative
Hypothesis 4C: Domain mismatch — general embeddings may be suboptimal (Major)
We currently use a general-purpose embedding model (MODEL_EMBEDDING_MODEL=qwen3-embedding:8b). Clinical NLP has multiple domain-adapted models (e.g., ClinicalBERT / PubMedBERT) that may better represent symptom language and reduce topical-but-not-clinical matches.
Improvement Hypothesis: Add an embeddings ablation suite:
- baseline: current qwen3-embedding:8b
- clinical-domain embedding baseline(s): ClinicalBERT / PubMedBERT style encoders (or a modern clinical embedding model)
- evaluate: retrieval sparsity, reference score usefulness, downstream MAE/AUGRC
This must be done as an ablation; do not assume improvements without measurement.
5. N/A Criteria Analysis
Current Behavior
Two paths to N/A:
1. NO_MENTION: LLM evidence count = 0 (no relevant quotes found)
2. SCORE_NA_WITH_EVIDENCE: LLM found evidence but explicitly said N/A
Hypothesis 5A: Over-Abstention on Implicit Evidence
Run 12 data: 51.5% abstention (zero-shot), 54% abstention (few-shot).
Many participants may have depression symptoms visible in their language patterns (word choice, response length, topic avoidance) without explicit symptom mentions.
Question: Should we abstain on items where behavioral indicators suggest pathology but explicit frequency is missing?
Trade-off: - Abstaining is methodologically conservative (no hallucinated scores) - But may miss clinically meaningful signals - Psychiatrists use holistic assessment, not just verbal frequency statements
6. Frequency Inference Hierarchy
Proposed Inference Rules (Hypothesis)
| Language Pattern | Inferred Frequency | PHQ-8 Score |
|---|---|---|
| "every day", "constantly", "all the time" | 12-14 days | 3 |
| "most days", "usually" | 7-11 days | 2 |
| "sometimes", "a few times", "lately" | 2-6 days | 1 |
| "once", "rarely", "not really" | 0-1 days | 0 |
| No temporal marker, only symptom mention | Ambiguous | N/A or 1? |
Current behavior: Ambiguous → N/A Alternative: Ambiguous → 1 (conservative non-zero) with low confidence
7. Pipeline Architecture Questions
Question 7A: Evidence Extraction as Bottleneck
The current pipeline:
Transcript → Evidence Extraction → Embedding → Reference Retrieval → Scoring
Evidence extraction is a filter: - Grounded quotes only (substring match) - Rejects ~50% of extracted quotes as "hallucinated"
Hypothesis: Evidence grounding is too strict. "Hallucinated" quotes may be: - Paraphrases (valid signal, wrong words) - Composite statements (synthesized from multiple utterances) - Reasonable inferences (not literal but implied)
Improvement Hypothesis: Fuzzy grounding with semantic similarity instead of substring match.
Hypothesis 7C: Evidence extractor prompt may be inducing quote “hallucinations” (Major)
The evidence extraction prompt currently asks the model to both (a) extract quotes and (b) “determine the appropriate PHQ-8 score”, but the response schema is quote arrays only (src/ai_psychiatrist/agents/prompts/quantitative.py:47-89). This mixed objective can incentivize the model to synthesize/normalize quotes rather than copy verbatim.
Improvement Hypothesis: Rewrite evidence extraction as a pure “verbatim quote finder”: - Remove any instruction about scoring in the evidence step. - Add stronger constraints: “copy exact substrings; do not paraphrase; do not merge lines.” - Evaluate impact on grounding rejection rate and few-shot reference coverage.
Question 7B: Direct Scoring vs. Evidence-Mediated
Alternative architecture:
Transcript → Direct Scoring (no evidence extraction)
Let the LLM see the full transcript and score directly. Trade-offs: - Pro: No evidence extraction bottleneck - Con: Less interpretable, harder to ground - Con: May increase hallucination
8. Ground Truth Reliability
The Meta-Question
How reliable is the PHQ-8 ground truth? - Patients self-report their symptoms - Self-report has known biases (social desirability, recall error) - The same patient might score differently on different days
Implication: Even perfect prediction can't exceed ground truth reliability. MAE floor may be ~0.5 due to label noise, not model error.
Evidence (examples of PHQ-8 reliability in the literature): - Swedish PHQ-8 psychometrics report test-retest ICC ≈ 0.83 for total score and Cronbach’s α ≈ 0.85 (Rheumatol Int, 2020; PubMed: 32661929). - Another PHQ-8 psychometric study reports Cronbach’s α ≈ 0.922 (Hum Reprod Open, 2022; PubMed: 35591921).
These are not DAIC-WOZ-specific, but they provide an empirical anchor: the label is not noise-free, and extremely low MAE targets may be unrealistic without additional modalities or repeated measures.
9. Summary of Hypotheses
| ID | Hypothesis | Type | Effort | Status |
|---|---|---|---|---|
| 2A | Allow frequency inference from temporal/intensity markers | Prompt change | Low | → Spec 063 |
| 3A | Add explicit symptom definitions to chunk scorer | Prompt change | Low | Proposed |
| 4A | Embedding captures topic, not severity | Architecture | High | Research |
| 4B | Rerank by severity markers, not just similarity | Code change | Medium | Proposed |
| 5A | Consider behavioral indicators beyond verbal frequency | Research | High | Research |
| 7A | Fuzzy evidence grounding (semantic similarity) | Config + code | Medium | Proposed |
| 7B | Direct scoring without evidence extraction | Architecture | High | Research |
Related Specs (address task validity):
- Spec 061: Total PHQ-8 Score Prediction (0-24) — docs/_specs/spec-061-total-phq8-score-prediction.md
- Spec 062: Binary Depression Classification — docs/_specs/spec-062-binary-depression-classification.md
- Spec 063: Severity Inference Prompt Policy (implements Hypothesis 2A) — docs/_specs/spec-063-severity-inference-prompt-policy.md
10. Recommended Next Steps
Immediate (Low-effort, testable)
- Hypothesis 3A: Update chunk scoring prompt with explicit symptom definitions
- Hypothesis 2A: Create a "frequency inference" prompt variant and ablate
Research (Higher effort)
- Hypothesis 7A: Implement fuzzy grounding and compare to substring match
- Hypothesis 4B: Implement severity-aware reranking
Fundamental Re-evaluation
- Consider whether PHQ-8 frequency scoring is the right task for this dataset
- Explore alternative tasks: binary depression detection, severity classification (none/mild/moderate/severe)
11. Related Documentation
- Task Validity — SSOT: construct mismatch and valid claims
- Few-Shot Analysis — Why few-shot may not beat zero-shot
- RAG Design Rationale — Original design decisions
- Metrics and Evaluation — AURC/AUGRC definitions
- Specs Index — Implementation specs (061-063 address task validity)
Sources
- DAIC-WOZ Database
- DAIC-WOZ Documentation
- DAIC-WOZ: On the Validity of Using the Therapist's prompts
- The Distress Analysis Interview Corpus
- PHQ-8 validation: The PHQ-8 as a measure of current depression in the general population
- PHQ-8 reliability example (test-retest ICC): https://pubmed.ncbi.nlm.nih.gov/32661929/
- PHQ-8 internal consistency example: https://pubmed.ncbi.nlm.nih.gov/35591921/
- DAIC-WOZ + PHQ-8 prediction prior art (LLMs): https://pubmed.ncbi.nlm.nih.gov/40720397/
- DAIC-WOZ + PHQ-8 prediction prior art (text regression): https://pubmed.ncbi.nlm.nih.gov/37398577/
- Selective classification evaluation pitfalls (AUGRC): http://arxiv.org/abs/2407.01032
- Clinical-domain language models (embedding ablations): http://arxiv.org/abs/1904.05342 (ClinicalBERT), http://arxiv.org/abs/2007.15779 (PubMedBERT)