RAG Design Rationale

Audience: Researchers evaluating few-shot vs zero-shot approaches Last Updated: 2026-01-03

Overview

This document covers critical design considerations for few-shot PHQ-8 scoring, including known limitations and their fixes.

Task validity note: PHQ-8 is a 2-week frequency instrument, while DAIC-WOZ transcripts are not structured as PHQ administration. Few-shot retrieval can only help when there is grounded, item-relevant evidence to embed; otherwise references will be sparse. See: docs/clinical/task-validity.md and docs/results/few-shot-analysis.md.

The Participant-Level Score Problem

The Issue

The paper's few-shot implementation has a fundamental limitation: participant-level PHQ-8 scores are assigned to individual chunks regardless of chunk content.

How PHQ-8 works: - 8 items (Sleep, Tired, Appetite, etc.), each scored 0-3 - Total score = sum of all 8 items = 0-24

How chunks are created: - Transcripts split into 8-line sliding windows (step=2) - Result: ~100 chunks per participant - Only a FEW chunks actually discuss any specific symptom

The flaw:

Participant 300 has PHQ8_Sleep = 2

Chunk 5 (about career goals):
  "Ellie: what's your dream job
   Participant: open a business"
  → Gets labeled: Sleep Score = 2  ← WRONG (nothing about sleep!)

Chunk 95 (about sleep):
  "Ellie: have you had trouble sleeping
   Participant: yes every night"
  → Gets labeled: Sleep Score = 2  ← CORRECT

Every chunk from a participant gets the SAME score, regardless of content.

The Fix: Spec 35 (Chunk-Level Scoring)

Instead of assigning participant-level scores to chunks, we score each chunk individually (0–3 or null) based on the chunk content:

# Generate per-chunk scores
uv run python scripts/score_reference_chunks.py \
  --embeddings-file huggingface_qwen3_8b_paper_train_participant_only \
  --scorer-backend ollama \
  --scorer-model gemma3:27b-it-qat

# Enable at runtime
EMBEDDING_REFERENCE_SCORE_SOURCE=chunk

See Chunk-level scoring for full details.

Zero-Shot Inflation Hypothesis

The Issue

The DAIC-WOZ transcripts include Ellie's questions, which directly probe PHQ-8 symptoms:

Ellie: have you been diagnosed with depression
Participant: yes i was diagnosed last year
Ellie: can you tell me more about that
Participant: i was feeling really down and couldn't sleep

Key insight: Ellie asks DIRECT questions about PHQ-8 symptoms: - "What are you like when you don't get enough sleep?" → PHQ8_Sleep - "Do you have trouble concentrating?" → PHQ8_Concentrating

External Validation: The Burdisso Paper

Paper: "DAIC-WOZ: On the Validity of Using the Therapist's prompts in Automatic Depression Detection" (Burdisso et al., 2024)

Critical Finding: Models using Ellie's prompts achieve 0.88 F1 vs 0.85 F1 for participant-only models.

"Models using interviewer's prompts learn to focus on a specific region of the interviews, where questions about past experiences with mental health issues are asked, and use them as discriminative shortcuts to detect depressed participants."

Implications

Mode	What It Tests	Validity
Zero-shot (participant-only)	Can LLM assess patient's words?	HIGH - The real test
Zero-shot (full transcript)	Can LLM read Ellie's shortcuts?	Lower - potentially inflated
Few-shot (paper method)	Noisy chunks + wrong scores	Lower - label noise
Few-shot + Spec 35	Filtered chunks + correct scores	Higher

The TRUE Baseline

The question isn't "why is few-shot worse?" but: 1. Is zero-shot artificially inflated by Ellie's shortcuts? 2. What is the TRUE baseline (participant-only)?

Few-Shot Still Has Value

Despite the design flaws, few-shot is valuable for:

Model Size Dependency

Model Size	Few-Shot Value	Reason
Small (Gemma 27B, local)	HIGH	Needs calibration examples
Large (GPT-4, frontier)	Lower	Has already learned patterns

Explainability

Property	Chain-of-Thought	RAG/CRAG
Reproducibility	Varies between runs	Fixed with same index
Grounding	Generated rationalization	Anchored to real examples
Verifiability	Cannot verify reasoning	Can examine retrieved examples
Auditability	May change on re-run	Citable, stable

"RAG-based explainability provides something that chain-of-thought prompting fundamentally cannot — grounded, verifiable clinical reasoning."

The CRAG Pipeline (Specs 34 + 35 + 36)

Spec	What It Does	Fixes
Spec 34	Tag chunks with relevant PHQ-8 items at index time	Only retrieve Sleep-tagged chunks for Sleep queries
Spec 35	Score each chunk individually via LLM	Chunks get accurate, content-based scores
Spec 36	Validate references at query time (CRAG-style)	Reject irrelevant/contradictory chunks

Together:

Naive Few-Shot (paper)           = Naive RAG
   ↓ add Spec 34 (tag filter)    = Better RAG
   ↓ add Spec 35 (chunk scoring) = Even Better RAG
   ↓ add Spec 36 (validation)    = CRAG (2025 gold standard)

Scorer Model Selection (Spec 35)

Circularity Risk

If the scorer and assessor are the same model: - Correlated bias: assessor is more likely to agree with scorer's labeling style - Metric inflation: few-shot can look "better" because examples match model's priors

Recommendation

Priority	Scorer Choice	Notes
1 (ideal)	MedGemma via HuggingFace	Medical tuning, most defensible
2 (practical)	Different model family (qwen2.5, llama3.1)	Truly disjoint
3 (baseline)	Same model with `--allow-same-model`	Explicit opt-in, ablate against disjoint

See Chunk-level scoring for generation commands.

Summary

Participant-level scoring is flawed - chunks get wrong labels (Spec 35 fixes)
Zero-shot may be inflated - Ellie's questions provide shortcuts
Few-shot has value - for small models and explainability
CRAG pipeline (Specs 34+35+36) is the current best practice

Chunk-level scoring (Spec 35) - Spec 35 details
RAG Overview - How retrieval works
Runtime features - CRAG validation (Spec 36), prompt format, batch embedding
Artifact generation - Embeddings + item tags (Spec 34)

RAG Design Rationale

Overview

The Participant-Level Score Problem

The Issue

The Fix: Spec 35 (Chunk-Level Scoring)

Zero-Shot Inflation Hypothesis

The Issue

External Validation: The Burdisso Paper

Implications

The TRUE Baseline

Few-Shot Still Has Value

Model Size Dependency

Explainability

The CRAG Pipeline (Specs 34 + 35 + 36)

Scorer Model Selection (Spec 35)

Circularity Risk

Recommendation

Summary

Related Documentation