RAG Overview: Embeddings and Few-Shot Retrieval
Audience: Clinicians and non-CS folks who want to understand the "magic" Last Updated: 2026-01-03
The Question This Answers
"How does the system find similar patients to help score new ones?"
This document explains embeddings and few-shot retrieval without requiring any computer science background.
Task validity note: PHQ-8 is a 2-week frequency self-report instrument, while DAIC-WOZ interviews are not structured as PHQ administration. Few-shot retrieval can only help when there is grounded, item-relevant evidence to embed; otherwise the system often abstains (
N/A). See:docs/clinical/task-validity.md.
The Core Idea
When you read a patient's interview, you might think:
"This reminds me of Patient X from last year who also couldn't sleep and felt hopeless. That patient had moderate depression."
The system does the same thing, but mathematically.
Part 1: What is an Embedding?
The Analogy: GPS Coordinates
Imagine every sentence in the world has a "location" in a giant map of meaning.
- "I can't sleep at night" → Location A
- "I have insomnia" → Location B (very close to A - similar meaning)
- "I love pizza" → Location C (far from A and B - different meaning)
An embedding is like GPS coordinates for a sentence's meaning.
The Technical Reality
Instead of 2D coordinates (latitude, longitude), embeddings use 4096 dimensions. But the principle is the same: similar meanings have similar coordinates.
| Sentence | "Meaning Location" (simplified) |
|---|---|
| "I can't sleep" | [0.8, 0.2, 0.9, ...4096 numbers...] |
| "I have insomnia" | [0.79, 0.21, 0.88, ...very similar...] |
| "I love pizza" | [0.1, 0.7, 0.3, ...very different...] |
Why 4096 Dimensions?
More dimensions = more nuance captured. The paper reports 4096 performed best among the tested values (64, 256, 1024, 4096), and this repo defaults to 4096.
Think of it like describing a patient: - 2 dimensions: "depressed" and "anxious" - 10 dimensions: add "sleep quality", "energy", "appetite", etc. - 4096 dimensions: captures extremely subtle differences in meaning
Part 2: How Similarity is Measured
The Analogy: Distance on a Map
If two places have similar GPS coordinates, they're close together.
Same with embeddings: if two sentences have similar "meaning coordinates," they're semantically similar.
Cosine Similarity
Raw cosine similarity ranges from -1 to 1: - 1.0 = identical direction (very similar meaning) - 0.0 = orthogonal (no directional similarity) - -1.0 = opposite direction (very dissimilar)
In this codebase, we store similarity in a 0 to 1 range by applying a simple, monotonic transform:
similarity = (1 + raw_cosine) / 2
So in the stored similarity scale:
- 1.0 = identical (raw_cosine = 1.0)
- 0.5 = neutral / orthogonal (raw_cosine = 0.0)
- 0.0 = opposite (raw_cosine = -1.0)
| Comparison | Similarity |
|---|---|
| "I can't sleep" vs "I have insomnia" | 0.92 |
| "I can't sleep" vs "I feel tired" | 0.75 |
| "I can't sleep" vs "I love hiking" | 0.15 |
These numbers are illustrative; exact values depend on the embedding model.
Part 3: The Reference Store (Knowledge Base)
What It Contains
Before running on new patients, we processed all training patients:
- Split each transcript into chunks (8 lines each)
- Computed embeddings for each chunk
- Stored them with PHQ-8 reference scores (participant-level ground truth by default; optionally chunk-level estimates when enabled)
Result: A database of thousands of chunks from the training split. In the paper-style split, the training set is 58 participants; in the AVEC2017 split, it is 107 participants. The exact chunk count depends on the chosen split and chunking parameters, but the contents are always: - The text itself - Its embedding (4096 numbers) - A PHQ-8 item score used as the reference label (participant ground truth or chunk-level estimate, depending on configuration)
Visualized
REFERENCE STORE
┌──────────────────────────────────────────────────────────────┐
│ Patient 101, Chunk 3 │
│ Text: "I haven't been able to sleep... I'm so exhausted" │
│ Embedding: [0.45, 0.82, 0.31, ... 4096 numbers ...] │
│ PHQ8_Sleep score: 2 │
├──────────────────────────────────────────────────────────────┤
│ Patient 142, Chunk 7 │
│ Text: "Nothing brings me joy anymore, I don't care" │
│ Embedding: [0.71, 0.23, 0.88, ... 4096 numbers ...] │
│ PHQ8_NoInterest score: 3 │
├──────────────────────────────────────────────────────────────┤
│ ... ~7,000 more chunks ... │
└──────────────────────────────────────────────────────────────┘
Part 4: Few-Shot Retrieval
The Analogy: "Show, Don't Tell"
Imagine training a new resident to score PHQ-8. You could:
Option A (Zero-Shot): Give them the PHQ-8 manual and say "score this patient."
Option B (Few-Shot): Show them 2-3 examples first:
"Here's Patient A who said 'I can't sleep' and had a score of 2. Here's Patient B who said 'I sleep too much' and had a score of 2. Now, this new patient says 'I wake up every night.' What's your score?"
Option B is better because examples calibrate their judgment.
How the System Does This
For each PHQ-8 item in a new patient:
- Extract evidence: "The patient said: 'I wake up at 3am every night'"
- Embed the evidence: Convert to 4096-dimension coordinates
- Find similar chunks: Search reference store for closest matches
- Retrieve examples: Get the 2 most similar chunks with their scores
- Score with examples: LLM sees the new evidence PLUS similar examples
Visual Example
NEW PATIENT'S EVIDENCE (Sleep):
"I wake up at 3am every night and can't get back to sleep"
│
▼ Compute embedding + search reference store
│
┌───────────────┴───────────────┐
│ │
▼ ▼
REFERENCE 1 (similarity: 0.89) REFERENCE 2 (similarity: 0.85)
"I keep waking up at night" "Can't stay asleep, up at 4am"
Score: 2 Score: 2
│ │
└───────────────┬───────────────┘
│
▼
LLM PROMPT:
"Here are similar examples:
- 'I keep waking up at night' → Score 2
- 'Can't stay asleep, up at 4am' → Score 2
Now score: 'I wake up at 3am every night and can't get back to sleep'"
LLM OUTPUT: Score 2
Part 5: Why This Works
The Calibration Effect
Without examples, the LLM must infer what "2" means on the PHQ-8 scale from the rubric and the transcript evidence.
With examples (when there is item-relevant evidence to retrieve), the LLM can calibrate:
"Oh, 'waking up at night' is a 2, not a 3. Got it."
The Paper's Results
| Mode | MAE | Explanation |
|---|---|---|
| Zero-shot | 0.796 | No examples, rubric-only calibration |
| Few-shot | 0.619 | 2 examples per item, calibrated |
That is a 22% lower item-level MAE vs zero-shot (paper-reported). In this repository, few-shot performance is sensitive to retrieval quality and can underperform zero-shot; see docs/results/reproduction-results.md and docs/results/run-history.md.
Part 6: Per-Item Retrieval
Each Symptom Gets Its Own Examples
The system doesn't find "similar patients overall." It finds similar evidence per PHQ-8 item:
| Item | Evidence Extracted | Similar References Found |
|---|---|---|
| Sleep | "I wake up at 3am" | 2 sleep-related chunks |
| Tired | "I have no energy" | 2 fatigue-related chunks |
| Appetite | (none found) | (none) → N/A |
Why Per-Item?
A patient might have severe sleep problems but mild appetite issues. Using overall similarity would miss this nuance.
Part 7: The Item Tagging Problem (Spec 34)
The Problem: Topic vs. Item Mismatch
Embedding similarity finds chunks that are semantically similar overall, but similarity doesn't guarantee the chunk is about the same PHQ-8 item.
Example of the problem:
You're scoring Sleep for a new patient. Your extracted evidence is:
"I can't sleep, I'm up all night worrying"
The embedding search might return:
Reference 1: "I worry constantly about money" (high similarity - both mention worry) Reference 2: "I toss and turn at night" (moderate similarity - about sleep)
The first reference is semantically similar (both express anxiety/worry), but it's tagged with PHQ8_Failure or PHQ8_Concentrating—not PHQ8_Sleep. Using it as a few-shot example for Sleep could confuse the model.
The Solution: Item Tagging
We now tag each reference chunk with which PHQ-8 items it actually discusses:
BEFORE (untagged):
┌─────────────────────────────────────────────┐
│ Chunk: "I worry constantly about money" │
│ Embedding: [0.45, 0.82, ...] │
│ PHQ8 scores: (participant-level only) │
└─────────────────────────────────────────────┘
AFTER (tagged):
┌─────────────────────────────────────────────┐
│ Chunk: "I worry constantly about money" │
│ Embedding: [0.45, 0.82, ...] │
│ PHQ8 scores: (participant-level only) │
│ Tags: ["PHQ8_Failure", "PHQ8_Concentrating"]│ ← NEW
└─────────────────────────────────────────────┘
How Tagging Works
At index time (when embeddings are generated):
1. Each chunk is analyzed for PHQ-8-related keywords
2. Keywords are matched against a curated keyword list (phq8_keywords.yaml)
3. Matching items are stored in a .tags.json sidecar file
At retrieval time (when scoring a new patient):
1. If item tag filtering is enabled (EMBEDDING_ENABLE_ITEM_TAG_FILTER=true)
2. When retrieving references for PHQ8_Sleep, only chunks tagged with PHQ8_Sleep are considered
3. This eliminates semantically-similar-but-wrong-item references
Visual Example
RETRIEVING REFERENCES FOR PHQ8_Sleep (with filtering)
New Evidence: "I can't sleep, I'm up all night"
│
▼ Search with item filter
│
┌───────────────┼───────────────────────────────┐
│ │ │
▼ ▼ ▼
Chunk A Chunk B Chunk C
"I worry about "I toss and turn "Up every night
money" at night" can't sleep"
Tags: [Failure] Tags: [Sleep] Tags: [Sleep]
│ │
✗ FILTERED ▼ ▼
(no Sleep tag) ✓ INCLUDED ✓ INCLUDED
The Artifacts
Item tagging creates a new sidecar file alongside embeddings:
| File | Contents |
|---|---|
{name}.npz |
Embedding vectors (unchanged) |
{name}.json |
Chunk text (unchanged) |
{name}.meta.json |
Generation metadata (unchanged) |
{name}.tags.json |
NEW: Per-chunk PHQ-8 item tags |
The .tags.json format:
{
"303": [
["PHQ8_Sleep", "PHQ8_Tired"],
[],
["PHQ8_Depressed"]
],
"304": [...]
}
Why This Matters
Without item tagging, few-shot retrieval can inject noise: - High-similarity chunks about the wrong symptom - Calibration examples that confuse rather than help
With item tagging, references are both: 1. Semantically similar (embedding-based) 2. Topically relevant (item-tagged)
Goal: reduce semantically-similar-but-wrong-item references. Whether this improves metrics depends on the model/run.
Part 8: When It Doesn't Help
The Appetite Problem
The paper (Appendix E) found:
"PHQ-8-Appetite had no successfully retrieved reference chunks"
Important: this statement is about few-shot reference retrieval (“no retrieved reference chunks”), not prediction coverage directly.
The paper continues (Appendix E) that Gemma 3 27B “did not identify any evidence related to appetite issues in the available transcripts, resulting in no reference for that symptom.” In our pipeline, reference retrieval is driven by embedding the extracted evidence per item. If the evidence extraction step returns no appetite evidence, there’s nothing to embed/query, so reference retrieval returns no appetite examples.
This often correlates with low appetite coverage (more N/A), but the two are not identical metrics.
Appetite coverage varies by run/model; see docs/results/run-history.md for concrete runs.
Summary: The Complete Picture
- Embeddings = Mathematical representation of meaning (like GPS for sentences)
- Reference Store = Database of training chunks with known scores
- Similarity Search = Find chunks with similar meaning to new evidence
- Few-Shot = Show the LLM similar examples before asking it to score
The key insight: Instead of telling the LLM "here's what a 2 means," we SHOW it examples of labeled chunks. The paper reports a large few-shot improvement, but in this repo few-shot performance depends heavily on retrieval quality (and can underperform zero-shot in some runs). See docs/results/reproduction-results.md and docs/results/run-history.md.
Glossary
| Term | Plain Definition |
|---|---|
| Embedding | A list of ~4000 numbers representing a sentence's meaning |
| Similarity (transformed cosine) | A 0–1 score derived from cosine similarity: 1=identical, 0.5=neutral, 0=opposite |
| Reference Store | Database of training examples with known scores |
| Few-Shot | Showing examples before asking for a prediction |
| Zero-Shot | Predicting without any examples |
| Chunk | A small section of a transcript (~8 lines) |
Related Documentation
- Evidence extraction - How evidence is found
- Coverage explained - Why some items get N/A
- Clinical understanding - Clinical context