Glossary
Terminology used throughout the AI Psychiatrist codebase and documentation.
Clinical Terms
PHQ-8 (Patient Health Questionnaire-8)
An 8-item self-report depression screening tool derived from the PHQ-9 (which includes a suicide ideation question). Each item assesses the frequency of a depressive symptom over the past two weeks on a 0-3 scale.
Items: 1. NoInterest (Anhedonia): Little interest or pleasure in doing things 2. Depressed: Feeling down, depressed, or hopeless 3. Sleep: Trouble falling/staying asleep, or sleeping too much 4. Tired: Feeling tired or having little energy 5. Appetite: Poor appetite or overeating 6. Failure: Feeling bad about yourself — or that you are a failure 7. Concentrating: Trouble concentrating on things 8. Moving: Moving or speaking slowly, or being fidgety/restless
Scoring: - 0 = Not at all - 1 = Several days - 2 = More than half the days - 3 = Nearly every day
MDD (Major Depressive Disorder)
A clinical diagnosis based on DSM-5 criteria. In the PHQ-8 context, a total score ≥ 10 indicates likely MDD.
Severity Levels
Depression severity categories derived from PHQ-8 total scores:
| Level | Score Range | Description |
|---|---|---|
| MINIMAL | 0-4 | No significant symptoms |
| MILD | 5-9 | Mild depressive symptoms |
| MODERATE | 10-14 | Moderate symptoms (MDD threshold) |
| MOD_SEVERE | 15-19 | Moderately severe symptoms |
| SEVERE | 20-24 | Severe depressive symptoms |
DSM-5 (Diagnostic and Statistical Manual of Mental Disorders, 5th Edition)
The standard classification system for mental disorders used by mental health professionals. PHQ-8 items align with DSM-5 criteria for Major Depressive Episode.
N/A Score
When the model cannot determine a PHQ-8 item score due to insufficient evidence in the transcript. N/A scores contribute 0 to the total score.
Dataset Terms
DAIC-WOZ (Distress Analysis Interview Corpus - Wizard of Oz)
A multimodal dataset of clinical interviews for depression detection research. Contains 189 participants with semi-structured interviews conducted by an animated virtual interviewer named Ellie.
Key facts: - Requires EULA from USC ICT for access - 142 labeled participants (train + dev), 47 unlabeled (test) - Participant IDs range 300-492 (with gaps) - Interview duration: 5-25 minutes
AVEC (Audio/Visual Emotion Challenge)
Annual challenge series for affective computing research. DAIC-WOZ was used in AVEC 2016-2019 challenges.
Ellie
The animated virtual interviewer character in DAIC-WOZ. Controlled via Wizard-of-Oz protocol (human operator behind the scenes).
Participant
An individual who completed a DAIC-WOZ interview. Identified by a numeric ID (e.g., 300, 301, 402).
System Terms
Agent
A specialized LLM-powered component that performs a specific task in the pipeline. AI Psychiatrist uses four agents:
- Qualitative Assessment Agent: Analyzes social, biological, and risk factors
- Judge Agent: Evaluates qualitative assessment quality
- Quantitative Assessment Agent: Selectively predicts PHQ-8 item scores (0-3) or abstains (
N/A) when transcript evidence is insufficient (seedocs/clinical/task-validity.md) - Meta-Review Agent: Integrates all assessments into final severity
Feedback Loop
Iterative refinement process where the Judge Agent evaluates the Qualitative Agent's output and triggers re-generation if any metric scores ≤ 3 (out of 5). Runs up to 10 iterations per the paper.
Evaluation Metrics
Four metrics used by the Judge Agent to evaluate qualitative assessments (1-5 Likert scale):
| Metric | Description |
|---|---|
| Coherence | Logical consistency of the assessment |
| Completeness | Coverage of all relevant symptoms and frequencies |
| Specificity | Avoidance of vague or generic statements |
| Accuracy | Alignment with PHQ-8/DSM-5 criteria |
Zero-Shot vs Few-Shot
Zero-shot: The LLM receives only the transcript and prompt, with no reference examples.
Few-shot: The LLM receives similar transcript chunks from a reference database along with their PHQ-8 scores. The paper reports a large MAE improvement vs zero-shot, but reproduction results can vary by model/backend and retrieval configuration; see docs/results/reproduction-results.md.
Embeddings
Vector representations of transcript chunks used for similarity search in few-shot retrieval. Generated by qwen3-embedding:8b model (4096 dimensions).
Reference Store
Pre-computed database of embeddings from training set transcripts with known PHQ-8 scores. Used to find similar examples for few-shot prompting.
Chunk
A segment of transcript text used for embedding generation. Appendix D hyperparameters: 8 lines with a 2-line sliding window step.
Architecture Terms
Clean Architecture
Software design pattern with concentric layers: - Domain: Business entities and logic (innermost) - Use Cases/Services: Application-specific business rules - Adapters: Interface implementations (API, CLI) - Infrastructure: External concerns (LLM, database, logging)
Protocol
Python typing construct (similar to interface) defining expected methods. Used for dependency injection and testability.
Entity
Mutable domain object with identity (UUID). Examples: Transcript, PHQ8Assessment, MetaReview.
Value Object
Immutable domain object without identity. Equal if all attributes equal. Examples: ItemAssessment, EvaluationScore, SimilarityMatch.
Configuration Terms
Ollama
Open-source platform for running LLMs locally. Default chat backend (LLM_BACKEND=ollama) and optional embedding backend (EMBEDDING_BACKEND=ollama).
Model Tags
Model identifiers depend on the backend.
Ollama backend uses identifiers in format name:variant:
- gemma3:27b - Gemma 3 27B (paper baseline)
- qwen3-embedding:8b - Qwen 3 8B embedding model
HuggingFace backend uses official model IDs (e.g. google/medgemma-27b-text-it). The codebase
also supports a canonical alias medgemma:27b, but there is no official MedGemma model in the
Ollama library; any Ollama “medgemma” is a community conversion and may behave differently.
Pydantic Settings
Configuration management using Pydantic. Settings are loaded from:
1. Default values in code
2. .env file
3. Environment variables (highest priority)
Metric Terms
MAE (Mean Absolute Error)
Average absolute difference between predicted and actual PHQ-8 item scores. Lower is better.
Paper results: - Zero-shot: 0.796 MAE - Few-shot: 0.619 MAE (22% lower item-level MAE vs zero-shot) - MedGemma few-shot: 0.505 MAE (18% lower item-level MAE vs Gemma; Appendix F, with lower coverage)
Accuracy
Percentage of correct severity level predictions.
Paper results: - Meta-Review: 78% severity accuracy - Comparable to human expert performance
Likert Scale
Rating scale used for Judge Agent metrics. 1-5 scale where: - 1-2 = Poor - 3 = Marginal - 4-5 = Acceptable
File Format Terms
TSV (Tab-Separated Values)
Text file format using tabs as delimiters. DAIC-WOZ transcripts use TSV format.
NPZ (NumPy Compressed Archive)
Binary format for storing multiple NumPy arrays. Used for pre-computed reference embeddings.
JSON Sidecar
Companion JSON file stored alongside an .npz artifact.
In this repo:
- {name}.json contains chunk texts aligned with the NPZ rows per participant.
- {name}.meta.json (optional) contains provenance/validation metadata (backend, model, dimension, chunking).