Glossary

Terminology used throughout the AI Psychiatrist codebase and documentation.

Clinical Terms

PHQ-8 (Patient Health Questionnaire-8)

An 8-item self-report depression screening tool derived from the PHQ-9 (which includes a suicide ideation question). Each item assesses the frequency of a depressive symptom over the past two weeks on a 0-3 scale.

Items: 1. NoInterest (Anhedonia): Little interest or pleasure in doing things 2. Depressed: Feeling down, depressed, or hopeless 3. Sleep: Trouble falling/staying asleep, or sleeping too much 4. Tired: Feeling tired or having little energy 5. Appetite: Poor appetite or overeating 6. Failure: Feeling bad about yourself — or that you are a failure 7. Concentrating: Trouble concentrating on things 8. Moving: Moving or speaking slowly, or being fidgety/restless

Scoring: - 0 = Not at all - 1 = Several days - 2 = More than half the days - 3 = Nearly every day

MDD (Major Depressive Disorder)

A clinical diagnosis based on DSM-5 criteria. In the PHQ-8 context, a total score ≥ 10 indicates likely MDD.

Severity Levels

Depression severity categories derived from PHQ-8 total scores:

Level	Score Range	Description
MINIMAL	0-4	No significant symptoms
MILD	5-9	Mild depressive symptoms
MODERATE	10-14	Moderate symptoms (MDD threshold)
MOD_SEVERE	15-19	Moderately severe symptoms
SEVERE	20-24	Severe depressive symptoms

DSM-5 (Diagnostic and Statistical Manual of Mental Disorders, 5th Edition)

The standard classification system for mental disorders used by mental health professionals. PHQ-8 items align with DSM-5 criteria for Major Depressive Episode.

N/A Score

When the model cannot determine a PHQ-8 item score due to insufficient evidence in the transcript. N/A scores contribute 0 to the total score.

Dataset Terms

DAIC-WOZ (Distress Analysis Interview Corpus - Wizard of Oz)

A multimodal dataset of clinical interviews for depression detection research. Contains 189 participants with semi-structured interviews conducted by an animated virtual interviewer named Ellie.

Key facts: - Requires EULA from USC ICT for access - 142 labeled participants (train + dev), 47 unlabeled (test) - Participant IDs range 300-492 (with gaps) - Interview duration: 5-25 minutes

AVEC (Audio/Visual Emotion Challenge)

Annual challenge series for affective computing research. DAIC-WOZ was used in AVEC 2016-2019 challenges.

Ellie

The animated virtual interviewer character in DAIC-WOZ. Controlled via Wizard-of-Oz protocol (human operator behind the scenes).

Participant

An individual who completed a DAIC-WOZ interview. Identified by a numeric ID (e.g., 300, 301, 402).

System Terms

Agent

A specialized LLM-powered component that performs a specific task in the pipeline. AI Psychiatrist uses four agents:

Qualitative Assessment Agent: Analyzes social, biological, and risk factors
Judge Agent: Evaluates qualitative assessment quality
Quantitative Assessment Agent: Selectively predicts PHQ-8 item scores (0-3) or abstains (N/A) when transcript evidence is insufficient (see docs/clinical/task-validity.md)
Meta-Review Agent: Integrates all assessments into final severity

Feedback Loop

Iterative refinement process where the Judge Agent evaluates the Qualitative Agent's output and triggers re-generation if any metric scores ≤ 3 (out of 5). Runs up to 10 iterations per the paper.

Evaluation Metrics

Four metrics used by the Judge Agent to evaluate qualitative assessments (1-5 Likert scale):

Metric	Description
Coherence	Logical consistency of the assessment
Completeness	Coverage of all relevant symptoms and frequencies
Specificity	Avoidance of vague or generic statements
Accuracy	Alignment with PHQ-8/DSM-5 criteria

Zero-Shot vs Few-Shot

Zero-shot: The LLM receives only the transcript and prompt, with no reference examples.

Few-shot: The LLM receives similar transcript chunks from a reference database along with their PHQ-8 scores. The paper reports a large MAE improvement vs zero-shot, but reproduction results can vary by model/backend and retrieval configuration; see docs/results/reproduction-results.md.

Embeddings

Vector representations of transcript chunks used for similarity search in few-shot retrieval. Generated by qwen3-embedding:8b model (4096 dimensions).

Reference Store

Pre-computed database of embeddings from training set transcripts with known PHQ-8 scores. Used to find similar examples for few-shot prompting.

Chunk

A segment of transcript text used for embedding generation. Appendix D hyperparameters: 8 lines with a 2-line sliding window step.

Architecture Terms

Clean Architecture

Software design pattern with concentric layers: - Domain: Business entities and logic (innermost) - Use Cases/Services: Application-specific business rules - Adapters: Interface implementations (API, CLI) - Infrastructure: External concerns (LLM, database, logging)

Protocol

Python typing construct (similar to interface) defining expected methods. Used for dependency injection and testability.

Entity

Mutable domain object with identity (UUID). Examples: Transcript, PHQ8Assessment, MetaReview.

Value Object

Immutable domain object without identity. Equal if all attributes equal. Examples: ItemAssessment, EvaluationScore, SimilarityMatch.

Configuration Terms

Ollama

Open-source platform for running LLMs locally. Default chat backend (LLM_BACKEND=ollama) and optional embedding backend (EMBEDDING_BACKEND=ollama).

Model Tags

Model identifiers depend on the backend.

Ollama backend uses identifiers in format name:variant: - gemma3:27b - Gemma 3 27B (paper baseline) - qwen3-embedding:8b - Qwen 3 8B embedding model

HuggingFace backend uses official model IDs (e.g. google/medgemma-27b-text-it). The codebase also supports a canonical alias medgemma:27b, but there is no official MedGemma model in the Ollama library; any Ollama “medgemma” is a community conversion and may behave differently.

Pydantic Settings

Configuration management using Pydantic. Settings are loaded from: 1. Default values in code 2. .env file 3. Environment variables (highest priority)

Metric Terms

MAE (Mean Absolute Error)

Average absolute difference between predicted and actual PHQ-8 item scores. Lower is better.

Paper results: - Zero-shot: 0.796 MAE - Few-shot: 0.619 MAE (22% lower item-level MAE vs zero-shot) - MedGemma few-shot: 0.505 MAE (18% lower item-level MAE vs Gemma; Appendix F, with lower coverage)

Accuracy

Percentage of correct severity level predictions.

Paper results: - Meta-Review: 78% severity accuracy - Comparable to human expert performance

Likert Scale

Rating scale used for Judge Agent metrics. 1-5 scale where: - 1-2 = Poor - 3 = Marginal - 4-5 = Acceptable

File Format Terms

TSV (Tab-Separated Values)

Text file format using tabs as delimiters. DAIC-WOZ transcripts use TSV format.

NPZ (NumPy Compressed Archive)

Binary format for storing multiple NumPy arrays. Used for pre-computed reference embeddings.

JSON Sidecar

Companion JSON file stored alongside an .npz artifact.

In this repo: - {name}.json contains chunk texts aligned with the NPZ rows per participant. - {name}.meta.json (optional) contains provenance/validation metadata (backend, model, dimension, chunking).