AI Psychiatrist Documentation

LLM-based Multi-Agent System for Depression Assessment from Clinical Interviews

What is AI Psychiatrist?

AI Psychiatrist is an engineering-focused, reproducible implementation of a research paper that uses large language models (LLMs) in a multi-agent architecture to assess depression severity from clinical interview transcripts. The system analyzes interview transcripts and selectively predicts PHQ-8 item scores (0–3) when supported by evidence, otherwise abstaining (N/A), using a four-agent pipeline.

Clinical disclaimer: This repository is intended for paper reproduction and experimentation. It is not a medical device and should not be used for clinical diagnosis or treatment decisions.

Task validity note: PHQ-8 is a 2-week frequency self-report instrument, while DAIC-WOZ transcripts are not structured as PHQ administration. Transcript-only item-level scoring is often underdetermined; the system may return N/A and must be evaluated with coverage-aware metrics (AURC/AUGRC). See: Task Validity.

Key Features

Four-Agent Pipeline: Qualitative, Judge, Quantitative, and Meta-Review agents collaborate for comprehensive assessment
Embedding-Based Few-Shot Retrieval: Optional few-shot references; retrieval quality is controlled by guardrails, item-tag filtering, chunk-level score attachment, and CRAG validation (see results docs)
Iterative Self-Refinement: Judge agent feedback loop improves assessment quality
Selective Prediction Evaluation: AURC/AUGRC + bootstrap confidence intervals (coverage-aware evaluation)
Engineering-Focused Architecture: Clean architecture, type safety, structured logging, and comprehensive testing

Paper Reference

Greene et al. "AI Psychiatrist Assistant: An LLM-based Multi-Agent System for Depression Assessment from Clinical Interviews" OpenReview

Getting Started

Document	Description
Quickstart	Get running in 5 minutes
Zero-Shot Preflight	Pre-run verification for zero-shot reproduction
Few-Shot Preflight	Pre-run verification for few-shot reproduction

Architecture

Document	Description
Architecture	System layers and design patterns
Pipeline	How the 4-agent pipeline works
Future Architecture	LangGraph integration roadmap

Clinical Domain

Document	Description
PHQ-8	Understanding PHQ-8 depression assessment
Task Validity	What can/cannot be inferred from transcripts
Clinical Understanding	How the system works clinically
Glossary	Terms and definitions

Configuration

Document	Description
Configuration Reference	All configuration options
Configuration Philosophy	Why defaults are what they are
Agent Sampling Registry	Sampling parameters per agent

Models

Document	Description
Model Registry	Supported models and backends
Model Wiring	How agents connect to models

RAG (Few-Shot Retrieval)

Document	Description
RAG Overview	Core embedding + retrieval concepts (plain language)
Design Rationale	Why few-shot is built this way, known limitations
Artifact Generation	Embeddings + item tags (Specs 34, 40)
Chunk Scoring	Chunk-level PHQ-8 scoring (Spec 35)
Runtime Features	Prompt format, CRAG validation, batch embedding (Specs 36, 37)
Debugging	Interpret retrieval logs, troubleshoot issues

Data

Document	Description
DAIC-WOZ Schema	Dataset schema for development without data access
DAIC-WOZ Preprocessing	Transcript cleaning, participant-only variants, ground truth integrity
Data Splits Overview	AVEC2017 vs paper splits + exact participant IDs
Artifact Namespace Registry	Embedding artifact naming conventions

Pipeline Internals

Document	Description
Feature Reference	Implemented features + defaults
Evidence Extraction	How quotes are extracted from transcripts

Statistics & Evaluation

Document	Description
Metrics and Evaluation	Exact metric definitions
Coverage Explained	What coverage means and why it matters
AURC/AUGRC Methodology	Selective prediction metrics

Results & Reproduction

Document	Description
Run History	Canonical history of reproduction runs
Reproduction Results	Current reproduction status
Run Output Schema	Output JSON format

Developer Reference

Document	Description
API Endpoints	REST API reference
Testing	Markers, fixtures, and test-doubles policy
Error Handling	Exception handling patterns
Exceptions	Exception class hierarchy
Dependency Registry	Third-party dependencies

System Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                         AI PSYCHIATRIST PIPELINE                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌──────────────┐    ┌─────────────────────────────────────────────┐   │
│   │  TRANSCRIPT  │───►│              QUALITATIVE AGENT              │   │
│   │   (Input)    │    │  Analyzes social, biological, risk factors  │   │
│   └──────────────┘    └──────────────────────┬──────────────────────┘   │
│                                              │                          │
│                                              ▼                          │
│                       ┌─────────────────────────────────────────────┐   │
│                       │                JUDGE AGENT                  │   │
│                       │  Evaluates coherence, completeness,         │   │
│           ┌──────────►│  specificity, accuracy (1-5 scale)          │   │
│           │           └──────────────────────┬──────────────────────┘   │
│           │                                  │                          │
│           │           ┌──────────────────────▼──────────────────────┐   │
│           │           │            FEEDBACK LOOP SERVICE            │   │
│           └───────────┤  If score < 4: refine and re-evaluate       │   │
│                       │  Max 10 iterations per paper                │   │
│                       └──────────────────────┬──────────────────────┘   │
│                                              │                          │
│   ┌──────────────┐    ┌──────────────────────▼──────────────────────┐   │
│   │  EMBEDDINGS  │───►│            QUANTITATIVE AGENT               │   │
│   │ (Few-Shot)   │    │  Predicts PHQ-8 item scores (0-3) or N/A    │   │
│   └──────────────┘    └──────────────────────┬──────────────────────┘   │
│                                              │                          │
│                                              ▼                          │
│                       ┌─────────────────────────────────────────────┐   │
│                       │             META-REVIEW AGENT               │   │
│                       │  Integrates all assessments                 │   │
│                       │  Outputs final severity (0-4)               │   │
│                       └──────────────────────┬──────────────────────┘   │
│                                              │                          │
│                                              ▼                          │
│                       ┌─────────────────────────────────────────────┐   │
│                       │              FINAL ASSESSMENT               │   │
│                       │  Severity: MINIMAL|MILD|MODERATE|           │   │
│                       │            MOD_SEVERE|SEVERE                │   │
│                       └─────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Technology Stack

Category	Tool	Purpose
Package Management	uv	Fast Python dependency management
LLM Backend	Ollama / HuggingFace (optional)	Local inference via Ollama; optional Transformers backend for official weights
Framework	FastAPI	REST API server
Validation	Pydantic v2	Configuration and data validation
Logging	structlog	Structured JSON logging
Testing	pytest	Unit, integration, and E2E tests
Linting	Ruff	Fast Python linting and formatting
Types	mypy	Static type checking (strict mode)

Project Status

This codebase is an engineering-focused refactor of the original research implementation. Key improvements:

Full test coverage (80%+ target)
Type hints throughout (mypy strict mode)
Clean architecture with dependency injection
Structured logging for observability
Comprehensive configuration management
Local-first deployment (Ollama + FastAPI); containerization TBD

Contributing

See CLAUDE.md in the repository root for development guidelines and commands.

# Quick development setup
make dev          # Install dependencies + pre-commit hooks
make test         # Run all tests with coverage
make ci           # Full CI pipeline (format, lint, typecheck, test)

License

Licensed under Apache 2.0. See LICENSE and NOTICE in the repository root for details and attribution.

This project is a clean-room reimplementation based on research from Georgia State University. See the paper for academic citation.