Skip to content

AI Psychiatrist Documentation

LLM-based Multi-Agent System for Depression Assessment from Clinical Interviews


What is AI Psychiatrist?

AI Psychiatrist is an engineering-focused, reproducible implementation of a research paper that uses large language models (LLMs) in a multi-agent architecture to assess depression severity from clinical interview transcripts. The system analyzes interview transcripts and selectively predicts PHQ-8 item scores (0–3) when supported by evidence, otherwise abstaining (N/A), using a four-agent pipeline.

Clinical disclaimer: This repository is intended for paper reproduction and experimentation. It is not a medical device and should not be used for clinical diagnosis or treatment decisions.

Task validity note: PHQ-8 is a 2-week frequency self-report instrument, while DAIC-WOZ transcripts are not structured as PHQ administration. Transcript-only item-level scoring is often underdetermined; the system may return N/A and must be evaluated with coverage-aware metrics (AURC/AUGRC). See: Task Validity.

Key Features

  • Four-Agent Pipeline: Qualitative, Judge, Quantitative, and Meta-Review agents collaborate for comprehensive assessment
  • Embedding-Based Few-Shot Retrieval: Optional few-shot references; retrieval quality is controlled by guardrails, item-tag filtering, chunk-level score attachment, and CRAG validation (see results docs)
  • Iterative Self-Refinement: Judge agent feedback loop improves assessment quality
  • Selective Prediction Evaluation: AURC/AUGRC + bootstrap confidence intervals (coverage-aware evaluation)
  • Engineering-Focused Architecture: Clean architecture, type safety, structured logging, and comprehensive testing

Paper Reference

Greene et al. "AI Psychiatrist Assistant: An LLM-based Multi-Agent System for Depression Assessment from Clinical Interviews" OpenReview


Quick Navigation

Getting Started

Document Description
Quickstart Get running in 5 minutes
Zero-Shot Preflight Pre-run verification for zero-shot reproduction
Few-Shot Preflight Pre-run verification for few-shot reproduction

Architecture

Document Description
Architecture System layers and design patterns
Pipeline How the 4-agent pipeline works
Future Architecture LangGraph integration roadmap

Clinical Domain

Document Description
PHQ-8 Understanding PHQ-8 depression assessment
Task Validity What can/cannot be inferred from transcripts
Clinical Understanding How the system works clinically
Glossary Terms and definitions

Configuration

Document Description
Configuration Reference All configuration options
Configuration Philosophy Why defaults are what they are
Agent Sampling Registry Sampling parameters per agent

Models

Document Description
Model Registry Supported models and backends
Model Wiring How agents connect to models

RAG (Few-Shot Retrieval)

Document Description
RAG Overview Core embedding + retrieval concepts (plain language)
Design Rationale Why few-shot is built this way, known limitations
Artifact Generation Embeddings + item tags (Specs 34, 40)
Chunk Scoring Chunk-level PHQ-8 scoring (Spec 35)
Runtime Features Prompt format, CRAG validation, batch embedding (Specs 36, 37)
Debugging Interpret retrieval logs, troubleshoot issues

Data

Document Description
DAIC-WOZ Schema Dataset schema for development without data access
DAIC-WOZ Preprocessing Transcript cleaning, participant-only variants, ground truth integrity
Data Splits Overview AVEC2017 vs paper splits + exact participant IDs
Artifact Namespace Registry Embedding artifact naming conventions

Pipeline Internals

Document Description
Feature Reference Implemented features + defaults
Evidence Extraction How quotes are extracted from transcripts

Statistics & Evaluation

Document Description
Metrics and Evaluation Exact metric definitions
Coverage Explained What coverage means and why it matters
AURC/AUGRC Methodology Selective prediction metrics

Results & Reproduction

Document Description
Run History Canonical history of reproduction runs
Reproduction Results Current reproduction status
Run Output Schema Output JSON format

Developer Reference

Document Description
API Endpoints REST API reference
Testing Markers, fixtures, and test-doubles policy
Error Handling Exception handling patterns
Exceptions Exception class hierarchy
Dependency Registry Third-party dependencies

Archive

Document Description
Spec 20: Keyword Fallback Deferred — intentionally not implementing

System Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                         AI PSYCHIATRIST PIPELINE                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌──────────────┐    ┌─────────────────────────────────────────────┐   │
│   │  TRANSCRIPT  │───►│              QUALITATIVE AGENT              │   │
│   │   (Input)    │    │  Analyzes social, biological, risk factors  │   │
│   └──────────────┘    └──────────────────────┬──────────────────────┘   │
│                                              │                          │
│                                              ▼                          │
│                       ┌─────────────────────────────────────────────┐   │
│                       │                JUDGE AGENT                  │   │
│                       │  Evaluates coherence, completeness,         │   │
│           ┌──────────►│  specificity, accuracy (1-5 scale)          │   │
│           │           └──────────────────────┬──────────────────────┘   │
│           │                                  │                          │
│           │           ┌──────────────────────▼──────────────────────┐   │
│           │           │            FEEDBACK LOOP SERVICE            │   │
│           └───────────┤  If score < 4: refine and re-evaluate       │   │
│                       │  Max 10 iterations per paper                │   │
│                       └──────────────────────┬──────────────────────┘   │
│                                              │                          │
│   ┌──────────────┐    ┌──────────────────────▼──────────────────────┐   │
│   │  EMBEDDINGS  │───►│            QUANTITATIVE AGENT               │   │
│   │ (Few-Shot)   │    │  Predicts PHQ-8 item scores (0-3) or N/A    │   │
│   └──────────────┘    └──────────────────────┬──────────────────────┘   │
│                                              │                          │
│                                              ▼                          │
│                       ┌─────────────────────────────────────────────┐   │
│                       │             META-REVIEW AGENT               │   │
│                       │  Integrates all assessments                 │   │
│                       │  Outputs final severity (0-4)               │   │
│                       └──────────────────────┬──────────────────────┘   │
│                                              │                          │
│                                              ▼                          │
│                       ┌─────────────────────────────────────────────┐   │
│                       │              FINAL ASSESSMENT               │   │
│                       │  Severity: MINIMAL|MILD|MODERATE|           │   │
│                       │            MOD_SEVERE|SEVERE                │   │
│                       └─────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Technology Stack

Category Tool Purpose
Package Management uv Fast Python dependency management
LLM Backend Ollama / HuggingFace (optional) Local inference via Ollama; optional Transformers backend for official weights
Framework FastAPI REST API server
Validation Pydantic v2 Configuration and data validation
Logging structlog Structured JSON logging
Testing pytest Unit, integration, and E2E tests
Linting Ruff Fast Python linting and formatting
Types mypy Static type checking (strict mode)

Project Status

This codebase is an engineering-focused refactor of the original research implementation. Key improvements:

  • Full test coverage (80%+ target)
  • Type hints throughout (mypy strict mode)
  • Clean architecture with dependency injection
  • Structured logging for observability
  • Comprehensive configuration management
  • Local-first deployment (Ollama + FastAPI); containerization TBD

Contributing

See CLAUDE.md in the repository root for development guidelines and commands.

# Quick development setup
make dev          # Install dependencies + pre-commit hooks
make test         # Run all tests with coverage
make ci           # Full CI pipeline (format, lint, typecheck, test)

License

Licensed under Apache 2.0. See LICENSE and NOTICE in the repository root for details and attribution.

This project is a clean-room reimplementation based on research from Georgia State University. See the paper for academic citation.