Pipeline

This document explains how the four-agent pipeline works to assess depression from clinical interview transcripts.

Overview

The AI Psychiatrist pipeline processes a transcript through four specialized agents, with an iterative refinement loop to ensure quality:

Transcript → Qualitative → [Judge ↔ Refinement] → Quantitative → Meta-Review → Severity

Each agent serves a specific purpose, and their outputs feed into subsequent stages.

Pipeline Stages

Stage 1: Qualitative Assessment

Agent: QualitativeAssessmentAgent Model: Gemma 3 27B (default) Paper Reference: Section 2.3.1

The qualitative agent analyzes the transcript to identify clinical factors across four domains:

Domain	Description	Example Findings
PHQ-8 Symptoms	Symptom presence and frequency	"Reports low energy nearly every day"
Social Factors	Relationships, support systems	"Limited social support, lives alone"
Biological Factors	Medical history, family history	"Family history of depression"
Risk Factors	Stressors, warning signs	"Recent job loss, financial stress"

Output: QualitativeAssessment entity with structured sections and supporting quotes.

Prompt Structure:

System: You are a clinical psychologist analyzing interview transcripts...
User: <transcript>
{transcript_text}
</transcript>

Please analyze this interview and provide:
1. Overall assessment
2. PHQ-8 symptom analysis with frequencies
3. Social factors
4. Biological factors
5. Risk factors

Stage 2: Judge Evaluation

Agent: JudgeAgent Model: Gemma 3 27B (default, temperature=0.0) Paper Reference: Section 2.3.1, Appendix B

The judge agent evaluates the qualitative assessment on four quality metrics:

Metric	Description	Scoring Guide
Coherence	Logical consistency	5=No contradictions, 1=Major logical errors
Completeness	Symptom coverage	5=All symptoms addressed, 1=Major gaps
Specificity	Concrete vs vague	5=Specific quotes/frequencies, 1=Generic statements
Accuracy	PHQ-8/DSM-5 alignment	5=Clinically correct, 1=Major misinterpretations

Scoring: 1-5 Likert scale per metric

Decision Logic: - If ALL scores ≥ 4: Assessment is acceptable, proceed to quantitative - If ANY score ≤ 3: Trigger refinement loop

Service: FeedbackLoopService Paper Reference: Section 2.3.1

When judge scores are below threshold, the feedback loop refines the assessment:

┌─────────────────────────────────────────────────────────────┐
│                     FEEDBACK LOOP                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   ┌─────────────────┐                                       │
│   │   Qualitative   │◄──────────────────────────────┐       │
│   │     Agent       │                               │       │
│   └────────┬────────┘                               │       │
│            │                                        │       │
│            ▼                                        │       │
│   ┌─────────────────┐    Low scores?    ┌────────-──┴──────┐│
│   │   Judge Agent   │─────Yes──────────►│ Extract Feedback ││
│   │  (Evaluate)     │                   │ for low metrics  ││
│   └────────┬────────┘                   └──────────────────┘│
│            │                                                │
│            │ All scores ≥ 4?                                │
│            │ OR max iterations?                             │
│            │                                                │
│            ▼ Yes                                            │
│   ┌─────────────────┐                                       │
│   │     EXIT        │                                       │
│   │  (Proceed to    │                                       │
│   │  Quantitative)  │                                       │
│   └─────────────────┘                                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Configuration: - max_iterations: 10 (paper Section 2.3.1) - score_threshold: 3 (scores ≤ 3 trigger refinement)

Refinement Prompt:

The judge evaluated your assessment and found issues:

Coherence: Scored 2/5. "The assessment contradicts itself..."
Specificity: Scored 3/5. "More specific quotes needed..."

Please revise your assessment addressing these concerns:
<original_assessment>
{previous_assessment}
</original_assessment>

<transcript>
{transcript_text}
</transcript>

Paper Results (Figure 2, 142 participants): The paper reports mean ± SD improvements after the feedback loop:

Metric	Before	After
Coherence	4.96 ± 0.20	5.00 ± 0.00
Specificity	4.37 ± 0.62	4.38 ± 0.58
Accuracy	4.33 ± 0.53	4.36 ± 0.48
Completeness	3.61 ± 0.85	3.72 ± 0.61

Stage 4: Quantitative Assessment

Agent: QuantitativeAssessmentAgent Model: Gemma 3 27B (default) Paper Reference: Section 2.3.2, Section 2.4.2

The quantitative agent predicts PHQ-8 item scores (0-3) or abstains (N/A) when the transcript lacks sufficient evidence.

PHQ-8 item scores are defined by 2-week frequency, while DAIC-WOZ transcripts are not structured as PHQ administration. Treat item scoring as a selective, evidence-limited task; see docs/clinical/task-validity.md.

Evidence Extraction

First, the agent extracts evidence quotes for each PHQ-8 item:

{
  "PHQ8_NoInterest": ["i don't enjoy anything anymore", "nothing seems fun"],
  "PHQ8_Depressed": ["i feel really down most days"],
  "PHQ8_Sleep": ["i can't fall asleep until 3am"],
  ...
}

Few-Shot Reference Retrieval

For each item with evidence:

Build per-item evidence text (one string per PHQ-8 item)
Embed the evidence text (Spec 37: batch query embedding is default; 1 embedding op per participant)
Search the reference store for similar chunks
Apply retrieval post-processing (all optional, configured via EmbeddingSettings):
Spec 33: similarity threshold + per-item context budget
Spec 34: item-tag filtering (requires {emb}.tags.json)
Spec 35: chunk-level score attachment (requires {emb}.chunk_scores.json)
Spec 36: CRAG reference validation (accept/reject)
Format the references into a unified <Reference Examples> block (Spec 31 + Spec 33 XML fix)

See: docs/pipeline-internals/features.md and docs/rag/runtime-features.md.

Query: "i don't enjoy anything anymore, nothing seems fun"
              │
              ▼ Embedding + Similarity Search
┌─────────────────────────────────────────────────┐
│ Reference 1 (similarity: 0.89, score: 2)        │
│ "haven't felt like doing my hobbies lately"     │
├─────────────────────────────────────────────────┤
│ Reference 2 (similarity: 0.85, score: 3)        │
│ "nothing brings me joy anymore"                 │
└─────────────────────────────────────────────────┘

Scoring

The agent generates scores with reasoning:

{
  "PHQ8_NoInterest": {
    "evidence": "i don't enjoy anything anymore",
    "reason": "Clear anhedonia, consistent with nearly every day",
    "score": 3
  },
  "PHQ8_Sleep": {
    "evidence": "i can't fall asleep until 3am",
    "reason": "Significant sleep onset insomnia",
    "score": 2
  },
  "PHQ8_Appetite": {
    "evidence": "No relevant evidence found",
    "reason": "Transcript does not discuss eating habits",
    "score": "N/A"
  }
}

Output: PHQ8Assessment with all 8 item scores, total score (0-24), and severity level.

Paper-reported Results (not a guarantee of reproduction): - Zero-shot MAE: 0.796 - Few-shot MAE: 0.619 (22% lower item-level MAE vs zero-shot) - MedGemma few-shot MAE: 0.505 (Appendix F alternative; better MAE but fewer predictions overall)

Note: these MAE values are conditional on non-N/A items. When coverages differ across modes, system-level comparisons should use coverage-aware selective prediction metrics (AURC/AUGRC); see docs/statistics/statistical-methodology-aurc-augrc.md.

Stage 5: Meta-Review

Agent: MetaReviewAgent Model: Gemma 3 27B (default) Paper Reference: Section 2.3.3

The meta-review agent integrates all previous outputs to determine final severity:

Inputs: 1. Original transcript 2. Qualitative assessment (social, biological, risk factors) 3. Quantitative scores (PHQ-8 item scores)

Output: - Final severity level (0-4: MINIMAL, MILD, MODERATE, MOD_SEVERE, SEVERE) - Explanation of determination - MDD indicator (true if severity ≥ MODERATE)

Prompt Structure:

You are integrating multiple assessments to determine depression severity.

<transcript>
{transcript_text}
</transcript>

<qualitative_assessment>
{qualitative_text}
</qualitative_assessment>

<quantitative_scores>
{phq8_scores}
</quantitative_scores>

Provide:
<severity>0-4</severity>
<explanation>Your integrated reasoning...</explanation>

Paper Results: 78% accuracy on severity prediction, comparable to human experts.

Complete Pipeline Flow

┌────────────────────────────────────────────────────────────────────────┐
│                        COMPLETE PIPELINE                               │
├────────────────────────────────────────────────────────────────────────┤
│                                                                        │
│  INPUT                                                                 │
│  ┌───────────────────────────────────────────────────────────────────┐ │
│  │ Transcript: "Ellie: How are you? Participant: I feel down..."     │ │
│  └───────────────────────────────────────────────────────────────────┘ │
│                              │                                         │
│                              ▼                                         │
│  QUALITATIVE (Gemma 3 27B)                                             │
│  ┌───────────────────────────────────────────────────────────────────┐ │
│  │ Overall: Participant shows signs of depression...                 │ │
│  │ PHQ-8: Anhedonia (several days), low mood (most days)...          │ │
│  │ Social: Limited support network...                                │ │
│  │ Biological: No family history mentioned...                        │ │
│  │ Risk: Recent stressors...                                         │ │
│  └───────────────────────────────────────────────────────────────────┘ │
│                              │                                         │
│                              ▼                                         │
│  JUDGE (Gemma 3 27B, temp=0)                                           │
│  ┌───────────────────────────────────────────────────────────────────┐ │
│  │ Coherence: 4/5  |  Completeness: 3/5  |  Specificity: 4/5  |      │ │
│  │ Accuracy: 4/5   |  → Completeness low, trigger refinement         │ │
│  └───────────────────────────────────────────────────────────────────┘ │
│                              │                                         │
│                              ▼                                         │
│  FEEDBACK LOOP (1 iteration)                                           │
│  ┌───────────────────────────────────────────────────────────────────┐ │
│  │ Refined assessment with better completeness...                    │ │
│  │ Judge re-evaluation: All scores ≥ 4 ✓                             │ │
│  └───────────────────────────────────────────────────────────────────┘ │
│                              │                                         │
│                              ▼                                         │
│  QUANTITATIVE (Gemma 3 27B)                                            │
│  ┌───────────────────────────────────────────────────────────────────┐ │
│  │ PHQ8_NoInterest: 2  |  PHQ8_Depressed: 2  |  PHQ8_Sleep: 1        │ │
│  │ PHQ8_Tired: 2       |  PHQ8_Appetite: N/A |  PHQ8_Failure: 1      │ │
│  │ PHQ8_Concentrating: 1  |  PHQ8_Moving: N/A                        │ │
│  │ Total: 9 → MILD severity                                          │ │
│  └───────────────────────────────────────────────────────────────────┘ │
│                              │                                         │
│                              ▼                                         │
│  META-REVIEW (Gemma 3 27B)                                             │
│  ┌───────────────────────────────────────────────────────────────────┐ │
│  │ Severity: 1 (MILD)                                                │ │
│  │ Explanation: While the participant reports several symptoms,      │ │
│  │ their frequency is mostly "several days" rather than daily.       │ │
│  │ The qualitative assessment notes limited but present coping...    │ │
│  └───────────────────────────────────────────────────────────────────┘ │
│                              │                                         │
│                              ▼                                         │
│  OUTPUT                                                                │
│  ┌───────────────────────────────────────────────────────────────────┐ │
│  │ FullAssessment {                                                  │ │
│  │   severity: MILD                                                  │ │
│  │   is_mdd: false                                                   │ │
│  │   phq8_total: 9                                                   │ │
│  │   ...                                                             │ │
│  │ }                                                                 │ │
│  └───────────────────────────────────────────────────────────────────┘ │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Timing

The paper reports the full pipeline runs in ~1 minute on a MacBook Pro with an Apple M3 Pro chipset (Section 2.3.5 / Discussion). Real-world timing varies significantly with:

backend (Ollama vs HuggingFace),
model quantization / device (CPU/GPU),
and whether the feedback loop triggers refinements.

Note: The paper text emphasizes consumer hardware (M3 Pro / no GPU requirement), but the public repo also includes SLURM scripts configured for A100 GPUs (_reference/slurm/job_ollama.sh). We cannot determine what hardware/precision produced the reported metrics from the paper text alone.

For local reproduction runtime measurements, see docs/results/reproduction-results.md.

Configuration Impact

Setting	Effect on Pipeline
`FEEDBACK_ENABLED=false`	Skip refinement loop entirely
`FEEDBACK_MAX_ITERATIONS=5`	Cap refinement attempts
`EMBEDDING_TOP_K_REFERENCES=4`	More reference examples per item
`LLM_BACKEND=huggingface` + `MODEL_QUANTITATIVE_MODEL=medgemma:27b`	Use Appendix F alternative (official weights via HuggingFace; may reduce prediction availability)