Preflight Checklist: Zero-Shot Run

Purpose: Comprehensive pre-run verification for zero-shot evaluation runs Last Updated: 2026-01-04 Related: Few-Shot Checklist | Configuration Reference

Overview

This checklist prevents run failures by verifying ALL known gotchas before running. Use this every time you start a zero-shot run.

Zero-shot mode uses NO reference embeddings — the model scores symptoms from transcript alone.

Validity note: PHQ-8 item scores are defined by 2-week frequency, which is often not explicit in DAIC-WOZ transcripts. Expect N/A outputs and coverage well below 100%; evaluate with coverage-aware metrics (AURC/AUGRC). See docs/clinical/task-validity.md.

CRITICAL RUN-MODE NOTE (do not skip): scripts/reproduce_results.py runs both modes (zero-shot + few-shot) by default. If you intend a true zero-shot run, you must pass --zero-shot-only or the run will attempt few-shot and require embeddings + embedding backend deps.

TL;DR (No-Excuses Preflight)

make dev
cp .env.example .env
uv run python scripts/reproduce_results.py --split paper-test --dry-run

Verify the dry-run header shows: - Embedding Backend: huggingface (or your intended backend) - Consistency: ENABLED (n=5, temp=0.2) if you are using .env.example (confidence suite) - And then run zero-shot only:

uv run python scripts/reproduce_results.py --split paper-test --zero-shot-only

Phase 1: Environment Setup

1.1 Dependencies

[ ] Install all dependencies: make dev (NOT uv sync --dev)
Gotcha (BUG-021): uv sync --dev does NOT install [project.optional-dependencies].dev
[ ] Verify installation: uv run pytest --co -q | head -5 (should show test count)

1.2 Configuration File

[ ] Copy template: cp .env.example .env
Gotcha (BUG-018b): .env OVERRIDES code defaults! Always start fresh.
[ ] Review .env file manually - open it and verify:
```
cat .env | grep -E "^[^#]" | sort
```

1.3 Ollama Status

[ ] Ollama running: curl -s http://localhost:11434/api/tags | head
Should return JSON with model list
[ ] Required model pulled: ollama list | grep -E "gemma3:27b|gemma3:27b-it-qat"

If missing:

# Production-recommended (QAT-quantized, faster):
ollama pull gemma3:27b-it-qat
# Standard Ollama tag (GGUF Q4_K_M):
ollama pull gemma3:27b

Phase 2: Model Configuration (CRITICAL)

2.1 Quantitative Model Selection

Reference: Paper Section 2.2, BUG-018a

[ ] Verify quantitative model is Gemma3 (NOT MedGemma):

grep "MODEL_QUANTITATIVE_MODEL" .env
# Acceptable values:
#   MODEL_QUANTITATIVE_MODEL=gemma3:27b-it-qat  (QAT-optimized, faster inference)
#   MODEL_QUANTITATIVE_MODEL=gemma3:27b         (standard Ollama quantization)

Note on quantization: The paper authors likely used full-precision BF16 weights. Ollama's gemma3:27b uses Q4_K_M quantization; -it-qat adds QAT optimization for faster inference. Both are acceptable for reproduction (neither is true BF16).

Gotcha (BUG-018a): MedGemma produces ALL N/A scores due to being too conservative. Appendix F says it "detected fewer relevant chunks, making fewer predictions overall."

[ ] Check for MedGemma contamination:

grep -i "medgemma" .env
# Should return NOTHING or only commented lines

2.2 Sampling Parameters

Reference: GAP-001b/c, Agent Sampling Registry

[ ] Temperature is zero (clinical AI best practice):

grep "MODEL_TEMPERATURE" .env
# Should show: MODEL_TEMPERATURE=0.0

Note: We use temp=0 for all agents. top_k/top_p are not set (irrelevant at temp=0).

2.3 Pydantic AI (Structured Validation)

Reference: Spec 13 - Enabled by default since 2025-12-26

[ ] Pydantic AI is enabled (recommended for structured output validation):

grep "PYDANTIC_AI_ENABLED" .env
# Should show: PYDANTIC_AI_ENABLED=true (or be absent, as true is the default)

What it does: Adds structured validation + automatic retries (default PYDANTIC_AI_RETRIES=5) for quantitative scoring, judge metrics, and meta-review. There is no legacy parsing fallback; failures after retries remain failures.

Phase 3: Quantitative Settings

3.1 N/A Reason Tracking

[ ] N/A tracking enabled (for debugging):

grep "QUANTITATIVE_TRACK_NA_REASONS" .env
# Should show: QUANTITATIVE_TRACK_NA_REASONS=true

Phase 4: Data Integrity

4.1 Transcripts Present

[ ] Transcripts directory exists:

ls data/transcripts_participant_only/ | wc -l
# Should show ~189 (or your participant count)

Note: This repo recommends DATA_TRANSCRIPTS_DIR=data/transcripts_participant_only for runs. If you are using raw transcripts, adjust the path accordingly.

4.2 Participant 487 Validation

Reference: BUG-003, BUG-022

[ ] Participant 487 is NOT corrupted:

file data/transcripts_participant_only/487_P/487_TRANSCRIPT.csv
# MUST show: ASCII text, or UTF-8 Unicode text
# NOT: AppleDouble encoded, or binary

[ ] Correct file size (~20KB, not 4KB):

ls -lh data/transcripts_participant_only/487_P/487_TRANSCRIPT.csv
# Should be ~18-25KB, NOT 4KB

Gotcha (BUG-003): macOS ZIP extraction can extract AppleDouble resource forks instead of real files.

4.3 Ground Truth Labels

[ ] AVEC2017 labels exist (for item-level MAE):

ls data/*_split_Depression_AVEC2017.csv
# Should show: dev_split, train_split, test_split files

Phase 5: Timeout Configuration

5.1 Timeout Setting

Reference: BUG-018e, BUG-027

[ ] Set generous timeout for GPU-safe operation:

grep -E "^(OLLAMA_TIMEOUT_SECONDS|PYDANTIC_AI_TIMEOUT_SECONDS)=" .env
# Recommended: 600  (10 min, safe default)
# For slow GPU: 3600 (1 hour, research runs)

Gotcha (BUG-027): Pydantic AI timeout is configurable via PYDANTIC_AI_TIMEOUT_SECONDS. If you set only one of {OLLAMA_TIMEOUT_SECONDS, PYDANTIC_AI_TIMEOUT_SECONDS}, Settings syncs the other; if you set both, keep them equal to avoid fallback timeouts.

Gotcha: 6/47 participants (13%) timed out on first run with 300s. Large transcripts (~24KB+) need 600s+ (often 3600s on slow GPUs).

5.2 Check for Long Transcripts

[ ] Identify large transcripts that may timeout:

find data/transcripts -name "*TRANSCRIPT.csv" -exec wc -c {} + | sort -n | tail -10
# Note any files > 25KB - these may need extra timeout

Phase 6: Paper Split (If Using Paper Methodology)

6.1 Create or Verify Splits

[ ] Paper splits exist OR will be created:

ls data/paper_splits/
# If empty or missing:
uv run python scripts/create_paper_split.py --verify

6.2 Verify Split Sizes

Reference: Paper Section 2.4.1

[ ] Split sizes match paper (58/43/41):

wc -l data/paper_splits/paper_split_*.csv
# Should show: 59 train (58+header), 44 val, 42 test

Phase 7: Pre-Run Verification

7.1 Quick Sanity Check

[ ] Run linter: make lint
[ ] Run type checker: make typecheck
[ ] Run unit tests: make test-unit

7.2 Configuration Summary Check

Run this to dump your effective configuration:

uv run python -c "
from ai_psychiatrist.config import get_settings
s = get_settings()
print('=== CRITICAL SETTINGS ===')
print(f'Quantitative Model: {s.model.quantitative_model}')
print(f'Temperature: {s.model.temperature}')
print(f'Timeout: {s.ollama.timeout_seconds}s')
print(f'Pydantic AI Enabled: {s.pydantic_ai.enabled}')
print(f'Embedding Dimension: {s.embedding.dimension}')
"

Expected output:

=== CRITICAL SETTINGS ===
Quantitative Model: gemma3:27b-it-qat  (or gemma3:27b)
Temperature: 0.0
Timeout: 600s  (or higher for research runs)
Pydantic AI Enabled: True
Embedding Dimension: 4096

Phase 8: Execute Zero-Shot Run

8.1 Use tmux for Long-Running Processes

CRITICAL: Reproduction runs take ~5 min/participant. Use tmux to prevent losing progress if your terminal disconnects.

[ ] Start or attach to tmux session:

# Start new session
tmux new -s reproduction

# Or attach to existing session
tmux attach -t reproduction

[ ] Verify you're inside tmux: Look for green status bar at bottom, or run:

echo $TMUX
# Should show something like: /private/tmp/tmux-501/default,12345,0
# If empty, you're NOT in tmux!

8.2 Run Command

# Zero-shot on AVEC dev split (has per-item labels)
uv run python scripts/reproduce_results.py --split dev --zero-shot-only

# Zero-shot on paper test split
uv run python scripts/reproduce_results.py --split paper-test --zero-shot-only

8.3 Monitor for Issues

Watch for these log patterns:

Log Pattern	Issue	Action
`LLM request timed out`	Transcript too long	Increase `OLLAMA_TIMEOUT_SECONDS`
`Failed to parse evidence JSON`	LLM output malformed	Check JSON repair (Spec 043) and inspect the raw response
`na_count = 8` for all	MedGemma contamination	Ensure model is Gemma3 (`gemma3:27b-it-qat` or `gemma3:27b`), not MedGemma

Phase 9: Post-Run Validation

9.1 Output File Created

[ ] Results file exists:
```
ls -lt data/outputs/*.json | head -1
```

9.2 Verify `item_signals` Present (Required for AURC/AUGRC)

Reference: Spec 25 - Required for selective prediction evaluation

[ ] Output includes item_signals for each participant:

python3 -c "
import json
from pathlib import Path
f = sorted(Path('data/outputs').glob('*.json'))[-1]
data = json.loads(f.read_text())
results = data['experiments'][0]['results']['results']
success = next((r for r in results if r.get('success')), None)
if success:
    has_signals = 'item_signals' in success
    print(f'Has item_signals: {has_signals}')
    if has_signals:
        print(f'Signal keys: {list(success[\"item_signals\"].keys())[:3]}...')
    assert has_signals, 'FAIL: Missing item_signals! Re-run with latest code.'
else:
    print('WARNING: No successful results found')
"

Gotcha: Outputs created before 2025-12-27 lack item_signals. The AURC/AUGRC evaluation script requires this field. Re-run if missing.

9.3 Metrics Sanity Check

[ ] Coverage is ~50-60% (baseline defaults):

# Check the output JSON for coverage metrics
python3 -c "
import json
from pathlib import Path
f = sorted(Path('data/outputs').glob('*.json'))[-1]
data = json.loads(f.read_text())
exp = data['experiments'][0]['results']
print(f'Mode: {exp.get(\"mode\", \"unknown\")}')
print(f'Success: {sum(1 for r in exp[\"results\"] if r.get(\"success\"))}')
print(f'Failed: {sum(1 for r in exp[\"results\"] if not r.get(\"success\"))}')
"

Expected (zero-shot): - Coverage: ~50-60% - MAE: ~0.72-0.80 (paper reports 0.796)

9.4 Run AURC/AUGRC Evaluation (Optional)

Reference: Spec 25 - Proper selective prediction evaluation

[ ] Evaluate with risk-coverage metrics:

uv run python scripts/evaluate_selective_prediction.py \
  --input data/outputs/<your_output>.json \
  --mode zero_shot \
  --seed 42

This computes AURC, AUGRC, MAE@coverage with bootstrap CIs - the statistically valid way to evaluate selective prediction systems.

Common Failure Modes Quick Reference

Symptom	Cause	Fix
All items N/A	MedGemma model	Change to `gemma3:27b`
Timeouts on 13%	Long transcripts	Increase `OLLAMA_TIMEOUT_SECONDS=600` or higher
Participant 487 fails	macOS resource fork	Re-extract with `unzip -x '._*'`
Config not applying	.env override	Start fresh: `cp .env.example .env`
MAE ~4.0 (wrong scale)	Old script	Use current `scripts/reproduce_results.py`
Missing `item_signals`	Old output file	Re-run with code from 2025-12-27+
AURC eval fails	No `item_signals`	Re-run reproduction to generate new outputs

Checklist Complete?

If ALL items are checked: 1. You're ready to run zero-shot reproduction 2. Expected runtime: ~5 min/participant (varies by hardware) 3. Reference target (paper-reported, item-level MAE): 0.796 (use coverage-aware metrics for fair comparisons)

Few-Shot Preflight - For few-shot runs (includes embedding setup)
Configuration Philosophy - Why we've moved beyond the legacy baseline
Model Registry - Model configuration and backend options