Preflight Checklist: Few-Shot Reproduction

Purpose: Comprehensive pre-run verification for few-shot paper reproduction Last Updated: 2026-01-04 Related: Zero-Shot Checklist | Configuration Reference

Overview

This checklist prevents reproduction failures by verifying ALL known gotchas before running. Use this every time you start a few-shot reproduction run.

Few-shot mode uses reference embeddings to retrieve similar transcript chunks as examples for the LLM. This requires: 1. Pre-computed reference embeddings 2. Matching embedding dimensions 3. Correct embedding model

Validity note: Few-shot can only help when there is grounded, item-relevant evidence to embed at runtime. Because PHQ-8 is a 2-week frequency instrument and DAIC-WOZ transcripts are not structured as PHQ administration, references may be sparse and N/A outputs are expected. Evaluate with AURC/AUGRC (coverage-aware) and see docs/clinical/task-validity.md.

TL;DR (No-Excuses Preflight)

make dev
cp .env.example .env

# If EMBEDDING_BACKEND=huggingface, verify deps load (required for runtime query embeddings)
uv run python -c "import torch, transformers, sentence_transformers; print(torch.__version__)"

# Sanity: verify the run header shows FOUND sidecars + chunk scoring enabled
uv run python scripts/reproduce_results.py --split paper-test --dry-run

Run Modes and Flags (Do Not Guess)

scripts/reproduce_results.py behavior: - Default (no mode flags): runs both modes (zero-shot + few-shot). - --few-shot-only: runs few-shot only. - --zero-shot-only: runs zero-shot only.

If you want all confidence-suite signals in one artifact, run both modes (default). .env.example enables consistency by default:

uv run python scripts/reproduce_results.py \
  --split paper-test

Phase 1: Environment Setup

1.1 Dependencies

[ ] Install all dependencies: make dev (NOT uv sync --dev)
Gotcha (BUG-021): uv sync --dev does NOT install [project.optional-dependencies].dev
[ ] Verify installation: uv run pytest --co -q | head -5 (should show test count)

1.2 Configuration File

[ ] Copy template: cp .env.example .env
Gotcha (BUG-018b): .env OVERRIDES code defaults! Always start fresh.
[ ] Review .env file manually - open it and verify:
```
cat .env | grep -E "^[^#]" | sort
```

1.3 Ollama Status

[ ] Ollama running: curl -s http://localhost:11434/api/tags | head
Should return JSON with model list

[ ] Required models pulled:

ollama list | grep -E "gemma3:27b|qwen3-embedding"

If missing:

# Production-recommended (QAT-quantized, faster):
ollama pull gemma3:27b-it-qat
# Standard Ollama tag (GGUF Q4_K_M):
ollama pull gemma3:27b
# Embedding model:
ollama pull qwen3-embedding:8b

Phase 2: Model Configuration (CRITICAL)

2.1 Quantitative Model Selection

Reference: Paper Section 2.2, BUG-018a

[ ] Verify quantitative model is Gemma3 (NOT MedGemma):

grep "MODEL_QUANTITATIVE_MODEL" .env
# Acceptable values:
#   MODEL_QUANTITATIVE_MODEL=gemma3:27b-it-qat  (QAT-optimized, faster inference)
#   MODEL_QUANTITATIVE_MODEL=gemma3:27b         (standard Ollama quantization)

Note on quantization: The paper authors likely used full-precision BF16 weights. Ollama's gemma3:27b uses Q4_K_M quantization; -it-qat adds QAT optimization for faster inference. Both are acceptable for reproduction (neither is true BF16).

Gotcha (BUG-018a): MedGemma produces ALL N/A scores due to being too conservative. Appendix F says it "detected fewer relevant chunks, making fewer predictions overall."

[ ] Check for MedGemma contamination:

grep -i "medgemma" .env
# Should return NOTHING or only commented lines

2.2 Embedding Model Selection

Reference: Paper Section 2.2, Appendix D

[ ] Verify embedding model:

grep "MODEL_EMBEDDING_MODEL" .env
# Should show: MODEL_EMBEDDING_MODEL=qwen3-embedding:8b

[ ] Verify embedding backend (HF recommended for higher quality):

grep "EMBEDDING_BACKEND" .env
# Recommended (default): EMBEDDING_BACKEND=huggingface (FP16, higher quality)
# Alternative: EMBEDDING_BACKEND=ollama (Q4_K_M, legacy baseline)

Note: HuggingFace backend requires make dev to install dependencies.

IMPORTANT: Precomputed data/embeddings/*.npz files are reference embeddings only. Few-shot also computes the query (participant evidence) at runtime in the same embedding space. If HF deps are missing, the run will fail fast with MissingHuggingFaceDependenciesError before wasting hours.

2.3 Sampling Parameters

Reference: GAP-001b/c, Agent Sampling Registry

[ ] Temperature is zero (clinical AI best practice):

grep "MODEL_TEMPERATURE" .env
# Should show: MODEL_TEMPERATURE=0.0

Note: We use temp=0 for all agents. top_k/top_p are not set (irrelevant at temp=0).

2.4 Pydantic AI (Structured Validation)

Reference: Spec 13 - Enabled by default since 2025-12-26

[ ] Pydantic AI is enabled (recommended for structured output validation):

grep "PYDANTIC_AI_ENABLED" .env
# Should show: PYDANTIC_AI_ENABLED=true (or be absent, as true is the default)

What it does: Adds structured validation + automatic retries (default PYDANTIC_AI_RETRIES=5) for quantitative scoring, judge metrics, and meta-review. There is no legacy parsing fallback; failures after retries remain failures.

[ ] Verify in config summary:

uv run python -c "
from ai_psychiatrist.config import get_settings
s = get_settings()
print(f'Pydantic AI Enabled: {s.pydantic_ai.enabled}')
print(f'Pydantic AI Retries: {s.pydantic_ai.retries}')
"
# Expected: Enabled=True, Retries=5

Phase 3: Embedding Hyperparameters (CRITICAL)

3.1 Appendix D Hyperparameters (Baseline)

Reference: Paper Appendix D

[ ] Chunk size = 8 (Nchunk):

grep "EMBEDDING_CHUNK_SIZE" .env
# MUST show: EMBEDDING_CHUNK_SIZE=8

[ ] Chunk step = 2 (overlap):

grep "EMBEDDING_CHUNK_STEP" .env
# MUST show: EMBEDDING_CHUNK_STEP=2

[ ] Dimension = 4096 (Ndimension):

grep "EMBEDDING_DIMENSION" .env
# MUST show: EMBEDDING_DIMENSION=4096

[ ] Top-k references = 2 (Nexample):

grep "EMBEDDING_TOP_K_REFERENCES" .env
# MUST show: EMBEDDING_TOP_K_REFERENCES=2

3.2 Dimension Mismatch Check

Reference: BUG-009

Gotcha: Dimension mismatches can result in skipped chunks. If all chunks are mismatched, the system fails loudly. If only some chunks are mismatched, retrieval quality degrades. Always validate dimensions pre-run.

[ ] Verify dimension consistency:

# If embeddings exist, check their dimension
uv run python -c "
import numpy as np
from ai_psychiatrist.config import get_settings, resolve_reference_embeddings_path
s = get_settings()
p = resolve_reference_embeddings_path(s.data, s.embedding)
if p.exists():
    data = np.load(str(p))
    # NPZ uses per-participant keys: emb_302, emb_304, etc.
    dim = data[data.files[0]].shape[1]
    print(f'Embedding dimension: {dim}')
    print(f'Config expects: 4096')
    assert dim == 4096, 'DIMENSION MISMATCH!'
    print('OK - dimensions match')
else:
    print('Embeddings not found - will need to generate')
"

Phase 4: Reference Embeddings (FEW-SHOT SPECIFIC)

4.1 Embeddings Exist

[ ] Check for embedding file:
```
ls -lh data/embeddings/*.npz
```

Default embedding artifact: huggingface_qwen3_8b_paper_train_participant_only.npz (FP16, participant-only transcripts; recommended) Alternative: ollama_qwen3_8b_paper_train_participant_only.npz (Ollama Q4_K_M, legacy baseline)

If missing, generate (takes ~65 min for 58 participants):

# Generate HuggingFace FP16 embeddings (recommended, collision-free naming)
DATA_TRANSCRIPTS_DIR=data/transcripts_participant_only \
uv run python scripts/generate_embeddings.py \
  --backend huggingface \
  --split paper-train \
  --output data/embeddings/huggingface_qwen3_8b_paper_train_participant_only.npz
# Optional (Spec 34): also write per-chunk PHQ-8 item tags sidecar
# (recommended for retrieval): add --write-item-tags

# Or generate Ollama embeddings (legacy baseline)
DATA_TRANSCRIPTS_DIR=data/transcripts_participant_only \
EMBEDDING_BACKEND=ollama uv run python scripts/generate_embeddings.py \
  --backend ollama \
  --split paper-train \
  --output data/embeddings/ollama_qwen3_8b_paper_train_participant_only.npz

4.2 Verify Embedding Integrity

[ ] Embedding file is valid:

uv run python -c "
import numpy as np
from pathlib import Path
from ai_psychiatrist.config import get_settings, resolve_reference_embeddings_path

s = get_settings()
candidates = [
    resolve_reference_embeddings_path(s.data, s.embedding),
    Path('data/embeddings/reference_embeddings.npz'),
]
for p in candidates:
    if p.exists():
        data = np.load(str(p))
        # NPZ uses per-participant keys: emb_302, emb_304, etc.
        pids = [int(k.split('_')[1]) for k in data.keys()]
        total_chunks = sum(data[k].shape[0] for k in data.keys())
        dim = data[data.files[0]].shape[1]
        print(f'File: {p.name}')
        print(f'  Participants: {len(pids)}')
        print(f'  Total chunks: {total_chunks}')
        print(f'  Dimension: {dim}')
        break
else:
    print('ERROR: No embedding file found!')
    print('Run: uv run python scripts/generate_embeddings.py --split paper-train')
"

Expected output for paper-train:

File: <your configured embeddings artifact>
  Participants: 58
  Total chunks: ~7000
  Dimension: 4096

4.3 Sidecar File Check

[ ] JSON sidecar exists (for chunk text) and (optional) tags sidecar:

uv run python -c "
from ai_psychiatrist.config import get_settings, resolve_reference_embeddings_path

s = get_settings()
npz = resolve_reference_embeddings_path(s.data, s.embedding)
paths = [
    ('json', npz.with_suffix('.json')),
    ('meta', npz.with_suffix('.meta.json')),
    ('tags', npz.with_suffix('.tags.json')),
]
print(f'NPZ: {npz}')
for name, path in paths:
    status = 'OK' if path.exists() else 'MISSING'
    print(f'{name}: {path.name} ({status})')
"

.tags.json is only required if you set EMBEDDING_ENABLE_ITEM_TAG_FILTER=true; otherwise it is ignored.

Phase 5: Quantitative Settings

5.1 N/A Reason Tracking

[ ] N/A tracking enabled (for debugging):

grep "QUANTITATIVE_TRACK_NA_REASONS" .env
# Should show: QUANTITATIVE_TRACK_NA_REASONS=true

Phase 6: Data Integrity

6.1 Transcripts Present

[ ] Transcripts directory exists:

ls data/transcripts_participant_only/ | wc -l
# Should show ~189 (or your participant count)

6.2 Participant 487 Validation

Reference: BUG-003, BUG-022

[ ] Participant 487 is NOT corrupted:

file data/transcripts_participant_only/487_P/487_TRANSCRIPT.csv
# MUST show: ASCII text, or UTF-8 Unicode text
# NOT: AppleDouble encoded, or binary

[ ] Correct file size (~20KB, not 4KB):

ls -lh data/transcripts_participant_only/487_P/487_TRANSCRIPT.csv
# Should be ~18-25KB, NOT 4KB

Gotcha (BUG-003): macOS ZIP extraction can extract AppleDouble resource forks instead of real files.

6.3 Ground Truth Labels

[ ] AVEC2017 labels exist (for item-level MAE):

ls data/*_split_Depression_AVEC2017.csv
# Should show: dev_split, train_split, test_split files

Phase 7: Timeout Configuration

7.1 Timeout Setting

Reference: BUG-018e, BUG-027

[ ] Set generous timeout for GPU-safe operation:

grep -E "^(OLLAMA_TIMEOUT_SECONDS|PYDANTIC_AI_TIMEOUT_SECONDS)=" .env
# Recommended: 600  (10 min, safe default)
# For slow GPU: 3600 (1 hour, research runs)

Gotcha (BUG-027): Pydantic AI timeout is configurable via PYDANTIC_AI_TIMEOUT_SECONDS. If you set only one of {OLLAMA_TIMEOUT_SECONDS, PYDANTIC_AI_TIMEOUT_SECONDS}, Settings syncs the other; if you set both, keep them equal to avoid fallback timeouts.

Gotcha: 6/47 participants (13%) timed out on first run with 300s. Large transcripts (~24KB+) need 600s+ (often 3600s on slow GPUs).

7.2 Check for Long Transcripts

[ ] Identify large transcripts that may timeout:

find data/transcripts -name "*TRANSCRIPT.csv" -exec wc -c {} + | sort -n | tail -10
# Note any files > 25KB - these may need extra timeout

Phase 8: Paper Split

8.1 Create or Verify Splits

[ ] Paper splits exist OR will be created:

ls data/paper_splits/
# If empty or missing:
uv run python scripts/create_paper_split.py --verify

8.2 Verify Split Sizes

Reference: Paper Section 2.4.1

[ ] Split sizes match paper (58/43/41):

wc -l data/paper_splits/paper_split_*.csv
# Should show: 59 train (58+header), 44 val, 42 test

8.3 Embedding-Split Alignment

CRITICAL for few-shot: Embeddings must be from TRAINING set only!

[ ] Verify embeddings are from paper-train:

uv run python -c "
import numpy as np
import csv
from ai_psychiatrist.config import get_settings, resolve_reference_embeddings_path

s = get_settings()
npz_path = resolve_reference_embeddings_path(s.data, s.embedding)

# Load embedding participant IDs from NPZ keys (emb_302, emb_304, etc.)
emb = np.load(str(npz_path))
emb_pids = {int(k.split('_')[1]) for k in emb.keys()}
print(f'Embedding participants: {len(emb_pids)}')
print(f'Embeddings file: {npz_path}')

# Load paper train split (column is Participant_ID)
with open('data/paper_splits/paper_split_train.csv') as f:
    train_pids = {int(row['Participant_ID']) for row in csv.DictReader(f)}
print(f'Paper train participants: {len(train_pids)}')

# Check alignment
if emb_pids == train_pids:
    print('OK - embeddings match paper train split')
else:
    print('WARNING: Embeddings do not match train split!')
    print(f'  In emb but not train: {sorted(emb_pids - train_pids)}')
    print(f'  In train but not emb: {sorted(train_pids - emb_pids)}')
"

Phase 9: Pre-Run Verification

9.1 Quick Sanity Check

[ ] Run linter: make lint
[ ] Run type checker: make typecheck
[ ] Run unit tests: make test-unit

9.2 Configuration Summary Check

Run this to dump your effective configuration:

uv run python -c "
from ai_psychiatrist.config import get_settings
s = get_settings()
print('=== CRITICAL SETTINGS ===')
print(f'Quantitative Model: {s.model.quantitative_model}')
print(f'Embedding Model: {s.model.embedding_model}')
print(f'Temperature: {s.model.temperature}')
print(f'Timeout: {s.ollama.timeout_seconds}s')
print(f'Pydantic AI Enabled: {s.pydantic_ai.enabled}')
print()
print('=== EMBEDDING SETTINGS (Appendix D) ===')
print(f'Dimension: {s.embedding.dimension} (paper: 4096)')
print(f'Chunk Size: {s.embedding.chunk_size} (paper: 8)')
print(f'Chunk Step: {s.embedding.chunk_step} (paper: 2)')
print(f'Top-K References: {s.embedding.top_k_references} (paper: 2)')
"

Expected output:

=== CRITICAL SETTINGS ===
Quantitative Model: gemma3:27b-it-qat  (or gemma3:27b for legacy baseline)
Embedding Model: qwen3-embedding:8b
Temperature: 0.0
Timeout: 600s  (or higher for research runs)
Pydantic AI Enabled: True

=== EMBEDDING SETTINGS (Appendix D) ===
Dimension: 4096 (paper: 4096)
Chunk Size: 8 (paper: 8)
Chunk Step: 2 (paper: 2)
Top-K References: 2 (paper: 2)

9.3 Few-Shot Readiness Final Check

[ ] Embeddings match config dimension (from Phase 3.2)
[ ] Embeddings are from train split (from Phase 8.3)
[ ] Config matches paper Appendix D (from Phase 9.2)

9.4 Retrieval/RAG Features (Required for Current SSOT)

These must be enabled for current “validated configuration” runs:

grep -E \"^(EMBEDDING_REFERENCE_SCORE_SOURCE|EMBEDDING_ENABLE_ITEM_TAG_FILTER|EMBEDDING_ENABLE_RETRIEVAL_AUDIT|EMBEDDING_MIN_REFERENCE_SIMILARITY|EMBEDDING_MAX_REFERENCE_CHARS_PER_ITEM)=\" .env
# Expected (baseline):
# EMBEDDING_REFERENCE_SCORE_SOURCE=chunk
# EMBEDDING_ENABLE_ITEM_TAG_FILTER=true
# EMBEDDING_ENABLE_RETRIEVAL_AUDIT=true
# EMBEDDING_MIN_REFERENCE_SIMILARITY=0.3
# EMBEDDING_MAX_REFERENCE_CHARS_PER_ITEM=500

Evidence grounding (prevents ungrounded quotes contaminating retrieval):

grep -E \"^QUANTITATIVE_EVIDENCE_QUOTE_VALIDATION_\" .env
# Expected:
# QUANTITATIVE_EVIDENCE_QUOTE_VALIDATION_ENABLED=true
# QUANTITATIVE_EVIDENCE_QUOTE_FAIL_ON_ALL_REJECTED=false  # strict mode = true

Phase 10: Execute Few-Shot Run

10.1 Use tmux for Long-Running Processes

CRITICAL: Reproduction runs take ~5-6 min/participant. Use tmux to prevent losing progress if your terminal disconnects.

[ ] Start or attach to tmux session:

# Start new session
tmux new -s reproduction

# Or attach to existing session
tmux attach -t reproduction

[ ] Verify you're inside tmux: Look for green status bar at bottom, or run:

echo $TMUX
# Should show something like: /private/tmp/tmux-501/default,12345,0
# If empty, you're NOT in tmux!

10.2 Run Command

# Few-shot only on paper test split
uv run python scripts/reproduce_results.py --split paper-test --few-shot-only

# Few-shot on AVEC dev split (sanity check)
uv run python scripts/reproduce_results.py --split dev --few-shot-only

Note: --few-shot-only ensures few-shot mode. If you omit it, the script runs both modes by default.

10.3 Monitor for Issues

Watch for these log patterns:

Log Pattern	Issue	Action
`LLM request timed out`	Transcript too long	Increase `OLLAMA_TIMEOUT_SECONDS`
`Failed to parse evidence JSON`	LLM output malformed	Check JSON repair (Spec 043) and inspect the raw response
`na_count = 8` for all	MedGemma contamination	Ensure model is Gemma3 (`gemma3:27b-it-qat` or `gemma3:27b`), not MedGemma
`No reference embeddings found`	Missing/wrong embeddings	Generate: `scripts/generate_embeddings.py`
`Embedding dimension mismatch`	Dimension inconsistency	Regenerate embeddings with correct dimension
`0 similar chunks found`	Silent dimension mismatch	Check `EMBEDDING_DIMENSION` matches NPZ

Phase 11: Post-Run Validation

11.1 Output File Created

[ ] Results file exists:
```
ls -lt data/outputs/*.json | head -1
```

11.2 Verify `item_signals` Present (Required for AURC/AUGRC)

Reference: Spec 25 - Required for selective prediction evaluation

[ ] Output includes item_signals for each participant:

python3 -c "
import json
from pathlib import Path
f = sorted(Path('data/outputs').glob('*.json'))[-1]
data = json.loads(f.read_text())
results = data['experiments'][0]['results']['results']
success = next((r for r in results if r.get('success')), None)
if success:
    has_signals = 'item_signals' in success
    print(f'Has item_signals: {has_signals}')
    if has_signals:
        print(f'Signal keys: {list(success[\"item_signals\"].keys())[:3]}...')
    assert has_signals, 'FAIL: Missing item_signals! Re-run with latest code.'
else:
    print('WARNING: No successful results found')
"

Gotcha: Outputs created before 2025-12-27 lack item_signals. The AURC/AUGRC evaluation script requires this field. Re-run if missing.

11.3 Metrics Sanity Check

[ ] Coverage is ~50-70% (few-shot typically higher than zero-shot):

python3 -c "
import json
from pathlib import Path
f = sorted(Path('data/outputs').glob('*.json'))[-1]
data = json.loads(f.read_text())
exp = data['experiments'][0]['results']
print(f'Mode: {exp.get(\"mode\", \"unknown\")}')
print(f'Success: {sum(1 for r in exp[\"results\"] if r.get(\"success\"))}')
print(f'Failed: {sum(1 for r in exp[\"results\"] if not r.get(\"success\"))}')
"

Expected (few-shot): - Coverage: ~50-72% (higher than zero-shot due to few-shot examples) - MAE: ~0.62-0.90 (varies with coverage - see BUG-029)

11.4 Compare to Paper Targets

Metric	Paper	Our Actual	Notes
MAE (few-shot)	0.619	~0.86	Higher coverage = higher MAE (expected)
Coverage (few-shot)	~50%	~72%	Our system predicts more items

Note: Paper compares MAE at different coverages (invalid per Spec 25). Use AURC/AUGRC for fair comparison.

11.5 Run AURC/AUGRC Evaluation (Recommended)

Reference: Spec 25 - Proper selective prediction evaluation

[ ] Evaluate with risk-coverage metrics:

uv run python scripts/evaluate_selective_prediction.py \
  --input data/outputs/<your_output>.json \
  --mode few_shot \
  --seed 42

[ ] Compare zero-shot vs few-shot (paired analysis):

uv run python scripts/evaluate_selective_prediction.py \
  --input data/outputs/<zero_shot>.json \
  --input data/outputs/<few_shot>.json \
  --seed 42

This computes AURC, AUGRC, and paired Δ with bootstrap CIs - the statistically valid way to compare selective prediction systems.

Common Failure Modes Quick Reference

Symptom	Cause	Fix
All items N/A	MedGemma model	Change to `gemma3:27b`
Timeouts on 13%	Long transcripts	Increase `OLLAMA_TIMEOUT_SECONDS=600` or higher
Participant 487 fails	macOS resource fork	Re-extract with `unzip -x '._*'`
Config not applying	.env override	Start fresh: `cp .env.example .env`
MAE ~4.0 (wrong scale)	Old script	Use current `scripts/reproduce_results.py`
No few-shot effect	Missing embeddings	Generate: `scripts/generate_embeddings.py`
Silent zero-shot	Dimension mismatch	Check `EMBEDDING_DIMENSION=4096` matches NPZ
Wrong participants	AVEC vs paper split	Use `--split paper` for paper methodology
Missing `item_signals`	Old output file	Re-run with code from 2025-12-27+
AURC eval fails	No `item_signals`	Re-run reproduction to generate new outputs
Embedding hash mismatch	Wrong split used	Regenerate embeddings for paper-train split

Checklist Complete?

If ALL items are checked: 1. You're ready to run few-shot reproduction 2. Expected runtime: ~5-6 min/participant (varies by hardware) 3. Embedding generation: ~65 min for 58 participants (one-time) 4. Paper few-shot MAE target: 0.619

Remember: The paper acknowledges stochasticity - results within ±0.1 MAE are considered consistent.

Complete Reproduction Workflow

# 1. Setup (first time only)
make dev        # Install with HuggingFace deps (recommended)
cp .env.example .env

# 2. Pull required Ollama models
# Production-recommended (QAT-quantized, faster):
ollama pull gemma3:27b-it-qat
# Standard Ollama tag (GGUF Q4_K_M):
ollama pull gemma3:27b
# Embedding model:
ollama pull qwen3-embedding:8b

# 3. Create paper ground truth split
uv run python scripts/create_paper_split.py --verify

# 4. Generate embeddings from paper-train (takes ~65 min)
# Default uses HuggingFace FP16 (higher quality)
uv run python scripts/generate_embeddings.py --split paper-train

# 5. Run few-shot reproduction (Pydantic AI enabled by default)
uv run python scripts/reproduce_results.py --split paper --few-shot-only

# 6. (Optional) Compare with zero-shot
uv run python scripts/reproduce_results.py --split paper

Note: Pydantic AI is enabled by default, providing structured validation with automatic retries. No additional configuration needed.

Zero-Shot Preflight - Simpler, no embeddings
Configuration Philosophy - Why we use validated baselines
Model Registry - Model configuration and backend options