Model Wiring: Current State

Purpose: Document exactly how models and backends are wired in the codebase. Last Updated: 2026-01-02 Status: Implemented. LLM_BACKEND for chat, EMBEDDING_BACKEND for embeddings.

TL;DR - The Simple Truth

Component	Backend	Default Model	Precision
Chat (all agents)	`LLM_BACKEND=ollama`	⭐ `gemma3:27b`	Q4_K_M (4-bit)
Chat (quant alt)	`LLM_BACKEND=huggingface`	`medgemma:27b`	FP16 (16-bit)
Embedding	`EMBEDDING_BACKEND=huggingface`	`qwen3-embedding:8b` → `Qwen/Qwen3-Embedding-8B`	FP16 (16-bit)

Key decisions: - Chat: Ollama default (validated baseline). MedGemma is a hard toggle for quant agent. - Embedding: HuggingFace default (better precision). Ollama is opt-out fallback.

Default vs Hard Toggle (The Simple Version)

Component	Default	Hard Toggle Option
Qualitative Agent	Ollama (`gemma3:27b`)	—
Judge Agent	Ollama (`gemma3:27b`)	—
Meta-Review Agent	Ollama (`gemma3:27b`)	—
Quant Agent	Ollama (`gemma3:27b`)	HF (`medgemma:27b`)
Embeddings	HF (`qwen3-embedding:8b` → `Qwen/Qwen3-Embedding-8B`)	Ollama (`qwen3-embedding:8b`)

Why this mix? - Ollama = local, no external deps, good baseline - HF embeddings = FP16 quality matters for similarity scores - MedGemma = only available officially on HF (Ollama version is community upload)

Gemma 3 27B: All Official Options (Dec 2025)

Hardware Requirements

Hardware	VRAM/Memory	Max Model Size
M1 Max 64GB	64GB unified	~54GB (BF16) ✅
M1 Pro 32GB	32GB unified	~29GB (Q8_0) ✅
RTX 4090	24GB VRAM	~17GB (Q4) ✅

Ollama Options (Official Google Models)

Tag	Quantization	Size	M1 Max 64GB	M1 Pro 32GB	RTX 4090 24GB	Quality
`gemma3:27b`	Q4_K_M	17GB	✅	✅	✅	Good
`gemma3:27b-it-qat`	Q4_0 (QAT)	17GB	✅	✅	✅	Better (QAT-trained)
`gemma3:27b-it-q8_0`	Q8_0	29GB	✅	❌	❌	Better

What Do These Abbreviations Mean?

Abbreviation	Full Name	Bits	What It Means
BF16	Brain Float 16	16-bit	Full precision. Each weight is a 16-bit float. No quality loss. Huge memory.
Q8_0	Quantized 8-bit	8-bit	Weights compressed to 8-bit integers. ~2x smaller than BF16. Small quality loss.
Q4_K_M	Quantized 4-bit (K-quant Medium)	4-bit	Weights compressed to 4-bit. ~4x smaller than BF16. Noticeable quality loss.
QAT	Quantization-Aware Training	4-bit	Model was trained knowing it would be quantized. Same 4-bit size but better quality than post-hoc Q4.

How Quantization Works (Simple Version)

Original model (BF16): Each of 27 billion weights stored as 16-bit float → 54GB

Post-hoc quantization (Q4_K_M, Q8_0): Take trained model, compress weights after training. - Like compressing a JPEG after taking the photo - Some information lost in compression

Quantization-Aware Training (QAT): Train the model knowing it will be compressed. - Like shooting a photo knowing it will be JPEG - you optimize for the output format - Google claims this preserves BF16 quality at Q4 size

Quality Ranking (Best → Worst)

BF16 (54GB) > Q8_0 (29GB) > QAT Q4 (17GB) ≈ Q4_K_M (17GB)
     ↑              ↑              ↑              ↑
  Perfect      Very Good    Good (smart)   Good (dumb)

Bottom line: QAT is the sweet spot - same size as Q4_K_M but trained smarter.

HuggingFace Options (Full Precision)

Model	HuggingFace ID	Precision	Size	M1 Max 64GB	RTX 4090	Access
Gemma 3 27B	`google/gemma-3-27b-it`	BF16	~54GB	✅	❌	Open
MedGemma 27B	`google/medgemma-27b-text-it`	BF16	~54GB	✅	❌	Gated

Other Models (Embedding + Community)

Model	Backend	Tag/ID	Quantization	Size
Qwen3 Embedding 8B	Ollama	`qwen3-embedding:8b`	Q4_K_M	4.7GB
Qwen3 Embedding 8B	HuggingFace	`Qwen/Qwen3-Embedding-8B`	FP16	~16GB
MedGemma 27B	Ollama	`alibayram/medgemma:27b`	Q4_K_M	~17GB

Note: MedGemma on Ollama is a community upload, NOT official Google.

Gemma 3 27B Options (All Agents)

Model	Backend	Tag/ID	Bits	Size	M1 Max	4090	Speed	Quality
Gemma 3 27B	HF	`google/gemma-3-27b-it`	16-bit	54GB	✅	❌	Slow	Best (BF16)
Gemma 3 27B	Ollama	`gemma3:27b-it-q8_0`	8-bit	29GB	✅	❌	Medium	Very Good
Gemma 3 27B	Ollama	`gemma3:27b-it-qat`	4-bit	17GB	✅	✅	Fast	Good (QAT-trained)
Gemma 3 27B	Ollama	`gemma3:27b`	4-bit	17GB	✅	✅	Fast	Good (Q4_K_M)

Paper reality check: Paper text claims MacBook M3 Pro, but repo has A100 SLURM scripts. Paper likely ran BF16 on A100s for the reported 0.619 MAE. Our QAT 4-bit zero-shot run achieved 0.717 MAE (see docs/results/reproduction-results.md).

Which Model Should We Use?

Goal	Model	Why
⭐ RECOMMENDED	`gemma3:27b-it-qat` (4-bit)	QAT-trained, same speed as Q4, claims BF16 quality
Closer to paper's likely setup	`gemma3:27b-it-q8_0` (8-bit)	Paper likely used BF16 on A100s; Q8 is closest but slow
Paper baseline (current default)	`gemma3:27b` (4-bit)	Post-hoc Q4_K_M; stable, widely available on Ollama
Maximum quality	HF `gemma-3-27b-it` (16-bit)	Full BF16, 54GB, very slow on M1

Estimated Run Times (Full Pipeline, 41 Transcripts)

Model	Est. Time	Notes
Ollama 4-bit (17GB)	~2-4 hours	Current default
Ollama 8-bit (29GB)	~6-12 hours	Recommended for reproduction
HF BF16 (54GB)	~12-24+ hours	Memory-bound on M1

MedGemma 27B (Quantitative Agent ONLY)

NOT a general model option. MedGemma is ONLY for the quantitative agent as a hard toggle.

Model	Backend	Tag/ID	Size	Access	Notes
MedGemma 27B	HF	`google/medgemma-27b-text-it`	54GB	Gated	Official, medical fine-tuned
MedGemma 27B	Ollama	`alibayram/medgemma:27b`	17GB	Open	Community upload, NOT official

Paper finding (Appendix F): MedGemma got better MAE (0.505 vs 0.619) but made fewer predictions. The paper chose Gemma 3 for main results because MedGemma was too conservative.

To enable MedGemma:

LLM_BACKEND=huggingface
MODEL_QUANTITATIVE_MODEL=medgemma:27b

Current Configuration (Code Defaults)

Backends

Setting	Default	Purpose
`LLM_BACKEND`	`ollama`	Chat models (all agents)
`EMBEDDING_BACKEND`	`huggingface`	Embedding model only

No runtime fallback. If configured backend fails → loud error with instructions.

Chat Models (All Agents)

Agent	Config Key	Default	Paper Reference
Qualitative	`MODEL_QUALITATIVE_MODEL`	`gemma3:27b`	Section 2.2
Judge	`MODEL_JUDGE_MODEL`	`gemma3:27b`	Section 2.2
Meta-Review	`MODEL_META_REVIEW_MODEL`	`gemma3:27b`	Section 2.2
Quantitative	`MODEL_QUANTITATIVE_MODEL`	`gemma3:27b`	Section 2.2

MedGemma (medgemma:27b) is an ALTERNATIVE for quantitative agent only (Appendix F). It requires LLM_BACKEND=huggingface for official weights. The Ollama community version may behave differently.

Embedding Model

Setting	Default	Backend	Precision
`MODEL_EMBEDDING_MODEL`	`qwen3-embedding:8b`	Resolved per backend	Q4 (Ollama) / FP16 (HF)

Why HF default for embeddings? FP16 embeddings produce better similarity scores than Q4_K_M. To use Ollama instead: EMBEDDING_BACKEND=ollama (will use qwen3-embedding:8b Q4_K_M).

Embedding Artifacts

Setting	Default	Purpose
`EMBEDDING_EMBEDDINGS_FILE`	`huggingface_qwen3_8b_paper_train_participant_only`	Selects `{DATA_BASE_DIR}/embeddings/{name}.npz` (+ `.json`, optional `.meta.json`, optional `.tags.json`)
`DATA_EMBEDDINGS_PATH`	`data/embeddings/huggingface_qwen3_8b_paper_train_participant_only.npz`	Full-path override (takes precedence over `EMBEDDING_EMBEDDINGS_FILE`)

When Embeddings Are Generated

1. Data Prep (Once)

Script: scripts/generate_embeddings.py Output: data/embeddings/{backend}_{model_slug}_{split_slug}.npz (+ .json, .meta.json, optional .tags.json)

Generates reference embeddings for training set transcripts. Run once before few-shot mode.

2. Runtime (Every Assessment in Few-Shot Mode)

Location: EmbeddingService.embed_text() called from QuantitativeAssessmentAgent

Flow:

Transcript → Extract Evidence → Embed Evidence → Cosine Similarity → Reference Matches
                                     ↑                    ↑
                              (runtime embed)    (pre-computed refs)

Consistency requirement: Reference embeddings and runtime embeddings should use the same backend for precision consistency.

Pipeline Flow

┌────────────────────────────────────────────────────────────────────┐
│                             PIPELINE                               │
├────────────────────────────────────────────────────────────────────┤
│                                                                    │
│  Transcript                                                        │
│      │                                                             │
│      ▼                                                             │
│  ┌─────────────────────┐                                           │
│  │ QUALITATIVE AGENT   │  Model: gemma3:27b (chat)                 │
│  │ (assess symptoms)   │  Backend: LLM_BACKEND (default: ollama)   │
│  └─────────────────────┘                                           │
│      │                                                             │
│      ▼                                                             │
│  ┌─────────────────────┐                                           │
│  │ JUDGE AGENT         │  Model: gemma3:27b (chat)                 │
│  │ (evaluate + refine) │  Backend: LLM_BACKEND                     │
│  └─────────────────────┘                                           │
│      │  ↺ feedback loop (max 10 iterations)                        │
│      ▼                                                             │
│  ┌─────────────────────┐                                           │
│  │ QUANTITATIVE AGENT  │  Model: gemma3:27b OR medgemma:27b        │
│  │ (PHQ-8 scoring)     │  Backend: LLM_BACKEND (default: ollama)   │
│  │                     │                                           │
│  │  Few-shot mode:     │                                           │
│  │  - Embed evidence   │  Model: qwen3-embedding:8b (resolved)     │
│  │  - Find references  │  Backend: EMBEDDING_BACKEND (default: hf) │
│  └─────────────────────┘                                           │
│      │                                                             │
│      ▼                                                             │
│  ┌─────────────────────┐                                           │
│  │ META-REVIEW AGENT   │  Model: gemma3:27b (chat)                 │
│  │ (final severity)    │  Backend: LLM_BACKEND                     │
│  └─────────────────────┘                                           │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

Factory Logic (Current)

# factory.py - NO FALLBACK
def create_llm_client(settings: Settings) -> LLMClient:
    backend = settings.backend.backend
    if backend == LLMBackend.OLLAMA:
        return OllamaClient(settings.ollama)
    if backend == LLMBackend.HUGGINGFACE:
        return HuggingFaceClient(...)  # HF deps are loaded lazily at first use
    raise ValueError(f"Unsupported backend: {backend}")


def create_embedding_client(settings: Settings) -> EmbeddingClient:
    backend = settings.embedding_config.backend
    if backend == EmbeddingBackend.OLLAMA:
        return OllamaClient(settings.ollama)
    if backend == EmbeddingBackend.HUGGINGFACE:
        return HuggingFaceClient(...)  # HF deps are loaded lazily at first use
    raise ValueError(f"Unsupported embedding backend: {backend}")

Configuration Scenarios

Scenario 1: Default (Recommended)

# .env (validated baseline - better embeddings)
LLM_BACKEND=ollama                # Chat: Ollama Q4_K_M
EMBEDDING_BACKEND=huggingface     # Embed: HuggingFace FP16

Requires: pip install 'ai-psychiatrist[hf]'

Scenario 2: Pure Ollama (Legacy Baseline, Lower-Quality Similarity)

LLM_BACKEND=ollama
EMBEDDING_BACKEND=ollama          # Opt-out of HF embeddings

All models Q4_K_M. Matches Paper Section 2.3.5 exactly.

Scenario 3: MedGemma for Quant Agent (Appendix F)

LLM_BACKEND=huggingface           # Required for official MedGemma
MODEL_QUANTITATIVE_MODEL=medgemma:27b
EMBEDDING_BACKEND=huggingface     # Keep FP16 embeddings

Requires MedGemma access approved on HuggingFace. Result: 18% better item MAE (0.505 vs 0.619) but fewer predictions.

Scenario 4: Full HuggingFace (Maximum Precision)

LLM_BACKEND=huggingface           # Chat: FP16
EMBEDDING_BACKEND=huggingface     # Embed: FP16

Everything FP16. Requires ~54GB VRAM for chat + ~16GB for embeddings.

What We Do NOT Support (By Design)

Runtime fallback: HF unavailable → silently use Ollama (breaks reproducibility)
Model substitution: medgemma → gemma3 (different clinical behavior)
Mixed embedding precision: FP16 refs + Q4_K_M runtime (breaks similarity scores)

Final Architecture

Default Configuration

# .env (defaults)
LLM_BACKEND=ollama                        # Chat: Ollama (baseline)
EMBEDDING_BACKEND=huggingface             # Embedding: HuggingFace (better precision)
MODEL_QUANTITATIVE_MODEL=gemma3:27b

Startup Validation

HuggingFace deps (EMBEDDING_BACKEND=huggingface):
Transformers/torch deps are loaded lazily on first embed/chat call.
If missing, the request fails with an ImportError containing: pip install 'ai-psychiatrist[hf]'
Reference embedding validation:
If {artifact}.meta.json exists, ReferenceStore validates: backend, dimension, chunk_size, chunk_step.
If metadata is missing, validation is skipped (and a warning is logged when EMBEDDING_BACKEND != ollama).

MedGemma: Hard Toggle

LLM_BACKEND=huggingface
MODEL_QUANTITATIVE_MODEL=medgemma:27b

If HF unavailable → FAIL LOUDLY. No silent substitution.

Why This Design?

Decision	Reason
HF embeddings default	FP16 > Q4_K_M for similarity quality
Ollama chat default	Local baseline (Section 2.3.5)
No runtime fallback	Reproducibility > convenience
Precision validation	Prevent silent embedding mismatch
Hard toggle for MedGemma	Different clinical behavior ≠ drop-in replacement

References

Paper Section 2.2: Model specification (Gemma 3 27B, Qwen 3 8B Embedding)
Paper Appendix F: MedGemma evaluation
src/ai_psychiatrist/config.py: All defaults
src/ai_psychiatrist/infrastructure/llm/factory.py: Client creation
src/ai_psychiatrist/infrastructure/llm/model_aliases.py: Backend mapping