Model Wiring: Current State
Purpose: Document exactly how models and backends are wired in the codebase.
Last Updated: 2026-01-02
Status: Implemented. LLM_BACKEND for chat, EMBEDDING_BACKEND for embeddings.
TL;DR - The Simple Truth
| Component | Backend | Default Model | Precision |
|---|---|---|---|
| Chat (all agents) | LLM_BACKEND=ollama |
⭐ gemma3:27b |
Q4_K_M (4-bit) |
| Chat (quant alt) | LLM_BACKEND=huggingface |
medgemma:27b |
FP16 (16-bit) |
| Embedding | EMBEDDING_BACKEND=huggingface |
qwen3-embedding:8b → Qwen/Qwen3-Embedding-8B |
FP16 (16-bit) |
Key decisions: - Chat: Ollama default (validated baseline). MedGemma is a hard toggle for quant agent. - Embedding: HuggingFace default (better precision). Ollama is opt-out fallback.
Default vs Hard Toggle (The Simple Version)
| Component | Default | Hard Toggle Option |
|---|---|---|
| Qualitative Agent | Ollama (gemma3:27b) |
— |
| Judge Agent | Ollama (gemma3:27b) |
— |
| Meta-Review Agent | Ollama (gemma3:27b) |
— |
| Quant Agent | Ollama (gemma3:27b) |
HF (medgemma:27b) |
| Embeddings | HF (qwen3-embedding:8b → Qwen/Qwen3-Embedding-8B) |
Ollama (qwen3-embedding:8b) |
Why this mix? - Ollama = local, no external deps, good baseline - HF embeddings = FP16 quality matters for similarity scores - MedGemma = only available officially on HF (Ollama version is community upload)
Gemma 3 27B: All Official Options (Dec 2025)
Hardware Requirements
| Hardware | VRAM/Memory | Max Model Size |
|---|---|---|
| M1 Max 64GB | 64GB unified | ~54GB (BF16) ✅ |
| M1 Pro 32GB | 32GB unified | ~29GB (Q8_0) ✅ |
| RTX 4090 | 24GB VRAM | ~17GB (Q4) ✅ |
Ollama Options (Official Google Models)
| Tag | Quantization | Size | M1 Max 64GB | M1 Pro 32GB | RTX 4090 24GB | Quality |
|---|---|---|---|---|---|---|
gemma3:27b |
Q4_K_M | 17GB | ✅ | ✅ | ✅ | Good |
gemma3:27b-it-qat |
Q4_0 (QAT) | 17GB | ✅ | ✅ | ✅ | Better (QAT-trained) |
gemma3:27b-it-q8_0 |
Q8_0 | 29GB | ✅ | ❌ | ❌ | Better |
What Do These Abbreviations Mean?
| Abbreviation | Full Name | Bits | What It Means |
|---|---|---|---|
| BF16 | Brain Float 16 | 16-bit | Full precision. Each weight is a 16-bit float. No quality loss. Huge memory. |
| Q8_0 | Quantized 8-bit | 8-bit | Weights compressed to 8-bit integers. ~2x smaller than BF16. Small quality loss. |
| Q4_K_M | Quantized 4-bit (K-quant Medium) | 4-bit | Weights compressed to 4-bit. ~4x smaller than BF16. Noticeable quality loss. |
| QAT | Quantization-Aware Training | 4-bit | Model was trained knowing it would be quantized. Same 4-bit size but better quality than post-hoc Q4. |
How Quantization Works (Simple Version)
Original model (BF16): Each of 27 billion weights stored as 16-bit float → 54GB
Post-hoc quantization (Q4_K_M, Q8_0): Take trained model, compress weights after training. - Like compressing a JPEG after taking the photo - Some information lost in compression
Quantization-Aware Training (QAT): Train the model knowing it will be compressed. - Like shooting a photo knowing it will be JPEG - you optimize for the output format - Google claims this preserves BF16 quality at Q4 size
Quality Ranking (Best → Worst)
BF16 (54GB) > Q8_0 (29GB) > QAT Q4 (17GB) ≈ Q4_K_M (17GB)
↑ ↑ ↑ ↑
Perfect Very Good Good (smart) Good (dumb)
Bottom line: QAT is the sweet spot - same size as Q4_K_M but trained smarter.
HuggingFace Options (Full Precision)
| Model | HuggingFace ID | Precision | Size | M1 Max 64GB | RTX 4090 | Access |
|---|---|---|---|---|---|---|
| Gemma 3 27B | google/gemma-3-27b-it |
BF16 | ~54GB | ✅ | ❌ | Open |
| MedGemma 27B | google/medgemma-27b-text-it |
BF16 | ~54GB | ✅ | ❌ | Gated |
Other Models (Embedding + Community)
| Model | Backend | Tag/ID | Quantization | Size |
|---|---|---|---|---|
| Qwen3 Embedding 8B | Ollama | qwen3-embedding:8b |
Q4_K_M | 4.7GB |
| Qwen3 Embedding 8B | HuggingFace | Qwen/Qwen3-Embedding-8B |
FP16 | ~16GB |
| MedGemma 27B | Ollama | alibayram/medgemma:27b |
Q4_K_M | ~17GB |
Note: MedGemma on Ollama is a community upload, NOT official Google.
Gemma 3 27B Options (All Agents)
| Model | Backend | Tag/ID | Bits | Size | M1 Max | 4090 | Speed | Quality |
|---|---|---|---|---|---|---|---|---|
| Gemma 3 27B | HF | google/gemma-3-27b-it |
16-bit | 54GB | ✅ | ❌ | Slow | Best (BF16) |
| Gemma 3 27B | Ollama | gemma3:27b-it-q8_0 |
8-bit | 29GB | ✅ | ❌ | Medium | Very Good |
| Gemma 3 27B | Ollama | gemma3:27b-it-qat |
4-bit | 17GB | ✅ | ✅ | Fast | Good (QAT-trained) |
| Gemma 3 27B | Ollama | gemma3:27b |
4-bit | 17GB | ✅ | ✅ | Fast | Good (Q4_K_M) |
Paper reality check: Paper text claims MacBook M3 Pro, but repo has A100 SLURM scripts.
Paper likely ran BF16 on A100s for the reported 0.619 MAE. Our QAT 4-bit zero-shot run achieved 0.717 MAE (see docs/results/reproduction-results.md).
Which Model Should We Use?
| Goal | Model | Why |
|---|---|---|
| ⭐ RECOMMENDED | gemma3:27b-it-qat (4-bit) |
QAT-trained, same speed as Q4, claims BF16 quality |
| Closer to paper's likely setup | gemma3:27b-it-q8_0 (8-bit) |
Paper likely used BF16 on A100s; Q8 is closest but slow |
| Paper baseline (current default) | gemma3:27b (4-bit) |
Post-hoc Q4_K_M; stable, widely available on Ollama |
| Maximum quality | HF gemma-3-27b-it (16-bit) |
Full BF16, 54GB, very slow on M1 |
Estimated Run Times (Full Pipeline, 41 Transcripts)
| Model | Est. Time | Notes |
|---|---|---|
| Ollama 4-bit (17GB) | ~2-4 hours | Current default |
| Ollama 8-bit (29GB) | ~6-12 hours | Recommended for reproduction |
| HF BF16 (54GB) | ~12-24+ hours | Memory-bound on M1 |
MedGemma 27B (Quantitative Agent ONLY)
NOT a general model option. MedGemma is ONLY for the quantitative agent as a hard toggle.
| Model | Backend | Tag/ID | Size | Access | Notes |
|---|---|---|---|---|---|
| MedGemma 27B | HF | google/medgemma-27b-text-it |
54GB | Gated | Official, medical fine-tuned |
| MedGemma 27B | Ollama | alibayram/medgemma:27b |
17GB | Open | Community upload, NOT official |
Paper finding (Appendix F): MedGemma got better MAE (0.505 vs 0.619) but made fewer predictions. The paper chose Gemma 3 for main results because MedGemma was too conservative.
To enable MedGemma:
LLM_BACKEND=huggingface
MODEL_QUANTITATIVE_MODEL=medgemma:27b
Current Configuration (Code Defaults)
Backends
| Setting | Default | Purpose |
|---|---|---|
LLM_BACKEND |
ollama |
Chat models (all agents) |
EMBEDDING_BACKEND |
huggingface |
Embedding model only |
No runtime fallback. If configured backend fails → loud error with instructions.
Chat Models (All Agents)
| Agent | Config Key | Default | Paper Reference |
|---|---|---|---|
| Qualitative | MODEL_QUALITATIVE_MODEL |
gemma3:27b |
Section 2.2 |
| Judge | MODEL_JUDGE_MODEL |
gemma3:27b |
Section 2.2 |
| Meta-Review | MODEL_META_REVIEW_MODEL |
gemma3:27b |
Section 2.2 |
| Quantitative | MODEL_QUANTITATIVE_MODEL |
gemma3:27b |
Section 2.2 |
MedGemma (medgemma:27b) is an ALTERNATIVE for quantitative agent only (Appendix F).
It requires LLM_BACKEND=huggingface for official weights. The Ollama community version may behave differently.
Embedding Model
| Setting | Default | Backend | Precision |
|---|---|---|---|
MODEL_EMBEDDING_MODEL |
qwen3-embedding:8b |
Resolved per backend | Q4 (Ollama) / FP16 (HF) |
Why HF default for embeddings? FP16 embeddings produce better similarity scores than Q4_K_M.
To use Ollama instead: EMBEDDING_BACKEND=ollama (will use qwen3-embedding:8b Q4_K_M).
Embedding Artifacts
| Setting | Default | Purpose |
|---|---|---|
EMBEDDING_EMBEDDINGS_FILE |
huggingface_qwen3_8b_paper_train_participant_only |
Selects {DATA_BASE_DIR}/embeddings/{name}.npz (+ .json, optional .meta.json, optional .tags.json) |
DATA_EMBEDDINGS_PATH |
data/embeddings/huggingface_qwen3_8b_paper_train_participant_only.npz |
Full-path override (takes precedence over EMBEDDING_EMBEDDINGS_FILE) |
When Embeddings Are Generated
1. Data Prep (Once)
Script: scripts/generate_embeddings.py
Output: data/embeddings/{backend}_{model_slug}_{split_slug}.npz (+ .json, .meta.json, optional .tags.json)
Generates reference embeddings for training set transcripts. Run once before few-shot mode.
2. Runtime (Every Assessment in Few-Shot Mode)
Location: EmbeddingService.embed_text() called from QuantitativeAssessmentAgent
Flow:
Transcript → Extract Evidence → Embed Evidence → Cosine Similarity → Reference Matches
↑ ↑
(runtime embed) (pre-computed refs)
Consistency requirement: Reference embeddings and runtime embeddings should use the same backend for precision consistency.
Pipeline Flow
┌────────────────────────────────────────────────────────────────────┐
│ PIPELINE │
├────────────────────────────────────────────────────────────────────┤
│ │
│ Transcript │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ QUALITATIVE AGENT │ Model: gemma3:27b (chat) │
│ │ (assess symptoms) │ Backend: LLM_BACKEND (default: ollama) │
│ └─────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ JUDGE AGENT │ Model: gemma3:27b (chat) │
│ │ (evaluate + refine) │ Backend: LLM_BACKEND │
│ └─────────────────────┘ │
│ │ ↺ feedback loop (max 10 iterations) │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ QUANTITATIVE AGENT │ Model: gemma3:27b OR medgemma:27b │
│ │ (PHQ-8 scoring) │ Backend: LLM_BACKEND (default: ollama) │
│ │ │ │
│ │ Few-shot mode: │ │
│ │ - Embed evidence │ Model: qwen3-embedding:8b (resolved) │
│ │ - Find references │ Backend: EMBEDDING_BACKEND (default: hf) │
│ └─────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ META-REVIEW AGENT │ Model: gemma3:27b (chat) │
│ │ (final severity) │ Backend: LLM_BACKEND │
│ └─────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────┘
Factory Logic (Current)
# factory.py - NO FALLBACK
def create_llm_client(settings: Settings) -> LLMClient:
backend = settings.backend.backend
if backend == LLMBackend.OLLAMA:
return OllamaClient(settings.ollama)
if backend == LLMBackend.HUGGINGFACE:
return HuggingFaceClient(...) # HF deps are loaded lazily at first use
raise ValueError(f"Unsupported backend: {backend}")
def create_embedding_client(settings: Settings) -> EmbeddingClient:
backend = settings.embedding_config.backend
if backend == EmbeddingBackend.OLLAMA:
return OllamaClient(settings.ollama)
if backend == EmbeddingBackend.HUGGINGFACE:
return HuggingFaceClient(...) # HF deps are loaded lazily at first use
raise ValueError(f"Unsupported embedding backend: {backend}")
Configuration Scenarios
Scenario 1: Default (Recommended)
# .env (validated baseline - better embeddings)
LLM_BACKEND=ollama # Chat: Ollama Q4_K_M
EMBEDDING_BACKEND=huggingface # Embed: HuggingFace FP16
Requires: pip install 'ai-psychiatrist[hf]'
Scenario 2: Pure Ollama (Legacy Baseline, Lower-Quality Similarity)
LLM_BACKEND=ollama
EMBEDDING_BACKEND=ollama # Opt-out of HF embeddings
All models Q4_K_M. Matches Paper Section 2.3.5 exactly.
Scenario 3: MedGemma for Quant Agent (Appendix F)
LLM_BACKEND=huggingface # Required for official MedGemma
MODEL_QUANTITATIVE_MODEL=medgemma:27b
EMBEDDING_BACKEND=huggingface # Keep FP16 embeddings
Requires MedGemma access approved on HuggingFace. Result: 18% better item MAE (0.505 vs 0.619) but fewer predictions.
Scenario 4: Full HuggingFace (Maximum Precision)
LLM_BACKEND=huggingface # Chat: FP16
EMBEDDING_BACKEND=huggingface # Embed: FP16
Everything FP16. Requires ~54GB VRAM for chat + ~16GB for embeddings.
What We Do NOT Support (By Design)
- Runtime fallback: HF unavailable → silently use Ollama (breaks reproducibility)
- Model substitution: medgemma → gemma3 (different clinical behavior)
- Mixed embedding precision: FP16 refs + Q4_K_M runtime (breaks similarity scores)
Final Architecture
Default Configuration
# .env (defaults)
LLM_BACKEND=ollama # Chat: Ollama (baseline)
EMBEDDING_BACKEND=huggingface # Embedding: HuggingFace (better precision)
MODEL_QUANTITATIVE_MODEL=gemma3:27b
Startup Validation
- HuggingFace deps (
EMBEDDING_BACKEND=huggingface): - Transformers/torch deps are loaded lazily on first embed/chat call.
-
If missing, the request fails with an
ImportErrorcontaining:pip install 'ai-psychiatrist[hf]' -
Reference embedding validation:
- If
{artifact}.meta.jsonexists,ReferenceStorevalidates:backend,dimension,chunk_size,chunk_step. - If metadata is missing, validation is skipped (and a warning is logged when
EMBEDDING_BACKEND != ollama).
MedGemma: Hard Toggle
LLM_BACKEND=huggingface
MODEL_QUANTITATIVE_MODEL=medgemma:27b
If HF unavailable → FAIL LOUDLY. No silent substitution.
Why This Design?
| Decision | Reason |
|---|---|
| HF embeddings default | FP16 > Q4_K_M for similarity quality |
| Ollama chat default | Local baseline (Section 2.3.5) |
| No runtime fallback | Reproducibility > convenience |
| Precision validation | Prevent silent embedding mismatch |
| Hard toggle for MedGemma | Different clinical behavior ≠ drop-in replacement |
References
- Paper Section 2.2: Model specification (Gemma 3 27B, Qwen 3 8B Embedding)
- Paper Appendix F: MedGemma evaluation
src/ai_psychiatrist/config.py: All defaultssrc/ai_psychiatrist/infrastructure/llm/factory.py: Client creationsrc/ai_psychiatrist/infrastructure/llm/model_aliases.py: Backend mapping