Agent Sampling Parameter Registry
Purpose: Single Source of Truth (SSOT) for all agent sampling parameters. Last Updated: 2026-01-04 Related: Configuration Reference | GitHub Issue #46 | BUG-027
Pipeline Architecture
Transcript
│
▼
┌─────────────────────┐
│ 1. QUALITATIVE │ ← Analyzes transcript, outputs narrative
└─────────────────────┘ (social, biological, risk factors)
│
▼
┌─────────────────────┐
│ 2. JUDGE │ ← Scores qualitative output (4 metrics)
└─────────────────────┘ (coherence, completeness, specificity, accuracy)
│ ↺ loops back if any score < 4 (max 10 iterations)
▼
┌─────────────────────┐
│ 3. QUANTITATIVE │ ← Selectively scores PHQ-8 items (0-3) or N/A
└─────────────────────┘ (zero-shot OR few-shot with embeddings)
│
▼
┌─────────────────────┐
│ 4. META-REVIEW │ ← Integrates all assessments
└─────────────────────┘ → Final severity (0-4) + MDD indicator
Validity note: PHQ-8 item scores are frequency-based; transcript-only item scoring is often underdetermined, so N/A outputs and <100% coverage are expected. Interpret quantitative results as selective prediction (coverage + AURC/AUGRC); see docs/clinical/task-validity.md.
Our Implementation (Evidence-Based)
The Config
# All agents: temperature=0.0, nothing else
temperature = 0.0
That's it.
Parameter Table
| Agent | temp | top_k | top_p | Rationale |
|---|---|---|---|---|
| Qualitative | 0.0 | — | — | Clinical extraction, not creative writing |
| Judge | 0.0 | — | — | Classification (1-5 scoring) |
| Quantitative (zero-shot) | 0.0 | — | — | Selective scoring (0-3 or N/A) |
| Quantitative (few-shot) | 0.0 | — | — | Selective scoring (0-3 or N/A) |
| Meta-Review | 0.0 | — | — | Classification (severity 0-4) |
Note on "—": Don't set these. At temp=0 they're irrelevant (greedy decoding), and best practice is "use temperature only."
Consistency Sampling (Spec 050)
Multi-sample scoring for agreement-based confidence signals.
Why Consistency Needs Non-Zero Temperature
| Mode | Temperature | Rationale |
|---|---|---|
| Primary inference | 0.0 | Deterministic, reproducible scores |
| Consistency sampling | 0.2 | Needs variance to measure agreement |
At temp=0.0, all N consistency samples would be identical (greedy decoding). You need temp>0 to generate diverse reasoning paths for self-consistency signals.
Temperature Selection (BUG-027)
| Value | Use Case | Research Basis |
|---|---|---|
| 0.0 | Primary inference | Med-PaLM, clinical best practice |
| 0.2 | Consistency sampling | Clinical studies define "low" threshold |
| 0.3+ | Not recommended | Performance becomes "unpredictable" |
Research Evidence: - How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment (arXiv 2509.19329) - Defines 0.2 as "low" - GPT-4 Clinical Depression Assessment (arXiv 2501.00199) - Notes unpredictability at ≥0.3
Environment Variables
# Primary inference: temp=0 (deterministic)
MODEL_TEMPERATURE=0.0
# Consistency sampling: temp=0.2 (low-variance for clinical)
CONSISTENCY_ENABLED=true
CONSISTENCY_N_SAMPLES=5
CONSISTENCY_TEMPERATURE=0.2 # BUG-027: updated from 0.3
Why These Settings (With Citations)
| Source | What It Says | Link |
|---|---|---|
| Med-PaLM | "sampled with temperature 0.0" for clinical answers | Nature Medicine |
| medRxiv 2025 | "Lower temperatures promote diagnostic accuracy and consistency" | Study |
| Anthropic | "Use temperature closer to 0.0 for analytical / multiple choice" | PromptHub |
| Anthropic | "top_k is recommended for advanced use cases only. You usually only need to use temperature" | PromptHub |
| Anthropic/AWS | "You should alter either temperature or top_p, but not both" | AWS Bedrock |
| Claude 4.x APIs | Returns ERROR: "temperature and top_p cannot both be specified" | GitHub |
| IBM | Low temperature reduces randomness but should be combined with RAG, calibration, and human oversight for clinical safety | IBM Think |
Best Practice: Use Temperature Only (2025)
Why Not Both top_k AND top_p?
"OpenAI and most AI companies recommend changing one or the other, not both." — OpenAI Community
"Newer Claude models return an error: 'temperature and top_p cannot both be specified'" — GitHub Issue
Why Not top_k At All?
"Top K is not a terribly useful parameter... not as well-supported, notably missing from OpenAI's API." — Vellum
2025 Reality: - Use temperature only (preferred) - OR top_p only (alternative) - Never both — newer Claude models literally error - top_k is obsolete — not even in OpenAI's API
At temp=0, Do top_k/top_p Matter?
No. At temperature=0, you get greedy decoding (argmax). There's no sampling, so sampling filters (top_k, top_p) have nothing to filter.
What the Paper Says
| Parameter | Paper Says | Section |
|---|---|---|
| Model | Gemma 3 27B | Section 2.2 |
| Temperature | "fairly deterministic" | Section 4 |
| top_k / top_p | NOT SPECIFIED | — |
| Per-agent sampling | NOT SPECIFIED | — |
The paper only tuned embedding hyperparameters (Appendix D): - chunk_size = 8 - top_k_references = 2 - dimension = 4096
Sampling parameters were NOT tuned or specified.
What Their Code Does (Reference Only)
Their Codebase is Internally Contradictory
| Source | temp | top_k | top_p | Notes |
|---|---|---|---|---|
basic_quantitative_analysis.ipynb:207 |
0 | 1 | 1.0 | Zero-shot |
basic_quantitative_analysis.ipynb:599 |
0.1 | 10 | — | Experimental? |
embedding_quantitative_analysis.ipynb:1174 |
0.2 | 20 | 0.8 | Few-shot |
qual_assessment.py |
0 | 20 | 0.9 | Wrong model default |
meta_review.py |
0 | 20 | 1.0 | Wrong model default |
Python files also have wrong model defaults (llama3 instead of gemma3). Cannot be trusted as SSOT.
Our decision: Use evidence-based clinical AI defaults, not reverse-engineered contradictory code.
Environment Variables
# Clinical AI default: temp=0 only
MODEL_TEMPERATURE=0.0
# top_k and top_p: DO NOT SET
# - At temp=0 they're irrelevant (greedy decoding)
# - Best practice: "use temperature only, not both"
# - Claude 4.x APIs error if you set both temp and top_p