Skip to content

Agent Sampling Parameter Registry

Purpose: Single Source of Truth (SSOT) for all agent sampling parameters. Last Updated: 2026-01-04 Related: Configuration Reference | GitHub Issue #46 | BUG-027


Pipeline Architecture

Transcript
    │
    ▼
┌─────────────────────┐
│ 1. QUALITATIVE      │  ← Analyzes transcript, outputs narrative
└─────────────────────┘     (social, biological, risk factors)
    │
    ▼
┌─────────────────────┐
│ 2. JUDGE            │  ← Scores qualitative output (4 metrics)
└─────────────────────┘     (coherence, completeness, specificity, accuracy)
    │                       ↺ loops back if any score < 4 (max 10 iterations)
    ▼
┌─────────────────────┐
│ 3. QUANTITATIVE     │  ← Selectively scores PHQ-8 items (0-3) or N/A
└─────────────────────┘     (zero-shot OR few-shot with embeddings)
    │
    ▼
┌─────────────────────┐
│ 4. META-REVIEW      │  ← Integrates all assessments
└─────────────────────┘     → Final severity (0-4) + MDD indicator

Validity note: PHQ-8 item scores are frequency-based; transcript-only item scoring is often underdetermined, so N/A outputs and <100% coverage are expected. Interpret quantitative results as selective prediction (coverage + AURC/AUGRC); see docs/clinical/task-validity.md.


Our Implementation (Evidence-Based)

The Config

# All agents: temperature=0.0, nothing else
temperature = 0.0

That's it.

Parameter Table

Agent temp top_k top_p Rationale
Qualitative 0.0 Clinical extraction, not creative writing
Judge 0.0 Classification (1-5 scoring)
Quantitative (zero-shot) 0.0 Selective scoring (0-3 or N/A)
Quantitative (few-shot) 0.0 Selective scoring (0-3 or N/A)
Meta-Review 0.0 Classification (severity 0-4)

Note on "—": Don't set these. At temp=0 they're irrelevant (greedy decoding), and best practice is "use temperature only."


Consistency Sampling (Spec 050)

Multi-sample scoring for agreement-based confidence signals.

Why Consistency Needs Non-Zero Temperature

Mode Temperature Rationale
Primary inference 0.0 Deterministic, reproducible scores
Consistency sampling 0.2 Needs variance to measure agreement

At temp=0.0, all N consistency samples would be identical (greedy decoding). You need temp>0 to generate diverse reasoning paths for self-consistency signals.

Temperature Selection (BUG-027)

Value Use Case Research Basis
0.0 Primary inference Med-PaLM, clinical best practice
0.2 Consistency sampling Clinical studies define "low" threshold
0.3+ Not recommended Performance becomes "unpredictable"

Research Evidence: - How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment (arXiv 2509.19329) - Defines 0.2 as "low" - GPT-4 Clinical Depression Assessment (arXiv 2501.00199) - Notes unpredictability at ≥0.3

Environment Variables

# Primary inference: temp=0 (deterministic)
MODEL_TEMPERATURE=0.0

# Consistency sampling: temp=0.2 (low-variance for clinical)
CONSISTENCY_ENABLED=true
CONSISTENCY_N_SAMPLES=5
CONSISTENCY_TEMPERATURE=0.2  # BUG-027: updated from 0.3

Why These Settings (With Citations)

Source What It Says Link
Med-PaLM "sampled with temperature 0.0" for clinical answers Nature Medicine
medRxiv 2025 "Lower temperatures promote diagnostic accuracy and consistency" Study
Anthropic "Use temperature closer to 0.0 for analytical / multiple choice" PromptHub
Anthropic "top_k is recommended for advanced use cases only. You usually only need to use temperature" PromptHub
Anthropic/AWS "You should alter either temperature or top_p, but not both" AWS Bedrock
Claude 4.x APIs Returns ERROR: "temperature and top_p cannot both be specified" GitHub
IBM Low temperature reduces randomness but should be combined with RAG, calibration, and human oversight for clinical safety IBM Think

Best Practice: Use Temperature Only (2025)

Why Not Both top_k AND top_p?

"OpenAI and most AI companies recommend changing one or the other, not both." — OpenAI Community

"Newer Claude models return an error: 'temperature and top_p cannot both be specified'" — GitHub Issue

Why Not top_k At All?

"Top K is not a terribly useful parameter... not as well-supported, notably missing from OpenAI's API." — Vellum

2025 Reality: - Use temperature only (preferred) - OR top_p only (alternative) - Never both — newer Claude models literally error - top_k is obsolete — not even in OpenAI's API

At temp=0, Do top_k/top_p Matter?

No. At temperature=0, you get greedy decoding (argmax). There's no sampling, so sampling filters (top_k, top_p) have nothing to filter.


What the Paper Says

Parameter Paper Says Section
Model Gemma 3 27B Section 2.2
Temperature "fairly deterministic" Section 4
top_k / top_p NOT SPECIFIED
Per-agent sampling NOT SPECIFIED

The paper only tuned embedding hyperparameters (Appendix D): - chunk_size = 8 - top_k_references = 2 - dimension = 4096

Sampling parameters were NOT tuned or specified.


What Their Code Does (Reference Only)

Their Codebase is Internally Contradictory

Source temp top_k top_p Notes
basic_quantitative_analysis.ipynb:207 0 1 1.0 Zero-shot
basic_quantitative_analysis.ipynb:599 0.1 10 Experimental?
embedding_quantitative_analysis.ipynb:1174 0.2 20 0.8 Few-shot
qual_assessment.py 0 20 0.9 Wrong model default
meta_review.py 0 20 1.0 Wrong model default

Python files also have wrong model defaults (llama3 instead of gemma3). Cannot be trusted as SSOT.

Our decision: Use evidence-based clinical AI defaults, not reverse-engineered contradictory code.


Environment Variables

# Clinical AI default: temp=0 only
MODEL_TEMPERATURE=0.0

# top_k and top_p: DO NOT SET
# - At temp=0 they're irrelevant (greedy decoding)
# - Best practice: "use temperature only, not both"
# - Claude 4.x APIs error if you set both temp and top_p

References