Agent Sampling Parameter Registry

Purpose: Single Source of Truth (SSOT) for all agent sampling parameters. Last Updated: 2026-01-04 Related: Configuration Reference | GitHub Issue #46 | BUG-027

Pipeline Architecture

Transcript
    │
    ▼
┌─────────────────────┐
│ 1. QUALITATIVE      │  ← Analyzes transcript, outputs narrative
└─────────────────────┘     (social, biological, risk factors)
    │
    ▼
┌─────────────────────┐
│ 2. JUDGE            │  ← Scores qualitative output (4 metrics)
└─────────────────────┘     (coherence, completeness, specificity, accuracy)
    │                       ↺ loops back if any score < 4 (max 10 iterations)
    ▼
┌─────────────────────┐
│ 3. QUANTITATIVE     │  ← Selectively scores PHQ-8 items (0-3) or N/A
└─────────────────────┘     (zero-shot OR few-shot with embeddings)
    │
    ▼
┌─────────────────────┐
│ 4. META-REVIEW      │  ← Integrates all assessments
└─────────────────────┘     → Final severity (0-4) + MDD indicator

Validity note: PHQ-8 item scores are frequency-based; transcript-only item scoring is often underdetermined, so N/A outputs and <100% coverage are expected. Interpret quantitative results as selective prediction (coverage + AURC/AUGRC); see docs/clinical/task-validity.md.

Our Implementation (Evidence-Based)

The Config

# All agents: temperature=0.0, nothing else
temperature = 0.0

That's it.

Parameter Table

Agent	top_k	top_p	Rationale
Qualitative	—	—	Clinical extraction, not creative writing
Judge	—	—	Classification (1-5 scoring)
Quantitative (zero-shot)	—	—	Selective scoring (0-3 or N/A)
Quantitative (few-shot)	—	—	Selective scoring (0-3 or N/A)
Meta-Review	—	—	Classification (severity 0-4)

Note on "—": Don't set these. At temp=0 they're irrelevant (greedy decoding), and best practice is "use temperature only."

Consistency Sampling (Spec 050)

Multi-sample scoring for agreement-based confidence signals.

Why Consistency Needs Non-Zero Temperature

Mode	Temperature	Rationale
Primary inference	0.0	Deterministic, reproducible scores
Consistency sampling	0.2	Needs variance to measure agreement

At temp=0.0, all N consistency samples would be identical (greedy decoding). You need temp>0 to generate diverse reasoning paths for self-consistency signals.

Temperature Selection (BUG-027)

Value	Use Case	Research Basis
0.0	Primary inference	Med-PaLM, clinical best practice
0.2	Consistency sampling	Clinical studies define "low" threshold
0.3+	Not recommended	Performance becomes "unpredictable"

Research Evidence: - How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment (arXiv 2509.19329) - Defines 0.2 as "low" - GPT-4 Clinical Depression Assessment (arXiv 2501.00199) - Notes unpredictability at ≥0.3

Environment Variables

# Primary inference: temp=0 (deterministic)
MODEL_TEMPERATURE=0.0

# Consistency sampling: temp=0.2 (low-variance for clinical)
CONSISTENCY_ENABLED=true
CONSISTENCY_N_SAMPLES=5
CONSISTENCY_TEMPERATURE=0.2  # BUG-027: updated from 0.3

Why These Settings (With Citations)

Source	What It Says	Link
Med-PaLM	"sampled with temperature 0.0" for clinical answers	Nature Medicine
medRxiv 2025	"Lower temperatures promote diagnostic accuracy and consistency"	Study
Anthropic	"Use temperature closer to 0.0 for analytical / multiple choice"	PromptHub
Anthropic	"top_k is recommended for advanced use cases only. You usually only need to use temperature"	PromptHub
Anthropic/AWS	"You should alter either temperature or top_p, but not both"	AWS Bedrock
Claude 4.x APIs	Returns ERROR: "temperature and top_p cannot both be specified"	GitHub
IBM	Low temperature reduces randomness but should be combined with RAG, calibration, and human oversight for clinical safety	IBM Think

Best Practice: Use Temperature Only (2025)

Why Not Both top_k AND top_p?

"OpenAI and most AI companies recommend changing one or the other, not both." — OpenAI Community

"Newer Claude models return an error: 'temperature and top_p cannot both be specified'" — GitHub Issue

Why Not top_k At All?

"Top K is not a terribly useful parameter... not as well-supported, notably missing from OpenAI's API." — Vellum

2025 Reality: - Use temperature only (preferred) - OR top_p only (alternative) - Never both — newer Claude models literally error - top_k is obsolete — not even in OpenAI's API

At temp=0, Do top_k/top_p Matter?

No. At temperature=0, you get greedy decoding (argmax). There's no sampling, so sampling filters (top_k, top_p) have nothing to filter.

What the Paper Says

Parameter	Paper Says	Section
Model	Gemma 3 27B	Section 2.2
Temperature	"fairly deterministic"	Section 4
top_k / top_p	NOT SPECIFIED	—
Per-agent sampling	NOT SPECIFIED	—

The paper only tuned embedding hyperparameters (Appendix D): - chunk_size = 8 - top_k_references = 2 - dimension = 4096

Sampling parameters were NOT tuned or specified.

What Their Code Does (Reference Only)

Their Codebase is Internally Contradictory

Source	temp	top_k	top_p	Notes
`basic_quantitative_analysis.ipynb:207`	0	1	1.0	Zero-shot
`basic_quantitative_analysis.ipynb:599`	0.1	10	—	Experimental?
`embedding_quantitative_analysis.ipynb:1174`	0.2	20	0.8	Few-shot
`qual_assessment.py`	0	20	0.9	Wrong model default
`meta_review.py`	0	20	1.0	Wrong model default

Python files also have wrong model defaults (llama3 instead of gemma3). Cannot be trusted as SSOT.

Our decision: Use evidence-based clinical AI defaults, not reverse-engineered contradictory code.

Environment Variables

# Clinical AI default: temp=0 only
MODEL_TEMPERATURE=0.0

# top_k and top_p: DO NOT SET
# - At temp=0 they're irrelevant (greedy decoding)
# - Best practice: "use temperature only, not both"
# - Claude 4.x APIs error if you set both temp and top_p

References

GitHub Issue #46 - Full investigation
Configuration Reference
Med-PaLM - Nature Medicine
Temperature in Clinical AI - medRxiv 2025
Anthropic Best Practices - PromptHub