BUG-027: Consistency Sampling Temperature May Be Suboptimal for Clinical Accuracy
Status: ✅ Resolved (Implemented)
Severity: P2 (Optimization)
Filed: 2026-01-04
Resolved: 2026-01-04
Component: src/ai_psychiatrist/config.py, .env.example
Summary
The consistency sampling feature (Spec 050) used CONSISTENCY_TEMPERATURE=0.3, which is likely higher than optimal for clinical diagnostic accuracy. 2025-2026 medical research suggests that temperatures in the 0.1-0.2 range provide better diagnostic accuracy while still enabling sufficient variance for self-consistency signals.
Resolution: Update the baseline to CONSISTENCY_TEMPERATURE=0.2 (docs + code + tests) to align with low-variance clinical scoring guidance while preserving multi-sample diversity.
Fix Implemented
- Baseline default updated:
src/ai_psychiatrist/config.py:ConsistencySettings.temperaturedefault0.3 → 0.2- Runbook/config updated:
.env.example:CONSISTENCY_TEMPERATURE=0.2AGENTS.md,CLAUDE.md,GEMINI.md,NEXT-STEPS.md: updated examples and checklists- Regression prevention:
tests/unit/test_bootstrap.py: asserts.env.examplecontainsCONSISTENCY_TEMPERATURE=0.2tests/unit/test_config.py: asserts schema default is0.2
Current Configuration
# .env.example:114
MODEL_TEMPERATURE=0.0 # Primary inference (CORRECT)
# .env.example:138
CONSISTENCY_TEMPERATURE=0.2 # Consistency sampling (low-variance baseline)
Primary inference at 0.0: Correct per Med-PaLM best practices.
Consistency sampling at 0.3: May introduce unnecessary diagnostic variance.
Research Evidence
2025 Medical Research Findings
| Study | Finding | Source |
|---|---|---|
| Emergency Diagnostic Accuracy (medRxiv 2025) | "At temperature 0.0, GPT-4o achieved 100% leading diagnosis accuracy. As temperature increased, accuracy declined systematically to 89.4% at temperature 1.0." | medRxiv 2025.06.04.25328288 |
| Same study | "Diagnostic divergence increased 583% from temp 0.0 to 1.0" | Same |
| GPT-4 Clinical Depression Assessment | "Optimal performance observed at lower temperature values (0.0-0.2) for complex prompts. Beyond temperature >= 0.3, the relationship becomes unpredictable." | arXiv 2501.00199 |
Self-Consistency Trade-off
| Context | Temperature | Rationale |
|---|---|---|
| Original self-consistency papers | 0.5-0.7 | General reasoning tasks, maximize diversity |
| Clinical accuracy (single-shot) | 0.0-0.2 | Maximize diagnostic accuracy |
| Our consistency sampling | 0.3 | Middle ground, but may lean too high |
The Trade-off
Self-consistency REQUIRES non-zero temperature to generate diverse reasoning paths:
If temperature = 0.0:
All N samples are identical (deterministic)
No variance = no consistency signal
Feature becomes useless
If temperature too high (0.5+):
High variance = diverse paths
BUT: Introduces diagnostic errors
Medical research shows accuracy drops significantly
The question: Is 0.3 the right balance, or should we use 0.1-0.2?
Impact Analysis
Current Behavior
- 5 samples at temperature 0.3
- Generates variance for agreement-based confidence
- May introduce 5-10% additional diagnostic variance vs. 0.0
Potential Improvement
- Lower to 0.1-0.2
- Still generates variance for consistency signals
- Better alignment with medical best practices
- Potential accuracy improvement (needs empirical validation)
Proposed Fix
Recommended: Temperature 0.2
Based on 2025 clinical research, 0.2 is the recommended value:
# .env.example
CONSISTENCY_TEMPERATURE=0.2 # Clinical best practice for low-variance sampling
Rationale: 1. A 2025 clinical reasoning assessment study explicitly categorized temperatures as: - Low: 0.2 (recommended for accuracy) - High: 0.7 (for creativity/exploration)
Source: How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment
- The Emergency Diagnostic Accuracy study (medRxiv 2025) tested:
- 0.0, 0.25, 0.50, 0.75, 1.0
- Best accuracy at 0.0, declining at each step
- 0.25 is the next tested point after 0.0
Our current 0.3 sits between 0.25 and 0.50 - slightly suboptimal.
-
ECG diagnostic studies used temperature 0.2 for consistency across runs.
-
Anthropic's official guidance recommends 0.0-0.2 for analytical tasks.
Why Not 0.1?
- Too low may not generate sufficient variance across 5 samples
- 0.2 is the established "low" threshold in clinical research
- Provides balance between variance and accuracy
Why Not Keep 0.3?
- 0.3 is above the "low" threshold used in clinical studies
- Research specifically tested 0.2 vs 0.7 as low/high categories
- GPT-4 depression study noted performance becomes "unpredictable" at >= 0.3
Code Locations
| File | Line | Current Value |
|---|---|---|
.env.example |
138 | CONSISTENCY_TEMPERATURE=0.2 |
.env |
137 | CONSISTENCY_TEMPERATURE=0.2 |
src/ai_psychiatrist/config.py |
440-444 | temperature: float = Field(default=0.2, ...) |
Validation (Optional)
Empirical testing can still be run (0.2 vs 0.3) to quantify impact on both accuracy and calibration, but the baseline is now aligned with the low-temperature clinical guidance cited below.
- Empirical testing: Run comparison at 0.2 vs 0.3 on paper-test split
- Variance sufficiency: Verify 0.2 still produces meaningful variance across 5 samples
- Consistency signal quality: Check if lower temp degrades agreement-based confidence calibration
Decision Points (If Revisited)
- [ ] Keep
0.2(baseline) and tune other confidence signals first - [ ] Evaluate adaptive temperature selection (see arXiv 2502.05234)
- [ ] Increase
n_samplesand reduce temperature further (if variance remains sufficient)
References
- Temperature-Driven Variability in Emergency Diagnostic Accuracy (medRxiv 2025) - Primary clinical evidence
- How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment (arXiv 2509.19329) - Defines 0.2 as "low" clinical threshold
- GPT-4 on Clinic Depression Assessment (arXiv 2501.00199) - PHQ-8 specific, notes unpredictability at >= 0.3
- Optimizing Temperature for Multi-Sample Inference (arXiv 2502.05234) - TURN automated optimization
- Calibration Study in Biomedical Research (bioRxiv 2025) - Self-consistency calibration
- Statistical Framework for LLM Consistency in Medical Diagnosis (medRxiv 2025)