Spec 050: Consistency-Based Confidence (Multi-Sample)
Status: Implemented (2026-01-03) Priority: Medium (alternative to verbalized confidence) Depends on: None (standalone) Estimated effort: Medium-High Research basis: CoCoA (TACL 2025), Semantic Entropy (2024)
0. Problem Statement
LLMs exhibit aleatoric uncertainty (irreducible randomness) and epistemic uncertainty (model uncertainty). When an LLM is uncertain about a prediction, running multiple inference passes with temperature > 0 will yield inconsistent outputs.
Hypothesis: Predictions with high consistency across samples are more likely to be correct.
This approach is validated by: - CoCoA (TACL 2025): "Confidence-Consistency Aggregation yields best overall reliability" - Semantic Entropy (2024): Clustering semantically equivalent outputs captures uncertainty better than token-level entropy - LM-Polygraph Benchmark: Consistency-based methods among top performers
Trade-off
Cost: N inference passes per item (N typically 3-10) Benefit: Strong uncertainty signal independent of LLM self-assessment
1. Goals / Non-Goals
1.1 Goals
- Implement multi-sample inference for quantitative assessment
- Compute consistency metrics across samples:
- Score agreement (exact match rate)
- Score variance (standard deviation)
- Modal score (most common prediction)
- Modal confidence (frequency of modal score)
- Persist consistency signals in run artifacts
- Add consistency-based confidence variants in evaluation
- Support configurable sample count (N) and temperature
1.2 Non-Goals
- Semantic clustering of evidence (would require NLP pipeline)
- Monte Carlo Dropout (requires model modifications, not applicable to Ollama)
- Changing the single-pass inference mode (consistency mode is opt-in)
2. Proposed Solution
2.1 Multi-Sample Inference Mode
Add --consistency-samples flag to scripts/reproduce_results.py:
# Standard inference (N=1)
uv run python scripts/reproduce_results.py --split paper-test
# Consistency mode (N=5)
uv run python scripts/reproduce_results.py \
--split paper-test \
--consistency-samples 5 \
--temperature 0.3
Implementation in QuantitativeAssessmentAgent:
def assess_with_consistency(
self,
transcript: str,
qualitative: QualitativeAssessment,
*,
n_samples: int = 5,
temperature: float = 0.3,
) -> PHQ8Assessment:
"""Run multiple inference passes and aggregate."""
samples = []
for _ in range(n_samples):
# Each call uses temperature > 0 for diversity
sample = self._assess_single(transcript, qualitative, temperature=temperature)
samples.append(sample)
# Aggregate across samples
return self._aggregate_samples(samples)
2.2 Consistency Metrics
For each PHQ-8 item, compute:
| Metric | Formula | Range | Interpretation |
|---|---|---|---|
consistency_modal_score |
Mode of predicted scores | 0-3 or None | Most common prediction |
consistency_modal_count |
Count of modal score | 1-N | How many samples agreed |
consistency_modal_confidence |
modal_count / N | 0-1 | Confidence from agreement |
consistency_score_std |
Std of scores | 0-1.5 | Lower = more consistent |
consistency_na_rate |
Fraction of N/A predictions | 0-1 | High = uncertain |
Final prediction selection:
- If consistency_modal_confidence >= 0.5: Use modal score
- Else: Use single-sample fallback or N/A
2.3 Run Artifact Schema Extension
{
"item_signals": {
"Sleep": {
"llm_evidence_count": 2,
"retrieval_similarity_mean": 0.82,
"verbalized_confidence": 4,
"consistency_modal_score": 2,
"consistency_modal_count": 4,
"consistency_modal_confidence": 0.8,
"consistency_score_std": 0.45,
"consistency_na_rate": 0.0,
"consistency_samples": [2, 2, 2, 2, 1]
}
},
"provenance": {
"consistency_enabled": true,
"consistency_n_samples": 5,
"consistency_temperature": 0.3
}
}
2.4 New Confidence Variants
Add to scripts/evaluate_selective_prediction.py:
CONFIDENCE_VARIANTS = {
# Existing...
# NEW (Spec 050)
"consistency", # = consistency_modal_confidence
"consistency_inverse_std", # = 1 / (1 + consistency_score_std)
"hybrid_consistency", # Combine consistency with other signals
}
Formula for consistency_inverse_std:
confidence = 1 / (1 + consistency_score_std)
Formula for hybrid_consistency:
c = consistency_modal_confidence
e = min(llm_evidence_count, 3) / 3
s = retrieval_similarity_mean or 0.0
confidence = 0.4 * c + 0.3 * e + 0.3 * s
2.5 Configuration
Environment variables / .env:
# Consistency mode (default: disabled)
CONSISTENCY_ENABLED=false
CONSISTENCY_N_SAMPLES=5
CONSISTENCY_TEMPERATURE=0.2 # Updated from 0.3 per BUG-027 (2025 clinical research)
3. Implementation Plan
Phase 1: Multi-Sample Infrastructure
- Add
--consistency-samplesand--temperatureflags toreproduce_results.py - Implement
assess_with_consistency()inQuantitativeAssessmentAgent - Implement sample aggregation logic
- Add
ConsistencyMetricsdataclass
Phase 2: Run Artifact Changes
- Extend
item_signalsschema with consistency fields - Persist
consistency_samplesarray for debugging - Add provenance fields for consistency configuration
Phase 3: Evaluation Support
- Add
consistency,consistency_inverse_std,hybrid_consistencyvariants - Handle missing consistency signals gracefully (error if variant requested but missing)
4. Test Plan
4.1 Unit Tests
test_consistency_aggregation: Modal score computationtest_consistency_metrics: Std, NA rate calculationstest_consistency_edge_cases: All N/A, single sample, tie-breaking
4.2 Integration Tests
- Mock LLM with deterministic outputs, verify aggregation
- Mock LLM with varying outputs, verify consistency metrics
4.3 Ablation Run
# Generate consistency run
uv run python scripts/reproduce_results.py \
--split paper-test \
--consistency-samples 5 \
--temperature 0.3 \
2>&1 | tee data/outputs/run_consistency.log
# Evaluate
uv run python scripts/evaluate_selective_prediction.py \
--input data/outputs/run_consistency.json \
--mode few_shot \
--confidence consistency
uv run python scripts/evaluate_selective_prediction.py \
--input data/outputs/run_consistency.json \
--mode few_shot \
--confidence hybrid_consistency
5. Expected Outcomes
Based on CoCoA and semantic entropy literature:
| Confidence Signal | Expected AUGRC | vs Baseline |
|---|---|---|
llm (baseline) |
0.031 | — |
consistency |
~0.025 | -19% |
consistency_inverse_std |
~0.024 | -23% |
hybrid_consistency |
~0.020 | -35% |
Trade-off: 5x inference time for potential 20-35% AUGRC improvement
6. Acceptance Criteria
- [ ]
--consistency-samplesflag works inreproduce_results.py - [ ] Consistency metrics computed and persisted in run artifacts
- [ ]
evaluate_selective_prediction.pysupports consistency variants - [ ] Provenance tracks consistency configuration
- [ ] Documentation in
docs/statistics/metrics-and-evaluation.md - [ ] Tests pass:
make ci
7. Performance Considerations
7.1 Inference Time
| N Samples | Estimated Time (paper-test) | vs Single-Pass |
|---|---|---|
| 1 | ~2 hours | 1x |
| 3 | ~6 hours | 3x |
| 5 | ~10 hours | 5x |
| 10 | ~20 hours | 10x |
Recommendation: Use N=5 for evaluation runs, N=3 for development.
7.2 Parallelization
Samples for the same participant can be parallelized if Ollama has sufficient GPU memory. Consider adding --consistency-parallel flag.
7.3 Caching
Cache intermediate samples to allow resumption if run is interrupted.
8. File Changes
New Files
src/ai_psychiatrist/domain/value_objects.py(extend withConsistencyMetrics)tests/unit/agents/test_consistency.py
Modified Files
scripts/reproduce_results.py(add consistency flags)src/ai_psychiatrist/agents/quantitative.py(addassess_with_consistency)scripts/evaluate_selective_prediction.py(add consistency variants)src/ai_psychiatrist/config.py(add consistency settings)