Spec 048: Verbalized Confidence for AUGRC Improvement
Status: Implemented (2026-01-03) Priority: High (next AUGRC improvement lever) Depends on: Spec 046 (retrieval signals) Estimated effort: Medium Research basis: LLM Uncertainty Survey 2025, ICLR 2025 - Do LLMs Estimate Uncertainty Well?
0. Problem Statement
Run 9 (Spec 046) achieved 5.4% AURC improvement using retrieval_similarity_mean as a confidence signal, but AUGRC remains at ~0.031 (target: <0.020).
Current confidence signals are external to the LLM's reasoning: - Evidence count (how many quotes extracted) - Retrieval similarity (how similar the references were)
Neither signal captures the LLM's internal uncertainty about its own prediction. Research shows that asking LLMs to verbalize their confidence—while imperfect—provides complementary signal that improves calibration.
Key Research Findings
| Source | Finding |
|---|---|
| LLM Uncertainty Survey 2025 | Verbalized confidence is overconfident (80-100% range) but still useful when calibrated |
| ICLR 2025 | normalized p(true) is a reliable uncertainty method across settings |
| CoCoA (TACL 2025) | Hybrid confidence-consistency aggregation yields best overall reliability |
Expected improvement: 20-40% AUGRC reduction (literature-based estimate)
1. Goals / Non-Goals
1.1 Goals
- Add verbalized confidence field to LLM output schema (per item)
- Persist verbalized confidence in run artifacts (
item_signals) - Add new confidence variants in
evaluate_selective_prediction.py: verbalized: raw verbalized confidenceverbalized_calibrated: temperature-scaled verbalized confidencehybrid_verbalized: combination of verbalized + retrieval + evidence signals- Provide calibration infrastructure to fit temperature scaling on paper-train
- Maintain backward compatibility with existing run artifacts
1.2 Non-Goals
- Changing the scoring logic (this spec targets confidence/ranking quality only)
- Ensemble methods requiring multiple inference passes (see Spec 050)
- Training a full ML calibrator (see Spec 049)
2. Proposed Solution
2.1 Extend LLM Output Schema
Current ItemAssessment output from the LLM:
{
"item": "Sleep",
"score": 2,
"evidence": ["Quote 1...", "Quote 2..."],
"explanation": "..."
}
New output with verbalized confidence:
{
"item": "Sleep",
"score": 2,
"confidence": 4,
"evidence": ["Quote 1...", "Quote 2..."],
"explanation": "..."
}
Where confidence is an integer from 1-5:
- 1 = Very uncertain (guessing)
- 2 = Somewhat uncertain
- 3 = Moderately confident
- 4 = Fairly confident
- 5 = Very confident
2.2 Prompt Modification
Add to the quantitative assessment prompt (after the scoring instructions):
For each item, also provide a confidence rating from 1 to 5:
- 1: Very uncertain - I am guessing based on minimal evidence
- 2: Somewhat uncertain - Evidence is weak or ambiguous
- 3: Moderately confident - Some supporting evidence
- 4: Fairly confident - Clear supporting evidence
- 5: Very confident - Strong, unambiguous evidence
If you cannot assess an item (N/A), do not include a confidence rating for that item.
2.3 Domain Model Changes
Extend ItemAssessment in src/ai_psychiatrist/domain/value_objects.py:
@dataclass(frozen=True)
class ItemAssessment:
item: PHQ8Item
score: int | None
evidence: tuple[str, ...]
explanation: str
na_reason: str | None = None
# Existing (Spec 046)
retrieval_reference_count: int | None = None
retrieval_similarity_mean: float | None = None
retrieval_similarity_max: float | None = None
# NEW (Spec 048)
verbalized_confidence: int | None = None # 1-5 scale
2.4 Export in Run Artifacts
Add to item_signals in run output JSON:
{
"item_signals": {
"Sleep": {
"llm_evidence_count": 2,
"retrieval_reference_count": 1,
"retrieval_similarity_mean": 0.82,
"retrieval_similarity_max": 0.82,
"verbalized_confidence": 4
}
}
}
2.5 New Confidence Variants
Add to scripts/evaluate_selective_prediction.py:
CONFIDENCE_VARIANTS = {
# Existing
"llm",
"total_evidence",
"retrieval_similarity_mean",
"retrieval_similarity_max",
"hybrid_evidence_similarity",
# NEW (Spec 048)
"verbalized",
"verbalized_calibrated",
"hybrid_verbalized",
}
Formula for verbalized:
confidence = (verbalized_confidence - 1) / 4 # Normalize to [0, 1]
Formula for verbalized_calibrated:
# Temperature scaling learned from paper-train (probability-space temperature scaling)
p = (verbalized_confidence - 1) / 4 # Normalize to [0, 1] (use 0.5 if null)
confidence = sigmoid(logit(p) / T)
# where T > 0 is fit by minimizing binary negative log-likelihood
Formula for hybrid_verbalized:
e = min(llm_evidence_count, 3) / 3
s = retrieval_similarity_mean or 0.0
v = (verbalized_confidence - 1) / 4 if verbalized_confidence else 0.5
confidence = 0.4 * v + 0.3 * e + 0.3 * s
2.6 Calibration Infrastructure
New script: scripts/calibrate_verbalized_confidence.py
# Fit temperature scaling on paper-train
uv run python scripts/calibrate_verbalized_confidence.py \
--input data/outputs/run_paper_train.json \
--mode few_shot \
--output data/calibration/temperature_scaling.json
# Apply calibration to evaluation
uv run python scripts/evaluate_selective_prediction.py \
--input data/outputs/run_paper_test.json \
--mode few_shot \
--confidence verbalized_calibrated \
--calibration data/calibration/temperature_scaling.json
Calibration artifact schema:
{
"method": "temperature_scaling",
"temperature": 2.3,
"fitted_on": {
"run_id": "...",
"mode": "few_shot",
"n_samples": 464
},
"metrics": {
"nll_before": 1.23,
"nll_after": 0.98,
"ece_before": 0.15,
"ece_after": 0.08
}
}
3. Implementation Plan
Phase 1: Prompt & Schema Changes
- Update
src/ai_psychiatrist/agents/prompts/quantitative.pywith confidence instructions - Update
ItemAssessmentdataclass withverbalized_confidencefield - Update
QuantitativeAssessmentAgentto parse and validate confidence from LLM response - Update
scripts/reproduce_results.pyto exportverbalized_confidenceinitem_signals
Phase 2: Evaluation Support
- Add
verbalizedconfidence variant toevaluate_selective_prediction.py - Add CLI flag
--calibrationto load calibration artifact - Add
verbalized_calibratedandhybrid_verbalizedvariants
Phase 3: Calibration Script
- Create
scripts/calibrate_verbalized_confidence.py - Implement temperature scaling optimization (scipy.optimize or sklearn)
- Add unit tests for calibration fitting and application
4. Test Plan
4.1 Unit Tests
test_verbalized_confidence_parsing: Validates 1-5 range, handles missing gracefullytest_verbalized_confidence_normalization: Verifies [0, 1] outputtest_temperature_scaling_calibration: Verifies NLL reductiontest_hybrid_verbalized_formula: Verifies bounded output
4.2 Integration Tests
- Mock LLM response with confidence field
- Verify end-to-end flow from assessment to evaluation
4.3 Ablation Run
After implementation, run on paper-test and compare:
# Baseline
uv run python scripts/evaluate_selective_prediction.py \
--input data/outputs/run10.json --mode few_shot --confidence llm
# Verbalized raw
uv run python scripts/evaluate_selective_prediction.py \
--input data/outputs/run10.json --mode few_shot --confidence verbalized
# Verbalized calibrated
uv run python scripts/evaluate_selective_prediction.py \
--input data/outputs/run10.json --mode few_shot --confidence verbalized_calibrated \
--calibration data/calibration/temperature_scaling.json
# Hybrid
uv run python scripts/evaluate_selective_prediction.py \
--input data/outputs/run10.json --mode few_shot --confidence hybrid_verbalized
5. Expected Outcomes
Based on literature:
| Confidence Signal | Expected AUGRC | vs Current |
|---|---|---|
llm (baseline) |
0.031 | — |
verbalized (raw) |
~0.028 | -10% |
verbalized_calibrated |
~0.024 | -23% |
hybrid_verbalized |
~0.020 | -35% |
Target: AUGRC < 0.020 with hybrid_verbalized
6. Acceptance Criteria
- [ ] LLM outputs include
confidencefield (1-5) - [ ]
ItemAssessmenthasverbalized_confidencefield - [ ] Run artifacts include
verbalized_confidenceinitem_signals - [ ]
evaluate_selective_prediction.pysupportsverbalized,verbalized_calibrated,hybrid_verbalized - [ ]
calibrate_verbalized_confidence.pyproduces valid calibration artifact - [ ] Documentation updated in
docs/statistics/metrics-and-evaluation.md - [ ] Tests pass:
make ci
7. Risks and Mitigations
| Risk | Likelihood | Mitigation |
|---|---|---|
| LLM ignores confidence instruction | Medium | Add examples in prompt; validate output |
| Verbalized confidence too noisy | Medium | Calibration reduces noise; hybrid signal provides fallback |
| Calibration overfits paper-train | Low | Use proper train/val split; check generalization |