Metrics and Evaluation (Exact Definitions + Output Schema)
Audience: Researchers who need implementation-accurate metrics Last Updated: 2026-01-03
This page is the canonical (non-archive) reference for how this repo computes and reports: - coverage - risk-coverage curves - AURC / AUGRC - MAE@coverage - bootstrap confidence intervals
SSOT implementations:
- src/ai_psychiatrist/metrics/selective_prediction.py
- src/ai_psychiatrist/metrics/bootstrap.py
- scripts/evaluate_selective_prediction.py
Unit of Evaluation (Critical)
We evaluate item instances: one PHQ-8 item for one participant.
For each (participant_id, item):
- gt is always present (0–3)
- pred is either (0–3) or None (abstain)
- confidence is a scalar ranking signal (higher = more confident)
Participant Failures
scripts/reproduce_results.py records per-participant success:
- success=True participants are included in selective prediction metrics.
- success=False participants are counted as reliability failures and excluded from AURC/AUGRC (by design).
This is implemented in scripts/evaluate_selective_prediction.py:parse_items().
Coverage and Cmax
Let:
- P = number of included participants (success=True)
- N = P * 8 total item instances
- K = number of predicted items (pred is not None) across the N items
Then:
- coverage = K / N
- Cmax = K / N (same value; named “max achievable coverage” because abstentions bound the curve)
SSOT: compute_cmax() in src/ai_psychiatrist/metrics/selective_prediction.py.
Confidence Variants
scripts/evaluate_selective_prediction.py supports the following confidence variants:
llm:confidence = llm_evidence_counttotal_evidence:confidence = llm_evidence_count(legacy alias; keyword backfill removed in Spec 047)retrieval_similarity_mean(Spec 046):confidence = retrieval_similarity_mean if not null else 0.0retrieval_similarity_max(Spec 046):confidence = retrieval_similarity_max if not null else 0.0hybrid_evidence_similarity(Spec 046):e = min(llm_evidence_count, 3) / 3s = retrieval_similarity_mean if not null else 0.0confidence = 0.5 * e + 0.5 * sverbalized(Spec 048):v = verbalized_confidence(1–5 scale)confidence = (v - 1) / 4(normalized to 0–1; uses 0.5 if null)verbalized_calibrated(Spec 048):- Requires
--calibrationpointing to amethod=temperature_scalingartifact v = (verbalized_confidence - 1) / 4(0–1; uses 0.5 if null)confidence = sigmoid(logit(v) / T)whereTis fitted on a training runhybrid_verbalized(Spec 048):v = (verbalized_confidence - 1) / 4(0–1; uses 0.5 if null)e = min(llm_evidence_count, 3) / 3s = retrieval_similarity_mean if not null else 0.0confidence = 0.4 * v + 0.3 * e + 0.3 * scalibrated(Spec 049):- Requires
--calibrationpointing to a calibrator artifact (e.g.,method=logistic) - Extract features from per-item
item_signalsusing the artifact’sfeatureslist confidence = p_correct(or a calibrated confidence score) produced by the calibratortoken_msp(Spec 051):- Requires per-item
token_mspinitem_signals confidence = token_msp(0–1, higher = more confident)token_pe(Spec 051):- Requires per-item
token_peinitem_signals - Stored value is entropy (lower = more confident)
confidence = 1 / (1 + token_pe)(maps to (0, 1])token_energy(Spec 051):- Requires per-item
token_energyinitem_signals - Stored value is
logsumexp(top_logprobs.logprob)over tokens confidence = exp(token_energy)(interpretable as cumulative mass captured bytop_logprobs)secondary:<csf1>+<csf2>:<average|product>(Spec 051):- Combines two base CSFs on the fly
- Example:
secondary:token_msp+retrieval_similarity_mean:average consistency(Spec 050):- Requires per-item
consistency_modal_confidenceinitem_signals confidence = consistency_modal_confidenceconsistency_inverse_std(Spec 050):- Requires per-item
consistency_score_stdinitem_signals confidence = 1 / (1 + consistency_score_std)hybrid_consistency(Spec 050):- Requires per-item
consistency_modal_confidenceinitem_signals e = min(llm_evidence_count, 3) / 3s = retrieval_similarity_mean if not null else 0.0confidence = 0.4 * consistency_modal_confidence + 0.3 * e + 0.3 * s
These are derived from item_signals in the run output JSON.
Calibration Artifacts
Calibration maps raw confidence to calibrated probabilities (typically P(correct)). Artifacts are JSON files generated by training scripts.
Generating calibrators:
# Temperature scaling for verbalized confidence (Spec 048)
uv run python scripts/calibrate_verbalized_confidence.py \
--input data/outputs/train_run.json \
--mode few_shot \
--output data/outputs/calibration_verbalized_temperature_scaling_fewshot.json
# Supervised calibrator (Spec 049)
uv run python scripts/train_confidence_calibrator.py \
--input data/outputs/train_run.json \
--mode few_shot \
--method logistic \
--features verbalized_confidence,retrieval_similarity_mean \
--output data/outputs/calibration_logistic_fewshot.json
Calibrator types (SSOT: src/ai_psychiatrist/calibration/calibrators.py):
| Type | Description | Use Case |
|---|---|---|
TemperatureScalingCalibrator |
Single-param scaling: sigmoid(logit(p)/T) |
Verbalized confidence |
LogisticCalibrator |
Logistic regression on feature vector | Multi-feature supervised |
LinearCalibrator |
Linear regression (for continuous targets) | Regression-style calibration |
IsotonicCalibrator |
Piecewise-linear monotonic | Non-parametric calibration |
Calibration metrics:
- ECE (Expected Calibration Error): Mean |accuracy - confidence| across bins. SSOT: compute_ece() in calibrators.py.
- NLL (Negative Log-Likelihood): Log-loss for binary correctness. SSOT: compute_binary_nll() in calibrators.py.
Feature extraction: CalibratorFeatureExtractor in src/ai_psychiatrist/calibration/feature_extraction.py extracts numeric features from item_signals with conservative defaults for missing values.
Retrieval-Signal Availability (Spec 046)
The retrieval-based confidence variants require the following per-item keys inside
item_signals:
- retrieval_reference_count
- retrieval_similarity_mean
- retrieval_similarity_max
Run artifacts produced by older versions of scripts/reproduce_results.py will not
contain these keys. In that case, scripts/evaluate_selective_prediction.py will raise
a clear error if a retrieval-based confidence variant is requested (it will not silently
substitute missing values).
Verbalized-Confidence Availability (Spec 048)
The verbalized confidence variants require the following per-item key inside
item_signals:
- verbalized_confidence
Artifacts produced by older versions of scripts/reproduce_results.py will not contain
this key. In that case, scripts/evaluate_selective_prediction.py will raise a clear
error if a verbalized-confidence variant is requested.
Token-Confidence Availability (Spec 051)
The token-level confidence variants require the following per-item keys inside
item_signals:
- token_msp
- token_pe
- token_energy
These values are only populated when the quantitative scorer backend returns logprobs.
If the backend does not support logprobs, the keys may be present but null; requesting
token_* confidence variants will then raise a clear error (no silent fallback).
Consistency-Signal Availability (Spec 050)
The consistency-based confidence variants require the following per-item keys inside
item_signals:
- consistency_modal_confidence
- consistency_score_std
These are only populated when the run was produced with multi-sample scoring enabled
(--consistency-samples > 1 in scripts/reproduce_results.py or CONSISTENCY_ENABLED=true).
Loss Functions
Two loss functions are supported:
abs:|pred - gt|abs_norm:|pred - gt| / 3(range 0–1)
SSOT: _compute_loss() in src/ai_psychiatrist/metrics/selective_prediction.py.
Risk-Coverage Curve (RC Curve)
Inputs
Given all N item instances:
1. Filter to predicted items S = {i | pred_i is not None}.
2. Compute loss for each i ∈ S.
3. Sort S by confidence descending.
Plateau (Tie) Handling
Confidence is often discrete (evidence counts). We compute working points by grouping equal confidence values: - Each unique confidence value defines a working point. - We add all items from that confidence plateau at once.
SSOT: compute_risk_coverage_curve() in src/ai_psychiatrist/metrics/selective_prediction.py.
Working Point Metrics
At working point j after accepting k_j items:
- coverage_j = k_j / N
- selective_risk_j = (sum loss of accepted) / k_j
- generalized_risk_j = (sum loss of accepted) / N
AURC and AUGRC (Integration Semantics)
We integrate using trapezoidal rule over [0, Cmax] with an explicit augmentation at coverage=0.
SSOT: _integrate_curve() in src/ai_psychiatrist/metrics/selective_prediction.py.
AURC
- x-axis: coverage
- y-axis: selective risk
- augmentation: right-continuous at 0
risk(0) = risk(coverage_1)
AUGRC
- x-axis: coverage
- y-axis: generalized risk
- augmentation:
generalized_risk(0) = 0
Optimal and Excess Metrics (Spec 052)
Added in Jan 2026 to measure distance from the theoretical limit.
Optimal Baselines (Oracle CSF)
- AURC_optimal: The AURC achievable if items were perfectly ranked by loss (ascending).
- AUGRC_optimal: The AUGRC achievable under perfect ranking.
Excess Metrics
- e-AURC =
AURC - AURC_optimal - e-AUGRC =
AUGRC - AUGRC_optimal
Interpretation
e-AURC = 0implies the confidence signal perfectly ranks correctness.aurc_gap_pct=(e-AURC / AURC_optimal) * 100shows the percentage room for improvement.
Achievable AURC (Convex Hull)
- AURC_achievable: The AURC of the lower convex hull of the risk-coverage curve.
- Represents the performance achievable by optimally selecting working points (filtering out suboptimal confidence thresholds).
Truncated Areas and MAE@Coverage
Truncated AURC/AUGRC
We compute truncated areas up to a requested maximum coverage C':
- AURC@C'
- AUGRC@C'
If C' > Cmax, the effective C' becomes Cmax.
SSOT: _integrate_truncated() in src/ai_psychiatrist/metrics/selective_prediction.py (includes linear interpolation to land exactly on C').
MAE@Coverage
MAE@coverage=c is defined as:
- take the first working point where coverage >= c
- return its selective risk
If no working point reaches the requested coverage (i.e., c > Cmax), the value is None.
SSOT: compute_risk_at_coverage() in src/ai_psychiatrist/metrics/selective_prediction.py.
Bootstrap Confidence Intervals
We use participant-cluster bootstrap: - resample participants with replacement - include all 8 items per sampled participant - recompute metrics on the resampled set
SSOT: bootstrap_by_participant() in src/ai_psychiatrist/metrics/bootstrap.py.
Paired Deltas (Mode Comparisons)
When evaluating two modes on the same run artifact, we can compute paired deltas:
- delta = metric_right - metric_left
- bootstrap resamples are applied at the participant level across both inputs
SSOT: paired_bootstrap_delta_by_participant() in src/ai_psychiatrist/metrics/bootstrap.py.
Metrics Artifact Output Schema
scripts/evaluate_selective_prediction.py produces a JSON artifact:
{
"schema_version": "1",
"created_at": "2026-01-03T00:00:00Z",
"inputs": [
{"path": "...", "run_id": "...", "git_commit": "...", "mode": "few_shot"}
],
"population": {
"participants_total": 41,
"participants_included": 40,
"participants_failed": 1,
"items_total": 320
},
"loss": {
"name": "abs_norm",
"definition": "abs(pred - gt) / 3",
"raw_multiplier": 3
},
"confidence_variants": {
"llm": {
"cmax": 0.655,
"aurc_full": 0.192,
"augrc_full": 0.058,
"aurc_optimal": 0.110,
"augrc_optimal": 0.035,
"eaurc": 0.082,
"eaugrc": 0.023,
"aurc_achievable": 0.170,
"interpretation": {
"aurc_gap_pct": 74.3,
"augrc_gap_pct": 65.7,
"achievable_gain_pct": 11.5
},
"aurc_at_c": {"requested": 0.5, "used": 0.5, "value": 0.123},
"augrc_at_c": {"requested": 0.5, "used": 0.5, "value": 0.041},
"mae_at_coverage": {"0.10": {"requested": 0.1, "achieved": 0.123, "value": 0.5}},
"bootstrap": {
"seed": 42,
"n_resamples": 10000,
"ci95": {"cmax": [0.6, 0.7], "aurc_full": [0.1, 0.2]}
},
"curve": {
"coverage": [0.123, 0.234],
"selective_risk": [0.500, 0.700],
"generalized_risk": [0.062, 0.164],
"threshold": [3.0, 2.0]
}
}
},
"comparison": {
"enabled": false,
"intersection_only": false,
"deltas": null
}
}
Exact keys and nesting are defined in scripts/evaluate_selective_prediction.py (constructs artifact near the end of main()).
How To Run
uv run python scripts/evaluate_selective_prediction.py \
--input data/outputs/your_run.json \
--mode few_shot \
--confidence default \
--loss abs \
--bootstrap-resamples 10000 \
--seed 42
For paired comparisons:
uv run python scripts/evaluate_selective_prediction.py \
--input data/outputs/your_run.json --mode zero_shot \
--input data/outputs/your_run.json --mode few_shot \
--loss abs \
--seed 42
Related Docs
- Why AURC/AUGRC matter:
docs/statistics/statistical-methodology-aurc-augrc.md - Run output format / provenance:
docs/results/run-history.md