Metrics and Evaluation (Exact Definitions + Output Schema)

Audience: Researchers who need implementation-accurate metrics Last Updated: 2026-01-03

This page is the canonical (non-archive) reference for how this repo computes and reports: - coverage - risk-coverage curves - AURC / AUGRC - MAE@coverage - bootstrap confidence intervals

SSOT implementations: - src/ai_psychiatrist/metrics/selective_prediction.py - src/ai_psychiatrist/metrics/bootstrap.py - scripts/evaluate_selective_prediction.py

Unit of Evaluation (Critical)

We evaluate item instances: one PHQ-8 item for one participant.

For each (participant_id, item): - gt is always present (0–3) - pred is either (0–3) or None (abstain) - confidence is a scalar ranking signal (higher = more confident)

Participant Failures

scripts/reproduce_results.py records per-participant success: - success=True participants are included in selective prediction metrics. - success=False participants are counted as reliability failures and excluded from AURC/AUGRC (by design).

This is implemented in scripts/evaluate_selective_prediction.py:parse_items().

Coverage and Cmax

Let: - P = number of included participants (success=True) - N = P * 8 total item instances - K = number of predicted items (pred is not None) across the N items

Then: - coverage = K / N - Cmax = K / N (same value; named “max achievable coverage” because abstentions bound the curve)

SSOT: compute_cmax() in src/ai_psychiatrist/metrics/selective_prediction.py.

Confidence Variants

scripts/evaluate_selective_prediction.py supports the following confidence variants:

llm:
confidence = llm_evidence_count
total_evidence:
confidence = llm_evidence_count (legacy alias; keyword backfill removed in Spec 047)
retrieval_similarity_mean (Spec 046):
confidence = retrieval_similarity_mean if not null else 0.0
retrieval_similarity_max (Spec 046):
confidence = retrieval_similarity_max if not null else 0.0
hybrid_evidence_similarity (Spec 046):
e = min(llm_evidence_count, 3) / 3
s = retrieval_similarity_mean if not null else 0.0
confidence = 0.5 * e + 0.5 * s
verbalized (Spec 048):
v = verbalized_confidence (1–5 scale)
confidence = (v - 1) / 4 (normalized to 0–1; uses 0.5 if null)
verbalized_calibrated (Spec 048):
Requires --calibration pointing to a method=temperature_scaling artifact
v = (verbalized_confidence - 1) / 4 (0–1; uses 0.5 if null)
confidence = sigmoid(logit(v) / T) where T is fitted on a training run
hybrid_verbalized (Spec 048):
v = (verbalized_confidence - 1) / 4 (0–1; uses 0.5 if null)
e = min(llm_evidence_count, 3) / 3
s = retrieval_similarity_mean if not null else 0.0
confidence = 0.4 * v + 0.3 * e + 0.3 * s
calibrated (Spec 049):
Requires --calibration pointing to a calibrator artifact (e.g., method=logistic)
Extract features from per-item item_signals using the artifact’s features list
confidence = p_correct (or a calibrated confidence score) produced by the calibrator
token_msp (Spec 051):
Requires per-item token_msp in item_signals
confidence = token_msp (0–1, higher = more confident)
token_pe (Spec 051):
Requires per-item token_pe in item_signals
Stored value is entropy (lower = more confident)
confidence = 1 / (1 + token_pe) (maps to (0, 1])
token_energy (Spec 051):
Requires per-item token_energy in item_signals
Stored value is logsumexp(top_logprobs.logprob) over tokens
confidence = exp(token_energy) (interpretable as cumulative mass captured by top_logprobs)
secondary:<csf1>+<csf2>:<average|product> (Spec 051):
Combines two base CSFs on the fly
Example: secondary:token_msp+retrieval_similarity_mean:average
consistency (Spec 050):
Requires per-item consistency_modal_confidence in item_signals
confidence = consistency_modal_confidence
consistency_inverse_std (Spec 050):
Requires per-item consistency_score_std in item_signals
confidence = 1 / (1 + consistency_score_std)
hybrid_consistency (Spec 050):
Requires per-item consistency_modal_confidence in item_signals
e = min(llm_evidence_count, 3) / 3
s = retrieval_similarity_mean if not null else 0.0
confidence = 0.4 * consistency_modal_confidence + 0.3 * e + 0.3 * s

These are derived from item_signals in the run output JSON.

Calibration Artifacts

Calibration maps raw confidence to calibrated probabilities (typically P(correct)). Artifacts are JSON files generated by training scripts.

Generating calibrators:

# Temperature scaling for verbalized confidence (Spec 048)
uv run python scripts/calibrate_verbalized_confidence.py \
  --input data/outputs/train_run.json \
  --mode few_shot \
  --output data/outputs/calibration_verbalized_temperature_scaling_fewshot.json

# Supervised calibrator (Spec 049)
uv run python scripts/train_confidence_calibrator.py \
  --input data/outputs/train_run.json \
  --mode few_shot \
  --method logistic \
  --features verbalized_confidence,retrieval_similarity_mean \
  --output data/outputs/calibration_logistic_fewshot.json

Calibrator types (SSOT: src/ai_psychiatrist/calibration/calibrators.py):

Type	Description	Use Case
`TemperatureScalingCalibrator`	Single-param scaling: `sigmoid(logit(p)/T)`	Verbalized confidence
`LogisticCalibrator`	Logistic regression on feature vector	Multi-feature supervised
`LinearCalibrator`	Linear regression (for continuous targets)	Regression-style calibration
`IsotonicCalibrator`	Piecewise-linear monotonic	Non-parametric calibration

Calibration metrics: - ECE (Expected Calibration Error): Mean |accuracy - confidence| across bins. SSOT: compute_ece() in calibrators.py. - NLL (Negative Log-Likelihood): Log-loss for binary correctness. SSOT: compute_binary_nll() in calibrators.py.

Feature extraction: CalibratorFeatureExtractor in src/ai_psychiatrist/calibration/feature_extraction.py extracts numeric features from item_signals with conservative defaults for missing values.

Retrieval-Signal Availability (Spec 046)

The retrieval-based confidence variants require the following per-item keys inside item_signals: - retrieval_reference_count - retrieval_similarity_mean - retrieval_similarity_max

Run artifacts produced by older versions of scripts/reproduce_results.py will not contain these keys. In that case, scripts/evaluate_selective_prediction.py will raise a clear error if a retrieval-based confidence variant is requested (it will not silently substitute missing values).

Verbalized-Confidence Availability (Spec 048)

The verbalized confidence variants require the following per-item key inside item_signals: - verbalized_confidence

Artifacts produced by older versions of scripts/reproduce_results.py will not contain this key. In that case, scripts/evaluate_selective_prediction.py will raise a clear error if a verbalized-confidence variant is requested.

Token-Confidence Availability (Spec 051)

The token-level confidence variants require the following per-item keys inside item_signals: - token_msp - token_pe - token_energy

These values are only populated when the quantitative scorer backend returns logprobs. If the backend does not support logprobs, the keys may be present but null; requesting token_* confidence variants will then raise a clear error (no silent fallback).

Consistency-Signal Availability (Spec 050)

The consistency-based confidence variants require the following per-item keys inside item_signals: - consistency_modal_confidence - consistency_score_std

These are only populated when the run was produced with multi-sample scoring enabled (--consistency-samples > 1 in scripts/reproduce_results.py or CONSISTENCY_ENABLED=true).

Loss Functions

Two loss functions are supported:

abs: |pred - gt|
abs_norm: |pred - gt| / 3 (range 0–1)

SSOT: _compute_loss() in src/ai_psychiatrist/metrics/selective_prediction.py.

Risk-Coverage Curve (RC Curve)

Inputs

Given all N item instances: 1. Filter to predicted items S = {i | pred_i is not None}. 2. Compute loss for each i ∈ S. 3. Sort S by confidence descending.

Plateau (Tie) Handling

Confidence is often discrete (evidence counts). We compute working points by grouping equal confidence values: - Each unique confidence value defines a working point. - We add all items from that confidence plateau at once.

SSOT: compute_risk_coverage_curve() in src/ai_psychiatrist/metrics/selective_prediction.py.

Working Point Metrics

At working point j after accepting k_j items: - coverage_j = k_j / N - selective_risk_j = (sum loss of accepted) / k_j - generalized_risk_j = (sum loss of accepted) / N

AURC and AUGRC (Integration Semantics)

We integrate using trapezoidal rule over [0, Cmax] with an explicit augmentation at coverage=0.

SSOT: _integrate_curve() in src/ai_psychiatrist/metrics/selective_prediction.py.

AURC

x-axis: coverage
y-axis: selective risk
augmentation: right-continuous at 0
risk(0) = risk(coverage_1)

AUGRC

x-axis: coverage
y-axis: generalized risk
augmentation:
generalized_risk(0) = 0

Optimal and Excess Metrics (Spec 052)

Added in Jan 2026 to measure distance from the theoretical limit.

Optimal Baselines (Oracle CSF)

AURC_optimal: The AURC achievable if items were perfectly ranked by loss (ascending).
AUGRC_optimal: The AUGRC achievable under perfect ranking.

Excess Metrics

e-AURC = AURC - AURC_optimal
e-AUGRC = AUGRC - AUGRC_optimal

Interpretation

e-AURC = 0 implies the confidence signal perfectly ranks correctness.
aurc_gap_pct = (e-AURC / AURC_optimal) * 100 shows the percentage room for improvement.

Achievable AURC (Convex Hull)

AURC_achievable: The AURC of the lower convex hull of the risk-coverage curve.
Represents the performance achievable by optimally selecting working points (filtering out suboptimal confidence thresholds).

Truncated Areas and MAE@Coverage

Truncated AURC/AUGRC

We compute truncated areas up to a requested maximum coverage C': - AURC@C' - AUGRC@C'

If C' > Cmax, the effective C' becomes Cmax.

SSOT: _integrate_truncated() in src/ai_psychiatrist/metrics/selective_prediction.py (includes linear interpolation to land exactly on C').

MAE@Coverage

MAE@coverage=c is defined as: - take the first working point where coverage >= c - return its selective risk

If no working point reaches the requested coverage (i.e., c > Cmax), the value is None.

SSOT: compute_risk_at_coverage() in src/ai_psychiatrist/metrics/selective_prediction.py.

Bootstrap Confidence Intervals

We use participant-cluster bootstrap: - resample participants with replacement - include all 8 items per sampled participant - recompute metrics on the resampled set

SSOT: bootstrap_by_participant() in src/ai_psychiatrist/metrics/bootstrap.py.

Paired Deltas (Mode Comparisons)

When evaluating two modes on the same run artifact, we can compute paired deltas: - delta = metric_right - metric_left - bootstrap resamples are applied at the participant level across both inputs

SSOT: paired_bootstrap_delta_by_participant() in src/ai_psychiatrist/metrics/bootstrap.py.

Metrics Artifact Output Schema

scripts/evaluate_selective_prediction.py produces a JSON artifact:

{
  "schema_version": "1",
  "created_at": "2026-01-03T00:00:00Z",
  "inputs": [
    {"path": "...", "run_id": "...", "git_commit": "...", "mode": "few_shot"}
  ],
  "population": {
    "participants_total": 41,
    "participants_included": 40,
    "participants_failed": 1,
    "items_total": 320
  },
  "loss": {
    "name": "abs_norm",
    "definition": "abs(pred - gt) / 3",
    "raw_multiplier": 3
  },
  "confidence_variants": {
    "llm": {
      "cmax": 0.655,
      "aurc_full": 0.192,
      "augrc_full": 0.058,
      "aurc_optimal": 0.110,
      "augrc_optimal": 0.035,
      "eaurc": 0.082,
      "eaugrc": 0.023,
      "aurc_achievable": 0.170,
      "interpretation": {
        "aurc_gap_pct": 74.3,
        "augrc_gap_pct": 65.7,
        "achievable_gain_pct": 11.5
      },
      "aurc_at_c": {"requested": 0.5, "used": 0.5, "value": 0.123},
      "augrc_at_c": {"requested": 0.5, "used": 0.5, "value": 0.041},
      "mae_at_coverage": {"0.10": {"requested": 0.1, "achieved": 0.123, "value": 0.5}},
      "bootstrap": {
        "seed": 42,
        "n_resamples": 10000,
        "ci95": {"cmax": [0.6, 0.7], "aurc_full": [0.1, 0.2]}
      },
      "curve": {
        "coverage": [0.123, 0.234],
        "selective_risk": [0.500, 0.700],
        "generalized_risk": [0.062, 0.164],
        "threshold": [3.0, 2.0]
      }
    }
  },
  "comparison": {
    "enabled": false,
    "intersection_only": false,
    "deltas": null
  }
}

Exact keys and nesting are defined in scripts/evaluate_selective_prediction.py (constructs artifact near the end of main()).

How To Run

uv run python scripts/evaluate_selective_prediction.py \
  --input data/outputs/your_run.json \
  --mode few_shot \
  --confidence default \
  --loss abs \
  --bootstrap-resamples 10000 \
  --seed 42

For paired comparisons:

uv run python scripts/evaluate_selective_prediction.py \
  --input data/outputs/your_run.json --mode zero_shot \
  --input data/outputs/your_run.json --mode few_shot \
  --loss abs \
  --seed 42

Why AURC/AUGRC matter: docs/statistics/statistical-methodology-aurc-augrc.md
Run output format / provenance: docs/results/run-history.md