Complete Run History & Statistical Analysis

Purpose: Comprehensive record of all reproduction runs, code changes, and statistical analyses for posterity.

Last Updated: 2026-01-08

⚠️ CRITICAL: Run Integrity Warnings

Prompt Confound Bug (BUG-035) - Fixed 2026-01-06

A prompt confound was discovered and fixed on 2026-01-06 where few-shot mode produced different prompts than zero-shot even when retrieval returned zero references.

What happened: When format_for_prompt() had no valid references, it returned:

<Reference Examples>
No valid evidence found
</Reference Examples>

Instead of an empty string. This meant few-shot prompts always differed from zero-shot, even when retrieval contributed nothing.

Impact on Comparative Claims: - Any claim that "few-shot is worse/better than zero-shot" is confounded - The observed difference could be due to: (1) actual retrieval effect, (2) the "No valid evidence found" message anchoring the model, or (3) interaction of both - The message may have caused the model to be more conservative/abstain more in few-shot mode

Status by Run:

Run	Affected?	Notes
Run 1-12	Yes	All comparative claims between modes are confounded
Future runs	No	Fix deployed: empty retrieval = identical to zero-shot

Fix Applied: Commit on 2026-01-06 - format_for_prompt() now returns "" when no valid entries - Few-shot with no retrieval results now produces identical prompt to zero-shot

Recommendation: Re-run comparative experiments post-fix to measure true retrieval effect.

See: BUG-035

Silent Fallback Bug (ANALYSIS-026) - Fixed 2026-01-03

A critical bug was discovered and fixed on 2026-01-03 where _extract_evidence() would silently return {} on JSON parse failure instead of raising an exception.

Impact on Mode Isolation:

Few-shot mode with empty evidence {} → no reference bundle → effectively zero-shot
This violated the independence of zero-shot and few-shot as research methodologies
Published results claiming "few-shot" could have been partially zero-shot

Status by Run:

Run	Code Version	Affected?	Notes
Run 1-9	Pre-fix	Unknown	Bug was SILENT - no way to know without re-running
Run 10	Pre-fix (git dirty)	Yes	Completed but invalid (zero-shot partial, few-shot failed entirely)
Future runs	Post-fix	No	Will fail loudly if JSON parsing fails

Why we can't be certain about Run 1-9:

The bug only triggers if LLM returns malformed JSON
If LLM always returned valid JSON, bug never triggered
Results looked plausible at the time (e.g., some runs reported few-shot < zero-shot MAE after chunk scoring), but those runs are pre-BUG-035 and are confounded for cross-mode comparisons
But we have NO PROOF the bug never triggered

Fix Applied: Commit on 2026-01-03

_extract_evidence() now raises json.JSONDecodeError on failure
Uses format="json" for grammar-level JSON constraint
All parsers use canonical parse_llm_json() function

Recommendation: For publication-quality results, consider re-running with post-fix code.

See: docs/_archive/bugs/ANALYSIS-026_JSON_PARSING_ARCHITECTURE_AUDIT.md

Invalid JSON Output Bug (BUG-048) - Fixed 2026-01-08

Some historical run artifacts may contain NaN/Infinity floating-point literals in the JSON output when a metric is undefined (e.g., an evaluation subset is empty). These outputs are not strict JSON and will fail parsers like jq.

Fix Applied: - The runner now serializes strict JSON (allow_nan=False). - Non-finite aggregate metrics are emitted as null instead of NaN/Infinity.

See: docs/_bugs/BUG-048-invalid-json-output-nan-metrics.md

Quick Reference: Current Best Results

All values below use loss=abs_norm and 1,000 participant-level bootstrap resamples.

Run 13: First Clean Run POST BUG-035 Fix ✅

This is the authoritative baseline for zero-shot vs few-shot comparisons (no prompt confound).

Mode	MAE_item	AURC (`llm`)	Best AURC	Best AUGRC	Cmax
Zero-shot	0.6079	0.107	0.098 (`consistency_inverse_std`)	0.024 (`consistency_inverse_std`)	48.8%
Few-shot	0.6571	0.115	0.091 (`token_pe`)	0.025 (`token_pe`)	48.5%

Paper Comparison (MAE_item)

Mode	Paper	Run 13	Delta
Zero-shot	0.796	0.6079	-24% (we're better)
Few-shot	0.619	0.6571	+6% (paper's better)

Historical Reference (Run 12, pre-BUG-035 fix)

Mode	AURC	AUGRC	Cmax	Notes
Zero-shot	0.102 [0.081-0.121]	0.025 [0.019-0.032]	48.5%	Pre-fix baseline
Few-shot	0.109 [0.084-0.133]	0.024 [0.018-0.032]	46.0%	Confounded (BUG-035)

Note: Run 1-12 few-shot results are confounded by BUG-035 (prompt contained "No valid evidence found" message). Use Run 13+ for valid zero-shot vs few-shot comparisons.

Note: Cmax is the max coverage in the risk–coverage curve (counts participants with 8/8 N/A as 0 coverage). MAE_w is computed over evaluated subjects only.

Run 14: Spec 063 Severity Inference (`infer`) Ablation (Coverage ↑; Risk ↑)

Run 14 (data/outputs/both_paper-test_20260108_114058.json) enables severity inference (--severity-inference infer) while keeping consistency sampling enabled.

Mode	MAE_item	AURC (`llm`)	Best AURC	Best AUGRC	Cmax
Zero-shot	0.7030	0.129	0.126 (`hybrid_consistency`)	0.038 (`hybrid_consistency`)	60.1%
Few-shot	0.7843	0.147	0.139 (`consistency_inverse_std`)	0.039 (`token_energy`)	57.5%

Key result: Compared to Run 13 (strict baseline), infer increases Cmax by ~8–11 points, but worsens AURC/AUGRC significantly (paired deltas are positive for most confidence variants).

Why AURC/AUGRC Instead of MAE?

MAE comparisons are not coverage-adjusted when coverages differ.

Run 7 Cmax: zero-shot 56.9%, few-shot 65.9%
Run 8 Cmax: zero-shot 48.8%, few-shot 50.9%

When one system predicts on more items, those additional items are inherently harder cases that another system abstained from. Comparing raw MAE without a coverage-adjusted metric is like comparing a surgeon who only takes easy cases vs one who takes hard cases.

AURC/AUGRC integrate over the entire risk-coverage curve, providing a fair comparison regardless of coverage differences.

See: docs/statistics/statistical-methodology-aurc-augrc.md

Run Timeline (Chronological)

Run 1: Dec 26, 2025 - Initial Validated Runs

Artifacts: Not retained in this repo snapshot (early outputs used different naming and were not committed). Treat this run as historical context only; later runs include stored JSON artifacts under data/outputs/.

Git Commits: Various (5b8f588, f6d2653)

Code State: - Pre-Spec 31/32 (old reference format) - 8 separate <Reference Examples> blocks per PHQ-8 item - Per-item headers like [Sleep] - XML-style closing tags </Reference Examples> - Empty items showed "No valid evidence found"

Results:

Mode	AURC	AUGRC	Cmax	MAE_w
Zero-shot	~0.134	~0.037	55.5%	0.698
Few-shot	~0.21	~0.07	71.6%	0.860

Notes: Initial baseline. Few-shot significantly worse than zero-shot.

Run 2: Dec 27, 2025 - Pre-Spec 31/32 Full Run

File: paper_test_full_run_20251228.json (filename misleading - actually Dec 27)

Git Commit: 0a98662

Timestamp: 2025-12-27T23:10:45

Code State: Same as Run 1 (pre-Spec 31/32)

Results:

Mode	AURC	AUGRC	Cmax	MAE_w	MAE_item
Zero-shot	0.134	0.037	55.5%	0.698	0.717
Few-shot	0.214	0.074	71.9%	0.804	N/A

Statistical Analysis: AURC computed via scripts/evaluate_selective_prediction.py

Run 3: Dec 29, 2025 - Post-Spec 31/32 (Legacy Prompt Format)

File: both_paper-test_backfill-off_20251229_003543.json

Git Commit: 7d54d98

Timestamp: 2025-12-28T21:39:32

Code Changes (Spec 31/32): - Single unified <Reference Examples> block - Inline labels: (PHQ8_Sleep Score: 2) instead of (Score: 2) - Empty items skipped entirely (no per-item blocks) - Same opening/closing tag: <Reference Examples> (not XML-style)

Results:

Mode	AURC	AUGRC	Cmax	MAE_w	MAE_item	MAE_subj
Zero-shot	0.134	0.037	55.5%	0.698	0.717	0.640
Few-shot	0.193	0.065	70.1%	0.774	0.762	0.712

95% Bootstrap CIs (10,000 resamples, participant-level):

Mode	AURC CI	AUGRC CI	Cmax CI
Zero-shot	[0.094, 0.176]	[0.024, 0.053]	[0.473, 0.640]
Few-shot	[0.142, 0.244]	[0.043, 0.091]	[0.604, 0.799]

Statistical Analysis: - Computed 2025-12-29 via scripts/evaluate_selective_prediction.py --seed 42 - Metrics files: selective_prediction_metrics_20251229T164344Z.json (zero-shot), selective_prediction_metrics_20251229T164403Z.json (few-shot) - Paired comparison: selective_prediction_metrics_20251229T1644_paired.json (ΔAURC = +0.058 [0.016, 0.107], few-shot − zero-shot)

Run 4: Dec 29, 2025 - Spec 33 Development Snapshot (Pre-merge)

File: both_paper-test_backfill-off_20251229_173727.json

Git Commit: 5e62455 (pre-merge dev commit; not on main)

Timestamp: 2025-12-29T14:41:44

Code Changes (Spec 33): - Retrieval quality guardrails (similarity threshold + per-item reference budget) - XML-style closing tag: </Reference Examples> (deviates from notebook tag mirroring)

Results (single-run metrics; note different included-N due to one zero-shot failure):

Mode	AURC	AUGRC	Cmax	MAE_w	N_included (AURC)
Zero-shot	0.138	0.039	56.9%	0.698	40
Few-shot	0.192	0.058	65.5%	0.777	41

95% Bootstrap CIs (10,000 resamples, participant-level):

Mode	AURC CI	AUGRC CI	Cmax CI	N_included (AURC)
Zero-shot	[0.097, 0.180]	[0.025, 0.055]	[0.491, 0.650]	40
Few-shot	[0.144, 0.243]	[0.039, 0.081]	[0.555, 0.753]	41

Statistical Analysis: - Computed 2025-12-29 via scripts/evaluate_selective_prediction.py --seed 42 - Metrics files: selective_prediction_metrics_20251229T231237Z.json (zero-shot), selective_prediction_metrics_20251229T231302Z.json (few-shot) - Paired comparison (overlap N=40 due to one zero-shot failure): selective_prediction_metrics_20251229T233314Z.json (ΔAURC = +0.058 [0.010, 0.109], few-shot − zero-shot)

Note on comparability: The paired comparison recomputes both modes on the overlap only (N=40). On that overlap, few-shot is slightly worse than the single-mode table above (AURC ≈ 0.196, AUGRC ≈ 0.060) because the dropped participant only affects the paired analysis, not the standalone few-shot evaluation.

Note: This was a pre-merge development snapshot. See Run 5 for the clean, post-merge Spec 33+34 ablation run.

Run 4b: Dec 30, 2025 - Post-Spec 34 Regression (Query Embedding Timeouts)

File: both_paper_backfill-off_20251230_053108.json

Git Commit: be35e35 (dirty)

Timestamp: 2025-12-29T23:34:42

What went wrong: - Few-shot had 9/41 failures (22%), all "LLM request timed out after 120s". - Runtime roughly doubled vs the expected ~95 minutes.

Root cause (since fixed): - Spec 37 was required (batch query embedding + configurable query embedding timeout).

Results (includes failures; do not treat as a valid baseline):

Mode	AURC	AUGRC	Cmax	MAE_w	N_included (AURC)	Failed
Zero-shot	0.138	0.039	56.9%	0.698	40	1
Few-shot	0.163	0.037	53.5%	0.745	32	9

95% Bootstrap CIs (10,000 resamples, participant-level):

Mode	AURC CI	AUGRC CI	Cmax CI
Zero-shot	[0.097, 0.180]	[0.025, 0.055]	[0.491, 0.650]
Few-shot	[0.098, 0.217]	[0.020, 0.060]	[0.426, 0.648]

Paired comparison (overlap N=31 due to failures): ΔAURC = +0.037 [-0.028, +0.087] (few-shot − zero-shot).

Run 5: Dec 30, 2025 - Post-Spec 33+34 (Full Ablation)

File: both_paper-test_backfill-off_20251230_230349.json

Git Commit: 36995f0 (clean)

Timestamp: 2025-12-30T20:27:38

Code Changes (Spec 33+34): - Spec 33: Retrieval quality guardrails (min_similarity=0.3, max_chars_per_item=500) - Spec 34: Item-tag filtering (only retrieve domain-matched chunks) - Spec 35/36: NOT enabled (chunk scores file doesn't exist)

Results:

Mode	AURC	AUGRC	Cmax	MAE_w	N_included
Zero-shot	0.138	0.039	56.9%	0.698	40
Few-shot	0.213	0.073	71.0%	0.807	41

95% Bootstrap CIs (10,000 resamples, participant-level):

Mode	AURC CI	AUGRC CI	Cmax CI
Zero-shot	[0.097, 0.180]	[0.025, 0.055]	[0.491, 0.650]
Few-shot	[0.153, 0.276]	[0.047, 0.103]	[0.610, 0.805]

Statistical Analysis: - Computed 2025-12-30 via scripts/evaluate_selective_prediction.py --seed 42 - Metrics files: selective_prediction_metrics_run5_zero_shot.json, selective_prediction_metrics_run5_few_shot.json

Comparison vs Run 3 (Spec 31/32 baseline):

Metric	Run 3	Run 5	Delta	% Change
few_shot AURC	0.193	0.213	+0.020	+10% (worse)
few_shot AUGRC	0.065	0.073	+0.008	+12% (worse)
zero_shot AURC	0.134	0.138	+0.004	+3% (noise)

Key Finding: Spec 33+34 did NOT improve few-shot. Performance regressed ~10%.

Interpretation: Domain filtering (Spec 34) and quality guardrails (Spec 33) cannot fix the fundamental chunk-scoring problem documented in HYPOTHESIS-FEWSHOT-DESIGN-FLAW.md. Chunks still have participant-level scores, not chunk-specific scores. Filtering by domain helps retrieval precision but doesn't fix the misleading score labels.

Conclusion: Spec 35 (chunk-level scoring) is required before further ablations are meaningful.

Spec 31/32 Impact Analysis

What Changed

Aspect	Before (Old Format)	After (Spec 31/32)
Block structure	8 separate blocks	1 unified block
Item labels	`[Sleep]` header	`(PHQ8_Sleep Score: X)` inline
Empty items	"No valid evidence found"	Omitted entirely
Closing tag	`</Reference Examples>`	`<Reference Examples>`

Impact on Metrics

Metric	Pre-Spec 31	Post-Spec 31	Delta	% Change
Zero-shot AURC	0.134	0.134	0	0%
Zero-shot AUGRC	0.037	0.037	0	0%
Few-shot AURC	0.214	0.193	-0.021	-10%
Few-shot AUGRC	0.074	0.065	-0.009	-12%
Few-shot MAE_w	0.804	0.774	-0.030	-3.7%
Few-shot Cmax	71.9%	70.1%	-1.8%	-2.5%

Interpretation

Zero-shot unchanged: Expected - doesn't use reference examples
Few-shot improved 10-12%: Legacy prompt format helps
Gap remains ~30%: Zero-shot still significantly better (0.134 vs 0.193)
Paired bootstrap delta excludes 0: Statistically significant difference at α=0.05

Key Findings

1. Few-Shot vs Zero-Shot (Paper Claim)

The paper claims few-shot beats zero-shot (by item-level MAE).

Update (Run 8): With participant-only transcript preprocessing + chunk scoring enabled, few-shot matches the paper’s reported MAE_item and slightly beats it:

Metric	Paper (reported)	Run 8 (participant-only)
Better mode (by MAE_item)	Few-shot	Few-shot
Few-shot MAE_item	0.619	0.609
Zero-shot MAE_item	0.796	0.776

Note: Earlier runs (Run 3 / Run 7) still showed zero-shot as better on AURC due to the confidence/coverage tradeoff; Run 8 changes the retrieval setting but lowers Cmax substantially.

Possible explanations (partially addressed by Specs 33-35 and transcript preprocessing): 1. Reference example quality issues 2. Embedding similarity matches topic, not severity 3. Low-similarity references inject noise 4. Model overconfidence with few-shot

2. Paper's MAE Comparison Was Not Coverage-Adjusted

The paper compared MAE at different coverages without analyzing the risk–coverage tradeoff. MAE alone does not establish dominance when abstention rates differ.

3. Formatting Matters But Isn't Everything

Spec 31/32 improved few-shot by ~10%, proving formatting matters. Retrieval quality still dominates: chunk scoring (Spec 35) and participant-only transcripts (Run 8) substantially change outcomes, but coverage/confidence tradeoffs remain.

Pending Work

Specs 33-36: Retrieval Quality Fixes

Spec	Description	Status	Result
33	Similarity threshold + context budget	✅ Implemented + tested	No improvement (Run 5)
34	Item-tagged reference embeddings	✅ Implemented + tested	No improvement (Run 5)
35	Offline chunk-level PHQ-8 scoring	✅ Implemented + tested	29% improvement (Run 7)
36	CRAG reference validation	✅ Implemented (optional)	Pending ablation (runtime cost)

Run 5 Conclusion: Spec 33+34 alone did not improve few-shot.

Run 7 Conclusion: Spec 35 chunk-level scoring improved few-shot AURC by 29% (0.213 → 0.151). Gap to zero-shot closed to 9% (CIs overlap).

Run 8 Conclusion: Participant-only transcript preprocessing reaches paper MAE_item parity, but reduces Cmax substantially; next work is improving confidence signals for AURC/AUGRC (Spec 046: docs/_specs/spec-046-selective-prediction-confidence-signals.md) and then revisiting coverage.

Run 6: Dec 31, 2025 - Spec 35 Chunk Scoring Preprocessing

Log File: data/outputs/run6_spec35_20251231_122458.log

Purpose: Generate chunk-level PHQ-8 scores (Spec 35 preprocessing step)

Configuration: - Embeddings: ollama_qwen3_8b_paper_train.npz - Scorer model: gemma3:27b-it-qat - Backend: Ollama - Temperature: 0.0

Output: data/embeddings/ollama_qwen3_8b_paper_train.chunk_scores.json

Notes: This was a preprocessing run to generate chunk scores, not an evaluation run. See Run 7 for the subsequent evaluation.

Run 7: Jan 1, 2026 - Post-Spec 35 Chunk Scoring (Full Run)

File: both_paper-test_backfill-off_20260101_111354.json

Git Commit: Current dev branch

Timestamp: 2026-01-01T11:13:54

Code State: - Spec 33: Retrieval quality guardrails ✅ - Spec 34: Item-tag filtering ✅ - Spec 35: Chunk-level scoring ✅ (EMBEDDING_REFERENCE_SCORE_SOURCE=chunk) - Spec 37: Batch query embedding ✅

Results:

Mode	AURC	AUGRC	Cmax	MAE_w	MAE_item	MAE_subj	N_included	Failed
Zero-shot	0.138	0.039	56.9%	0.698	0.717	0.640	40	1
Few-shot	0.151	0.048	65.9%	0.639	0.636	0.606	41	0

95% Bootstrap CIs (10,000 resamples, participant-level):

Mode	AURC CI	AUGRC CI	Cmax CI
Zero-shot	[0.097, 0.180]	[0.025, 0.055]	[0.491, 0.650]
Few-shot	[0.109, 0.194]	[0.033, 0.065]	[0.570, 0.747]

Statistical Analysis: - Computed 2026-01-01 via scripts/evaluate_selective_prediction.py --seed 42 - Metrics files: selective_prediction_metrics_20260101T165303Z.json (zero-shot), selective_prediction_metrics_20260101T165328Z.json (few-shot)

Known Issue: Participant 339 failed in zero-shot mode due to JSON parsing error (missing comma). See GitHub Issue #84.

Comparison vs Run 5:

Metric	Run 5	Run 7	Delta	% Change
few_shot AURC	0.213	0.151	-0.062	-29% (better)
few_shot AUGRC	0.073	0.048	-0.025	-34% (better)
zero_shot AURC	0.138	0.138	0.000	0% (unchanged)

Key Finding: With Spec 35 chunk-level scoring enabled, few-shot improved 29% on AURC vs Run 5. Few-shot now has better MAE (0.639 vs 0.698) but AURC is still slightly worse due to confidence calibration.

Interpretation: Spec 35 significantly improved few-shot performance. The remaining gap is now within statistical noise (CIs overlap). The next lever was participant-only transcript preprocessing (implemented in Run 8).

Run 8: Jan 2, 2026 - Participant-Only Transcript Preprocessing (Full Run)

File: both_paper-test_backfill-off_20260102_065249.json

Log: repro_post_preprocessing_20260101_183533.log

Run ID: 19b42478

Git Commit: 1b48d7a (dirty)

Timestamp: 2026-01-02T04:22:43

Code State: - Spec 33: Retrieval quality guardrails ✅ - Spec 34: Item-tag filtering ✅ - Spec 35: Chunk-level scoring ✅ (EMBEDDING_REFERENCE_SCORE_SOURCE=chunk) - Spec 37: Batch query embedding ✅ - Transcript preprocessing: participant-only turns ✅ (data/transcripts_participant_only/)

Reference Artifacts: - Few-shot embeddings: data/embeddings/huggingface_qwen3_8b_paper_train_participant_only.npz - Chunk scores sidecar: data/embeddings/huggingface_qwen3_8b_paper_train_participant_only.chunk_scores.json (loaded; train participants=58)

Results:

Mode	AURC	AUGRC	Cmax	MAE_w	MAE_item	MAE_subj	N_included	Failed
Zero-shot	0.141	0.031	48.8%	0.744	0.776	0.736	41	0
Few-shot	0.125	0.031	50.9%	0.706	0.609	0.688	40	1

95% Bootstrap CIs (10,000 resamples, participant-level):

Mode	AURC CI	AUGRC CI	Cmax CI
Zero-shot	[0.108, 0.174]	[0.022, 0.043]	[0.412, 0.567]
Few-shot	[0.099, 0.151]	[0.022, 0.041]	[0.447, 0.575]

Statistical Analysis: - Computed 2026-01-02 via scripts/evaluate_selective_prediction.py --loss abs_norm --seed 42 - Metrics files: selective_prediction_metrics_20260102T132843Z.json (zero-shot), selective_prediction_metrics_20260102T132902Z.json (few-shot) - Paired comparison (overlap N=40; --intersection-only): selective_prediction_metrics_20260102T132930Z_paired.json (ΔAURC = -0.020 [-0.053, +0.014], few-shot − zero-shot)

Paper MAE comparison (MAE_item): - Zero-shot: 0.776 vs paper 0.796 (better) - Few-shot: 0.609 vs paper 0.619 (better)

Interpretation (first principles): - Accuracy vs abstention: In Run 8, both modes abstain at similar rates (Cmax ~49% vs ~51%), so the large MAE_item gap (0.776 → 0.609) is less likely to be an artifact of one mode simply “skipping harder items”. - Calibration unchanged: AURC/AUGRC CIs overlap, and the paired ΔAURC CI includes 0. This suggests few-shot improves scores on predicted items but does not materially improve the model’s ranking of confidence / abstention decisions. - Practical takeaway: If the goal is “predict more items correctly”, retrieval helps; if the goal is “know when not to predict”, focus on evidence availability + confidence signals (e.g., evaluate participant_qa, tune thresholds, improve confidence estimation).

Known Issues: - Few-shot had 1/41 participant failure (PID 383): Exceeded maximum retries (3) for output validation. - Zero-shot excluded 1/41 participant from MAE aggregation due to 8/8 N/A (counted as 0 coverage for Cmax).

Run 9: Jan 2-3, 2026 - Spec 046 Confidence Signals Ablation

File: both_paper-test_backfill-off_20260102_215843.json

Log: data/outputs/run9_spec046_20260102_181114.log

Git Commit: Post Spec 046 + 047 (retrieval signals + keyword backfill removal)

Timestamp: 2026-01-03T02:58:43

Code State: - Spec 33-35: Full retrieval stack ✅ - Spec 37: Batch query embedding ✅ - Spec 046: Retrieval similarity fields ✅ - Spec 047: Keyword backfill removal ✅

Results:

Mode	AURC	AUGRC	Cmax	MAE_w	MAE_item	N_included
Zero-shot	0.144	0.032	48.8%	0.744	0.776	40
Few-shot	0.135	0.035	53.0%	0.718	0.662	41

95% Bootstrap CIs (10,000 resamples, participant-level):

Mode	AURC CI	AUGRC CI	Cmax CI
Zero-shot	[0.110, 0.178]	[0.022, 0.045]	[0.412, 0.567]
Few-shot	[0.107, 0.165]	[0.025, 0.047]	[0.460, 0.604]

Spec 046 Confidence Signal Ablation (few-shot):

Confidence Signal	AURC	AUGRC	vs llm baseline
`llm` (evidence count)	0.135	0.035	—
`retrieval_similarity_mean`	0.128	0.034	-5.4% AURC
`retrieval_similarity_max`	0.128	0.034	-5.4% AURC
`hybrid_evidence_similarity`	0.135	0.035	+0.2% AURC

Key Findings: 1. Retrieval similarity improves AURC 5.4%: retrieval_similarity_mean provides better ranking than evidence count alone 2. AUGRC unchanged: Improvement within noise (0.034 vs 0.035) 3. Hybrid signal not helpful: Multiplying evidence × similarity doesn't improve over either alone 4. GitHub Issue #86 hypothesis partially validated: Retrieval signals help AURC but don't substantially move AUGRC

Interpretation: The retrieval similarity signal provides modest but measurable improvement in selective prediction ranking. However, the AUGRC target of <0.020 (from Issue #86) was not achieved. Further improvements would require Phase 2 (verbalized confidence) or Phase 3 (multi-signal calibration) approaches.

Run 10: Jan 3, 2026 - Confidence Suite (Specs 048–051) Attempt (INVALID)

File: data/outputs/both_paper-test_20260103_182316.json

Log: data/outputs/run10_confidence_suite_20260103_111959.log

Run ID: 3186a50d

Git Commit: 064ed30 (dirty)

Timestamp: 2026-01-03T11:20:01

Goal: Emit confidence-suite signals (verbalized confidence, token-level CSFs, consistency) and re-evaluate AURC/AUGRC.

What went wrong (why this run is invalid for comparisons):

Zero-shot had 2/41 hard failures (PIDs 383, 427): Exceeded maximum retries (3) for output validation.
This was caused by deterministic malformed “JSON-like” outputs in the scoring step (pre-ANALYSIS-026 JSON hardening).
Few-shot evaluated 0/41 participants: every participant failed with:
HuggingFace backend requires optional dependencies. Install with: pip install 'ai-psychiatrist[hf]'
Root cause: the run used EMBEDDING_BACKEND=huggingface but torch was not installed, so query embeddings could not be computed.

Results (retain for debugging only; not a publication-quality run):

Mode	N_eval	MAE_w	MAE_item	Coverage	Notes
Zero-shot	39/41	0.632	0.597	48.7%	Partial; biased by failures
Few-shot	0/41	n/a	n/a	n/a	Invalid (missing HF deps)

Selective prediction (zero-shot only; 39 participants):

Computed via: uv run python scripts/evaluate_selective_prediction.py --input data/outputs/both_paper-test_20260103_182316.json --mode zero_shot

Confidence	AURC	AUGRC	Cmax	Notes
`llm`	0.101	0.026	48.7%	Baseline for this partial run
`verbalized`	0.092	0.026	48.7%	Lower AURC than `llm`
`token_pe`	0.100	0.024	48.7%	Lower AUGRC than `llm`

Action items before Run 11: - Use a clean git state for the run (commit or stash). - If using HuggingFace embeddings (EMBEDDING_BACKEND=huggingface), install deps first: make dev (or uv sync --extra hf) and verify uv run python -c "import torch". - Re-run the confidence suite on a valid run artifact (both modes evaluated) before interpreting deltas.

Run 11: Jan 4, 2026 - Confidence Suite (Specs 048–051) (DIAGNOSTIC; NOT COMPARABLE)

File: data/outputs/both_paper-test_20260104_102031.json

Log: data/outputs/run11_confidence_suite_20260103_215102.log

Run ID: d4c78527

Git Commit: 056d3be (clean)

Timestamp: 2026-01-03T21:51:02

Goal: Emit confidence-suite signals (verbalized confidence, token-level CSFs, consistency) and re-evaluate AURC/AUGRC for both modes.

What went wrong (why this run is not comparable to prior baselines):

5/41 participants failed in both modes due to evidence_hallucination (10 total failures, all fatal).
Failure artifact: data/outputs/failures_d4c78527.json
Most failing participants: 367, 386, 409, 456, 487 (each failed in both modes)

This creates selection bias (N=36 instead of N=41). Treat this run as diagnostic-only for confidence-signal ranking, not as a publication-quality benchmark.

Results (diagnostic-only; N=36):

Mode	N_eval	MAE_w	MAE_item	Coverage
Zero-shot	36/41	0.617	0.534	49.0%
Few-shot	36/41	0.715	0.663	47.6%

Selective prediction (Run 11):

Computed via: - data/outputs/selective_prediction_metrics_run11_zero_shot_all.json - data/outputs/selective_prediction_metrics_run11_few_shot_all.json - Paired (few − zero, overlap only): data/outputs/selective_prediction_metrics_run11_paired_default.json

Key takeaways (abs_norm):

Mode	Confidence	AURC	AUGRC	Cmax
Zero-shot	`llm`	0.1035	0.0253	48.96%
Zero-shot	`verbalized`	0.0878	0.0257	48.96%
Few-shot	`llm`	0.1184	0.0270	47.57%
Few-shot	`token_pe`	0.0861	0.0235	47.57%

Paired deltas (few-shot − zero-shot, confidence=llm): ΔAURC = +0.0149 [-0.0136, +0.0445], ΔAUGRC = +0.0017 [-0.0069, +0.0114].

Run 12: Jan 4-5, 2026 - Confidence Suite (Specs 048–052) ✅ VALID (N=41)

File: data/outputs/both_paper-test_20260105_072303.json

Log: data/outputs/run12_confidence_suite_20260104_115021.log

Run ID: 05621949

Git Commit: c0d79c5 (clean)

Timestamp: 2026-01-04T11:50:22

What changed vs Run 11: - Evidence grounding failures are recorded as non-fatal (failure registry) instead of aborting participant evaluation, eliminating selection bias (N=41/41). - JSON parsing hardening and retry improvements are present at run start; the run completes with 0 JSON parse failures (telemetry records fixups without failures).

Results:

Mode	N_eval	MAE_w	MAE_item	Coverage
Zero-shot	41/41	0.642	0.572	48.5%
Few-shot	41/41	0.676	0.616	46.0%

Selective prediction (Run 12, confidence=llm):

Mode	AURC	AUGRC	Cmax
Zero-shot	0.1019 [0.0806-0.1214]	0.0252 [0.0186-0.0323]	48.5%
Few-shot	0.1085 [0.0835-0.1327]	0.0242 [0.0175-0.0319]	46.0%

Best artifact-free confidence variants (within the same run): - Zero-shot (best AURC): verbalized (AURC 0.0917) - Zero-shot (best AUGRC): token_pe (AUGRC 0.0234) - Few-shot (best AURC/AUGRC): token_energy (AURC 0.0862, AUGRC 0.0216)

Artifacts: - Failures: data/outputs/failures_05621949.json (8 non-fatal evidence_hallucination events) - Telemetry: data/outputs/telemetry_05621949.json (json_fixups_applied) - Selective metrics (all variants): data/outputs/selective_prediction_metrics_run12_zero_shot_all.json, data/outputs/selective_prediction_metrics_run12_few_shot_all.json - Paired (few − zero, default): data/outputs/selective_prediction_metrics_run12_paired_default.json - Paired (Run 11 → Run 12, overlap only): data/outputs/selective_prediction_metrics_run11_vs_run12_zero_shot_llm.json, data/outputs/selective_prediction_metrics_run11_vs_run12_few_shot_llm.json

Interpretation: - The confidence-suite signals are working and measurably reduce AURC/AUGRC relative to llm within a fixed run (selective prediction improvement without changing the underlying predictions). - Few-shot does not outperform zero-shot on MAE_item in this run; however, few-shot slightly improves AUGRC at the cost of lower Cmax and slightly worse AURC under confidence=llm. Prefer paired + confidence-variant comparisons for selective prediction claims. - See Few-Shot Analysis for first-principles explanation of why few-shot may not outperform zero-shot with strict evidence grounding.

Run 13: Jan 6-7, 2026 - POST BUG-035 (First Clean Comparative Run) ✅ VALID

File: data/outputs/both_paper-test_20260107_134730.json

Log: data/outputs/run13_20260106_175051.log

Run ID: 7d5eadf0

Git Commit: 01d3124 (clean)

Timestamp: 2026-01-06T17:50:52 (started) → 2026-01-07T18:47:30 (completed)

Why this run is significant: - ✅ First run POST BUG-035 fix (prompt confound resolved) - ✅ Clean git state - ✅ All 41 participants evaluated in both modes (no selection bias) - ❌ Does NOT include Spec 061-063 (total score, binary classification, severity inference)

Code State: - Spec 032-037: Full retrieval stack ✅ - Spec 046-050: Confidence signals ✅ - Spec 051-052: Token-level CSFs ✅ - BUG-035 fix: Empty retrieval → identical prompt to zero-shot ✅ - Consistency: ENABLED (n=5, temp=0.2)

Results:

Mode	N_eval	MAE_w	MAE_item	Coverage	Time
Zero-shot	40	0.6750	0.6079	50.0%	~9.8h
Few-shot	41	0.7107	0.6571	48.5%	~10.2h

Selective Prediction (Run 13):

Mode	Confidence	AURC	AUGRC	Cmax
Zero-shot	`llm`	0.1066 [0.087-0.125]	0.0267 [0.020-0.034]	48.8%
Zero-shot	`consistency_inverse_std`	0.0977 [0.077-0.121]	0.0244 [0.017-0.032]	48.8%
Few-shot	`llm`	0.1153 [0.088-0.143]	0.0279 [0.020-0.038]	48.5%
Few-shot	`token_pe`	0.0906 [0.070-0.118]	0.0246 [0.018-0.034]	48.5%

Comparison to Paper (MAE_item):

Mode	Paper	Run 13	Delta
Zero-shot	0.796	0.6079	-24% (better)
Few-shot	0.619	0.6571	+6% (worse)

Key Findings: 1. Zero-shot beats paper by 24% (0.6079 vs 0.796) - substantial improvement 2. Zero-shot beats few-shot (0.6079 vs 0.6571) - consistent with prior runs 3. Few-shot slightly worse than paper - retrieval may still be introducing noise 4. Token-level CSFs work well for few-shot (token_pe AURC 0.0906) 5. Consistency signals work well for zero-shot (consistency_inverse_std AUGRC 0.0244)

Robustness: - Failures: 8 non-fatal evidence_hallucination events (recorded in failure registry) - Telemetry: 13 json_fixups_applied, 1 json_repair_fallback (healthy)

Artifacts: - Failures: data/outputs/failures_7d5eadf0.json - Telemetry: data/outputs/telemetry_7d5eadf0.json - Selective metrics: data/outputs/selective_prediction_metrics_run13_zero_shot_all.json, data/outputs/selective_prediction_metrics_run13_few_shot_all.json

Interpretation: This is the first clean comparative run after the BUG-035 prompt confound fix. The result confirms that zero-shot outperforms few-shot even when few-shot prompts are no longer contaminated by "No valid evidence found" messages. The few-shot underperformance is therefore due to retrieval quality issues, not prompt confounding.

Run 14: Jan 7-8, 2026 - Spec 063 Severity Inference (`infer`) ✅ VALID (Coverage ↑; Risk ↑)

File: data/outputs/both_paper-test_20260108_114058.json

Log: data/outputs/run14_infer_20260107_172234.log

Run ID: 02a0d65e

Git Commit: e55c00f (clean)

Timestamp: 2026-01-07T17:22:35 (started) → 2026-01-08T11:40:58 (completed)

Code State: - Spec 063 enabled via CLI: --severity-inference infer (default remains strict) - Consistency: ENABLED (n=5, temp=0.2) - Prediction mode: item (Specs 061/062 are implemented, but not invoked in this run)

Results:

Mode	N_eval	MAE_w	MAE_item	Coverage	Time
Zero-shot	41	0.7056	0.7030	60.1%	~9.9h
Few-shot	40	0.7772	0.7843	57.5%	~8.4h

Selective Prediction (Run 14, all variants; abs_norm, 1,000 bootstrap resamples): - Zero-shot: data/outputs/selective_prediction_metrics_run14_infer_zero_shot_all.json - Few-shot: data/outputs/selective_prediction_metrics_run14_infer_few_shot_all.json - Paired (few − zero, default confidences; overlap only): data/outputs/selective_prediction_metrics_run14_infer_paired_default.json

Mode	Confidence	AURC	AUGRC	Cmax
Zero-shot	`llm`	0.1292 [0.104-0.161]	0.0409 [0.030-0.057]	60.1%
Zero-shot	`hybrid_consistency`	0.1258 [0.102-0.155]	0.0377 [0.028-0.052]	60.1%
Few-shot	`llm`	0.1467 [0.114-0.180]	0.0421 [0.029-0.059]	57.5%
Few-shot	`consistency_inverse_std`	0.1391 [0.107-0.176]	0.0404 [0.028-0.057]	57.5%
Few-shot	`token_energy`	0.1455 [0.111-0.176]	0.0394 [0.028-0.054]	57.5%

Robustness: - Failures: 9 total (8 evidence_hallucination in evidence extraction; 1 HTTP 500 causing a single few-shot participant failure) - Failures file: data/outputs/failures_02a0d65e.json - Telemetry file: data/outputs/telemetry_02a0d65e.json (22 pydantic_retry, 4 json_fixups_applied, 1 json_python_literal_fallback)

Comparison to Run 13 (strict baseline): - Zero-shot (paired, overlap N=41): Cmax +0.113; AURC(llm) +0.023; AUGRC(llm) +0.014 - Few-shot (paired, overlap N=40): Cmax +0.084; AURC(llm) +0.029; AUGRC(llm) +0.013

Paired deltas artifacts: - Zero-shot: data/outputs/selective_prediction_metrics_run13_vs_run14_zero_shot_all.json - Few-shot: data/outputs/selective_prediction_metrics_run13_vs_run14_few_shot_all.json

Interpretation: Severity inference increases coverage as intended, but in this first ablation it materially increases risk (AURC/AUGRC) relative to the strict baseline. Treat infer as an experimental setting until additional prompt/guardrail iterations show coverage gains without degrading coverage-aware metrics.

Reproduction Commands

Run Evaluation

# Full reproduction (both modes)
uv run python scripts/reproduce_results.py --split paper-test

# Zero-shot only
uv run python scripts/reproduce_results.py --split paper-test --zero-shot-only

# Few-shot only (requires embeddings)
uv run python scripts/reproduce_results.py --split paper-test --few-shot-only

Compute AURC/AUGRC

# Single mode (writes `data/outputs/selective_prediction_metrics_*.json`)
uv run python scripts/evaluate_selective_prediction.py \
  --input data/outputs/YOUR_OUTPUT.json \
  --mode zero_shot \
  --seed 42

# Paired comparison (recommended): pass the same run file twice with different modes
uv run python scripts/evaluate_selective_prediction.py \
  --input data/outputs/YOUR_OUTPUT.json \
  --mode zero_shot \
  --input data/outputs/YOUR_OUTPUT.json \
  --mode few_shot \
  --seed 42

# Or run separately for each mode
uv run python scripts/evaluate_selective_prediction.py \
  --input data/outputs/YOUR_OUTPUT.json \
  --mode few_shot \
  --seed 42

Generate Embeddings (for few-shot)

uv run python scripts/generate_embeddings.py --split paper-train
# Optional (Spec 34): add `--write-item-tags` to generate a `.tags.json` sidecar for item-tag filtering, then set `EMBEDDING_ENABLE_ITEM_TAG_FILTER=true` for runs.

File Locations

Type	Path
Run outputs	`data/outputs/*.json`
AURC metrics	`data/outputs/selective_prediction_metrics_*.json`
Run log (gitignored)	`data/outputs/RUN_LOG.md`
Embeddings	`data/embeddings/*.npz`
Experiment registry	`data/experiments/registry.yaml`

References

Statistical methodology: docs/statistics/statistical-methodology-aurc-augrc.md
Feature index + defaults: docs/pipeline-internals/features.md
RAG runtime features: docs/rag/runtime-features.md
RAG debugging: docs/rag/debugging.md
RAG artifact generation: docs/rag/artifact-generation.md
Paper analysis: docs/_archive/misc/paper-reproduction-analysis.md

Complete Run History & Statistical Analysis

⚠️ CRITICAL: Run Integrity Warnings

Prompt Confound Bug (BUG-035) - Fixed 2026-01-06

Silent Fallback Bug (ANALYSIS-026) - Fixed 2026-01-03

Invalid JSON Output Bug (BUG-048) - Fixed 2026-01-08

Quick Reference: Current Best Results

Run 13: First Clean Run POST BUG-035 Fix ✅

Paper Comparison (MAE_item)

Historical Reference (Run 12, pre-BUG-035 fix)

Run 14: Spec 063 Severity Inference (infer) Ablation (Coverage ↑; Risk ↑)

Why AURC/AUGRC Instead of MAE?

Run Timeline (Chronological)

Run 1: Dec 26, 2025 - Initial Validated Runs

Run 2: Dec 27, 2025 - Pre-Spec 31/32 Full Run

Run 3: Dec 29, 2025 - Post-Spec 31/32 (Legacy Prompt Format)

Run 4: Dec 29, 2025 - Spec 33 Development Snapshot (Pre-merge)

Run 4b: Dec 30, 2025 - Post-Spec 34 Regression (Query Embedding Timeouts)

Run 5: Dec 30, 2025 - Post-Spec 33+34 (Full Ablation)

Spec 31/32 Impact Analysis

What Changed

Impact on Metrics

Interpretation

Key Findings

1. Few-Shot vs Zero-Shot (Paper Claim)

2. Paper's MAE Comparison Was Not Coverage-Adjusted

3. Formatting Matters But Isn't Everything

Pending Work

Specs 33-36: Retrieval Quality Fixes

Run 6: Dec 31, 2025 - Spec 35 Chunk Scoring Preprocessing

Run 7: Jan 1, 2026 - Post-Spec 35 Chunk Scoring (Full Run)

Run 8: Jan 2, 2026 - Participant-Only Transcript Preprocessing (Full Run)

Run 9: Jan 2-3, 2026 - Spec 046 Confidence Signals Ablation

Run 10: Jan 3, 2026 - Confidence Suite (Specs 048–051) Attempt (INVALID)

Run 11: Jan 4, 2026 - Confidence Suite (Specs 048–051) (DIAGNOSTIC; NOT COMPARABLE)

Run 12: Jan 4-5, 2026 - Confidence Suite (Specs 048–052) ✅ VALID (N=41)

Run 13: Jan 6-7, 2026 - POST BUG-035 (First Clean Comparative Run) ✅ VALID

Run 14: Jan 7-8, 2026 - Spec 063 Severity Inference (infer) ✅ VALID (Coverage ↑; Risk ↑)

Reproduction Commands

Run Evaluation

Compute AURC/AUGRC

Generate Embeddings (for few-shot)

File Locations

References

Run 14: Spec 063 Severity Inference (`infer`) Ablation (Coverage ↑; Risk ↑)

Run 14: Jan 7-8, 2026 - Spec 063 Severity Inference (`infer`) ✅ VALID (Coverage ↑; Risk ↑)