Reproduction Run Output Schema (JSON + Registry)

Audience: Researchers parsing outputs and maintaining run provenance Last Updated: 2026-01-08

This repo writes four primary provenance artifacts for quantitative reproduction runs:

data/outputs/{mode}_{split}_{YYYYMMDD_HHMMSS}.json
data/experiments/registry.yaml (append/update registry of runs)
data/outputs/failures_{run_id}.json (failure summary; Spec 056)
data/outputs/telemetry_{run_id}.json (structured telemetry summary)

SSOT implementation: - scripts/reproduce_results.py (writer) - src/ai_psychiatrist/services/experiment_tracking.py (filename + provenance helpers) - src/ai_psychiatrist/infrastructure/observability.py (failure registry; Spec 056) - src/ai_psychiatrist/infrastructure/telemetry.py (telemetry registry)

Output Filename Format

generate_output_filename() produces:

{mode}_{split}_{YYYYMMDD_HHMMSS}.json

Where: - mode is zero_shot, few_shot, or both - split is one of: train, dev, train+dev, paper, paper-train, paper-val, paper-test - paper is an alias for paper-test in scripts/reproduce_results.py

Note: {mode} in the filename refers to scoring mode (zero-shot vs few-shot vs both). Prediction evaluation mode (Specs 061/062: item vs total vs binary) is recorded inside the JSON as experiments[].results.prediction_mode.

JSON Top-Level Shape

{
  "run_metadata": { "...": "..." },
  "experiments": [
    {
      "provenance": { "...": "..." },
      "results": { "...": "..." }
    }
  ]
}

`run_metadata`

Captured once per run: - run_id - timestamp - git_commit, git_dirty - python_version, platform - ollama_base_url

`experiments[].provenance`

Captured per mode (zero_shot / few_shot): - split name - model/backends used - embedding artifact identity (path + checksums when applicable) - participants requested vs evaluated

`experiments[].results`

Per-mode aggregated metrics + per-participant results: - counts: total/success/failed/evaluated - MAE variants (item_mae_weighted, item_mae_by_item, item_mae_by_subject) for item-level scoring - coverage (prediction_coverage) - evaluation settings (Specs 061/062): prediction_mode, total_score_min_coverage, binary_threshold, binary_strategy - per-item breakdowns - per-participant predictions - item_signals (Spec 25/046 confidence signals) - includes Spec 063 severity inference annotations when enabled: inference_used, inference_type, inference_marker

When enabled via --prediction-mode (Specs 061/062), additional aggregated metrics are included: - total_metrics (total score evaluation) - binary_metrics (binary depression classification)

For downstream AURC/AUGRC evaluation, scripts/evaluate_selective_prediction.py consumes: - per-participant success - per-item predictions - item_signals evidence counts (llm_evidence_count) - (Spec 046) retrieval stats (retrieval_reference_count, retrieval_similarity_mean, retrieval_similarity_max) - (Specs 048–051) confidence suite signals: - verbalized_confidence - token_msp, token_pe, token_energy - consistency_modal_confidence, consistency_score_std, consistency_na_rate, consistency_samples

Note: Outputs are written as strict JSON (no NaN/Infinity). Undefined aggregate metrics are emitted as null (BUG-048).

`experiments[].results.results[]` (Per-participant)

Each participant result includes: - participant_id, success, error - per-item maps: ground_truth_items, predicted_items - totals/bounds: ground_truth_total, predicted_total, predicted_total_min, predicted_total_max - severity bounds: severity_lower_bound, severity_upper_bound (and severity when available) - binary fields (Spec 062): ground_truth_binary, predicted_binary, binary_correct (populated when prediction_mode="binary") - optional structured sections when enabled: - total_score (when prediction_mode="total") - binary_classification (when prediction_mode="binary") - severity_tier (when prediction_mode="total")

Experiment Registry (`data/experiments/registry.yaml`)

scripts/reproduce_results.py updates a YAML registry after each run to make it easy to: - list historical runs - associate output files with run ids and git commits - compare summary metrics without opening large JSON artifacts

Treat the registry as a convenience index; the JSON output is the authoritative raw artifact.

Failure Registry (`data/outputs/failures_{run_id}.json`) (Spec 056)

Runs can emit a compact failure summary to support postmortems and trend tracking without leaking transcript text.

Top-level shape:

{
  "summary": { "...": "..." },
  "failures": [
    {
      "category": "scoring_pydantic_retry_exhausted",
      "severity": "fatal",
      "message": "Exceeded maximum retries (3) for output validation",
      "participant_id": 383,
      "phq8_item": null,
      "stage": "scoring",
      "timestamp": "2026-01-03T23:23:16.828218+00:00",
      "context": { "...": "..." }
    }
  ]
}

Design constraints: - No transcript text or evidence quotes should be written to this file. - Prefer categorical category/severity, numeric counts, stable hashes, and short messages.

Telemetry Registry (`data/outputs/telemetry_{run_id}.json`)

Runs may also emit a compact telemetry summary for debugging/monitoring without leaking transcript text.

Design constraints: - No transcript text, evidence quotes, or raw LLM outputs. - Telemetry should be categorical + aggregate (counts, stable hashes), suitable for trend tracking.

Reproduction Run Output Schema (JSON + Registry)

Output Filename Format

JSON Top-Level Shape

run_metadata

experiments[].provenance

experiments[].results

experiments[].results.results[] (Per-participant)

Experiment Registry (data/experiments/registry.yaml)

Failure Registry (data/outputs/failures_{run_id}.json) (Spec 056)

Telemetry Registry (data/outputs/telemetry_{run_id}.json)