Spec 061: Total PHQ-8 Score Prediction (0-24)

Status: IMPLEMENTED Created: 2026-01-05 Implemented: 2026-01-07 Rationale: Item-level PHQ-8 frequency scoring (0-3 per item) is often underdetermined from DAIC-WOZ transcripts. Total score prediction (0-24) may be more defensible.

Motivation

Task Validity Problem

PHQ-8 item scores (0-3) encode 2-week frequency (0-1, 2-6, 7-11, 12-14 days). DAIC-WOZ interviews are not structured to elicit frequency information. This creates a fundamental construct mismatch (see docs/clinical/task-validity.md).

Run 12 evidence: - Only 32% of item assessments have any grounded evidence - ~50% abstention rate (N/A) is expected behavior - Coverage stabilizes around 46-49%

Why Total Score May Be More Valid

Error averaging: Item-level errors partially cancel when summed
Fewer degrees of freedom: 1 prediction vs 8 predictions per participant
Prior art: Text-only PHQ-8 total regression exists (PubMed 37398577)
Clinical utility: Total score determines severity tier (0-4, 5-9, 10-14, 15-19, 20-24)

Design

Prediction Modes

Add a new configuration option and CLI flag:

# config.py
class PredictionSettings(BaseSettings):
    prediction_mode: Literal["item", "total", "binary"] = "item"

# CLI usage
uv run python scripts/reproduce_results.py --prediction-mode total

Mode Behaviors

Mode	Output	Coverage Handling	Evaluation Metric
`item`	8 scores (0-3) or N/A per item	Per-item abstention	MAE_item, AURC
`total`	1 score (0-24) per participant	Participant-level abstention	MAE_total, RMSE
`binary`	1 label (depressed/not)	Participant-level abstention	Accuracy, F1

Total Score Prediction Strategy

Option A: Sum of Item Predictions (Default)

Use existing item-level pipeline, sum non-N/A scores:

def predict_total_score(item_scores: dict[str, int | None]) -> int | None:
    scored_items = [v for v in item_scores.values() if v is not None]
    if len(scored_items) < 4:  # Require at least 50% coverage
        return None  # Abstain
    return sum(scored_items)  # Partial sum (underestimate)

Note: Partial sums underestimate true total when items are missing.

Option B: Direct Total Prediction (Optional)

Add a new prompt that predicts total score directly without item decomposition:

Based on this clinical interview transcript, estimate the participant's
overall PHQ-8 depression severity score (0-24).

Consider all observable indicators of depression symptoms:
- Mood and affect
- Sleep and energy
- Interest and pleasure
- Self-perception
- Concentration

Output a single integer 0-24, or "N/A" if insufficient evidence.

Trade-off: Less interpretable (no item breakdown) but avoids compounding item abstentions.

Implementation

Implemented Scope (2026-01-07)

Phase 1 (Sum-of-Items): Implemented via PREDICTION_MODE=total / --prediction-mode total, with coverage gating via TOTAL_SCORE_MIN_COVERAGE / --total-min-coverage.
Phase 2 (Direct Total Prediction): Deferred (not implemented).

Phase 1: Sum-of-Items (Low Effort)

Add --prediction-mode CLI flag to reproduce_results.py
In output generation, compute total from item scores
Add total_score and total_score_predicted fields to output JSON
Update evaluation script to compute MAE_total when mode=total

Phase 2: Direct Prediction (Medium Effort)

Add new prompt template in agents/prompts/quantitative.py
Add DirectTotalAgent or extend QuantitativeAgent with mode switch
Output format: {"total_score": int | "N/A", "confidence": float, "reason": str}

Evaluation

Metrics for Total Score

Metric	Formula	Notes
MAE_total	`mean(\|predicted - actual\|)`	Primary metric
RMSE	`sqrt(mean((predicted - actual)^2))`	Penalizes large errors
Correlation	Pearson r	Linear relationship
Severity Tier Accuracy	`sum(tier_pred == tier_actual) / N`	Clinically meaningful

Severity Tiers (PHQ-8)

Tier	Range	Label
0	0-4	Minimal/None
1	5-9	Mild
2	10-14	Moderate
3	15-19	Moderately Severe
4	20-24	Severe

Configuration

New Settings

# .env
PREDICTION_MODE=total  # item | total | binary
TOTAL_SCORE_MIN_COVERAGE=0.5  # Minimum item coverage for sum-of-items

CLI Override

uv run python scripts/reproduce_results.py \
  --prediction-mode total \
  --total-min-coverage 0.5

Output Schema Changes

Add to participant results:

{
  "participant_id": "303",
  "prediction_mode": "total",
  "total_score": {
    "predicted": 12,
    "actual": 14,
    "method": "sum_of_items",
    "items_covered": 6,
    "confidence": 0.75
  },
  "severity_tier": {
    "predicted": 2,
    "actual": 2,
    "correct": true
  }
}

Testing

Unit tests for total score computation from items
Integration test with --prediction-mode total
Verify output JSON schema includes total fields
Compare MAE_total to MAE_item on same run

Dependencies

None (uses existing pipeline)
Phase 2 requires new prompt design