Spec 061: Total PHQ-8 Score Prediction (0-24)
Status: IMPLEMENTED Created: 2026-01-05 Implemented: 2026-01-07 Rationale: Item-level PHQ-8 frequency scoring (0-3 per item) is often underdetermined from DAIC-WOZ transcripts. Total score prediction (0-24) may be more defensible.
Motivation
Task Validity Problem
PHQ-8 item scores (0-3) encode 2-week frequency (0-1, 2-6, 7-11, 12-14 days). DAIC-WOZ interviews are not structured to elicit frequency information. This creates a fundamental construct mismatch (see docs/clinical/task-validity.md).
Run 12 evidence: - Only 32% of item assessments have any grounded evidence - ~50% abstention rate (N/A) is expected behavior - Coverage stabilizes around 46-49%
Why Total Score May Be More Valid
- Error averaging: Item-level errors partially cancel when summed
- Fewer degrees of freedom: 1 prediction vs 8 predictions per participant
- Prior art: Text-only PHQ-8 total regression exists (PubMed 37398577)
- Clinical utility: Total score determines severity tier (0-4, 5-9, 10-14, 15-19, 20-24)
Design
Prediction Modes
Add a new configuration option and CLI flag:
# config.py
class PredictionSettings(BaseSettings):
prediction_mode: Literal["item", "total", "binary"] = "item"
# CLI usage
uv run python scripts/reproduce_results.py --prediction-mode total
Mode Behaviors
| Mode | Output | Coverage Handling | Evaluation Metric |
|---|---|---|---|
item |
8 scores (0-3) or N/A per item | Per-item abstention | MAE_item, AURC |
total |
1 score (0-24) per participant | Participant-level abstention | MAE_total, RMSE |
binary |
1 label (depressed/not) | Participant-level abstention | Accuracy, F1 |
Total Score Prediction Strategy
Option A: Sum of Item Predictions (Default)
Use existing item-level pipeline, sum non-N/A scores:
def predict_total_score(item_scores: dict[str, int | None]) -> int | None:
scored_items = [v for v in item_scores.values() if v is not None]
if len(scored_items) < 4: # Require at least 50% coverage
return None # Abstain
return sum(scored_items) # Partial sum (underestimate)
Note: Partial sums underestimate true total when items are missing.
Option B: Direct Total Prediction (Optional)
Add a new prompt that predicts total score directly without item decomposition:
Based on this clinical interview transcript, estimate the participant's
overall PHQ-8 depression severity score (0-24).
Consider all observable indicators of depression symptoms:
- Mood and affect
- Sleep and energy
- Interest and pleasure
- Self-perception
- Concentration
Output a single integer 0-24, or "N/A" if insufficient evidence.
Trade-off: Less interpretable (no item breakdown) but avoids compounding item abstentions.
Implementation
Implemented Scope (2026-01-07)
- Phase 1 (Sum-of-Items): Implemented via
PREDICTION_MODE=total/--prediction-mode total, with coverage gating viaTOTAL_SCORE_MIN_COVERAGE/--total-min-coverage. - Phase 2 (Direct Total Prediction): Deferred (not implemented).
Phase 1: Sum-of-Items (Low Effort)
- Add
--prediction-modeCLI flag toreproduce_results.py - In output generation, compute total from item scores
- Add
total_scoreandtotal_score_predictedfields to output JSON - Update evaluation script to compute MAE_total when mode=total
Phase 2: Direct Prediction (Medium Effort)
- Add new prompt template in
agents/prompts/quantitative.py - Add
DirectTotalAgentor extendQuantitativeAgentwith mode switch - Output format:
{"total_score": int | "N/A", "confidence": float, "reason": str}
Evaluation
Metrics for Total Score
| Metric | Formula | Notes |
|---|---|---|
| MAE_total | mean(|predicted - actual|) |
Primary metric |
| RMSE | sqrt(mean((predicted - actual)^2)) |
Penalizes large errors |
| Correlation | Pearson r | Linear relationship |
| Severity Tier Accuracy | sum(tier_pred == tier_actual) / N |
Clinically meaningful |
Severity Tiers (PHQ-8)
| Tier | Range | Label |
|---|---|---|
| 0 | 0-4 | Minimal/None |
| 1 | 5-9 | Mild |
| 2 | 10-14 | Moderate |
| 3 | 15-19 | Moderately Severe |
| 4 | 20-24 | Severe |
Configuration
New Settings
# .env
PREDICTION_MODE=total # item | total | binary
TOTAL_SCORE_MIN_COVERAGE=0.5 # Minimum item coverage for sum-of-items
CLI Override
uv run python scripts/reproduce_results.py \
--prediction-mode total \
--total-min-coverage 0.5
Output Schema Changes
Add to participant results:
{
"participant_id": "303",
"prediction_mode": "total",
"total_score": {
"predicted": 12,
"actual": 14,
"method": "sum_of_items",
"items_covered": 6,
"confidence": 0.75
},
"severity_tier": {
"predicted": 2,
"actual": 2,
"correct": true
}
}
Testing
- Unit tests for total score computation from items
- Integration test with
--prediction-mode total - Verify output JSON schema includes total fields
- Compare MAE_total to MAE_item on same run
Dependencies
- None (uses existing pipeline)
- Phase 2 requires new prompt design