Coverage Explained: What It Is and Why It Matters

Audience: Anyone trying to understand what "coverage" means in PHQ-8 assessment Last Updated: 2026-01-02

What is Coverage?

Coverage is the percentage of PHQ-8 items that received an actual score (0, 1, 2, or 3) instead of "N/A" (Not Applicable / Cannot Assess).

Simple Example

The PHQ-8 has 8 items. If a patient's assessment looks like this:

Item	Score
No Interest	2
Depressed	1
Sleep	1
Tired	N/A
Appetite	N/A
Failure	0
Concentrating	N/A
Moving	N/A

Coverage = 4/8 = 50%

Only 4 items got scores; 4 were marked "N/A" (cannot assess).

Why Does Coverage Happen?

The system says "N/A" when it cannot find enough evidence in the interview transcript to make a prediction.

Reasons for N/A

Symptom not discussed: If the patient never mentioned sleep, the system can't score the sleep item
Vague mentions: "I've been okay" doesn't give enough information
Evidence extraction failed: Sometimes the LLM fails to extract relevant quotes
Conservative thresholds: Some models are more cautious about making predictions

Clinical Parallel

This mirrors real clinical practice. If a patient never discussed their appetite during an interview, a clinician wouldn't score that item either—they'd mark it as "not assessed."

Coverage vs. Accuracy Tradeoff

This is the key insight:

Higher coverage → More predictions → Includes harder items → Potentially higher MAE
Lower coverage → Fewer predictions → Only "easy" items → Potentially lower MAE

Example Run vs. Paper

The paper reports item-level MAE and notes that in ~50% of cases the model could not provide a prediction due to insufficient evidence. The paper does not fully specify what the denominator for “cases” is (item-level vs subject-level), but it is clearly describing substantial abstention due to missing evidence.

This repository also computes item-level MAE excluding N/A, but the exact coverage/MAE depends on model weights/quantization, backend, and prompt behavior.

Metric	Paper (reported)	Example Run (paper-test, few-shot, participant-only transcripts)
Coverage (Cmax)	~50% abstention (“unable to provide a prediction”)	50.9%
MAE_item	0.619	0.609

Run details and metric definitions live in: - docs/results/run-history.md - docs/results/reproduction-results.md - docs/statistics/metrics-and-evaluation.md

Interpretation: higher coverage often increases MAE because the model attempts more items (including harder-to-evidence symptoms). This is a general tradeoff; attributing cause requires ablations (e.g., retrieval thresholds, validation, model choice).

Per-Item Coverage Patterns

Not all PHQ-8 items are created equal. Some are discussed more often in interviews:

Item	Typical Coverage Pattern	Why
Depressed	High	Often directly discussed
Sleep	High	Common topic, clear evidence
Appetite	Low	Often not discussed explicitly
Moving	Low	Hard to infer from text alone (psychomotor change)

The paper confirms this:

"PHQ-8-Appetite had no successfully retrieved reference chunks"

Note: this quote is about few-shot reference retrieval (no retrieved reference chunks), not “coverage” directly.

And:

"For symptoms such as poor appetite and moving slowly, MAE performance was highly variable due to substantially fewer subjects with available scores"

What's Better: High or Low Coverage?

It depends on your goal:

High Coverage is Better If:

You want to assess as many symptoms as possible
You're willing to accept some error on harder items
Clinical utility matters (a partial assessment is better than no assessment)

Low Coverage is Better If:

You only want high-confidence predictions
You prefer to say "I don't know" rather than risk being wrong
You're measuring MAE and want it to look good

Our Approach

We prioritize pure LLM measurement—the system only scores items where the LLM found sufficient evidence. This matches the paper's methodology and provides a clean measure of model capability.

How Coverage Affects MAE Calculation

MAE (Mean Absolute Error) is only calculated on items that have scores.

Example

Item	Ground Truth	Prediction	Error
No Interest	2	1	1
Depressed	1	2	1
Sleep	1	1	0
Tired	2	N/A	(excluded)
Appetite	0	N/A	(excluded)
Failure	0	0	0
Concentrating	1	N/A	(excluded)
Moving	0	N/A	(excluded)

MAE = (1 + 1 + 0 + 0) / 4 = 0.5

Note: Only 4 items counted because 4 were N/A.

The Trick

If the system skips hard items (where it would have made errors) and only predicts easy items (where it's accurate), MAE looks artificially good.

What Drives Coverage in Our System?

1. Evidence Extraction

The LLM reads the transcript and extracts quotes for each PHQ-8 item. If it finds quotes, it can make a prediction.

2. Model Confidence

The LLM decides when to say "N/A". Some models are more conservative than others.

3. Transcript Richness

Longer, more detailed interviews → more evidence → higher coverage.

Why Our Coverage May Differ from the Paper

Paper Section 3.2 explicitly notes that subjects without sufficient evidence were excluded, and that in ~50% of cases the model was unable to provide a prediction due to insufficient evidence.

Plausible contributors to coverage differences include:

Prompt wording and parsing behavior differences
Model weights and quantization differences (paper does not specify quantization)
Backend/runtime differences (Ollama vs HuggingFace)

Summary

Concept	Definition
Coverage	% of PHQ-8 items that got scores (not N/A)
N/A	Item not scored due to insufficient evidence
Tradeoff	Higher coverage → more items → may include harder predictions
MAE impact	Only scored items count; N/A items are excluded

Key takeaway: Coverage and MAE must be interpreted together. A system with 0.619 MAE at ~50% abstention is not directly comparable to a system with ~0.78 MAE at ~69% coverage—they’re making different tradeoffs.

Clinical Understanding - How the system works
Reproduction Results - Historical run notes
Agent Sampling Registry - Sampling parameters (paper leaves some unspecified)
Metrics and Evaluation - Exact metric definitions + output schema