Data Splits Overview

Purpose: Definitive reference for all data split configurations Last Updated: 2026-01-03

This document explains the relationship between AVEC2017 competition splits, the paper's custom splits, and our implementation.

The Core Problem

The AVEC2017 test set does NOT have per-item PHQ-8 scores. You cannot compute item-level MAE without per-item ground truth.

Split	Per-Item PHQ-8	Total Score	Can Compute Item MAE?
Train	✅ Yes	✅ Yes	✅ Yes
Dev	✅ Yes	✅ Yes	✅ Yes
Test	❌ No	✅ Yes	❌ NO

This is why the paper created their own split.

Part 1: AVEC2017 Official Splits

What is AVEC2017?

AVEC = Audio/Visual Emotion Challenge (annual competition)
2017 = The year this depression detection challenge ran
DAIC-WOZ = The dataset used (Distress Analysis Interview Corpus - Wizard of Oz)

AVEC2017 defined official train/dev/test splits of the DAIC-WOZ dataset.

Official Split Counts

From the original challenge (Ringeval et al., 2019):

Split	Participants	Per-Item PHQ-8	Purpose
Train	107	✅ Available	Model training
Dev	35	✅ Available	Hyperparameter tuning
Test	47	❌ Not available	Competition evaluation
Total	189

Our Local Data

data/train_split_Depression_AVEC2017.csv  → 107 participants
data/dev_split_Depression_AVEC2017.csv    → 35 participants
data/test_split_Depression_AVEC2017.csv   → 47 participants (matches)
data/full_test_split.csv                  → 47 participants (total score only)

Note: data/ is gitignored due to DAIC-WOZ licensing. These files exist locally after running the dataset preparation step (scripts/prepare_dataset.py), but are not committed to the repository.

Test Set Schema Comparison

Train/Dev CSVs (have everything):

Participant_ID,PHQ8_Binary,PHQ8_Score,Gender,PHQ8_NoInterest,PHQ8_Depressed,PHQ8_Sleep,PHQ8_Tired,PHQ8_Appetite,PHQ8_Failure,PHQ8_Concentrating,PHQ8_Moving

Test Split CSV (NO PHQ-8 at all):

participant_ID,Gender

Full Test CSV (total score only, NO per-item):

Participant_ID,PHQ_Binary,PHQ_Score,Gender

Part 2: What the Paper Did

The Problem

The paper wanted to report item-level MAE (error on each of the 8 PHQ-8 items).

But the AVEC2017 test set doesn't have per-item labels. So they couldn't use it.

Their Solution

From Paper Section 2.4.1:

"We split 142 subjects with eight-item PHQ-8 scores from the DAIC-WOZ database into training, validation, and test sets."

They: 1. Combined train + dev = 142 participants (all have per-item labels) 2. Created a NEW 58/43/41 stratified split 3. Used their 58 for the few-shot knowledge base 4. Reported MAE on their 41-participant custom test set

Paper's Custom Split

Split	Count	Percentage	Purpose
Paper Train	58	41%	Few-shot embedding knowledge base
Paper Val	43	30%	Hyperparameter tuning (Appendix D)
Paper Test	41	29%	Final MAE evaluation (Table 1, Figure 4/5)

Stratification Algorithm (Appendix C)

"We stratified 142 subjects [...] based on PHQ-8 total scores and gender information." "For PHQ-8 total scores with two participants, we put one in the validation set and one in the test set. For PHQ-8 total scores with one participant, we put that one participant in the training set."

This ensures balanced distribution of severity levels across splits.

How We Obtained the Exact Split IDs

The paper does not publish the exact participant IDs. We reconstructed them by extracting IDs from the paper authors' published output files in _reference/analysis_output/:

Source	Split	Count
`quan_gemma_few_shot/TEST_analysis_output/*.jsonl`	TEST	41
`quan_gemma_few_shot/VAL_analysis_output/*.jsonl`	VAL	43
`quan_gemma_zero_shot.jsonl` minus TEST minus VAL	TRAIN	58

This reconstruction is authoritative because these are the actual IDs the paper used. The exact IDs are documented in Appendix A below.

We also verified the paper's stated heuristics against the reconstructed splits: - Single-participant scores → TRAIN: 2/2 verified (100%) - Two-participant scores → 1 VAL + 1 TEST: 4/4 verified (100%) - Three-participant strata: Sorted by ID, first→TRAIN, second→VAL, third→TEST (8/8 verified)

Part 3: Our Implementation Options

Option A: Use AVEC2017 Dev (Current Approach)

uv run python scripts/reproduce_results.py --split dev

Aspect	Value
Evaluation set	AVEC2017 dev split (35 participants)
Knowledge base	AVEC2017 train split (107 participants)
Comparable to paper?	⚠️ Different participants
Valid methodology?	✅ Yes

Pros: Simpler, no data leakage concerns, uses official splits Cons: Results not directly comparable to paper's 0.619 MAE

Option B: Use Paper Ground Truth Split (Recommended for Reproduction)

uv run python scripts/create_paper_split.py  # uses --mode ground-truth by default
uv run python scripts/generate_embeddings.py --split paper-train
uv run python scripts/reproduce_results.py --split paper

Aspect	Value
Evaluation set	Custom test split (41 participants)
Knowledge base	Custom train split (58 participants)
Comparable to paper?	✅ Exact match (IDs reverse-engineered from output files)
Valid methodology?	✅ Yes

Pros: Matches paper participants exactly. Cons: Requires generating paper split + paper embeddings artifact locally (not committed).

Option C: Full Train+Dev Evaluation

uv run python scripts/reproduce_results.py --split train+dev

Aspect	Value
Evaluation set	All 142 participants
Knowledge base	Same participants (⚠️ data leakage!)
Comparable to paper?	❌ No
Valid methodology?	⚠️ Only for debugging

Warning: This has data leakage because you're testing on the same participants used for few-shot retrieval.

Part 4: Data Leakage Considerations

What is Data Leakage?

If you use participant X's transcript to build the few-shot knowledge base, then test on participant X, the model has "seen" the answer. This artificially inflates performance.

How to Avoid It

Scenario	Knowledge Base	Test Set	Leakage?
AVEC2017 approach	Train only (107)	Dev only (35)	✅ No leakage
Paper approach	Paper train (58)	Paper test (41)	✅ No leakage
Our current	Train only (107)	Dev (35)	✅ No leakage
Dangerous	Train+Dev	Train+Dev	❌ LEAKAGE

Our `scripts/generate_embeddings.py` Uses Train Only

From scripts/generate_embeddings.py:

# Uses ONLY training IDs to avoid leakage:
# - AVEC2017: data/train_split_Depression_AVEC2017.csv
# - Paper: data/paper_splits/paper_split_train.csv

This is correct. We never include dev participants in the knowledge base.

Part 5: Recommended Approach

For Paper Reproduction

Generate paper ground truth splits:
```
uv run python scripts/create_paper_split.py
```
This uses the exact participant IDs reverse-engineered from the paper authors' output files.

Regenerate embeddings using paper train split only:

uv run python scripts/generate_embeddings.py --split paper-train

Evaluate on paper test split:

uv run python scripts/reproduce_results.py --split paper --few-shot-only

For General Use

Use AVEC2017 official splits. They're simpler and avoid any ambiguity:

uv run python scripts/reproduce_results.py --split dev

Part 6: Summary Table

Split Source	Train	Val/Dev	Test	Total	Per-Item Labels
AVEC2017 Official	107	35	47	189	Train+Dev only
Paper Custom	58	43	41	142	All (from train+dev)
Our Local Data	107	35	47	189	Train+Dev only

Files Reference

File	Source	Count	Has Per-Item PHQ-8
`train_split_Depression_AVEC2017.csv`	AVEC2017	107	✅ Yes
`dev_split_Depression_AVEC2017.csv`	AVEC2017	35	✅ Yes
`test_split_Depression_AVEC2017.csv`	AVEC2017	47	❌ No
`full_test_split.csv`	AVEC2017	47	❌ Total only
`paper_splits/paper_split_train.csv`	Our impl	58	✅ Yes
`paper_splits/paper_split_val.csv`	Our impl	43	✅ Yes
`paper_splits/paper_split_test.csv`	Our impl	41	✅ Yes

Appendix A: Paper Split Participant IDs

These splits were reconstructed by extracting participant IDs from the paper authors' output files in _reference/analysis_output/. This is the ground truth for reproducing the paper's results.

TRAIN (58 participants)

Source: Derived from _reference/analysis_output/quan_gemma_zero_shot.jsonl minus TEST minus VAL

303, 304, 305, 310, 312, 313, 315, 317, 318, 321, 324, 327, 335, 338, 340, 343,
344, 346, 347, 350, 352, 356, 363, 368, 369, 388, 391, 395, 397, 400, 402, 404,
406, 412, 414, 415, 416, 418, 426, 429, 433, 434, 437, 439, 444, 458, 463, 464,
473, 474, 475, 476, 477, 478, 483, 486, 488, 491

VAL (43 participants)

Source: _reference/analysis_output/quan_gemma_few_shot/VAL_analysis_output/*.jsonl

302, 307, 320, 322, 325, 326, 328, 331, 333, 336, 341, 348, 351, 353, 355, 358,
360, 364, 366, 371, 372, 374, 376, 380, 381, 382, 392, 401, 403, 419, 420, 425,
440, 443, 446, 448, 454, 457, 471, 479, 482, 490, 492

TEST (41 participants)

Source: _reference/analysis_output/quan_gemma_few_shot/TEST_analysis_output/*.jsonl

316, 319, 330, 339, 345, 357, 362, 367, 370, 375, 377, 379, 383, 385, 386, 389,
390, 393, 409, 413, 417, 422, 423, 427, 428, 430, 436, 441, 445, 447, 449, 451,
455, 456, 459, 468, 472, 484, 485, 487, 489

Verification

Check	Result
TRAIN + VAL + TEST	58 + 43 + 41 = 142 ✓
TRAIN ∩ VAL	0 ✓
TRAIN ∩ TEST	0 ✓
VAL ∩ TEST	0 ✓
TEST == metareview IDs	✓
TEST == medgemma IDs	✓
TEST == DIM_TEST IDs	✓

Output File Consistency

All output files use these same splits. Paths below are relative to _reference/analysis_output/:

File	Split Used	Count	Consistent
`quan_gemma_zero_shot.jsonl`	ALL	142	✓
`quan_gemma_few_shot/TEST_analysis_output/*.jsonl`	TEST	41	✓
`quan_gemma_few_shot/VAL_analysis_output/*.jsonl`	VAL	43	✓
`quan_gemma_few_shot/DIM_TEST_analysis_output/*.jsonl`	TEST	41	✓
`quan_medgemma_few_shot.jsonl`	TEST	41	✓
`quan_medgemma_zero_shot.jsonl`	TEST	41	✓
`metareview_gemma_few_shot.csv`	TEST	41	✓
`qual_gemma.csv`	ALL (142 unique IDs; one duplicated row)	142	✓

Python Usage

To reproduce the paper's results, use these exact participant IDs:

TRAIN_IDS = [303, 304, 305, 310, 312, 313, 315, 317, 318, 321, 324, 327, 335, 338, 340, 343, 344, 346, 347, 350, 352, 356, 363, 368, 369, 388, 391, 395, 397, 400, 402, 404, 406, 412, 414, 415, 416, 418, 426, 429, 433, 434, 437, 439, 444, 458, 463, 464, 473, 474, 475, 476, 477, 478, 483, 486, 488, 491]

VAL_IDS = [302, 307, 320, 322, 325, 326, 328, 331, 333, 336, 341, 348, 351, 353, 355, 358, 360, 364, 366, 371, 372, 374, 376, 380, 381, 382, 392, 401, 403, 419, 420, 425, 440, 443, 446, 448, 454, 457, 471, 479, 482, 490, 492]

TEST_IDS = [316, 319, 330, 339, 345, 357, 362, 367, 370, 375, 377, 379, 383, 385, 386, 389, 390, 393, 409, 413, 417, 422, 423, 427, 428, 430, 436, 441, 445, 447, 449, 451, 455, 456, 459, 468, 472, 484, 485, 487, 489]

Last verified: 2025-12-25 Reconstructed from: _reference/analysis_output/ (snapshot of paper authors' published outputs; upstream: trendscenter/ai-psychiatrist)