DAIC-WOZ Dataset Schema

Purpose: Enable development without direct data access Dataset: Distress Analysis Interview Corpus - Wizard of Oz (DAIC-WOZ) Access: Requires EULA from USC ICT Reference: AVEC 2017 Challenge

Overview

DAIC-WOZ is a clinical interview dataset for depression detection research. It contains semi-structured interviews conducted by an animated virtual interviewer named Ellie with participants who may or may not have depression.

Key Statistics

Metric	Value
Total participants	189
Labeled participants	142 (train + dev)
Unlabeled participants	47 (test)
ID range	300-492 (with gaps)
Interview duration	5-25 minutes
Total size	~86 GB (with all modalities)

Missing Participant IDs

Not all IDs in range 300-492 exist. Known gaps include:

342, 394, 398, 460, ...

Always validate participant existence before processing.

Directory Structure

Expected Layout (after `scripts/prepare_dataset.py`)

data/
├── transcripts/                         # Extracted transcripts
│   ├── 300_P/
│   │   └── 300_TRANSCRIPT.csv
│   ├── 301_P/
│   │   └── 301_TRANSCRIPT.csv
│   └── .../
├── transcripts_participant_only/        # Deterministic participant-only variant (recommended)
│   ├── 300_P/300_TRANSCRIPT.csv
│   └── ...
├── transcripts_both_speakers_clean/     # Cleaned but keeps Ellie + Participant
├── transcripts_participant_qa/          # Participant + minimal question context
├── embeddings/                          # Pre-computed (Spec 08)
│   ├── huggingface_qwen3_8b_paper_train_participant_only.npz       # Participant-only paper-train knowledge base (TRAIN=58)
│   ├── huggingface_qwen3_8b_paper_train_participant_only.json
│   ├── huggingface_qwen3_8b_paper_train_participant_only.meta.json  # Provenance metadata
│   ├── paper_reference_embeddings.npz             # Optional legacy/compat filename (paper-train)
│   ├── paper_reference_embeddings.json
│   ├── paper_reference_embeddings.meta.json       # Optional provenance metadata
│   ├── reference_embeddings.npz         # Optional: AVEC train knowledge base
│   └── reference_embeddings.json
├── paper_splits/                        # Optional: paper 58/43/41 split (ground truth for reproduction)
│   ├── paper_split_train.csv
│   ├── paper_split_val.csv
│   ├── paper_split_test.csv
│   └── paper_split_metadata.json
├── train_split_Depression_AVEC2017.csv  # Ground truth (train)
├── dev_split_Depression_AVEC2017.csv    # Ground truth (dev)
├── test_split_Depression_AVEC2017.csv   # Identifiers only
└── full_test_split.csv                  # Test totals (if available)

Configuration Paths

Defined in src/ai_psychiatrist/config.py:

class DataSettings(BaseSettings):
    base_dir: Path = Path("data")
    transcripts_dir: Path = Path("data/transcripts")
    embeddings_path: Path = Path("data/embeddings/huggingface_qwen3_8b_paper_train.npz")
    train_csv: Path = Path("data/train_split_Depression_AVEC2017.csv")
    dev_csv: Path = Path("data/dev_split_Depression_AVEC2017.csv")

Note: .env.example and DATA-PIPELINE-SPEC.md use participant-only artifacts (e.g., huggingface_qwen3_8b_paper_train_participant_only.*) via env overrides.

To use a preprocessed transcript variant, set:

DATA_TRANSCRIPTS_DIR=data/transcripts_participant_only

See: DAIC-WOZ Preprocessing.

Transcript Format

File Location

data/transcripts/{id}_P/{id}_TRANSCRIPT.csv

Example: data/transcripts/300_P/300_TRANSCRIPT.csv

Schema

Column	Type	Description	Example
`start_time`	float	Utterance start (seconds)	`36.588`
`stop_time`	float	Utterance end (seconds)	`39.668`
`speaker`	string	Speaker identifier	`"Ellie"` or `"Participant"`
`value`	string	Transcript text	`"hi i'm ellie thanks for coming in today"`

Format Details

Separator: Tab (\t)
Encoding: UTF-8
Text style: Lowercase, minimal punctuation
Headers: First row is header
Typical size: ~100-300 rows per transcript

Synthetic Example

start_time  stop_time   speaker value
0.000   2.500   Ellie   hi i'm ellie thanks for coming in today
3.100   4.200   Participant hello
5.000   8.500   Ellie   how are you doing today
9.200   12.800  Participant i'm doing okay i guess
13.500  18.000  Ellie   tell me about the last time you felt really happy
19.200  28.500  Participant um i don't know it's been a while i guess maybe when i saw my family last month
30.000  35.500  Ellie   that sounds nice can you tell me more about that visit

How It's Loaded

TranscriptService._parse_daic_woz_transcript() in src/ai_psychiatrist/services/transcript.py:

df = pd.read_csv(path, sep="\t")
df = df.dropna(subset=["speaker", "value"])
df["dialogue"] = df["speaker"] + ": " + df["value"]
return "\n".join(df["dialogue"].tolist())

Output format (what agents see):

Ellie: hi i'm ellie thanks for coming in today
Participant: hello
Ellie: how are you doing today
Participant: i'm doing okay i guess
...

Ground Truth Format

Train/Dev Split CSVs

Files: - train_split_Depression_AVEC2017.csv (107 participants) - dev_split_Depression_AVEC2017.csv (35 participants)

Schema

Column	Type	Range	Description
`Participant_ID`	int	300-492	Unique identifier
`PHQ8_Binary`	int	0-1	MDD indicator (1 if score >= 10)
`PHQ8_Score`	int	0-24	Total PHQ-8 score
`Gender`	int	0-1	0 = male, 1 = female
`PHQ8_NoInterest`	int	0-3	Item 1: Little interest or pleasure
`PHQ8_Depressed`	int	0-3	Item 2: Feeling down, depressed
`PHQ8_Sleep`	int	0-3	Item 3: Sleep problems
`PHQ8_Tired`	int	0-3	Item 4: Feeling tired
`PHQ8_Appetite`	int	0-3	Item 5: Appetite changes
`PHQ8_Failure`	int	0-3	Item 6: Feeling bad about self
`PHQ8_Concentrating`	int	0-3	Item 7: Trouble concentrating
`PHQ8_Moving`	int	0-3	Item 8: Moving/speaking slowly or fidgety

PHQ-8 Item Score Meaning

Each item scored 0-3 based on frequency over past 2 weeks:

Score	Meaning
0	Not at all
1	Several days
2	More than half the days
3	Nearly every day

Severity Levels

Derived from total PHQ-8 score (0-24):

Score Range	Severity Level	MDD Classification
0-4	None/Minimal	No MDD
5-9	Mild	No MDD
10-14	Moderate	MDD
15-19	Moderately Severe	MDD
20-24	Severe	MDD

Synthetic Example (train CSV)

Participant_ID,PHQ8_Binary,PHQ8_Score,Gender,PHQ8_NoInterest,PHQ8_Depressed,PHQ8_Sleep,PHQ8_Tired,PHQ8_Appetite,PHQ8_Failure,PHQ8_Concentrating,PHQ8_Moving
300,0,3,1,0,0,1,1,0,0,1,0
301,0,7,0,1,1,1,1,1,0,1,1
302,1,15,1,2,2,2,2,1,2,2,2
303,0,0,0,0,0,0,0,0,0,0,0
304,1,20,0,2,3,3,3,3,3,3,0

Column Mapping in Code

GroundTruthService.COLUMN_MAPPING in src/ai_psychiatrist/services/ground_truth.py:

COLUMN_MAPPING = {
    "PHQ8_NoInterest": PHQ8Item.NO_INTEREST,
    "PHQ8_Depressed": PHQ8Item.DEPRESSED,
    "PHQ8_Sleep": PHQ8Item.SLEEP,
    "PHQ8_Tired": PHQ8Item.TIRED,
    "PHQ8_Appetite": PHQ8Item.APPETITE,
    "PHQ8_Failure": PHQ8Item.FAILURE,
    "PHQ8_Concentrating": PHQ8Item.CONCENTRATING,
    "PHQ8_Moving": PHQ8Item.MOVING,
}

Test Split Format

AVEC2017 Test Split

File: test_split_Depression_AVEC2017.csv

Note: Does NOT include PHQ-8 scores (evaluation set).

Column	Type	Description
`participant_ID`	int	Note: lowercase 'p' (column name differs)
`Gender`	int	0 = male, 1 = female

Full Test Split (if available)

File: full_test_split.csv

Some distributions include total scores but NOT item-wise scores:

Column	Type	Description
`Participant_ID`	int	Note: uppercase 'P'
`PHQ_Binary`	int	Note: no '8' in column name
`PHQ_Score`	int	Note: no '8' in column name
`Gender`	int

Data Splits

AVEC2017 Official Splits

Split	Count	PHQ-8 Items	Purpose
Train	107	Available	Model training, few-shot retrieval
Dev	35	Available	Hyperparameter tuning
Test	47	Not available	Final evaluation

Paper Re-Split (Section 2.4.1)

The paper creates a custom 58/43/41 split from the 142 labeled participants:

Split	Count	Percentage	Purpose
Train	58	41%	Few-shot reference store
Dev	43	30%	Hyperparameter tuning
Test	41	29%	Final evaluation

Implementation: - scripts/create_paper_split.py generates data/paper_splits/paper_split_{train,val,test}.csv from the paper's ground truth IDs in Data Splits Overview (default), or can generate an algorithmic seeded split with --mode algorithmic. - scripts/generate_embeddings.py --split paper-train generates data/embeddings/{backend}_{model_slug}_paper_train.{npz,json,meta.json} by default (and an optional .tags.json sidecar if --write-item-tags is set), or use --output data/embeddings/paper_reference_embeddings.npz for the legacy filename. - scripts/reproduce_results.py --split paper evaluates on the 41-participant paper test set and computes item-level MAE excluding N/A, matching the paper’s metric definition.

Embeddings Format

File Structure

data/embeddings/
├── paper_reference_embeddings.npz   # NumPy compressed archive
├── paper_reference_embeddings.json  # Text sidecar (participant IDs, chunks)
├── paper_reference_embeddings.meta.json  # Optional: provenance metadata
├── paper_reference_embeddings.tags.json  # Optional: per-chunk PHQ-8 item tags (Spec 34)
├── paper_reference_embeddings.chunk_scores.json       # Optional: per-chunk PHQ-8 item scores (Spec 35)
├── paper_reference_embeddings.chunk_scores.meta.json  # Optional: scorer provenance + prompt hash (Spec 35)
├── reference_embeddings.npz         # Optional: AVEC train knowledge base
├── reference_embeddings.json
├── reference_embeddings.tags.json   # Optional: per-chunk PHQ-8 item tags (Spec 34)
├── reference_embeddings.chunk_scores.json       # Optional: per-chunk PHQ-8 item scores (Spec 35)
└── reference_embeddings.chunk_scores.meta.json  # Optional: scorer provenance + prompt hash (Spec 35)

NPZ Format

The NPZ stores one array per participant:

Key: emb_{participant_id} (example: emb_300)
Value: float32 array of shape (num_chunks, EMBEDDING_DIMENSION)

This matches ReferenceStore._load_embeddings() in src/ai_psychiatrist/services/reference_store.py.

JSON Sidecar

{
  "300": [
    "Ellie: ...\nParticipant: ...",
    "Ellie: ...\nParticipant: ...",
    "... (chunk text strings in the same order as the NPZ rows)"
  ],
  "301": ["..."],
  "...": ["..."]
}

The JSON maps participant ID (string) → list of chunk texts. The list order must match the corresponding NPZ array row order for that participant.

Configuration

From EmbeddingSettings in src/ai_psychiatrist/config.py:

Setting	Default	Paper Reference
`EMBEDDING_DIMENSION`	4096	Appendix D
`EMBEDDING_CHUNK_SIZE`	8	Appendix D
`EMBEDDING_CHUNK_STEP`	2	Section 2.4.2
`EMBEDDING_TOP_K_REFERENCES`	2	Appendix D

Raw Download Structure

Before Preparation

downloads/
├── participants/
│   ├── 300_P.zip           # ~475MB each
│   ├── 301_P.zip
│   └── .../
├── train_split_Depression_AVEC2017.csv
├── dev_split_Depression_AVEC2017.csv
├── test_split_Depression_AVEC2017.csv
├── full_test_split.csv
└── DAICWOZDepression_Documentation_AVEC2017.pdf

Zip Contents (per participant)

File	Size	Used by System
`{id}_TRANSCRIPT.csv`	~10KB	YES - Primary input
`{id}_AUDIO.wav`	~20MB	Future (multimodal)
`{id}_COVAREP.csv`	~37MB	Future
`{id}_FORMANT.csv`	~2MB	Future
`{id}_CLNF_AUs.txt`	~2MB	Future
`{id}_CLNF_features.txt`	~24MB	Future
`{id}_CLNF_features3D.txt`	~36MB	Future
`{id}_CLNF_gaze.txt`	~3MB	Future
`{id}_CLNF_hog.txt`	~350MB	Future
`{id}_CLNF_pose.txt`	~2MB	Future

Domain Model Mapping

Transcript → Entity

src/ai_psychiatrist/domain/entities.py:

@dataclass
class Transcript:
    participant_id: int      # From directory name ({id}_P)
    text: str                # Formatted dialogue
    created_at: datetime     # Load timestamp
    id: UUID                 # Instance UUID

Ground Truth → Entity

@dataclass
class PHQ8Assessment:
    items: Mapping[PHQ8Item, ItemAssessment]  # All 8 items
    mode: AssessmentMode                       # ZERO_SHOT or FEW_SHOT
    participant_id: int

PHQ8Item Enum

src/ai_psychiatrist/domain/enums.py:

class PHQ8Item(StrEnum):
    NO_INTEREST = "NoInterest"      # Item 1
    DEPRESSED = "Depressed"         # Item 2
    SLEEP = "Sleep"                 # Item 3
    TIRED = "Tired"                 # Item 4
    APPETITE = "Appetite"           # Item 5
    FAILURE = "Failure"             # Item 6
    CONCENTRATING = "Concentrating" # Item 7
    MOVING = "Moving"               # Item 8

Known Data Issues

Participant	Issue	Status	Reference
487	Corrupted transcript (AppleDouble file, not CSV)	Resolved ✓	Avoid AppleDouble files; re-download and re-extract cleanly

Note: Issue was caused by macOS AppleDouble extraction, not source data. Re-download and careful extraction fixed it.

Validation Checklist

When working with data, verify:

[ ] Participant ID exists (not all 300-492 are present)
[ ] Transcript file is tab-separated, not comma-separated
[ ] Transcript is valid UTF-8 (not AppleDouble metadata)
[ ] Speaker column contains "Ellie" or "Participant"
[ ] Ground truth CSV uses Participant_ID (uppercase P)
[ ] Test split uses participant_ID (lowercase p)
[ ] PHQ-8 scores are in range 0-3 (items) or 0-24 (total)
[ ] Embeddings dimension matches model (4096 for qwen3-embedding:8b)

DAIC-WOZ Dataset Schema

Overview

Key Statistics

Missing Participant IDs

Directory Structure

Expected Layout (after scripts/prepare_dataset.py)

Configuration Paths

Transcript Format

File Location

Schema

Format Details

Synthetic Example

How It's Loaded

Ground Truth Format

Train/Dev Split CSVs

Schema

PHQ-8 Item Score Meaning

Severity Levels

Synthetic Example (train CSV)

Column Mapping in Code

Test Split Format

AVEC2017 Test Split

Full Test Split (if available)

Data Splits

AVEC2017 Official Splits

Paper Re-Split (Section 2.4.1)

Embeddings Format

File Structure

NPZ Format

JSON Sidecar

Configuration

Raw Download Structure

Before Preparation

Zip Contents (per participant)

Domain Model Mapping

Transcript → Entity

Ground Truth → Entity

PHQ8Item Enum

Known Data Issues

Validation Checklist

See Also

Expected Layout (after `scripts/prepare_dataset.py`)