Skip to content

DAIC-WOZ Dataset Schema

Purpose: Enable development without direct data access Dataset: Distress Analysis Interview Corpus - Wizard of Oz (DAIC-WOZ) Access: Requires EULA from USC ICT Reference: AVEC 2017 Challenge


Overview

DAIC-WOZ is a clinical interview dataset for depression detection research. It contains semi-structured interviews conducted by an animated virtual interviewer named Ellie with participants who may or may not have depression.

Key Statistics

Metric Value
Total participants 189
Labeled participants 142 (train + dev)
Unlabeled participants 47 (test)
ID range 300-492 (with gaps)
Interview duration 5-25 minutes
Total size ~86 GB (with all modalities)

Missing Participant IDs

Not all IDs in range 300-492 exist. Known gaps include:

342, 394, 398, 460, ...

Always validate participant existence before processing.


Directory Structure

Expected Layout (after scripts/prepare_dataset.py)

data/
├── transcripts/                         # Extracted transcripts
│   ├── 300_P/
│   │   └── 300_TRANSCRIPT.csv
│   ├── 301_P/
│   │   └── 301_TRANSCRIPT.csv
│   └── .../
├── transcripts_participant_only/        # Deterministic participant-only variant (recommended)
│   ├── 300_P/300_TRANSCRIPT.csv
│   └── ...
├── transcripts_both_speakers_clean/     # Cleaned but keeps Ellie + Participant
├── transcripts_participant_qa/          # Participant + minimal question context
├── embeddings/                          # Pre-computed (Spec 08)
│   ├── huggingface_qwen3_8b_paper_train_participant_only.npz       # Participant-only paper-train knowledge base (TRAIN=58)
│   ├── huggingface_qwen3_8b_paper_train_participant_only.json
│   ├── huggingface_qwen3_8b_paper_train_participant_only.meta.json  # Provenance metadata
│   ├── paper_reference_embeddings.npz             # Optional legacy/compat filename (paper-train)
│   ├── paper_reference_embeddings.json
│   ├── paper_reference_embeddings.meta.json       # Optional provenance metadata
│   ├── reference_embeddings.npz         # Optional: AVEC train knowledge base
│   └── reference_embeddings.json
├── paper_splits/                        # Optional: paper 58/43/41 split (ground truth for reproduction)
│   ├── paper_split_train.csv
│   ├── paper_split_val.csv
│   ├── paper_split_test.csv
│   └── paper_split_metadata.json
├── train_split_Depression_AVEC2017.csv  # Ground truth (train)
├── dev_split_Depression_AVEC2017.csv    # Ground truth (dev)
├── test_split_Depression_AVEC2017.csv   # Identifiers only
└── full_test_split.csv                  # Test totals (if available)

Configuration Paths

Defined in src/ai_psychiatrist/config.py:

class DataSettings(BaseSettings):
    base_dir: Path = Path("data")
    transcripts_dir: Path = Path("data/transcripts")
    embeddings_path: Path = Path("data/embeddings/huggingface_qwen3_8b_paper_train.npz")
    train_csv: Path = Path("data/train_split_Depression_AVEC2017.csv")
    dev_csv: Path = Path("data/dev_split_Depression_AVEC2017.csv")

Note: .env.example and DATA-PIPELINE-SPEC.md use participant-only artifacts (e.g., huggingface_qwen3_8b_paper_train_participant_only.*) via env overrides.

To use a preprocessed transcript variant, set:

DATA_TRANSCRIPTS_DIR=data/transcripts_participant_only

See: DAIC-WOZ Preprocessing.


Transcript Format

File Location

data/transcripts/{id}_P/{id}_TRANSCRIPT.csv

Example: data/transcripts/300_P/300_TRANSCRIPT.csv

Schema

Column Type Description Example
start_time float Utterance start (seconds) 36.588
stop_time float Utterance end (seconds) 39.668
speaker string Speaker identifier "Ellie" or "Participant"
value string Transcript text "hi i'm ellie thanks for coming in today"

Format Details

  • Separator: Tab (\t)
  • Encoding: UTF-8
  • Text style: Lowercase, minimal punctuation
  • Headers: First row is header
  • Typical size: ~100-300 rows per transcript

Synthetic Example

start_time  stop_time   speaker value
0.000   2.500   Ellie   hi i'm ellie thanks for coming in today
3.100   4.200   Participant hello
5.000   8.500   Ellie   how are you doing today
9.200   12.800  Participant i'm doing okay i guess
13.500  18.000  Ellie   tell me about the last time you felt really happy
19.200  28.500  Participant um i don't know it's been a while i guess maybe when i saw my family last month
30.000  35.500  Ellie   that sounds nice can you tell me more about that visit

How It's Loaded

TranscriptService._parse_daic_woz_transcript() in src/ai_psychiatrist/services/transcript.py:

df = pd.read_csv(path, sep="\t")
df = df.dropna(subset=["speaker", "value"])
df["dialogue"] = df["speaker"] + ": " + df["value"]
return "\n".join(df["dialogue"].tolist())

Output format (what agents see):

Ellie: hi i'm ellie thanks for coming in today
Participant: hello
Ellie: how are you doing today
Participant: i'm doing okay i guess
...


Ground Truth Format

Train/Dev Split CSVs

Files: - train_split_Depression_AVEC2017.csv (107 participants) - dev_split_Depression_AVEC2017.csv (35 participants)

Schema

Column Type Range Description
Participant_ID int 300-492 Unique identifier
PHQ8_Binary int 0-1 MDD indicator (1 if score >= 10)
PHQ8_Score int 0-24 Total PHQ-8 score
Gender int 0-1 0 = male, 1 = female
PHQ8_NoInterest int 0-3 Item 1: Little interest or pleasure
PHQ8_Depressed int 0-3 Item 2: Feeling down, depressed
PHQ8_Sleep int 0-3 Item 3: Sleep problems
PHQ8_Tired int 0-3 Item 4: Feeling tired
PHQ8_Appetite int 0-3 Item 5: Appetite changes
PHQ8_Failure int 0-3 Item 6: Feeling bad about self
PHQ8_Concentrating int 0-3 Item 7: Trouble concentrating
PHQ8_Moving int 0-3 Item 8: Moving/speaking slowly or fidgety

PHQ-8 Item Score Meaning

Each item scored 0-3 based on frequency over past 2 weeks:

Score Meaning
0 Not at all
1 Several days
2 More than half the days
3 Nearly every day

Severity Levels

Derived from total PHQ-8 score (0-24):

Score Range Severity Level MDD Classification
0-4 None/Minimal No MDD
5-9 Mild No MDD
10-14 Moderate MDD
15-19 Moderately Severe MDD
20-24 Severe MDD

Synthetic Example (train CSV)

Participant_ID,PHQ8_Binary,PHQ8_Score,Gender,PHQ8_NoInterest,PHQ8_Depressed,PHQ8_Sleep,PHQ8_Tired,PHQ8_Appetite,PHQ8_Failure,PHQ8_Concentrating,PHQ8_Moving
300,0,3,1,0,0,1,1,0,0,1,0
301,0,7,0,1,1,1,1,1,0,1,1
302,1,15,1,2,2,2,2,1,2,2,2
303,0,0,0,0,0,0,0,0,0,0,0
304,1,20,0,2,3,3,3,3,3,3,0

Column Mapping in Code

GroundTruthService.COLUMN_MAPPING in src/ai_psychiatrist/services/ground_truth.py:

COLUMN_MAPPING = {
    "PHQ8_NoInterest": PHQ8Item.NO_INTEREST,
    "PHQ8_Depressed": PHQ8Item.DEPRESSED,
    "PHQ8_Sleep": PHQ8Item.SLEEP,
    "PHQ8_Tired": PHQ8Item.TIRED,
    "PHQ8_Appetite": PHQ8Item.APPETITE,
    "PHQ8_Failure": PHQ8Item.FAILURE,
    "PHQ8_Concentrating": PHQ8Item.CONCENTRATING,
    "PHQ8_Moving": PHQ8Item.MOVING,
}

Test Split Format

AVEC2017 Test Split

File: test_split_Depression_AVEC2017.csv

Note: Does NOT include PHQ-8 scores (evaluation set).

Column Type Description
participant_ID int Note: lowercase 'p' (column name differs)
Gender int 0 = male, 1 = female

Full Test Split (if available)

File: full_test_split.csv

Some distributions include total scores but NOT item-wise scores:

Column Type Description
Participant_ID int Note: uppercase 'P'
PHQ_Binary int Note: no '8' in column name
PHQ_Score int Note: no '8' in column name
Gender int

Data Splits

AVEC2017 Official Splits

Split Count PHQ-8 Items Purpose
Train 107 Available Model training, few-shot retrieval
Dev 35 Available Hyperparameter tuning
Test 47 Not available Final evaluation

Paper Re-Split (Section 2.4.1)

The paper creates a custom 58/43/41 split from the 142 labeled participants:

Split Count Percentage Purpose
Train 58 41% Few-shot reference store
Dev 43 30% Hyperparameter tuning
Test 41 29% Final evaluation

Implementation: - scripts/create_paper_split.py generates data/paper_splits/paper_split_{train,val,test}.csv from the paper's ground truth IDs in Data Splits Overview (default), or can generate an algorithmic seeded split with --mode algorithmic. - scripts/generate_embeddings.py --split paper-train generates data/embeddings/{backend}_{model_slug}_paper_train.{npz,json,meta.json} by default (and an optional .tags.json sidecar if --write-item-tags is set), or use --output data/embeddings/paper_reference_embeddings.npz for the legacy filename. - scripts/reproduce_results.py --split paper evaluates on the 41-participant paper test set and computes item-level MAE excluding N/A, matching the paper’s metric definition.


Embeddings Format

File Structure

data/embeddings/
├── paper_reference_embeddings.npz   # NumPy compressed archive
├── paper_reference_embeddings.json  # Text sidecar (participant IDs, chunks)
├── paper_reference_embeddings.meta.json  # Optional: provenance metadata
├── paper_reference_embeddings.tags.json  # Optional: per-chunk PHQ-8 item tags (Spec 34)
├── paper_reference_embeddings.chunk_scores.json       # Optional: per-chunk PHQ-8 item scores (Spec 35)
├── paper_reference_embeddings.chunk_scores.meta.json  # Optional: scorer provenance + prompt hash (Spec 35)
├── reference_embeddings.npz         # Optional: AVEC train knowledge base
├── reference_embeddings.json
├── reference_embeddings.tags.json   # Optional: per-chunk PHQ-8 item tags (Spec 34)
├── reference_embeddings.chunk_scores.json       # Optional: per-chunk PHQ-8 item scores (Spec 35)
└── reference_embeddings.chunk_scores.meta.json  # Optional: scorer provenance + prompt hash (Spec 35)

NPZ Format

The NPZ stores one array per participant:

  • Key: emb_{participant_id} (example: emb_300)
  • Value: float32 array of shape (num_chunks, EMBEDDING_DIMENSION)

This matches ReferenceStore._load_embeddings() in src/ai_psychiatrist/services/reference_store.py.

JSON Sidecar

{
  "300": [
    "Ellie: ...\nParticipant: ...",
    "Ellie: ...\nParticipant: ...",
    "... (chunk text strings in the same order as the NPZ rows)"
  ],
  "301": ["..."],
  "...": ["..."]
}

The JSON maps participant ID (string) → list of chunk texts. The list order must match the corresponding NPZ array row order for that participant.

Configuration

From EmbeddingSettings in src/ai_psychiatrist/config.py:

Setting Default Paper Reference
EMBEDDING_DIMENSION 4096 Appendix D
EMBEDDING_CHUNK_SIZE 8 Appendix D
EMBEDDING_CHUNK_STEP 2 Section 2.4.2
EMBEDDING_TOP_K_REFERENCES 2 Appendix D

Raw Download Structure

Before Preparation

downloads/
├── participants/
│   ├── 300_P.zip           # ~475MB each
│   ├── 301_P.zip
│   └── .../
├── train_split_Depression_AVEC2017.csv
├── dev_split_Depression_AVEC2017.csv
├── test_split_Depression_AVEC2017.csv
├── full_test_split.csv
└── DAICWOZDepression_Documentation_AVEC2017.pdf

Zip Contents (per participant)

File Size Used by System
{id}_TRANSCRIPT.csv ~10KB YES - Primary input
{id}_AUDIO.wav ~20MB Future (multimodal)
{id}_COVAREP.csv ~37MB Future
{id}_FORMANT.csv ~2MB Future
{id}_CLNF_AUs.txt ~2MB Future
{id}_CLNF_features.txt ~24MB Future
{id}_CLNF_features3D.txt ~36MB Future
{id}_CLNF_gaze.txt ~3MB Future
{id}_CLNF_hog.txt ~350MB Future
{id}_CLNF_pose.txt ~2MB Future

Domain Model Mapping

Transcript → Entity

src/ai_psychiatrist/domain/entities.py:

@dataclass
class Transcript:
    participant_id: int      # From directory name ({id}_P)
    text: str                # Formatted dialogue
    created_at: datetime     # Load timestamp
    id: UUID                 # Instance UUID

Ground Truth → Entity

@dataclass
class PHQ8Assessment:
    items: Mapping[PHQ8Item, ItemAssessment]  # All 8 items
    mode: AssessmentMode                       # ZERO_SHOT or FEW_SHOT
    participant_id: int

PHQ8Item Enum

src/ai_psychiatrist/domain/enums.py:

class PHQ8Item(StrEnum):
    NO_INTEREST = "NoInterest"      # Item 1
    DEPRESSED = "Depressed"         # Item 2
    SLEEP = "Sleep"                 # Item 3
    TIRED = "Tired"                 # Item 4
    APPETITE = "Appetite"           # Item 5
    FAILURE = "Failure"             # Item 6
    CONCENTRATING = "Concentrating" # Item 7
    MOVING = "Moving"               # Item 8

Known Data Issues

Participant Issue Status Reference
487 Corrupted transcript (AppleDouble file, not CSV) Resolved ✓ Avoid AppleDouble files; re-download and re-extract cleanly

Note: Issue was caused by macOS AppleDouble extraction, not source data. Re-download and careful extraction fixed it.


Validation Checklist

When working with data, verify:

  • [ ] Participant ID exists (not all 300-492 are present)
  • [ ] Transcript file is tab-separated, not comma-separated
  • [ ] Transcript is valid UTF-8 (not AppleDouble metadata)
  • [ ] Speaker column contains "Ellie" or "Participant"
  • [ ] Ground truth CSV uses Participant_ID (uppercase P)
  • [ ] Test split uses participant_ID (lowercase p)
  • [ ] PHQ-8 scores are in range 0-3 (items) or 0-24 (total)
  • [ ] Embeddings dimension matches model (4096 for qwen3-embedding:8b)

See Also