DAIC-WOZ Transcript Preprocessing (Bias-Aware, Deterministic)

Spec: docs/_specs/daic-woz-transcript-preprocessing.md

Purpose: Produce clean, reproducible DAIC-WOZ transcript variants (especially participant-only) without overwriting raw data, so that:

Few-shot retrieval embeddings are not biased by interviewer prompt patterns
Known dataset issues (interruptions, missing Ellie transcripts) are handled deterministically
Inputs/outputs never collide (raw vs processed vs derived artifacts)

This document is written to be implementation-ready: it specifies file layouts, edge cases, and exact rules.

Why This Exists

1) Interviewer prompt leakage (retrieval bias)

Recent analysis shows models can exploit Ellie’s follow-up prompts as a shortcut signal for depression classification, inflating performance in ways that may not generalize.

This repo’s few-shot pipeline can inherit this bias before the LLM sees the prompt: - Evidence extraction → query embedding → similarity search - If interviewer prompts are embedded, retrieval can return interviewer-driven “shortcuts”

See the internal analysis in docs/_brainstorming/daic-woz-preprocessing.md for background.

2) DAIC-WOZ has known “mechanical” issues

The Bailey/Plumbley preprocessing tool (mirrored in _reference/daic_woz_process/) documents known transcript problems: - Session interruptions (e.g., 373, 444) - Missing Ellie transcripts (451, 458, 480) - Audio timing misalignment (318, 321, 341, 362) — only relevant if using audio - A known PHQ8_Binary label bug (409) - Certain participant IDs are absent from DAIC-WOZ (e.g., 342, 394, 398, 460)

3) AVEC2017 split CSVs can contain deterministic integrity bugs

This repo includes deterministic integrity checks/repairs for common AVEC2017 CSV issues (see scripts/patch_missing_phq8_values.py), e.g.: - A single missing PHQ-8 item cell reconstructable from PHQ8_Score (observed in some upstream copies) - PHQ8_Binary inconsistency (e.g., participant 409 had PHQ8_Score=10 but PHQ8_Binary=0 upstream)

Inputs (Raw, Untouched)

Canonical raw layout (see DAIC-WOZ Schema):

data/
  transcripts/
    300_P/300_TRANSCRIPT.csv
    ...

Raw transcript files are tab-separated with columns:

start_time    stop_time    speaker    value

Raw inputs must never be overwritten. Preprocessing always writes to a new directory.

Outputs (Processed Variants)

Preprocessing produces a new transcripts root that still matches the expected folder/file conventions:

data/
  transcripts_participant_only/
    300_P/300_TRANSCRIPT.csv
    ...
  transcripts_both_speakers_clean/
    300_P/300_TRANSCRIPT.csv
    ...
  transcripts_participant_qa/
    300_P/300_TRANSCRIPT.csv
    ...

Each variant is selectable via configuration:

DATA_TRANSCRIPTS_DIR=data/transcripts_participant_only

No code changes are required: TranscriptService already accepts a configurable transcripts_dir.

Variant Definitions (What “Participant-Only” Means)

All variants apply the same deterministic cleaning rules first (next section). Then:

Variant A: `both_speakers_clean`

Keep both speakers after cleaning.
Use for legacy baseline comparisons where you want to minimize non-clinical noise without changing speaker content.

Variant B: `participant_only`

Drop all Ellie rows after cleaning.
Use for embedding generation + retrieval to reduce interviewer-protocol leakage.

Variant C: `participant_qa`

Keep participant rows, plus only the immediately preceding Ellie prompt (one Ellie row) for each block of participant responses.
Intended as a compromise to preserve minimal question context while avoiding “Ellie-only region” leakage.

Rule (deterministic): - When a participant row is kept, include the most recent prior Ellie row once (do not repeat it before every consecutive participant line).

Deterministic Cleaning Rules (Applied to All Variants)

These rules are designed to be loss-minimizing for LLM use while removing clearly non-interview artifacts.

1) Parse + basic validation

For each {pid}_TRANSCRIPT.csv: - Must contain start_time, stop_time, speaker, value - Drop rows where speaker or value is missing/NaN

Fail-fast if required columns are missing.

2) Remove “pre-interview chatter”

Goal: drop the preamble interaction that occurs before the interview begins.

Rule: - If the file contains any speaker == "Ellie" row, find the first such row and drop all rows before it.

Missing Ellie sessions: - Sessions 451/458/480 are known to contain only participant rows. - For “no Ellie present” files: drop leading sync markers (see next rule) and keep the remaining rows.

3) Remove sync markers (where present)

Drop rows whose value is a sync marker, e.g.:

<sync>, <synch>, [sync], [synching], ...

Implementation rule: - Normalize with strip().lower() and tolerate trailing punctuation (e.g., <sync.).

4) Remove known interruption windows (text-safe)

Drop rows whose time range overlaps the interruption window:

373: [395, 428] seconds
444: [286, 387] seconds

Overlap definition:

row_start < window_end AND row_end > window_start

Rationale: these spans contain non-interview events (“person enters room”, alarms, etc.) and are explicitly treated as noise by the preprocessing reference tool.

5) Preserve nonverbal annotations (default)

By default, do not delete nonverbal tags like <laughter> / <sigh> because they can carry affective signal for LLM reasoning.

If you want a “classical ML” style cleanup, make it an explicit variant/flag and ablate it.

Note on reference parity: the Bailey/Plumbley tool removes placeholder/unknown tokens (e.g., xxx, xxxx) and strips tokens containing < > [ ] when building Word2Vec features. This repo’s preprocessing keeps nonverbal tags by default for LLM use; treat additional stripping as an explicit ablation.

Ground Truth Integrity (PHQ-8 CSVs)

The AVEC2017-derived ground-truth CSVs occasionally contain integrity issues. These are deterministic fixes, not statistical imputation. The reproduction runner requires complete per-item ground truth and will fail fast if issues remain.

A) Missing PHQ-8 Item Cells

The dataset includes: - PHQ8_Score (total score; authoritative) - 8 item columns PHQ8_* (0–3 each)

For valid rows, the invariant must hold:

PHQ8_Score == sum(PHQ8 item columns)

If exactly one item cell is missing and the total is present, the missing cell is uniquely determined:

missing_item = PHQ8_Score - sum(known_items)

This is not statistical imputation. It is deterministic reconstruction of a single missing cell required for the invariant to hold.

How to patch:

1) Preview what would change:

uv run python scripts/patch_missing_phq8_values.py --dry-run

2) Apply the patch:

uv run python scripts/patch_missing_phq8_values.py --apply

3) Regenerate paper splits (so paper CSVs reflect corrected values):

uv run python scripts/create_paper_split.py --verify

4) Re-run a quick validation:

uv run python scripts/reproduce_results.py --split paper --zero-shot-only --limit 3

Failure semantics:

If a ground-truth CSV has: - more than one missing PHQ-8 item in a row, or - an invariant violation (sum != total), or - a reconstructed value outside 0..3

the patch script will fail fast, because it cannot be corrected deterministically.

B) `PHQ8_Binary` Consistency

This repo treats:

PHQ8_Binary = 1 iff PHQ8_Score >= 10

Known upstream issue: - Participant 409 had PHQ8_Score=10 but PHQ8_Binary=0 (now corrected; see data/DATA_PROVENANCE.md).

Collision-Free Artifact Workflow (Recommended)

To avoid mixing artifacts from different transcript variants:

1) Keep raw transcripts in data/transcripts/ 2) Generate a processed variant in data/transcripts_<variant>/ 3) Point config to it: - DATA_TRANSCRIPTS_DIR=data/transcripts_<variant> 4) Generate embeddings with an explicit, variant-stamped name: - uv run python scripts/generate_embeddings.py --split paper-train --output data/embeddings/<backend>_<model>_paper_train_<variant>.npz 5) Set: - EMBEDDING_EMBEDDINGS_FILE=<backend>_<model>_paper_train_<variant>

Also ensure any .tags.json / .chunk_scores.json sidecars are generated from the same embeddings base name.

See: - Artifact Namespace Registry - RAG Artifact Generation

Preprocessing CLI (Implemented)

Script: - scripts/preprocess_daic_woz_transcripts.py

Examples:

# 1) Bias-aware variant for retrieval (recommended default)
uv run python scripts/preprocess_daic_woz_transcripts.py \
  --variant participant_only \
  --output-dir data/transcripts_participant_only \
  --overwrite

# 2) Keep both speakers, but remove mechanical noise
uv run python scripts/preprocess_daic_woz_transcripts.py \
  --variant both_speakers_clean \
  --output-dir data/transcripts_both_speakers_clean \
  --overwrite

# 3) Minimal Q/A context
uv run python scripts/preprocess_daic_woz_transcripts.py \
  --variant participant_qa \
  --output-dir data/transcripts_participant_qa \
  --overwrite

The script: - Refuses to run if --output-dir equals --input-dir - Writes a machine-readable manifest at preprocess_manifest.json (counts only; no transcript text)

Validation Checklist (What “Done” Means)

Dataset integrity

uv run python scripts/patch_missing_phq8_values.py --dry-run reports no missing item cells
PHQ8_Binary matches PHQ8_Score >= 10 for train+dev

Transcript preprocessing

Output directory contains *_P/ folders and *_TRANSCRIPT.csv files
No output transcript is empty after preprocessing
Sessions 451/458/480 are handled without failure (no Ellie speaker present)
Sessions 373/444 have rows removed in the specified time windows
Input audit (optional): confirm expected speakers + known anomalies exist in data/transcripts/:
189 transcript files
Only speakers: Ellie, Participant
Missing Ellie sessions: 451, 458, 480
Interruption-window overlaps: 373 (5 rows), 444 (37 rows)

Downstream consistency

Regenerate embeddings using the processed transcripts dir
Run mkdocs build --strict and ensure no new warnings are introduced by doc changes