Skip to content

Spec: DAIC-WOZ Transcript Preprocessing (Bias-Aware, Deterministic Variants)

Status: Implemented Primary implementation: scripts/preprocess_daic_woz_transcripts.py Integration points: src/ai_psychiatrist/config.py (DATA_TRANSCRIPTS_DIR), src/ai_psychiatrist/services/transcript.py Verification: uv run pytest tests/ --tb=short (2026-01-02)

0. Problem Statement

DAIC-WOZ transcripts contain:

1) Interviewer prompt leakage: Ellie’s prompts can leak protocol patterns into embedding-based retrieval, biasing few-shot selection before the LLM is prompted. 2) Known "mechanical" transcript issues: e.g., interruption windows and missing Ellie transcripts (sessions 451, 458, 480). 3) Potential integrity issues in split CSVs (depending on upstream copy): missing PHQ-8 item cells and known label inconsistencies (e.g., PHQ8_Binary mismatch).

We need a deterministic, reproducible preprocessing workflow that creates collision-free transcript variants without modifying raw data.

1. Goals / Non-Goals

1.1 Goals

  • Produce bias-aware transcript variants (notably participant-only) for embeddings/retrieval.
  • Apply deterministic cleanup for known transcript mechanical issues (sync markers, interruptions).
  • Guarantee raw vs processed inputs never collide (no in-place overwrites).
  • Maintain the directory + filename convention expected by the codebase.
  • Provide a machine-readable manifest (counts + warnings; no transcript text) for auditability.

1.2 Non-Goals

  • Audio preprocessing / audio-text alignment fixes (reference tool flags misaligned audio sessions; not required for text-only runs).
  • “Classical ML” token stripping (e.g., removing <laughter> tokens) by default; this is an explicit ablation, not the default.
  • Downloading/unzipping DAIC-WOZ data (handled by dataset prep tooling; this spec focuses on transcript variants once data/transcripts/ exists).

2. Inputs (Raw, Untouched)

2.1 Canonical raw layout

Raw transcripts are expected in:

data/
  transcripts/
    300_P/300_TRANSCRIPT.csv
    ...

The transcript file is tab-separated with required columns:

start_time    stop_time    speaker    value

See: docs/data/daic-woz-schema.md.

2.2 Raw data must not be modified

  • The preprocessing workflow must never overwrite anything under data/transcripts/.
  • Processed variants must be written to a distinct directory root (see Section 3).

3. Outputs (Processed Variants)

3.1 Output directory convention

Processed transcripts are written to a new transcripts root that preserves the same on-disk convention:

data/
  transcripts_<variant_name>/
    300_P/300_TRANSCRIPT.csv
    ...

3.2 Variant selection in runtime code

The runtime transcript loader is already configurable via DATA_TRANSCRIPTS_DIR:

  • src/ai_psychiatrist/config.py: DataSettings.transcripts_dir (env prefix DATA_)
  • src/ai_psychiatrist/services/transcript.py: TranscriptService reads data_settings.transcripts_dir

Example:

export DATA_TRANSCRIPTS_DIR=data/transcripts_participant_only

No code changes are required to select a variant: only configuration changes.

4. Dataset Facts (Reference Tool + Local Audit)

This implementation is aligned with the widely used Bailey/Plumbley DAIC-WOZ preprocessing tool mirrored under _reference/daic_woz_process/.

4.1 Known transcript mechanical issues (reference tool config)

From _reference/daic_woz_process/config_files/config_process.py:

  • Interruption windows (drop rows overlapping these time ranges):
  • 373: [395, 428] seconds
  • 444: [286, 387] seconds
  • Missing Ellie transcripts (participant-only transcripts exist):
  • 451, 458, 480
  • Audio-text misalignment offsets (text is usable; offsets for audio sync):
  • 318: 34.319917 seconds
  • 321: 3.8379167 seconds
  • 341: 6.1892 seconds
  • 362: 16.8582 seconds
  • Known label issue:
  • wrong_labels = {409: 1} (PHQ8_Binary mismatch for score ≥ 10)
  • Absent sessions (no data at all):
  • excluded_sessions = [342, 394, 398, 460]

4.2 Local raw transcript audit expectations

On a complete DAIC-WOZ transcript dump under data/transcripts/, the following checks should hold:

  • Transcript file count: 189
  • Speakers present: only Ellie and Participant
  • Missing Ellie sessions: 451, 458, 480 contain no Ellie rows
  • Interruption-window overlap counts (rows removed if applying interruption rule):
  • 373: 5 rows overlap [395, 428]
  • 444: 37 rows overlap [286, 387]

These are data-validation expectations, not hard-coded invariants: the preprocessing should handle deviations by warning/fail-fast depending on severity (see Section 6).

5. Variant Definitions

All variants apply the same deterministic cleaning rules (Section 6) first, then apply a variant-specific speaker selection rule.

5.1 both_speakers_clean

  • Keep all cleaned rows for both speakers.
  • Intended for “paper-parity-ish” text runs where you want noise removal without removing Ellie entirely.
  • Keep only rows where speaker == "Participant" after cleaning.
  • Rationale: minimizes interviewer protocol leakage in embedding generation and retrieval.

5.3 participant_qa (minimal question context)

  • Keep all participant rows, plus the most recent prior Ellie prompt once per contiguous participant block.

Deterministic rule: - When a participant row is kept, include the most recent prior Ellie row (if any) exactly once until another Ellie row appears.

6. Deterministic Cleaning Rules (Applied to All Variants)

6.1 Parse + schema validation (fail-fast)

For each {pid}_TRANSCRIPT.csv:

  • Must contain columns: start_time, stop_time, speaker, value
  • If required columns are missing: fail preprocessing for that transcript (do not silently continue)
  • Drop rows where speaker or value is missing/NaN

6.2 Speaker normalization + validation (fail-fast)

Normalize speaker values by trimming and case-folding:

  • "ellie""Ellie"
  • "participant""Participant"

After normalization: - If any speaker value is not in {Ellie, Participant}: fail preprocessing for that transcript.

6.3 Pre-interview removal (drop “preamble”)

If Ellie is present: - Find the first row where speaker == "Ellie". - Drop all rows before it.

If Ellie is not present: - This is expected only for sessions {451, 458, 480}. - Drop leading sync markers / empty rows until the first non-empty, non-sync row. - If Ellie is absent for a session not in the known list: do not fail; emit a warning (to avoid hard-coding assumptions that may vary across dataset copies).

6.4 Sync marker removal

Drop rows whose value is a sync marker.

Must match these canonical markers (case-insensitive, whitespace-trimmed), tolerating minor punctuation:

  • <sync>, <synch>
  • [sync], [synch], [syncing], [synching]
  • plus any value whose normalized form starts with <sync or [sync

6.5 Interruption window removal (text-safe)

Drop rows overlapping known interruption windows:

  • 373: [395, 428]
  • 444: [286, 387]

Row/window overlap definition:

row_start < window_end AND row_end > window_start

6.6 Preserve nonverbal annotations and original case (default)

Do not delete tokens like <laughter> / <sigh> by default, because these can carry affective signal for LLM reasoning.

Do not lowercase text. The reference tool lowercases all text (for Word2Vec), but LLMs can use case as a signal (e.g., "I'm REALLY tired" vs "i'm really tired").

Differences from reference tool (intentional):

Behavior Reference Tool This Spec (Default) Rationale
Lowercase text .lower() ✗ Preserve case LLM affective signal
Strip xxx/xxxx ✗ Preserve Placeholder may indicate hesitation
Strip < > [ ] tokens ✓ All such tokens ✗ Preserve <laughter> is affectively informative

The reference tool (_reference/daic_woz_process/utils/utilities.py, remove_words_symbols()) strips:

  • words_to_remove = {'xxx', 'xxxx', ' ', ' ', ' ', ' ', ' '}
  • Any token containing symbols_to_remove = ['<', '>', '[', ']']

These behaviors are appropriate for Word2Vec/classical ML but not for LLM-based reasoning. If an ablation requires reference-parity stripping, implement it as a separate variant (participant_only_stripped).

7. Preprocessing CLI Contract

7.1 Script entrypoint

Provide a deterministic CLI at:

  • scripts/preprocess_daic_woz_transcripts.py

7.2 Required flags / behavior

  • --input-dir (default data/transcripts)
  • --output-dir (required)
  • --variant in {both_speakers_clean, participant_only, participant_qa} (default participant_only)
  • --overwrite to delete an existing output dir (explicit opt-in)
  • --dry-run to validate and compute stats without writing outputs

Safety constraints:

  • Refuse to run if --output-dir resolves to the same path as --input-dir.

Atomicity:

  • Write outputs to a staging directory (e.g., output_dir.tmp) and rename to output_dir only on success.
  • On failure, remove staging output to avoid partial/corrupt processed datasets.

Audit output:

  • When writing outputs (non-dry-run), write preprocess_manifest.json containing:
  • counts (rows in/out)
  • per-file removals by category
  • warnings
  • no transcript text

8. Ground Truth Integrity (PHQ-8 CSVs)

These are deterministic repairs and should be treated as integrity fixes, not statistical imputation.

8.1 Missing PHQ-8 item cell repair (when applicable)

If exactly one PHQ-8 item is missing and PHQ8_Score is present:

missing_item = PHQ8_Score - sum(known_items)

Tooling:

  • uv run python scripts/patch_missing_phq8_values.py --dry-run
  • uv run python scripts/patch_missing_phq8_values.py --apply

Doc: docs/data/patch-missing-phq8-values.md

8.2 PHQ8_Binary consistency rule

Treat:

PHQ8_Binary = 1 iff PHQ8_Score >= 10

Known upstream issue to account for: Participant 409 has been observed with PHQ8_Score=10 but PHQ8_Binary=0 in some copies.

9. Collision-Free Artifact Workflow (Embeddings, Tags, Chunk Scores)

To avoid mixing artifacts from different transcript variants:

1) Keep raw transcripts in data/transcripts/ 2) Generate a processed variant in data/transcripts_<variant>/ 3) Set DATA_TRANSCRIPTS_DIR to that variant 4) Generate embeddings with a variant-stamped artifact name 5) Ensure .tags.json and .chunk_scores.json correspond to the same embeddings base name

See:

  • docs/data/artifact-namespace-registry.md
  • docs/embeddings/embedding-generation.md

10. Acceptance Criteria / Validation

10.1 Preprocessing correctness

  • Preprocessing completes without error across all transcripts in data/transcripts/.
  • Output transcript files preserve the {pid}_P/{pid}_TRANSCRIPT.csv convention.
  • Every output transcript contains at least one participant utterance.
  • Sessions 451/458/480 are processed without failure (Ellie absent).
  • Sessions 373/444 have rows removed overlapping the specified windows.

10.2 Reproducibility and auditability

  • Output directory includes preprocess_manifest.json (no transcript text).
  • Re-running with identical inputs and settings produces identical outputs.

10.3 Downstream compatibility

  • Setting DATA_TRANSCRIPTS_DIR to the output directory results in successful transcript loads via TranscriptService.
  • Embeddings generation succeeds against the processed transcripts directory when configured.
  • User-facing guide (overview + rationale): docs/data/daic-woz-preprocessing.md
  • Local audit notes: docs/_brainstorming/daic-woz-preprocessing.md
  • DAIC-WOZ schema: docs/data/daic-woz-schema.md
  • Reference preprocessing repo mirror: _reference/daic_woz_process/