Spec: DAIC-WOZ Transcript Preprocessing (Bias-Aware, Deterministic Variants)
Status: Implemented
Primary implementation: scripts/preprocess_daic_woz_transcripts.py
Integration points: src/ai_psychiatrist/config.py (DATA_TRANSCRIPTS_DIR), src/ai_psychiatrist/services/transcript.py
Verification: uv run pytest tests/ --tb=short (2026-01-02)
0. Problem Statement
DAIC-WOZ transcripts contain:
1) Interviewer prompt leakage: Ellie’s prompts can leak protocol patterns into embedding-based retrieval, biasing few-shot selection before the LLM is prompted.
2) Known "mechanical" transcript issues: e.g., interruption windows and missing Ellie transcripts (sessions 451, 458, 480).
3) Potential integrity issues in split CSVs (depending on upstream copy): missing PHQ-8 item cells and known label inconsistencies (e.g., PHQ8_Binary mismatch).
We need a deterministic, reproducible preprocessing workflow that creates collision-free transcript variants without modifying raw data.
1. Goals / Non-Goals
1.1 Goals
- Produce bias-aware transcript variants (notably participant-only) for embeddings/retrieval.
- Apply deterministic cleanup for known transcript mechanical issues (sync markers, interruptions).
- Guarantee raw vs processed inputs never collide (no in-place overwrites).
- Maintain the directory + filename convention expected by the codebase.
- Provide a machine-readable manifest (counts + warnings; no transcript text) for auditability.
1.2 Non-Goals
- Audio preprocessing / audio-text alignment fixes (reference tool flags misaligned audio sessions; not required for text-only runs).
- “Classical ML” token stripping (e.g., removing
<laughter>tokens) by default; this is an explicit ablation, not the default. - Downloading/unzipping DAIC-WOZ data (handled by dataset prep tooling; this spec focuses on transcript variants once
data/transcripts/exists).
2. Inputs (Raw, Untouched)
2.1 Canonical raw layout
Raw transcripts are expected in:
data/
transcripts/
300_P/300_TRANSCRIPT.csv
...
The transcript file is tab-separated with required columns:
start_time stop_time speaker value
See: docs/data/daic-woz-schema.md.
2.2 Raw data must not be modified
- The preprocessing workflow must never overwrite anything under
data/transcripts/. - Processed variants must be written to a distinct directory root (see Section 3).
3. Outputs (Processed Variants)
3.1 Output directory convention
Processed transcripts are written to a new transcripts root that preserves the same on-disk convention:
data/
transcripts_<variant_name>/
300_P/300_TRANSCRIPT.csv
...
3.2 Variant selection in runtime code
The runtime transcript loader is already configurable via DATA_TRANSCRIPTS_DIR:
src/ai_psychiatrist/config.py:DataSettings.transcripts_dir(env prefixDATA_)src/ai_psychiatrist/services/transcript.py:TranscriptServicereadsdata_settings.transcripts_dir
Example:
export DATA_TRANSCRIPTS_DIR=data/transcripts_participant_only
No code changes are required to select a variant: only configuration changes.
4. Dataset Facts (Reference Tool + Local Audit)
This implementation is aligned with the widely used Bailey/Plumbley DAIC-WOZ preprocessing tool mirrored under _reference/daic_woz_process/.
4.1 Known transcript mechanical issues (reference tool config)
From _reference/daic_woz_process/config_files/config_process.py:
- Interruption windows (drop rows overlapping these time ranges):
373:[395, 428]seconds444:[286, 387]seconds- Missing Ellie transcripts (participant-only transcripts exist):
451,458,480- Audio-text misalignment offsets (text is usable; offsets for audio sync):
318:34.319917seconds321:3.8379167seconds341:6.1892seconds362:16.8582seconds- Known label issue:
wrong_labels = {409: 1}(PHQ8_Binarymismatch for score ≥ 10)- Absent sessions (no data at all):
excluded_sessions = [342, 394, 398, 460]
4.2 Local raw transcript audit expectations
On a complete DAIC-WOZ transcript dump under data/transcripts/, the following checks should hold:
- Transcript file count:
189 - Speakers present: only
EllieandParticipant - Missing Ellie sessions:
451,458,480contain no Ellie rows - Interruption-window overlap counts (rows removed if applying interruption rule):
373:5rows overlap[395, 428]444:37rows overlap[286, 387]
These are data-validation expectations, not hard-coded invariants: the preprocessing should handle deviations by warning/fail-fast depending on severity (see Section 6).
5. Variant Definitions
All variants apply the same deterministic cleaning rules (Section 6) first, then apply a variant-specific speaker selection rule.
5.1 both_speakers_clean
- Keep all cleaned rows for both speakers.
- Intended for “paper-parity-ish” text runs where you want noise removal without removing Ellie entirely.
5.2 participant_only (recommended for embeddings/retrieval)
- Keep only rows where
speaker == "Participant"after cleaning. - Rationale: minimizes interviewer protocol leakage in embedding generation and retrieval.
5.3 participant_qa (minimal question context)
- Keep all participant rows, plus the most recent prior Ellie prompt once per contiguous participant block.
Deterministic rule: - When a participant row is kept, include the most recent prior Ellie row (if any) exactly once until another Ellie row appears.
6. Deterministic Cleaning Rules (Applied to All Variants)
6.1 Parse + schema validation (fail-fast)
For each {pid}_TRANSCRIPT.csv:
- Must contain columns:
start_time,stop_time,speaker,value - If required columns are missing: fail preprocessing for that transcript (do not silently continue)
- Drop rows where
speakerorvalueis missing/NaN
6.2 Speaker normalization + validation (fail-fast)
Normalize speaker values by trimming and case-folding:
"ellie"→"Ellie""participant"→"Participant"
After normalization:
- If any speaker value is not in {Ellie, Participant}: fail preprocessing for that transcript.
6.3 Pre-interview removal (drop “preamble”)
If Ellie is present:
- Find the first row where speaker == "Ellie".
- Drop all rows before it.
If Ellie is not present:
- This is expected only for sessions {451, 458, 480}.
- Drop leading sync markers / empty rows until the first non-empty, non-sync row.
- If Ellie is absent for a session not in the known list: do not fail; emit a warning (to avoid hard-coding assumptions that may vary across dataset copies).
6.4 Sync marker removal
Drop rows whose value is a sync marker.
Must match these canonical markers (case-insensitive, whitespace-trimmed), tolerating minor punctuation:
<sync>,<synch>[sync],[synch],[syncing],[synching]- plus any value whose normalized form starts with
<syncor[sync
6.5 Interruption window removal (text-safe)
Drop rows overlapping known interruption windows:
373:[395, 428]444:[286, 387]
Row/window overlap definition:
row_start < window_end AND row_end > window_start
6.6 Preserve nonverbal annotations and original case (default)
Do not delete tokens like <laughter> / <sigh> by default, because these can carry affective signal for LLM reasoning.
Do not lowercase text. The reference tool lowercases all text (for Word2Vec), but LLMs can use case as a signal (e.g., "I'm REALLY tired" vs "i'm really tired").
Differences from reference tool (intentional):
| Behavior | Reference Tool | This Spec (Default) | Rationale |
|---|---|---|---|
| Lowercase text | ✓ .lower() |
✗ Preserve case | LLM affective signal |
Strip xxx/xxxx |
✓ | ✗ Preserve | Placeholder may indicate hesitation |
Strip < > [ ] tokens |
✓ All such tokens | ✗ Preserve | <laughter> is affectively informative |
The reference tool (_reference/daic_woz_process/utils/utilities.py, remove_words_symbols()) strips:
words_to_remove = {'xxx', 'xxxx', ' ', ' ', ' ', ' ', ' '}- Any token containing
symbols_to_remove = ['<', '>', '[', ']']
These behaviors are appropriate for Word2Vec/classical ML but not for LLM-based reasoning. If an ablation requires reference-parity stripping, implement it as a separate variant (participant_only_stripped).
7. Preprocessing CLI Contract
7.1 Script entrypoint
Provide a deterministic CLI at:
scripts/preprocess_daic_woz_transcripts.py
7.2 Required flags / behavior
--input-dir(defaultdata/transcripts)--output-dir(required)--variantin{both_speakers_clean, participant_only, participant_qa}(defaultparticipant_only)--overwriteto delete an existing output dir (explicit opt-in)--dry-runto validate and compute stats without writing outputs
Safety constraints:
- Refuse to run if
--output-dirresolves to the same path as--input-dir.
Atomicity:
- Write outputs to a staging directory (e.g.,
output_dir.tmp) and rename tooutput_dironly on success. - On failure, remove staging output to avoid partial/corrupt processed datasets.
Audit output:
- When writing outputs (non-dry-run), write
preprocess_manifest.jsoncontaining: - counts (rows in/out)
- per-file removals by category
- warnings
- no transcript text
8. Ground Truth Integrity (PHQ-8 CSVs)
These are deterministic repairs and should be treated as integrity fixes, not statistical imputation.
8.1 Missing PHQ-8 item cell repair (when applicable)
If exactly one PHQ-8 item is missing and PHQ8_Score is present:
missing_item = PHQ8_Score - sum(known_items)
Tooling:
uv run python scripts/patch_missing_phq8_values.py --dry-runuv run python scripts/patch_missing_phq8_values.py --apply
Doc: docs/data/patch-missing-phq8-values.md
8.2 PHQ8_Binary consistency rule
Treat:
PHQ8_Binary = 1 iff PHQ8_Score >= 10
Known upstream issue to account for: Participant 409 has been observed with PHQ8_Score=10 but PHQ8_Binary=0 in some copies.
9. Collision-Free Artifact Workflow (Embeddings, Tags, Chunk Scores)
To avoid mixing artifacts from different transcript variants:
1) Keep raw transcripts in data/transcripts/
2) Generate a processed variant in data/transcripts_<variant>/
3) Set DATA_TRANSCRIPTS_DIR to that variant
4) Generate embeddings with a variant-stamped artifact name
5) Ensure .tags.json and .chunk_scores.json correspond to the same embeddings base name
See:
docs/data/artifact-namespace-registry.mddocs/embeddings/embedding-generation.md
10. Acceptance Criteria / Validation
10.1 Preprocessing correctness
- Preprocessing completes without error across all transcripts in
data/transcripts/. - Output transcript files preserve the
{pid}_P/{pid}_TRANSCRIPT.csvconvention. - Every output transcript contains at least one participant utterance.
- Sessions
451/458/480are processed without failure (Ellie absent). - Sessions
373/444have rows removed overlapping the specified windows.
10.2 Reproducibility and auditability
- Output directory includes
preprocess_manifest.json(no transcript text). - Re-running with identical inputs and settings produces identical outputs.
10.3 Downstream compatibility
- Setting
DATA_TRANSCRIPTS_DIRto the output directory results in successful transcript loads viaTranscriptService. - Embeddings generation succeeds against the processed transcripts directory when configured.
11. Related Docs
- User-facing guide (overview + rationale):
docs/data/daic-woz-preprocessing.md - Local audit notes:
docs/_brainstorming/daic-woz-preprocessing.md - DAIC-WOZ schema:
docs/data/daic-woz-schema.md - Reference preprocessing repo mirror:
_reference/daic_woz_process/