Artifact Namespace Registry
Purpose: Single source of truth for all data artifacts, scripts, and naming conventions Last Updated: 2026-01-03
Naming Convention Summary
| Prefix | Meaning | Example |
|---|---|---|
| (none) | AVEC2017 official splits | reference_embeddings.npz |
paper_ |
Paper-style custom splits | paper_reference_embeddings.npz |
{backend}_... |
Embedding generator output | huggingface_qwen3_8b_paper_train_participant_only.npz |
Note: scripts/generate_embeddings.py defaults to {backend}_{model_slug}_{split}.npz naming and writes an optional .meta.json. For collision-free runs, include a transcript-variant suffix (e.g., _participant_only) in the output name.
Legacy filenames like paper_reference_embeddings.npz are still supported (use --output to regenerate with a specific name).
Data Splits
See Data Splits Overview for the authoritative reference on AVEC2017 vs paper splits.
Quick Reference: - AVEC2017: 107 train / 35 dev / 47 test (test has no per-item labels) - Paper custom: 58 train / 43 val / 41 test (all have per-item labels) - Ground truth IDs: Data Splits Overview
Transcript Artifacts
Raw (Extraction Output)
scripts/prepare_dataset.py writes raw transcripts to:
data/transcripts/{id}_P/{id}_TRANSCRIPT.csv
These are not speaker-filtered and may contain known DAIC-WOZ issues (interruptions, sync markers, missing Ellie transcripts).
Preprocessed Variants (Recommended for Bias-Aware Retrieval)
scripts/preprocess_daic_woz_transcripts.py writes deterministic variants under:
data/transcripts_{variant}/{id}_P/{id}_TRANSCRIPT.csv
Recommended variants:
- participant_only (bias-aware retrieval default)
- both_speakers_clean (clean baseline, keeps Ellie + Participant)
- participant_qa (participant + minimal question context)
Select a variant via:
DATA_TRANSCRIPTS_DIR=data/transcripts_participant_only
See: DAIC-WOZ Preprocessing.
Embeddings Artifacts
Legacy (Backward Compatible)
| File | Source Split | Participants | Size | Notes |
|---|---|---|---|---|
data/embeddings/paper_reference_embeddings.npz |
Paper-train | 58 | ~101 MB | NPZ embeddings |
data/embeddings/paper_reference_embeddings.json |
Paper-train | 58 | ~2.9 MB | Text chunks sidecar |
Current Generator Output (Default)
scripts/generate_embeddings.py writes:
- {output}.npz (embeddings)
- {output}.json (text chunks)
- {output}.meta.json (provenance metadata)
- {output}.tags.json (optional, with --write-item-tags flag)
- {output}.partial.json (debug-only, with --allow-partial; Spec 40)
Additional optional sidecars (separate preprocessing steps):
- {output}.chunk_scores.json + {output}.chunk_scores.meta.json (Spec 35; from scripts/score_reference_chunks.py)
Item Tags Sidecar (Spec 34)
When generated with --write-item-tags, the .tags.json sidecar contains per-chunk PHQ-8 item tags:
{
"303": [
["PHQ8_Sleep", "PHQ8_Tired"],
[],
["PHQ8_Depressed"]
],
"304": []
}
Purpose: Enables item-level filtering at retrieval time (EMBEDDING_ENABLE_ITEM_TAG_FILTER=true).
Validation: ReferenceStore validates that:
- Participant IDs match the texts sidecar
- Per-participant list length equals chunk count
- Tag values are valid PHQ8_* strings
Chunk Scores Sidecar (Spec 35)
Chunk scoring produces per-chunk estimated PHQ-8 item scores aligned with {output}.json:
{output}.chunk_scores.json{output}.chunk_scores.meta.json
Purpose: Enables chunk-level labels when EMBEDDING_REFERENCE_SCORE_SOURCE=chunk.
Validation: ReferenceStore validates that:
- Participant IDs match the embeddings/text sidecars exactly
- Per-participant list length equals chunk count
- Keys are exactly the 8 PHQ8_* strings
- Values are 0..3 or null
- prompt_hash matches the current scorer prompt (unless explicitly overridden as unsafe)
See: Chunk-level scoring.
Partial Output Manifest (Spec 40)
If embeddings are generated with --allow-partial, the script writes {output}.partial.json when skips occur.
Rule: Partial artifacts are debug-only and must not be used for final evaluation.
Embedding Auto-Selection Logic
Reference embeddings are selected via config, not --split:
- If
DATA_EMBEDDINGS_PATHis explicitly set, use it. - Otherwise use
EMBEDDING_EMBEDDINGS_FILEresolved under{DATA_BASE_DIR}/embeddings/.
If {artifact}.meta.json exists, ReferenceStore validates metadata (backend, model, dimension, chunking, min_evidence_chars, split CSV hash) against config at load time.
Scripts
Split Creation
| Script | Output | Purpose |
|---|---|---|
scripts/create_paper_split.py |
data/paper_splits/paper_split_*.csv |
Create paper-style 58/43/41 split |
Embedding Generation
| Script | Input Split | Output | Purpose |
|---|---|---|---|
scripts/generate_embeddings.py --split avec-train |
train_split_Depression_AVEC2017.csv |
{backend}_{model_slug}_avec_train.* |
AVEC embeddings |
scripts/generate_embeddings.py --split paper-train |
paper_split_train.csv |
{backend}_{model_slug}_paper_train.* |
Paper embeddings |
Reproduction
| Script | Eval Split | Embeddings Used | Purpose |
|---|---|---|---|
scripts/reproduce_results.py --split dev |
AVEC dev (35) | Configured reference artifact (EMBEDDING_EMBEDDINGS_FILE / DATA_EMBEDDINGS_PATH) |
Default evaluation |
scripts/reproduce_results.py --split paper |
Paper test (41) | Configured reference artifact (EMBEDDING_EMBEDDINGS_FILE / DATA_EMBEDDINGS_PATH) |
Paper reproduction |
Output Artifacts
| File Pattern | Purpose |
|---|---|
data/outputs/{mode}_{split}_{YYYYMMDD_HHMMSS}.json |
Reproduction results with run + per-experiment provenance (from scripts/reproduce_results.py) |
data/outputs/selective_prediction_metrics_*.json |
AURC/AUGRC + bootstrap CIs (from scripts/evaluate_selective_prediction.py) |
data/outputs/RUN_LOG.md |
Human-maintained run history log (append-only) |
data/outputs/*.log |
Console log captures for long runs / tmux sessions (optional) |
data/experiments/registry.yaml |
Registry of run metadata + summary metrics (updated by scripts/reproduce_results.py) |
Related Documentation
- Data Splits Overview - AVEC2017 vs paper splits
- DAIC-WOZ Schema - Dataset format and domain model
- DAIC-WOZ Preprocessing - Transcript variant generation
- RAG Artifact Generation - Embedding generation
- Configuration - Environment variables
- Model Registry - Model options and precision