Skip to content

RAG Artifact Generation

Audience: Researchers generating few-shot reference artifacts Last Updated: 2026-01-03

This guide describes how to generate embedding artifacts safely and reproducibly, including optional item tags (Spec 34).

SSOT implementation: - scripts/generate_embeddings.py - src/ai_psychiatrist/services/reference_store.py (loads/validates artifacts)


Output Artifacts

Given an output basename {name}, embedding generation produces:

File Contents Required
{name}.npz Embedding vectors (per-participant keys like emb_303) Yes
{name}.json Chunk texts (participant id → list[str]) Yes
{name}.meta.json Provenance metadata for fail-fast mismatch detection (legacy artifacts may omit; validation is then skipped with a warning) Yes
{name}.tags.json PHQ-8 item tags per chunk (only if --write-item-tags) Optional

For chunk-level scoring artifacts, see chunk-scoring.md.


Basic Generation (Strict Mode)

Strict mode is fail-fast and recommended for production: - transcript load failures → crash - empty transcript → crash - embedding failures → crash - NaN/Inf/zero embeddings → crash (Spec 055) - dimension mismatch → crash (Spec 057)

uv run python scripts/generate_embeddings.py --split paper-train

Embedding Validation (Spec 055)

All generated embeddings are validated for: - NaN values: Corrupted embedding vectors - Inf values: Numerical overflow - Zero vectors: Invalid for cosine similarity

If any validation fails, generation crashes with a clear error message including the chunk index.

Dimension Invariants (Spec 057)

If the embedding backend returns fewer dimensions than EMBEDDING_DIMENSION: - Strict mode (default): Crashes immediately - Partial mode (--allow-partial): Skips chunk, records dimension_mismatch in .partial.json

Escape hatch (runtime only, not for generation):

EMBEDDING_ALLOW_INSUFFICIENT_DIMENSION_EMBEDDINGS=true

This allows loading legacy artifacts with dimension mismatches but should only be used for forensics.


Generation With Item Tags (Spec 34)

Item tagging adds a {name}.tags.json sidecar so retrieval can filter candidate chunks to the target PHQ-8 item.

uv run python scripts/generate_embeddings.py \
  --split paper-train \
  --write-item-tags \
  --tagger keyword

This writes {name}.tags.json aligned with {name}.json.

Enable Tag Filtering at Runtime

EMBEDDING_ENABLE_ITEM_TAG_FILTER=true

Fail-Fast Semantics (Spec 38)

  • If EMBEDDING_ENABLE_ITEM_TAG_FILTER=false:
  • {name}.tags.json is ignored (no load, no validation)
  • If EMBEDDING_ENABLE_ITEM_TAG_FILTER=true:
  • missing {name}.tags.json → crash
  • invalid {name}.tags.json → crash

This is intentional: enabling a feature must not silently run a different method.


Partial Mode (Debug Only)

Partial mode is opt-in for debugging:

uv run python scripts/generate_embeddings.py --split paper-train --allow-partial

Behavior: - skips failed participants/chunks - exits with code 2 - writes a skip manifest {output}.partial.json if any skips occur

Manifest schema:

{
  "output_npz": "data/embeddings/....npz",
  "skipped_participants": [487],
  "skipped_participant_count": 1,
  "skipped_chunks": 12
}

Rule: any artifact produced in partial mode with skips is not valid for final evaluation.


Artifact Schemas

Tags Sidecar ({name}.tags.json)

Top-level JSON object: - keys: participant id strings (e.g., "303") - values: list of per-chunk tag lists aligned with {name}.json

Example:

{
  "303": [
    ["PHQ8_Sleep", "PHQ8_Tired"],
    [],
    ["PHQ8_Depressed"]
  ]
}

Constraints: - participant ids must match {name}.json - per-participant list length must equal the chunk count in {name}.json - each tag must be one of the 8 PHQ8_* strings


Verification Checklist

After generation:

# Base artifacts
ls -la data/embeddings/{name}.npz data/embeddings/{name}.json data/embeddings/{name}.meta.json

# If using item tags
ls -la data/embeddings/{name}.tags.json

# If using chunk scores (see chunk-scoring.md)
ls -la data/embeddings/{name}.chunk_scores.json data/embeddings/{name}.chunk_scores.meta.json

  • Chunk-level scoring: chunk-scoring.md
  • Few-shot preflight: docs/preflight-checklist/preflight-checklist-few-shot.md
  • Feature index: docs/pipeline-internals/features.md