Feature Reference (Non-Archive Canonical)

Audience: Researchers and maintainers Last Updated: 2026-01-01

This page is the canonical, non-archive reference for implemented features that affect: - few-shot retrieval behavior - artifact formats - evaluation metrics - fail-fast / reliability semantics

If docs/_archive/ disappeared tomorrow, this page (and the linked docs under docs/) should still be sufficient to run, debug, and interpret experiments.

SSOT + Defaults

SSOT for config names + code defaults: src/ai_psychiatrist/config.py
Recommended baseline for research runs: .env.example (copy to .env)
Run provenance: scripts/reproduce_results.py writes run_metadata (timestamp, git commit, run id, settings snapshot)

When this page says “default”, it refers to code defaults unless explicitly marked as “.env.example baseline”.

Few-Shot Retrieval Features

Feature	Spec	Config	Code Default	Artifact Requirement	What It Changes
Reference Examples prompt format	31 (+33 XML)	(none)	ON	(none)	How references are formatted in the prompt
Retrieval audit logs	32	`EMBEDDING_ENABLE_RETRIEVAL_AUDIT`	`false`	(none)	Adds structured logs per retrieved reference
Similarity threshold	33	`EMBEDDING_MIN_REFERENCE_SIMILARITY`	`0.0`	(none)	Drops low-similarity references before top-k
Per-item context budget	33	`EMBEDDING_MAX_REFERENCE_CHARS_PER_ITEM`	`0`	(none)	Caps total chars per item after top-k
Item-tag filtering	34 (+38 semantics)	`EMBEDDING_ENABLE_ITEM_TAG_FILTER`	`false`	`{emb}.tags.json`	Filters candidate chunks by PHQ-8 item tags
Chunk-level score attachment	35	`EMBEDDING_REFERENCE_SCORE_SOURCE`	`participant`	`{emb}.chunk_scores.json` + `{emb}.chunk_scores.meta.json`	Uses per-chunk estimated labels instead of participant-level labels
CRAG-style reference validation	36 (+38 semantics)	`EMBEDDING_ENABLE_REFERENCE_VALIDATION`	`false`	(none)	LLM validates each retrieved reference (`accept`/`reject`)
Batch query embedding	37	`EMBEDDING_ENABLE_BATCH_QUERY_EMBEDDING`	`true`	(none)	Uses 1 embedding call per participant (vs 8)
Query embedding timeout	37	`EMBEDDING_QUERY_EMBED_TIMEOUT_SECONDS`	`300`	(none)	Bounds embedding latency; replaces older hardcoded timeouts
Skip-if-disabled, crash-if-broken	38	(automatic)	ON	(varies)	Disabled optional features do no I/O; enabled features crash on invalid/missing artifacts
Preserve exception types	39	(automatic)	ON	(none)	Avoids masking errors as `ValueError` so failures are diagnosable

Notes: - “{emb}” means the resolved embeddings NPZ path: resolve_reference_embeddings_path(...) in src/ai_psychiatrist/config.py. - Spec 31’s original notebook used an unusual “same open/close tag” (<Reference Examples> … <Reference Examples>). Spec 33 intentionally changed the closing delimiter to proper XML: </Reference Examples>.

Embedding Artifact Safety

Feature	Spec	Where	Behavior
Fail-fast embedding generation	40	`scripts/generate_embeddings.py`	Default strict mode crashes on missing/corrupt transcripts or embedding failures; `--allow-partial` is debug-only and exits `2` with a `{output}.partial.json` skip manifest
Embedding NaN/Inf/zero detection	55	`infrastructure/validation.py`	Validates embeddings at generation, load, and similarity computation
Dimension strict mode	57	`reference_store.py`	Default: fail on `len(emb) < dimension`; escape hatch: `EMBEDDING_ALLOW_INSUFFICIENT_DIMENSION_EMBEDDINGS=true`

See: Artifact generation.

Evidence Extraction Validation (Specs 053-054)

Feature	Spec	Config	Code Default	What It Does
Evidence schema validation	54	(automatic)	ON	Raises `EvidenceSchemaError` on wrong types (string instead of list)
Evidence hallucination detection	53	`QUANTITATIVE_EVIDENCE_QUOTE_VALIDATION_ENABLED`	`true`	Validates extracted quotes exist in transcript
Grounding mode	53	`QUANTITATIVE_EVIDENCE_QUOTE_VALIDATION_MODE`	`substring`	`substring` (conservative) or `fuzzy` (requires rapidfuzz)
Fail on all rejected	53	`QUANTITATIVE_EVIDENCE_QUOTE_FAIL_ON_ALL_REJECTED`	`false`	When enabled, raises if LLM returned evidence but none grounded (strict mode)

SSOT: src/ai_psychiatrist/services/evidence_validation.py

Failure Pattern Observability (Spec 056)

Feature	Spec	Config	Code Default	What It Does
Failure registry	56	(automatic)	ON	Captures all failures with consistent taxonomy
Failure JSON artifact	56	(automatic)	ON	Writes `failures_{run_id}.json` per evaluation run

SSOT: src/ai_psychiatrist/infrastructure/observability.py

Privacy: only counts, lengths, hashes, and error codes are stored. Never raw transcript text.

Evaluation / Metrics

Feature	Spec	Where	Why It Exists
Selective prediction metrics	25	`scripts/evaluate_selective_prediction.py`, `src/ai_psychiatrist/metrics/*`	Comparing MAE across different coverages is invalid; we report AURC/AUGRC + bootstrap CIs

See: - Statistical methodology (AURC/AUGRC) (why AURC/AUGRC) - Metrics and evaluation (exact definitions + output schema)

Recommended Profiles (Research Workflow)

Legacy Baseline (Historical)

Goal: reproduce the paper’s method as described, even if it is noisy.

EMBEDDING_REFERENCE_SCORE_SOURCE=participant
EMBEDDING_ENABLE_ITEM_TAG_FILTER=false
EMBEDDING_MIN_REFERENCE_SIMILARITY=0.0
EMBEDDING_MAX_REFERENCE_CHARS_PER_ITEM=0
EMBEDDING_ENABLE_REFERENCE_VALIDATION=false

Research-Honest Retrieval (Post-Ablation Target)

Goal: minimize known failure modes (label mismatch, wrong-item retrieval, irrelevant references).

EMBEDDING_REFERENCE_SCORE_SOURCE=chunk
EMBEDDING_ENABLE_ITEM_TAG_FILTER=true
EMBEDDING_MIN_REFERENCE_SIMILARITY=0.3
EMBEDDING_MAX_REFERENCE_CHARS_PER_ITEM=500
EMBEDDING_ENABLE_REFERENCE_VALIDATION=true

See: Preflight checklist (few-shot).