Feature Reference (Non-Archive Canonical)
Audience: Researchers and maintainers Last Updated: 2026-01-01
This page is the canonical, non-archive reference for implemented features that affect: - few-shot retrieval behavior - artifact formats - evaluation metrics - fail-fast / reliability semantics
If docs/_archive/ disappeared tomorrow, this page (and the linked docs under docs/) should still be sufficient to run, debug, and interpret experiments.
SSOT + Defaults
- SSOT for config names + code defaults:
src/ai_psychiatrist/config.py - Recommended baseline for research runs:
.env.example(copy to.env) - Run provenance:
scripts/reproduce_results.pywritesrun_metadata(timestamp, git commit, run id, settings snapshot)
When this page says “default”, it refers to code defaults unless explicitly marked as “.env.example baseline”.
Few-Shot Retrieval Features
| Feature | Spec | Config | Code Default | Artifact Requirement | What It Changes |
|---|---|---|---|---|---|
| Reference Examples prompt format | 31 (+33 XML) | (none) | ON | (none) | How references are formatted in the prompt |
| Retrieval audit logs | 32 | EMBEDDING_ENABLE_RETRIEVAL_AUDIT |
false |
(none) | Adds structured logs per retrieved reference |
| Similarity threshold | 33 | EMBEDDING_MIN_REFERENCE_SIMILARITY |
0.0 |
(none) | Drops low-similarity references before top-k |
| Per-item context budget | 33 | EMBEDDING_MAX_REFERENCE_CHARS_PER_ITEM |
0 |
(none) | Caps total chars per item after top-k |
| Item-tag filtering | 34 (+38 semantics) | EMBEDDING_ENABLE_ITEM_TAG_FILTER |
false |
{emb}.tags.json |
Filters candidate chunks by PHQ-8 item tags |
| Chunk-level score attachment | 35 | EMBEDDING_REFERENCE_SCORE_SOURCE |
participant |
{emb}.chunk_scores.json + {emb}.chunk_scores.meta.json |
Uses per-chunk estimated labels instead of participant-level labels |
| CRAG-style reference validation | 36 (+38 semantics) | EMBEDDING_ENABLE_REFERENCE_VALIDATION |
false |
(none) | LLM validates each retrieved reference (accept/reject) |
| Batch query embedding | 37 | EMBEDDING_ENABLE_BATCH_QUERY_EMBEDDING |
true |
(none) | Uses 1 embedding call per participant (vs 8) |
| Query embedding timeout | 37 | EMBEDDING_QUERY_EMBED_TIMEOUT_SECONDS |
300 |
(none) | Bounds embedding latency; replaces older hardcoded timeouts |
| Skip-if-disabled, crash-if-broken | 38 | (automatic) | ON | (varies) | Disabled optional features do no I/O; enabled features crash on invalid/missing artifacts |
| Preserve exception types | 39 | (automatic) | ON | (none) | Avoids masking errors as ValueError so failures are diagnosable |
Notes:
- “{emb}” means the resolved embeddings NPZ path: resolve_reference_embeddings_path(...) in src/ai_psychiatrist/config.py.
- Spec 31’s original notebook used an unusual “same open/close tag” (<Reference Examples> … <Reference Examples>). Spec 33 intentionally changed the closing delimiter to proper XML: </Reference Examples>.
Embedding Artifact Safety
| Feature | Spec | Where | Behavior |
|---|---|---|---|
| Fail-fast embedding generation | 40 | scripts/generate_embeddings.py |
Default strict mode crashes on missing/corrupt transcripts or embedding failures; --allow-partial is debug-only and exits 2 with a {output}.partial.json skip manifest |
| Embedding NaN/Inf/zero detection | 55 | infrastructure/validation.py |
Validates embeddings at generation, load, and similarity computation |
| Dimension strict mode | 57 | reference_store.py |
Default: fail on len(emb) < dimension; escape hatch: EMBEDDING_ALLOW_INSUFFICIENT_DIMENSION_EMBEDDINGS=true |
See: Artifact generation.
Evidence Extraction Validation (Specs 053-054)
| Feature | Spec | Config | Code Default | What It Does |
|---|---|---|---|---|
| Evidence schema validation | 54 | (automatic) | ON | Raises EvidenceSchemaError on wrong types (string instead of list) |
| Evidence hallucination detection | 53 | QUANTITATIVE_EVIDENCE_QUOTE_VALIDATION_ENABLED |
true |
Validates extracted quotes exist in transcript |
| Grounding mode | 53 | QUANTITATIVE_EVIDENCE_QUOTE_VALIDATION_MODE |
substring |
substring (conservative) or fuzzy (requires rapidfuzz) |
| Fail on all rejected | 53 | QUANTITATIVE_EVIDENCE_QUOTE_FAIL_ON_ALL_REJECTED |
false |
When enabled, raises if LLM returned evidence but none grounded (strict mode) |
SSOT: src/ai_psychiatrist/services/evidence_validation.py
Failure Pattern Observability (Spec 056)
| Feature | Spec | Config | Code Default | What It Does |
|---|---|---|---|---|
| Failure registry | 56 | (automatic) | ON | Captures all failures with consistent taxonomy |
| Failure JSON artifact | 56 | (automatic) | ON | Writes failures_{run_id}.json per evaluation run |
SSOT: src/ai_psychiatrist/infrastructure/observability.py
Privacy: only counts, lengths, hashes, and error codes are stored. Never raw transcript text.
Evaluation / Metrics
| Feature | Spec | Where | Why It Exists |
|---|---|---|---|
| Selective prediction metrics | 25 | scripts/evaluate_selective_prediction.py, src/ai_psychiatrist/metrics/* |
Comparing MAE across different coverages is invalid; we report AURC/AUGRC + bootstrap CIs |
See: - Statistical methodology (AURC/AUGRC) (why AURC/AUGRC) - Metrics and evaluation (exact definitions + output schema)
Recommended Profiles (Research Workflow)
Legacy Baseline (Historical)
Goal: reproduce the paper’s method as described, even if it is noisy.
EMBEDDING_REFERENCE_SCORE_SOURCE=participantEMBEDDING_ENABLE_ITEM_TAG_FILTER=falseEMBEDDING_MIN_REFERENCE_SIMILARITY=0.0EMBEDDING_MAX_REFERENCE_CHARS_PER_ITEM=0EMBEDDING_ENABLE_REFERENCE_VALIDATION=false
Research-Honest Retrieval (Post-Ablation Target)
Goal: minimize known failure modes (label mismatch, wrong-item retrieval, irrelevant references).
EMBEDDING_REFERENCE_SCORE_SOURCE=chunkEMBEDDING_ENABLE_ITEM_TAG_FILTER=trueEMBEDDING_MIN_REFERENCE_SIMILARITY=0.3EMBEDDING_MAX_REFERENCE_CHARS_PER_ITEM=500EMBEDDING_ENABLE_REFERENCE_VALIDATION=true
See: Preflight checklist (few-shot).