Configuration Philosophy
Date: 2026-01-02 Purpose: Define what should be configurable vs baked-in defaults.
Core Principle
Correct behavior is the default. Broken behavior requires explicit opt-in.
Not everything needs a flag. Flags add cognitive load and misconfiguration risk.
SSOT + Terminology
- SSOT for config names + code defaults:
src/ai_psychiatrist/config.py. - Recommended runtime baseline:
.env.example(what most runs use once copied to.env). - When this doc says "default", it should be read as:
- Code default = what happens with no
.envoverrides (or in tests where.envis ignored). - Recommended
.envbaseline = what we expect for normal research runs.
On the Legacy Baseline (Paper-Derived)
The paper's few-shot method (as described) introduces a fundamental label mismatch.
We initially aimed to match the paper's reported methodology. Through rigorous investigation, we discovered critical issues:
-
Participant-level scores attached to retrieved chunks: In the legacy baseline pipeline, the score shown for a retrieved reference chunk is a participant-level PHQ-8 item score. This creates label noise: a chunk about "career goals" can be shown as
(PHQ8_Sleep Score: 2)even if it contains no sleep evidence. This is not chunk-level ground truth. -
Keyword backfill was a heuristic: Keyword triggers ("sleep", "tired", etc.) can increase apparent coverage but may introduce false positives and distort selective-prediction metrics. This feature was removed in Spec 047; historical context is kept under
docs/_archive/. -
Reproducibility is ambiguous: Despite extensive effort, we have not reproduced the paper's headline improvements in our environment. This could be due to methodology gaps (under-specified prompts, artifacts, split details) and/or implementation differences.
Our stance: the legacy baseline is still useful as a historical baseline, but should not be the default behavior. We aim for research-honest behavior: minimize label noise, avoid silent heuristics, and fail fast when enabled features are broken.
See _archive/misc/HYPOTHESIS-FEWSHOT-DESIGN-FLAW.md for the full analysis.
Configuration Categories
1. Always-On Correctness Invariants (Do Not "Tune")
These are correctness behaviors. Some have knobs in the codebase, but treating them as "tunable" creates misconfiguration risk and can corrupt research runs.
| Behavior | Where Enforced | Config Knob? | Notes |
|---|---|---|---|
| Skip-if-disabled, crash-if-broken (Spec 38) | ReferenceStore + ReferenceValidation |
No (automatic) | Disabled feature = no file I/O; enabled feature = strict load + validate |
| Preserve exception types (Spec 39) | Agents | No (automatic) | Log error_type, then raise to preserve the original exception |
| Fail-fast embedding generation (Spec 40) | scripts/generate_embeddings.py |
CLI (--allow-partial) |
Strict-by-default; partial is for debugging only |
| Evidence schema validation (Spec 54) | _extract_evidence() |
No (automatic) | Raises EvidenceSchemaError on wrong types |
| Evidence grounding validation (Spec 53) | _extract_evidence() |
Yes (QUANTITATIVE_EVIDENCE_QUOTE_VALIDATION_*) |
Default ON; validates quotes exist in transcript |
| Embedding NaN/Inf/zero detection (Spec 55) | Query + reference embeddings | No (automatic) | Raises EmbeddingValidationError |
| Dimension strict mode (Spec 57) | ReferenceStore |
Yes (EMBEDDING_ALLOW_INSUFFICIENT_DIMENSION_EMBEDDINGS) |
Default: fail on dimension < expected |
| Failure pattern observability (Spec 56) | reproduce_results.py |
No (automatic) | Writes failures_{run_id}.json |
| Pydantic AI structured output | Agents | Yes (PYDANTIC_AI_ENABLED) |
Disabling is not supported (agents will raise; legacy fallback removed) |
| Track N/A reasons | Quantitative agent | Yes (QUANTITATIVE_TRACK_NA_REASONS) |
Default ON; small runtime cost but improves run diagnostics |
Rule: if disabling a "correctness invariant" is possible, it must either: - be clearly documented as unsupported, or - be restricted to a debug-only escape hatch (explicit, noisy, and off by default).
2. Post-Ablation Defaults (Will Be Baked In)
After ablations complete, these become baked-in defaults:
| Setting | Code Default | .env.example Baseline |
Post-Ablation Default | Why |
|---|---|---|---|---|
EMBEDDING_REFERENCE_SCORE_SOURCE |
participant |
participant |
chunk |
Fixes participant-score-on-chunk mismatch (Spec 35) |
EMBEDDING_ENABLE_ITEM_TAG_FILTER |
false |
true |
true |
Improves item-level retrieval precision (Spec 34) |
EMBEDDING_MIN_REFERENCE_SIMILARITY |
0.0 |
0.3 |
0.3 |
Drops low-similarity references (Spec 33) |
EMBEDDING_MAX_REFERENCE_CHARS_PER_ITEM |
0 |
500 |
500 |
Prevents context bloat (Spec 33) |
EMBEDDING_ENABLE_REFERENCE_VALIDATION |
false |
false |
true |
CRAG validation to reject irrelevant references (Spec 36) |
Post-ablation: These become defaults. Flags remain ONLY for legacy baseline reproduction.
Why CRAG (Spec 36) Should Be Default ON
If our goal is research-honest retrieval (not the legacy baseline), reference validation is part of the "correct" pipeline:
- Spec 34 (item tags) is a static heuristic and will miss symptom mentions that don't match keywords.
- Spec 33 (similarity threshold/budget) is a quality guardrail, not a relevance proof.
- Spec 35 fixes label correctness, but does not prevent "semantically similar but clinically irrelevant" chunks.
- Spec 36 is the only layer that asks an LLM directly: "Is this reference actually about the target PHQ-8 item?"
In this repo's research workflow (local Ollama, long-running ablations), correctness outweighs latency.
3. Tunable Hyperparameters (Keep Configurable)
Researchers should experiment with these. They affect results, not correctness.
Important: Some "hyperparameters" are index-time and require regenerating artifacts. Changing them without regenerating embeddings/tags/chunk-scores will either crash or silently change the retrieval universe.
| Setting | Code Default | .env.example Baseline |
Runtime-Only? | Notes |
|---|---|---|---|---|
EMBEDDING_DIMENSION |
4096 | 4096 | No | Must match embedding model + stored artifact dimension |
EMBEDDING_CHUNK_SIZE |
8 | 8 | No | Requires regenerating .npz + sidecars |
EMBEDDING_CHUNK_STEP |
2 | 2 | No | Requires regenerating .npz + sidecars |
EMBEDDING_TOP_K_REFERENCES |
2 | 2 | Yes | Paper Appendix D chose 2; can be tuned without reindex |
EMBEDDING_ENABLE_BATCH_QUERY_EMBEDDING |
true |
true |
Yes | Spec 37 stability/perf default |
EMBEDDING_QUERY_EMBED_TIMEOUT_SECONDS |
300 | 300 | Yes | Stability knob (Spec 37) |
EMBEDDING_ENABLE_RETRIEVAL_AUDIT |
false |
true |
Yes | Diagnostics only (Spec 32); recommended ON for research runs |
EMBEDDING_MIN_REFERENCE_SIMILARITY |
0.0 | 0.3 | Yes | Retrieval-time filter; safe to tune |
EMBEDDING_MAX_REFERENCE_CHARS_PER_ITEM |
0 | 500 | Yes | Retrieval-time budget; safe to tune |
FEEDBACK_MAX_ITERATIONS |
10 | 10 | Yes | More iterations increases runtime and can change outputs |
FEEDBACK_SCORE_THRESHOLD |
3 | 3 | Yes | Controls when refinement triggers |
EMBEDDING_VALIDATION_MAX_REFS_PER_ITEM |
2 | 2 | Yes | Bounds CRAG keep-set per item |
These stay as env vars. Researchers tune them for ablation studies.
4. Model Selection (Always Configurable)
Users must be able to swap models:
| Setting | Code Default | .env.example Baseline |
Purpose |
|---|---|---|---|
MODEL_QUALITATIVE_MODEL |
gemma3:27b |
gemma3:27b-it-qat |
Qualitative agent |
MODEL_JUDGE_MODEL |
gemma3:27b |
gemma3:27b-it-qat |
Judge agent |
MODEL_META_REVIEW_MODEL |
gemma3:27b |
gemma3:27b-it-qat |
Meta-review agent |
MODEL_QUANTITATIVE_MODEL |
gemma3:27b |
gemma3:27b-it-qat |
Quantitative agent |
MODEL_EMBEDDING_MODEL |
qwen3-embedding:8b |
qwen3-embedding:8b |
Embedding model |
MODEL_TEMPERATURE |
0.0 |
0.0 |
Keep 0.0 for reproducibility |
EMBEDDING_VALIDATION_MODEL |
"" (falls back) |
(unset) | Effective default is MODEL_JUDGE_MODEL when validation is enabled |
These are always configurable. Different hardware = different models.
5. Infrastructure (Always Configurable)
Environment-specific setup:
| Setting | Default | Purpose |
|---|---|---|
OLLAMA_HOST |
127.0.0.1 |
Ollama server |
OLLAMA_PORT |
11434 |
Ollama port |
OLLAMA_TIMEOUT_SECONDS |
600 |
Request timeout |
PYDANTIC_AI_TIMEOUT_SECONDS |
(unset) | Timeout for Pydantic AI calls |
LLM_BACKEND |
ollama |
Chat backend |
EMBEDDING_BACKEND |
huggingface |
Embedding backend |
HF_DEFAULT_CHAT_TIMEOUT |
180 |
HuggingFace chat timeout |
HF_DEFAULT_EMBED_TIMEOUT |
120 |
HuggingFace embed timeout |
DATA_* paths |
data/... |
Data locations |
LOG_LEVEL |
INFO |
Logging verbosity |
API_HOST, API_PORT |
0.0.0.0:8000 |
Server binding |
These are always configurable. Infrastructure varies by deployment.
6. Embedding Artifact Selection (Critical)
Problem Identified: Embedding artifacts and chunk scores must be generated separately for each backend. HuggingFace embeddings (FP16) produce higher quality similarity scores than Ollama (Q4_K_M).
| Embeddings File | Backend | Precision | Quality |
|---|---|---|---|
huggingface_qwen3_8b_paper_train_participant_only |
HuggingFace | FP16 | Higher |
ollama_qwen3_8b_paper_train_participant_only |
Ollama | Q4_K_M | Lower |
Recommendation:
- For best quality: Use HuggingFace embeddings (
EMBEDDING_BACKEND=huggingface) - For accessibility: Use Ollama if HuggingFace deps unavailable
Important: EMBEDDING_EMBEDDINGS_FILE and EMBEDDING_BACKEND should be coherent:
- HuggingFace backend → huggingface_* embeddings file
- Ollama backend → ollama_* embeddings file
Chunk Scores Dependency: If EMBEDDING_REFERENCE_SCORE_SOURCE=chunk, the corresponding
.chunk_scores.json file must exist for the selected embeddings file.
7. Removed Features
These are no longer present in the codebase (historical context is kept under docs/_archive/):
- Keyword backfill (Spec 047)
8. Safety Overrides (Danger Zone)
These bypass safety checks. Require explicit acknowledgment:
| Setting | Default | What It Bypasses |
|---|---|---|
EMBEDDING_ALLOW_CHUNK_SCORES_PROMPT_HASH_MISMATCH |
false |
Prompt change detection |
EMBEDDING_ALLOW_INSUFFICIENT_DIMENSION_EMBEDDINGS |
false |
Dimension strict mode (Spec 57) |
--allow-same-model (CLI) |
N/A | Scorer circularity check |
--allow-partial (CLI) |
N/A | Fail-fast embedding generation |
These should be OFF by default. Explicit opt-in for known risks.
Decision Framework
When adding a new setting, ask:
1. Is this CORRECT BEHAVIOR vs BROKEN BEHAVIOR?
→ Correct = bake it in, no flag
→ Broken = require explicit opt-in (for legacy only)
2. Is this a RESEARCH HYPERPARAMETER?
→ Yes = make it tunable with env var
→ No = don't add a flag
3. Is this INFRASTRUCTURE-SPECIFIC?
→ Yes = make it configurable
→ No = use sensible default
4. Does this BYPASS SAFETY CHECKS?
→ Yes = default OFF, require explicit opt-in
→ No = default ON if it's correct behavior
Anti-Patterns
DON'T: Add flags for "flexibility"
# BAD: Flag for something that should always be true
ENABLE_STRUCTURED_OUTPUT = True # Why would you disable this?
DON'T: Default broken behavior
# BAD: Broken behavior as default
reference_score_source: str = "participant" # Known to be wrong
DON'T: Hide correctness behind opt-in
# BAD: Correct behavior requires user action
enable_reference_validation: bool = False # CRAG is "gold standard" but OFF?
DO: Make correct behavior the default
# GOOD: Correct behavior, opt-out for legacy
reference_score_source: str = "chunk" # Correct default
# Legacy baseline (paper-derived, known flawed): EMBEDDING_REFERENCE_SCORE_SOURCE=participant
Current State vs Target State
Current (Pre-Ablation)
# To run the full "correct" retrieval pipeline today, ensure these are set.
EMBEDDING_REFERENCE_SCORE_SOURCE=chunk
EMBEDDING_ENABLE_ITEM_TAG_FILTER=true
EMBEDDING_MIN_REFERENCE_SIMILARITY=0.3
EMBEDDING_MAX_REFERENCE_CHARS_PER_ITEM=500
EMBEDDING_ENABLE_REFERENCE_VALIDATION=true
Problem: 5 flags to get correct behavior. Easy to misconfigure.
Target (Post-Ablation)
# User runs system - it works correctly by default
# (no flags needed)
# ONLY if reproducing paper (broken) baseline:
EMBEDDING_REFERENCE_SCORE_SOURCE=participant
EMBEDDING_ENABLE_ITEM_TAG_FILTER=false
EMBEDDING_MIN_REFERENCE_SIMILARITY=0.0
EMBEDDING_MAX_REFERENCE_CHARS_PER_ITEM=0
EMBEDDING_ENABLE_REFERENCE_VALIDATION=false
Better: Correct by default. Flags only for legacy reproduction.
Post-Ablation Migration
Validation Gates (Before Consolidation)
Do NOT consolidate defaults until all of these are verified:
- [ ] Spec 35 ablation complete: Chunk scoring vs participant scoring comparison
- [ ] chunk_scores.json artifact exists: Generated by
score_reference_chunks.py - [ ] No regressions: Primary metrics (AURC/AUGRC + MAE + coverage) meet or beat baseline
- [ ] CI passes: All tests green with new defaults
config.py Changes Required
# === EmbeddingSettings ===
# Spec 35: Chunk-level scoring (label-noise reduction)
reference_score_source: Literal["participant", "chunk"] = Field(
default="chunk", # CHANGED from "participant"
)
# Spec 34: Item-tag filtering
enable_item_tag_filter: bool = Field(
default=True, # CHANGED from False
)
# Spec 33: Retrieval quality guardrails
min_reference_similarity: float = Field(
default=0.3, # CHANGED from 0.0
)
max_reference_chars_per_item: int = Field(
default=500, # CHANGED from 0
)
# Spec 36: CRAG-style runtime reference validation
enable_reference_validation: bool = Field(
default=True, # CHANGED from False
)
Artifact Requirements
For the consolidated defaults to work, these artifacts MUST exist:
| Artifact | Required For | Generated By |
|---|---|---|
*.npz |
All few-shot | generate_embeddings.py |
*.json |
All few-shot | generate_embeddings.py |
*.meta.json |
All few-shot | generate_embeddings.py |
*.tags.json |
Spec 34 | generate_embeddings.py --write-item-tags |
*.chunk_scores.json |
Spec 35 | score_reference_chunks.py |
*.chunk_scores.meta.json |
Spec 35 | score_reference_chunks.py |
Summary Table
| Category | Example | Configurable? | Default |
|---|---|---|---|
| Invariants | Spec 38/39/40 semantics | No | Always ON |
| Post-ablation retrieval | CRAG, chunk scores, tag filter | Yes (for now) | Will be ON |
| Performance/stability | batch query embedding, timeouts | Yes | Default ON |
| Hyperparameters | top_k, thresholds, feedback | Yes | Baseline values |
| Models | quantitative_model | Yes | Code: gemma3:27b |
| Infrastructure | OLLAMA_HOST | Yes | localhost |
| Removed | keyword_backfill (Spec 047) | No | Removed |
| Safety overrides | allow_prompt_mismatch | Yes | Always OFF |
Related Documentation
- Configuration Reference — Full settings reference
.env.example(repository root) — Example configuration_archive/misc/HYPOTHESIS-FEWSHOT-DESIGN-FLAW.md— Why participant-level scoring is broken
"Make the right thing easy and the wrong thing hard."