RAG Debugging (Audit Logs + Guardrails)
Audience: Researchers debugging few-shot performance Last Updated: 2026-01-07
This guide explains how to debug few-shot retrieval using the built-in diagnostic tools: - retrieval audit logs (Spec 32) - similarity threshold + per-item budgets (Spec 33) - item tag filtering (Spec 34 + Spec 38 semantics) - CRAG validation decisions (Spec 36)
Step 0: Confirm What Method You Ran
Before debugging retrieval quality, confirm run configuration:
- scripts/reproduce_results.py prints the effective settings at startup
- your output JSON stores per-experiment settings in experiments[*].provenance (run_metadata is run-level environment info)
If you can’t explain exactly which features were enabled, do not interpret the results.
Step 1: Enable Retrieval Audit Logs (Spec 32)
Set:
EMBEDDING_ENABLE_RETRIEVAL_AUDIT=true
You should see retrieved_reference log events with fields:
- item, evidence_key
- rank, similarity
- participant_id, reference_score
- chunk_hash, chunk_chars (no raw transcript text)
This is emitted after retrieval post-processing (threshold + top-k + budgets + CRAG filtering).
Step 2: Triage Common Failure Modes
A) Low Similarity References (Spec 33)
Symptom: - top references have low similarity (e.g., < 0.3)
Mitigations:
- raise EMBEDDING_MIN_REFERENCE_SIMILARITY
- ensure embeddings backend is correct (EMBEDDING_BACKEND=huggingface is higher precision)
B) Prompt Bloat / Drowning (Spec 33)
Symptom: - many long chunks dominate the prompt, drowning out evidence
Mitigations:
- set EMBEDDING_MAX_REFERENCE_CHARS_PER_ITEM (e.g., 500–2000)
C) Wrong-Item Retrieval (Spec 34)
Symptom: - references talk about the wrong PHQ-8 item (“sleep” query pulls “failure” content)
Mitigations:
- regenerate embeddings with --write-item-tags to produce {emb}.tags.json
- set EMBEDDING_ENABLE_ITEM_TAG_FILTER=true
Fail-fast note (Spec 38):
- if filtering is enabled and {emb}.tags.json is missing/invalid → run should crash
D) Semantically Irrelevant References (Spec 36)
Symptom: - similarity is high but the chunk is clinically irrelevant or contradictory
Mitigations:
- enable CRAG validation:
- EMBEDDING_ENABLE_REFERENCE_VALIDATION=true
- optionally set EMBEDDING_VALIDATION_MODEL (if unset, runners fall back to MODEL_JUDGE_MODEL)
Step 3: Check Artifact Preconditions
Few-shot retrieval requires:
- {emb}.npz and {emb}.json
- {emb}.meta.json (expected for modern artifacts; enables fail-fast mismatch detection)
Optional but required when features are enabled:
- {emb}.tags.json if EMBEDDING_ENABLE_ITEM_TAG_FILTER=true
- {emb}.chunk_scores.json + {emb}.chunk_scores.meta.json if EMBEDDING_REFERENCE_SCORE_SOURCE=chunk
See: - Artifact generation - Chunk scoring
Step 4: Interpret Empty Reference Bundles (Missing <Reference Examples> Block)
Current behavior (post BUG-035): if no reference entries survive filtering, the reference bundle
formats to an empty string and the <Reference Examples> block is omitted from the scoring prompt.
If the scoring prompt contains no <Reference Examples> section, it can mean:
- evidence extraction produced no usable evidence (no query embeddings)
- retrieval found no matches above the similarity threshold
- all matches had reference_score=None (common with chunk scores when the chunk is non-evidentiary)
- CRAG validation rejected all candidate references
Historical note: older runs (pre BUG-035) may show a sentinel wrapper containing the string “No valid evidence found”. Treat that as a prompt confound artifact, not current behavior.
Use retrieval audit logs + the failure/telemetry registries to disambiguate which case occurred.
Step 5: Check Failure Registry (Spec 056)
After each evaluation run, check data/outputs/failures_{run_id}.json:
cat data/outputs/failures_19b42478.json | jq '.summary'
The failure registry categorizes failures by:
- Category: evidence_json_parse, embedding_nan, scoring_pydantic_retry_exhausted, etc.
- Severity: fatal, error, warning, info
- Stage: evidence_extraction, embedding_generation, scoring
- Participant: Which participants failed most often
Use this to identify systematic issues (e.g., "participant 373 always fails on evidence extraction").
Step 5b: Check Retry Telemetry (Spec 060)
Even when a run succeeds, the system can be “quietly brittle” (many retries, frequent JSON repair).
After each run, check data/outputs/telemetry_{run_id}.json:
cat data/outputs/telemetry_19b42478.json | jq '.summary'
This captures:
- PydanticAI retry triggers (ModelRetry) by extractor
- JSON repair usage (tolerant_json_fixups, python-literal fallback, json-repair)
If dropped_events is non-zero, the run hit the telemetry event cap (defaults to 5,000). Treat that as a sign of extreme brittleness.
If these counts spike, treat it as a regression risk even if MAE/AUGRC look good.
Step 6: Diagnose Embedding Failures (Spec 055)
If you see EmbeddingValidationError:
| Error Pattern | Likely Cause | Fix |
|---|---|---|
NaN detected |
Malformed input to embedding backend | Check transcript preprocessing |
Inf detected |
Numerical overflow | Check embedding model/backend |
All-zero vector |
Empty or whitespace-only input | Check chunking configuration |
At generation time: Regenerate artifacts with scripts/generate_embeddings.py
At runtime: Check query embedding input (evidence text may be empty or corrupted)
Related Docs
- Feature index:
docs/pipeline-internals/features.md - Runtime features: runtime-features.md
- Error-handling philosophy:
docs/developer/error-handling.md - Failure registry:
docs/developer/error-handling.md#failure-pattern-observability-spec-056