RAG Runtime Features

Audience: Researchers configuring few-shot retrieval behavior Last Updated: 2026-01-07

This document covers runtime features that affect how few-shot retrieval operates: prompt formatting, batch embedding, and CRAG validation.

SSOT implementations: - src/ai_psychiatrist/services/embedding.py (retrieval + formatting) - src/ai_psychiatrist/services/reference_validation.py (CRAG validation)

Prompt Format (Reference Examples)

Few-shot mode retrieves reference chunks from a training split and inserts them into the scoring prompt as "reference examples".

Reference Entry Format

Each included reference is formatted as:

({EVIDENCE_KEY} Score: {SCORE})
{CHUNK_TEXT}

Where: - {EVIDENCE_KEY} is PHQ8_{item.value} (e.g., PHQ8_Sleep) - {SCORE} is an integer 0..3 - {CHUNK_TEXT} is the raw chunk text (may contain internal newlines)

References with reference_score=None are omitted.

Reference Bundle Format

All reference entries across all items are merged into a single block:

<Reference Examples>

{entry_1}

{entry_2}

...

</Reference Examples>

If no entries survive filtering:

Current behavior (post BUG-035): emit an empty string (the <Reference Examples> block is omitted).
Historical behavior (pre BUG-035): some runs inserted a sentinel wrapper containing “No valid evidence found”.

Ordering Rules

Ordering is deterministic: 1. Items are iterated in PHQ8Item.all_items() order. 2. Within an item, references are emitted in retrieval order (similarity-sorted).

Paper Notebook vs Current Code

The paper notebook used an unusual delimiter style (<Reference Examples>...<Reference Examples>). Current code uses proper XML-style closing tags (</Reference Examples>). This was an intentional fix in Spec 33.

Batch Query Embedding (Spec 37)

Spec 37 is a performance + reliability fix: - Before: up to 8 sequential query embeddings per participant - After: 1 batch query embedding per participant

This fixes timeout failures from repeated embedding calls.

Configuration

# Enable batch embedding (default: true)
EMBEDDING_ENABLE_BATCH_QUERY_EMBEDDING=true

# Query embedding timeout in seconds (default: 300)
EMBEDDING_QUERY_EMBED_TIMEOUT_SECONDS=300

Why This Exists

Few-shot retrieval embeds the query evidence to find similar reference chunks. Evidence is extracted per PHQ-8 item, so a participant can produce up to 8 evidence texts.

Historically, these were embedded one-by-one: - 8 embeddings × 41 participants = 328 calls - High timeout exposure

Spec 37 reduces this to 1 embedding operation per participant.

Verification

Run few-shot on a small limit:

uv run python scripts/reproduce_results.py --split paper-test --few-shot-only --limit 3

If you see LLM request timed out after …s, confirm EMBEDDING_QUERY_EMBED_TIMEOUT_SECONDS matches that value and that EMBEDDING_ENABLE_BATCH_QUERY_EMBEDDING=true.

CRAG Reference Validation (Spec 36)

CRAG-style validation adds a second LLM step after retrieval: 1. Retrieve candidate reference chunks 2. Validate each reference against the item + evidence (accept / reject / unsure) 3. Include only accept references in the few-shot prompt

Enable CRAG Validation

EMBEDDING_ENABLE_REFERENCE_VALIDATION=true

# Optional: specify validation model (defaults to MODEL_JUDGE_MODEL)
EMBEDDING_VALIDATION_MODEL=gemma3:27b-it-qat

# Optional: max accepted refs per item after validation (default: 2)
EMBEDDING_VALIDATION_MAX_REFS_PER_ITEM=2

Fail-Fast Semantics (Spec 38)

If validation is enabled, it must work or crash: - invalid JSON responses raise LLMResponseParseError - network/backend failures propagate (preserve exception type)

unsure is a first-class validator output and is treated like reject (filtered out). There is no silent fallback for validation failures.

What CRAG Can and Cannot Fix

CRAG validation is a filter, not a relabeler: - It can reject irrelevant or contradictory references - It cannot correct a wrong reference_score label (that's Spec 35's job)

Recommended Layering

Spec 35 (chunk scores) for label correctness
Spec 34 (item tags) for candidate set precision
Spec 33 (threshold/budget) for quality guardrails
Spec 36 (CRAG) for semantic validation

Pipeline Flow Summary

1. Extract evidence per PHQ-8 item from qualitative assessment
2. Batch embed all evidence texts (Spec 37)
3. For each item with evidence:
   a. Compute similarities against all reference chunks (vectorized cosine)
   b. If enabled, filter candidates to chunks tagged for that item (Spec 34)
   c. Attach reference scores (participant-level or chunk-level per Spec 35)
   d. Drop references below `EMBEDDING_MIN_REFERENCE_SIMILARITY` (Spec 33)
   e. Take top-k references (`EMBEDDING_TOP_K_REFERENCES`)
   f. Apply per-item char budget (`EMBEDDING_MAX_REFERENCE_CHARS_PER_ITEM`) (Spec 33)
   g. Apply CRAG validation if enabled; keep only `accept` references (Spec 36)
4. Format unified <Reference Examples> block
5. Insert into quantitative scoring prompt

Artifact generation: artifact-generation.md
Chunk-level scoring: chunk-scoring.md
Debugging: debugging.md
Feature index: docs/pipeline-internals/features.md