BUG-035: Few-Shot vs Zero-Shot Prompt Confound

Date: 2026-01-06 Status: FIXED Severity: HIGH - Invalidates few-shot vs zero-shot causal claims Discovered By: External agent review (validation audit) Affects: All historical runs comparing zero-shot vs few-shot Fixed In: This commit (2026-01-06)

Executive Summary

When few-shot retrieval returns zero usable references, the prompt still differs from zero-shot because format_for_prompt() returns a wrapper block:

<Reference Examples>
No valid evidence found
</Reference Examples>

Zero-shot passes an empty string, which results in no reference section at all.

Impact: Any claim that "few-shot performs differently than zero-shot" cannot be cleanly attributed to retrieval benefit, because the prompts themselves differ even when retrieval contributes nothing.

Root Cause

File: `src/ai_psychiatrist/services/embedding.py`

# Lines 112-115
if entries:
    return "<Reference Examples>\n\n" + "\n\n".join(entries) + "\n\n</Reference Examples>"

return "<Reference Examples>\nNo valid evidence found\n</Reference Examples>"  # <-- BUG

When no valid reference entries exist, this returns a non-empty wrapper instead of an empty string.

File: `src/ai_psychiatrist/agents/prompts/quantitative.py`

# Lines 102
reference_section = f"\n{reference_bundle}\n" if reference_bundle else ""

This correctly handles empty string (no section), but the embedding service never returns empty string.

Result

Condition	`reference_text`	Prompt includes
Zero-shot	`""`	No reference section
Few-shot (with refs)	`<Reference Examples>...</Reference Examples>`	Reference section with examples
Few-shot (no refs)	`<Reference Examples>\nNo valid evidence found\n</Reference Examples>`	Reference section with "No valid evidence" message

The third row is the bug. Few-shot with no refs should produce the same prompt as zero-shot.

Impact Assessment

Runs Affected

All runs that compare zero-shot vs few-shot modes: - Run 12 (and prior runs) - Any ablation claiming "few-shot helps/hurts"

Claims Invalidated

From docs/results/few-shot-analysis.md:

"Few-shot underperforms zero-shot (MAE 0.616 vs 0.572)"

This claim is confounded. The observed difference could be due to: 1. Actual retrieval effect (intended measurement) 2. Prompt difference from the "No valid evidence found" block (confound) 3. Interaction between the two

We cannot determine which without fixing the confound.

Claims That Remain Valid

Coverage statistics (abstention rates)
Individual mode performance (not comparative)
AURC/AUGRC within each mode

Fix

Option A: Return Empty String (Recommended)

# src/ai_psychiatrist/services/embedding.py
def format_for_prompt(self) -> str:
    entries: list[str] = []
    # ... existing entry building ...

    if entries:
        return "<Reference Examples>\n\n" + "\n\n".join(entries) + "\n\n</Reference Examples>"

    return ""  # Was: "<Reference Examples>\nNo valid evidence found\n</Reference Examples>"

Pros: Clean experimental design. Few-shot with no refs = identical to zero-shot. Cons: Loses observability that retrieval returned nothing (but this is logged elsewhere).

Option B: Keep Wrapper, Add to Zero-Shot

Add the same empty wrapper to zero-shot prompts so both conditions have identical structure.

Cons: Adds unnecessary prompt tokens to zero-shot. Not recommended.

Verification

After fix:

# Verify prompt equality when retrieval is empty
uv run python -c "
from ai_psychiatrist.services.embedding import ReferenceBundle

# Empty bundle
bundle = ReferenceBundle(item_references={})
assert bundle.format_for_prompt() == '', 'Should return empty string'
print('PASS: Empty bundle returns empty string')
"

docs/results/few-shot-analysis.md: Add caveat about confound in historical runs
HYPOTHESES-EXPLAINED.md: Note that "few-shot vs zero-shot" claims need revalidation post-fix
MASTER_BUG_AUDIT.md: Add this bug to the findings table
docs/_specs/index.md: No changes needed (specs 061-063 are independent of this confound)

Testing Plan

Unit test: ReferenceBundle.format_for_prompt() returns "" when no entries
Integration test: Verify prompt content is identical between zero-shot and few-shot-with-no-refs
Regression: Re-run comparison after fix to measure true retrieval effect

Historical Note

This bug has existed since the initial few-shot implementation. It was discovered during an external validation audit on 2026-01-06. The original implementation may have intended the wrapper as an observability feature, but it creates a confound that invalidates comparative claims.

References

src/ai_psychiatrist/services/embedding.py:119 (fix location: empty bundle returns empty string)
src/ai_psychiatrist/agents/prompts/quantitative.py:102 (prompt construction)
tests/unit/services/test_embedding.py:73 (unit regression test)
tests/unit/agents/test_quantitative.py:630 (prompt equality regression test)
docs/results/few-shot-analysis.md (affected analysis)
External validation prompt (discovery source)