SPEC-042: LLM Synthesizer Implementation

Status: 🟡 Phase 1 implemented (2026-01-19) Priority: P1 (High - Required for agent system value) Created: 2026-01-18 Promoted From: FUTURE-007 Owner: Solo Effort: ~2-3 days (Phase 1)

Summary

Implement a real LLM-based synthesizer to replace MockSynthesizer in the agent analysis workflow. Without this, kalshi agent analyze returns meaningless "+5% from market" predictions.

This spec resolved DEBT-037 which previously blocked the entire agent system value proposition.

Model Choice: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) - pinned model ID for reproducibility (confirmed in Anthropic model docs).

Goals

Implement Claude Sonnet 4.5 synthesizer using Anthropic's native structured outputs
Ensure structured output validation via Pydantic + tool-use JSON schema validation
Make backend configurable via environment variable (default: anthropic)
Add cost tracking for LLM calls
Preserve mock for testing - Mock stays available for CI/testing

Non-Goals

Fine-tuning models
Multi-agent debate/consensus
Real-time streaming responses
Multi-provider support in Phase 1 (stick with Anthropic until stable)

SSOT (What's True Today)

Protocol defined: StructuredSynthesizer in src/kalshi_research/agent/providers/llm.py
Mock implementation: MockSynthesizer returns market_price + 5%
CLI configurable: src/kalshi_research/cli/agent.py uses get_synthesizer() (env-controlled backend)
Schemas exist: SynthesisInput in providers/llm.py, AnalysisResult in schemas.py
Warnings: When mock is active, JSON output includes a "warning" field (and human output prints a warning)

Model Selection Rationale

Claude Sonnet 4.5 (`claude-sonnet-4-5-20250929`)

Pinned model ID: Use a dated model ID for reproducibility (claude-sonnet-4-5-20250929) and optionally allow an alias via configuration.
Structured output: Use Anthropic tool use + schema validation (Pydantic) for deterministic machine-readable outputs.
Cost tracking: Track token usage and compute USD cost using pricing from Anthropic vendor docs at implementation time (do not hardcode numbers in the spec).

Architecture

Synthesizer Selection

# src/kalshi_research/agent/providers/llm.py

def get_synthesizer(backend: str | None = None) -> StructuredSynthesizer:
    """Factory function to create synthesizer based on config."""
    backend = backend or os.getenv("KALSHI_SYNTHESIZER_BACKEND", "anthropic")

    if backend == "mock":
        return MockSynthesizer()
    elif backend == "anthropic":
        return ClaudeSynthesizer()  # optional: budget via max_cost_usd
    else:
        raise ValueError(f"Unknown synthesizer backend: {backend}")

ClaudeSynthesizer (Phase 1 - Primary)

from anthropic import AsyncAnthropic

# Frontier model - Claude Sonnet 4.5
CLAUDE_MODEL = "claude-sonnet-4-5-20250929"

# Pricing constants (USD per 1M tokens). Fill from vendor docs at implementation time.
INPUT_USD_PER_M: float = ...
OUTPUT_USD_PER_M: float = ...

class ClaudeSynthesizer:
    """LLM synthesizer using Claude Sonnet 4.5 with native structured outputs."""

    def __init__(self, model: str = CLAUDE_MODEL):
        self.client = AsyncAnthropic()
        self.model = model
        self._total_tokens = 0
        self._total_cost_usd = 0.0

    async def synthesize(self, *, input: SynthesisInput) -> AnalysisResult:
        """Synthesize probability estimate from market and research data."""
        response = await self.client.messages.create(
            model=self.model,
            max_tokens=4096,
            # If required for structured outputs, set the beta header per Anthropic docs.
            # extra_headers={"anthropic-beta": "structured-outputs-YYYY-MM-DD"},
            tools=[{
                "name": "submit_analysis",
                "description": "Submit your probability analysis for this market",
                "input_schema": AnalysisResult.model_json_schema()
            }],
            tool_choice={"type": "tool", "name": "submit_analysis"},
            messages=[{
                "role": "user",
                "content": self._build_prompt(input)
            }],
            system=SYSTEM_PROMPT,
        )

        # Extract tool call result
        tool_use = next(
            block for block in response.content
            if block.type == "tool_use"
        )

        # Track costs
        self._track_usage(response)

        # Validate and return
        return AnalysisResult.model_validate(tool_use.input)

    def _build_prompt(self, input: SynthesisInput) -> str:
        """Build prompt from market info, price snapshot, and research."""
        research_factors = input.research.factors if input.research else []
        return ANALYSIS_PROMPT_TEMPLATE.format(
            ticker=input.market.ticker,
            title=input.market.title,
            subtitle=input.market.subtitle,
            close_time=input.market.close_time.isoformat(),
            current_prob=f"{input.snapshot.midpoint_prob:.1%}",
            yes_bid=input.snapshot.yes_bid_cents,
            yes_ask=input.snapshot.yes_ask_cents,
            spread=input.snapshot.spread_cents,
            volume_24h=input.snapshot.volume_24h,
            factors=self._format_research_factors(research_factors),
        )

    def _track_usage(self, response) -> None:
        """Track token usage and costs."""
        self._total_tokens += response.usage.input_tokens + response.usage.output_tokens
        # Compute cost using pricing constants sourced from vendor docs at implementation time.
        self._total_cost_usd += (
            response.usage.input_tokens * INPUT_USD_PER_M / 1_000_000 +
            response.usage.output_tokens * OUTPUT_USD_PER_M / 1_000_000
        )

    def _format_research_factors(self, factors: list[Factor]) -> str:
        """Format ResearchSummary factors for prompt."""
        if not factors:
            return "No factors identified"
        return "\n".join(
            f"- {f.factor_text} (source: {f.source_url})"
            for f in factors
        )

Prompt Template

SYSTEM_PROMPT = """You are a prediction market analyst specializing in probability estimation.
Given market information and research, estimate the probability of the YES outcome.

Key principles:
1. Be calibrated - your 70% predictions should resolve YES ~70% of the time
2. Use research evidence to inform estimates, but acknowledge uncertainty
3. Consider base rates and reference classes
4. Be aware that markets can be wrong - your edge comes from research
5. Express genuine uncertainty through your confidence level

You will use the submit_analysis tool to provide your structured analysis."""

ANALYSIS_PROMPT_TEMPLATE = """
## Market: {ticker}
**{title}**
{subtitle}

### Current Market State
- Market closes: {close_time}
- Current implied probability: {current_prob}
- Yes bid/ask: {yes_bid}¢ / {yes_ask}¢ (spread: {spread}¢)
- 24h volume: {volume_24h} contracts

### Research Factors
{factors}

---

Analyze this market and provide:
1. Your probability estimate (0-100) for YES
2. Your confidence level (low/medium/high) based on research quality
3. Clear reasoning citing specific evidence
4. Key sources that informed your estimate

Consider:
- What does the research suggest vs market price?
- What uncertainties or information gaps remain?
- Are there base rates or reference classes to consider?
"""

CLI Changes

# src/kalshi_research/cli/agent.py

from kalshi_research.agent.providers.llm import get_synthesizer, MockSynthesizer

# In analyze command:
backend = os.getenv("KALSHI_SYNTHESIZER_BACKEND", "anthropic")
synthesizer = get_synthesizer(backend)

# Warn if mock
if isinstance(synthesizer, MockSynthesizer):
    if human or not output_json:
        console.print(
            "[yellow]Warning:[/yellow] Using MockSynthesizer. "
            "Set KALSHI_SYNTHESIZER_BACKEND=anthropic for real analysis."
        )

Dependencies

Add to pyproject.toml as optional extras:

[project.optional-dependencies]
llm = [
    "anthropic>=0.40.0",  # For Claude Sonnet 4.5 + structured outputs
]

Installation:

uv sync --extra llm  # For Claude synthesizer

Environment Variables

Variable	Default	Description
`KALSHI_SYNTHESIZER_BACKEND`	`anthropic`	`mock` or `anthropic`
`ANTHROPIC_API_KEY`	-	Required for `anthropic` backend

Implementation Plan

Phase 1: Claude Sonnet 4.5 (This Spec)

Add anthropic>=0.40.0 to optional dependencies
Implement ClaudeSynthesizer class with structured outputs
Add get_synthesizer() factory function
Add KALSHI_SYNTHESIZER_BACKEND env var support (default: anthropic)
Create prompt template optimized for calibrated forecasting
Add cost tracking (input/output tokens)
Update CLI to use factory function
Add unit tests with mocked Anthropic responses

Phase 2: Calibration Layer (Future)

Track historical predictions vs outcomes
Apply statistical calibration adjustment
Store predictions in DB for backtesting

Testing Strategy

Unit Tests (No API Calls)

# tests/unit/agent/test_llm_synthesizer.py

def test_claude_synthesizer_builds_prompt():
    """Prompt template includes all required fields."""
    synth = ClaudeSynthesizer()
    prompt = synth._build_prompt(mock_input)
    assert "TICKER" in prompt
    assert "Current implied probability" in prompt

@respx.mock
async def test_claude_synthesizer_returns_valid_result():
    """Mocked Anthropic returns valid AnalysisResult."""
    respx.post("https://api.anthropic.com/v1/messages").mock(
        return_value=httpx.Response(200, json={
            "content": [{
                "type": "tool_use",
                "name": "submit_analysis",
                "input": {
                    "ticker": "TEST",
                    "predicted_prob": 65,
                    "confidence": "medium",
                    "reasoning": "Test reasoning",
                    # ... other fields
                }
            }],
            "usage": {"input_tokens": 100, "output_tokens": 200}
        })
    )
    synth = ClaudeSynthesizer()
    result = await synth.synthesize(input=mock_input)
    assert 0 <= result.predicted_prob <= 100

def test_get_synthesizer_factory():
    """Factory returns correct synthesizer type."""
    assert isinstance(get_synthesizer("mock"), MockSynthesizer)

Integration Tests (Opt-In)

# tests/integration/agent/test_llm_real.py

@pytest.mark.skipif(not os.getenv("ANTHROPIC_API_KEY"), reason="No API key")
async def test_real_claude_synthesis():
    """Real Claude call returns valid result."""
    synth = ClaudeSynthesizer()
    result = await synth.synthesize(input=real_input)
    assert result.reasoning  # Non-empty reasoning
    assert 0 <= result.predicted_prob <= 100

Acceptance Criteria

[x] ClaudeSynthesizer implemented using claude-sonnet-4-5-20250929
[x] Native structured outputs enabled via Anthropic tool use + JSON schema
[x] get_synthesizer() factory function works
[x] KALSHI_SYNTHESIZER_BACKEND env var controls backend (default: anthropic)
[x] CLI uses factory, warns when mock is active (including JSON output)
[x] Prompt template optimized for calibrated probability estimation
[x] Cost tracking for LLM calls (tokens used, USD spent)
[x] Unit tests with mocked API (no real calls in CI)
[x] Integration test with real API (opt-in via env var)
[x] Documentation updated with env var instructions (.env.example)

Files to Create/Modify

File	Action
`src/kalshi_research/agent/providers/llm.py`	Add `ClaudeSynthesizer`, `get_synthesizer()`, prompt templates
`src/kalshi_research/cli/agent.py`	Use factory function, update warning logic
`pyproject.toml`	Add `[llm]` optional dependencies
`tests/unit/agent/test_llm_synthesizer.py`	New unit tests
`tests/integration/agent/test_llm_real.py`	New integration tests (opt-in)
`.env.example`	Add `KALSHI_SYNTHESIZER_BACKEND`