Calibration Analysis (Explanation)
Calibration answers the most important question for any forecaster: "Are my probabilities accurate?"
A perfectly calibrated predictor assigns probabilities that match actual frequencies. If you say something has a 70% chance of happening, and you make 100 such predictions, exactly 70 of them should come true.
Why Calibration Matters
Most people are poorly calibrated:
- Overconfidence: Saying 90% when it's really 60%
- Underconfidence: Saying 50% when it's really 80%
- Extremity aversion: Avoiding 5% or 95% predictions
Poor calibration means you're systematically wrong, which bleeds money in prediction markets.
The Brier Score
The fundamental metric for probabilistic predictions:
brier_score = (1/N) * Σ(forecast - outcome)²
Where:
forecastis your probability (0 to 1)outcomeis what happened (0 for NO, 1 for YES)
Interpreting Brier Scores
| Score | Interpretation |
|---|---|
| 0.00 | Perfect (you predicted 1.0 for all YES, 0.0 for all NO) |
| 0.10 | Excellent |
| 0.15 | Good |
| 0.20 | Fair |
| 0.25 | Random guessing (always predict 50%) |
| 0.33 | Bad (worse than random) |
| 1.00 | Perfectly wrong |
Brier Decomposition
The Brier score can be broken down into three components that tell you why you're scoring the way you are:
Brier = Reliability - Resolution + Uncertainty
Reliability (Calibration Error)
How well your probabilities match actual frequencies.
reliability = Σ(n_k * (o_k - f_k)²) / N
Where:
n_k= samples in bin ko_k= actual YES frequency in bin kf_k= mean predicted probability in bin k
Lower is better. Zero means perfect calibration.
Resolution (Discrimination)
How much your predictions vary from the base rate. Can you distinguish likely from unlikely events?
resolution = Σ(n_k * (o_k - base_rate)²) / N
Higher is better. If you always predict the base rate, resolution is zero (no discrimination).
Uncertainty
The inherent unpredictability of the events:
uncertainty = base_rate * (1 - base_rate)
This is fixed by the data - you can't control it.
The Relationship
Brier = Reliability - Resolution + Uncertainty
To get a good Brier score:
- Minimize reliability (be well-calibrated)
- Maximize resolution (make confident, varying predictions)
Brier Skill Score
Compares your Brier score to a baseline (always predicting the base rate):
skill_score = 1 - (brier / climatology_brier)
Where climatology_brier = base_rate * (1 - base_rate).
- Skill > 0: You're better than guessing the base rate
- Skill = 0: You're no better than always guessing the base rate
- Skill < 0: You're worse than the baseline (ouch)
Calibration Curves
A calibration curve plots predicted probabilities vs actual frequencies:
Actual Frequency
1.0 │ ╱
│ ╱
│ ╱ ●
0.5 │ ╱ ●
│ ╱ ●
│ ╱ ●
0.0 │────●───●───╱──────────────────
0 0.2 0.4 0.6 0.8 1.0
Predicted Probability
The diagonal line is perfect calibration. Points above the line mean underconfidence; below means overconfidence.
CalibrationResult Data
The analyzer returns:
@dataclass
class CalibrationResult:
brier_score: float
brier_skill_score: float
n_samples: int
# Calibration curve data
bins: NDArray # Probability bin edges
predicted_probs: NDArray # Mean predicted prob per bin
actual_freqs: NDArray # Actual YES frequency per bin
bin_counts: NDArray # Samples per bin
# Brier decomposition
reliability: float
resolution: float
uncertainty: float
CLI Usage
Run calibration analysis on your historical data:
# Last 30 days
uv run kalshi analysis calibration --db data/kalshi.db --days 30
# Save results to file
uv run kalshi analysis calibration --db data/kalshi.db --days 30 --output calibration.json
What the Data Shows
The calibration analysis uses:
- Price snapshots: What the market predicted at various times
- Settlements: What actually happened
It bins market prices (e.g., 0-10%, 10-20%, ..., 90-100%) and compares to actual YES frequencies.
This tells you if Kalshi markets are well-calibrated (they generally are), and helps you spot where markets might be systematically miscalibrated.
Using for Your Theses
You can compute calibration on your personal theses:
from kalshi_research.analysis.calibration import CalibrationAnalyzer
from kalshi_research.research.thesis import ThesisTracker
tracker = ThesisTracker()
resolved = tracker.list_resolved()
scored = [t for t in resolved if t.actual_outcome in {"yes", "no"}]
forecasts = [t.your_probability for t in scored]
outcomes = [1 if t.actual_outcome == "yes" else 0 for t in scored]
analyzer = CalibrationAnalyzer(n_bins=5) # Fewer bins for small samples
result = analyzer.compute_calibration(forecasts, outcomes)
print(result)
Sample Size Considerations
Calibration analysis needs sufficient data:
- < 20 samples: Results are noise, don't trust them
- 20-50 samples: Directional signal, but wide confidence intervals
- 50-100 samples: Starting to be meaningful
- 100+ samples: Reliable metrics
With small samples, use fewer bins (5 instead of 10) to get enough data per bin.
Key Code
- Analyzer:
src/kalshi_research/analysis/calibration.py - CLI command:
src/kalshi_research/cli/analysis.py
See Also
- Thesis System - Track your predictions
- Backtesting - Simulate P&L
- Usage: Analysis - CLI commands