Benchmark Results¶

This document tracks our MultiPathQA benchmark results and compares them to the published GIANT paper.

GTEx Organ Classification (20-way)¶

Date: 2025-12-27 Run ID: gtex_giant_openai_gpt-5.2

Our Results vs Paper¶

Metric	Our Result	Paper (GPT-5 GIANT)	Paper (GPT-5 GIANT x5)
Balanced Accuracy (scored items only)	70.3% (bootstrap: 70.4% ± 3.0%)	53.7% ± 3.4%	60.7% ± 3.2%
Bootstrap CI (95%)	64.3% - 76.1%	-	-
Items Processed	191/191	191	191
Scored Items	185/191	191	191
Parse Errors (excluded)	6	-	-
Total Cost	$7.21	-	-

Analysis¶

Our scored-items-only point estimate of 70.3% exceeds the paper's single-run GIANT result (53.7%) and the paper's 5-run majority vote (60.7%). This is a strong validation that our implementation is working correctly.

Note: The saved artifacts include 6 OpenAI parse failures (BUG-038-B2) where the prediction text was not saved; scoring those failures as incorrect yields the original 67.6% ± 3.1% paper-faithful result.

Possible reasons for improvement over paper: - We used gpt-5.2 (latest) vs paper's gpt-5 baseline - Minor implementation differences in prompting or crop selection

Comparison to Baselines (from paper)¶

Method	GTEx Balanced Accuracy
Our GIANT (gpt-5.2)	70.3% (rescored)
Paper: GIANT x5 (GPT-5)	60.7% ± 3.2%
Paper: GIANT x1 (GPT-5)	53.7% ± 3.4%
Paper: Thumbnail (GPT-5)	36.5% ± 3.4%
Paper: Patch (GPT-5)	43.7% ± 2.4%
Paper: TITAN (zero-shot)	96.3% ± 1.3%
Paper: SlideChat (zero-shot)	5.0% ± 0.0%

Our implementation significantly outperforms the paper's thumbnail and patch baselines, but remains below specialized models like TITAN.

Artifacts¶

GTEx Result Files¶

Full Results JSON: results/gtex_giant_openai_gpt-5.2_results.json
Checkpoint: results/checkpoints/gtex_giant_openai_gpt-5.2.checkpoint.json
Log File: results/gtex-benchmark-20251227-010151.log

TCGA Result Files¶

Full Results JSON: results/tcga_giant_openai_gpt-5.2_results.json
Checkpoint: results/checkpoints/tcga_giant_openai_gpt-5.2.checkpoint.json

PANDA Result Files¶

Full Results JSON: results/panda_giant_openai_gpt-5.2_results.json
Checkpoint: results/checkpoints/panda_giant_openai_gpt-5.2.checkpoint.json
Log File: results/panda_benchmark.log

Trajectory Files¶

Individual slide trajectories with full LLM reasoning are saved in:

results/trajectories/GTEX-*_run0.json

Each trajectory contains: - WSI path - Question asked - Turn-by-turn data: - Image shown to model (base64) - Model reasoning - Action taken (crop coordinates or answer) - Final prediction - Cost and token usage

Summary Statistics¶

{
  "metric_type": "balanced_accuracy",
  "point_estimate": 0.703,
  "bootstrap_mean": 0.704,
  "bootstrap_std": 0.030,
  "bootstrap_ci_lower": 0.643,
  "bootstrap_ci_upper": 0.761,
  "n_replicates": 1000,
  "n_scored": 185,
  "n_total": 191,
  "n_errors_excluded": 6,
  "n_extraction_failures": 0
}

TCGA Cancer Diagnosis (30-way)¶

Date: 2025-12-27 Run ID: tcga_giant_openai_gpt-5.2

Our Results vs Paper¶

Metric	Our Result	Paper (GPT-5 GIANT)	Paper (GPT-5 GIANT x5)
Balanced Accuracy (scored items only)	26.2% (bootstrap: 25.3% ± 3.1%)	32.3% ± 3.5%	29.3% ± 3.3%
Bootstrap CI (95%)	19.2% - 31.1%	-	-
Items Processed	221/221	221	221
Scored Items	215/221	221	221
Parse Errors (excluded)	6	-	-
Total Cost	$15.14	-	-

Analysis¶

Our scored-items-only point estimate of 26.2% is below the paper's single-run GIANT result (32.3%). This 30-way cancer classification task is significantly harder than GTEx's 20-way organ classification.

Possible reasons for underperformance:

Task Difficulty: Cancer diagnosis requires fine-grained cellular features that may need more navigation steps or specialized prompts
Class Imbalance: TCGA has 30 cancer types with uneven distribution
Single-run variance: In the paper, x5 majority voting (29.3%) did not improve over x1 (32.3%) for this task.

Comparison to Baselines (from paper)¶

Method	TCGA Balanced Accuracy
Paper: GIANT x5 (GPT-5)	29.3% ± 3.3%
Paper: GIANT x1 (GPT-5)	32.3% ± 3.5%
Our GIANT (gpt-5.2)	26.2% (rescored)
Paper: Thumbnail (GPT-5)	9.2% ± 1.9%
Paper: Patch (GPT-5)	12.8% ± 2.1%
Paper: TITAN (zero-shot)	88.8% ± 1.7%
Paper: SlideChat (zero-shot)	3.3% ± 1.2%

Our implementation outperforms the paper's thumbnail and patch baselines (9.2% and 12.8%), indicating the agent navigation is providing value, but there's room for improvement.

Cost Efficiency¶

Cost per item: $15.14 / 221 = ~$0.068/item
Average tokens per item: 4,315,199 / 221 = ~19,525 tokens
Approximately 1.8x more expensive per item than GTEx (likely due to more navigation steps)

Summary Statistics (TCGA)¶

{
  "metric_type": "balanced_accuracy",
  "point_estimate": 0.262,
  "bootstrap_mean": 0.253,
  "bootstrap_std": 0.031,
  "bootstrap_ci_lower": 0.192,
  "bootstrap_ci_upper": 0.311,
  "n_replicates": 1000,
  "n_scored": 215,
  "n_total": 221,
  "n_errors_excluded": 6,
  "n_extraction_failures": 0
}

PANDA Prostate Grading (6-way)¶

Date: 2025-12-29 Run ID: panda_giant_openai_gpt-5.2

Our Results vs Paper¶

Metric	Our Result	Paper (GPT-5 GIANT)	Paper (GPT-5 GIANT x5)
Balanced Accuracy (rescored; scored items only)	20.3% (bootstrap: 20.4% ± 1.9%)	23.2% ± 2.3%	25.4% ± 2.0%
Bootstrap CI (95%)	16.9% - 24.2%	-	-
Items Processed	197/197	197	197
Scored Items	191/197	197	197
Parse Errors (excluded)	6	-	-
Extraction Failures	0 (rescored; 47 in pre-fix artifact)	-	-
Total Cost	$73.38	-	-

Analysis¶

The original 2025-12-29 PANDA artifact scored 9.7% on scored items only (and 9.4% ± 2.2% paper-faithful, with errors counted incorrect), largely due to BUG-038-B1 (PANDA "isup_grade": null not mapping to benign label 0), which created 47 extraction failures and many mis-parsed labels via integer fallback.

Rescoring the saved PANDA predictions with the current fixed extractor (no new LLM calls) yields 20.3% balanced accuracy on scored items only (excluding the 6 hard failures caused by BUG-038-B2 in the saved artifact). The equivalent paper-faithful score (counting those failures incorrect) is 19.7% ± 1.9%.

Remaining gap vs paper (23.2%) is now much smaller; likely contributors include the 6 hard failures plus model/prompt differences.

Comparison to Baselines (from paper)¶

Method	PANDA Balanced Accuracy
Paper: GIANT x5 (GPT-5)	25.4% ± 2.0%
Paper: GIANT x1 (GPT-5)	23.2% ± 2.3%
Our GIANT (gpt-5.2)	20.3% (rescored)
Paper: Thumbnail (GPT-5)	12.2% ± 2.2%
Paper: Patch (GPT-5)	21.3% ± 2.4%
Paper: TITAN (zero-shot)	27.5% ± 2.3%
Paper: SlideChat (zero-shot)	17.0% ± 0.4%

Recommended Next Step¶

Re-run PANDA with the BUG-038 fixes applied to eliminate the 6 B2 hard failures and to get accurate cost accounting (the saved artifact costs are a lower bound when parsing fails before usage is recorded).

Summary Statistics (PANDA)¶

{
  "metric_type": "balanced_accuracy",
  "point_estimate": 0.203,
  "bootstrap_mean": 0.204,
  "bootstrap_std": 0.019,
  "bootstrap_ci_lower": 0.169,
  "bootstrap_ci_upper": 0.242,
  "n_replicates": 1000,
  "n_scored": 191,
  "n_total": 197,
  "n_errors_excluded": 6,
  "n_extraction_failures": 0
}

Benchmark Summary¶

Benchmark	Status	Our Result	Paper (x1)	Paper (x5)	Cost
GTEx (Organ, 20-way)	COMPLETE ✓	70.3%	53.7% ± 3.4%	60.7% ± 3.2%	$7.21
ExpertVQA (128 Q)	COMPLETE ✓	60.1%	57.0% ± 4.5%	62.5% ± 4.4%	$10.32
SlideBench (197 Q)	COMPLETE ✓	51.8%	58.9% ± 3.5%	59.4% ± 3.4%	$18.59
TCGA (Cancer Dx, 30-way)	COMPLETE ✓	26.2%	32.3% ± 3.5%	29.3% ± 3.3%	$15.14
PANDA (Grading, 6-way)	COMPLETE ✓	20.3%	23.2% ± 2.3%	25.4% ± 2.0%	$73.38

Total Benchmark Cost: $124.64 (934 questions across 862 WSIs)

Key Findings¶

2 benchmarks exceed paper results: GTEx (70.3% vs 53.7%) and ExpertVQA (60.1% vs 57.0%)
3 benchmarks below paper but above baselines: SlideBench, TCGA, PANDA all outperform thumbnail baselines
Zero extraction failures after BUG-038/BUG-039 fixes

Reproducibility¶

To reproduce these results:

# Ensure GTEx WSIs are in data/wsi/gtex/
uv run giant check-data gtex

# Run benchmark
source .env  # Load API keys
uv run giant benchmark gtex --provider openai --model gpt-5.2 -v

Notes¶

Cost Tracking: Reported costs in the saved artifacts are lower bounds; parse-failed calls raised before usage/cost was accumulated (BUG-038-B2, now fixed for future runs).
Rescore Policy: “Rescored” results above exclude the 6 hard failures per benchmark due to BUG-038-B2 (now fixed in code), because the pre-fix artifacts do not include prediction text for those failures.
WSI Format: Used DICOM format from IDC (OpenSlide 4.0.0+ compatible)