Running Benchmarks¶

This guide covers running GIANT on the MultiPathQA benchmark suite.

Prerequisites¶

Installation completed
API key configured in .env
WSI files downloaded (see Data Acquisition)
MultiPathQA metadata downloaded:
```
giant download multipathqa
```

Basic Usage¶

giant benchmark <dataset> [options]

Available Datasets¶

Dataset	Task	Items	WSI Source
`gtex`	Organ Classification (20-way)	191	GTEx
`tcga`	Cancer Diagnosis (30-way)	221	TCGA
`panda`	Prostate Grading (6-way)	197	PANDA
`tcga_expert_vqa`	Pathologist-Authored VQA	128	TCGA
`tcga_slidebench`	SlideBench VQA	197	TCGA

Example¶

# Full GTEx benchmark
giant benchmark gtex --provider openai -v

Command Options¶

Data Options¶

Option	Default	Description
`--csv-path`	`data/multipathqa/MultiPathQA.csv`	Path to benchmark CSV
`--wsi-root`	`data/wsi`	Root directory for WSI files
`-o, --output-dir`	`results`	Output directory

Provider Options¶

Option	Default	Description
`-p, --provider`	`openai`	LLM provider
`--model`	`gpt-5.2`	Model ID

Run Options¶

Option	Default	Description
`-T, --max-steps`	`20`	Max navigation steps per item
`-r, --runs`	`1`	Runs per item (for majority voting)
`-c, --concurrency`	`4`	Max concurrent API calls
`--max-items`	`0` (all)	Limit items to process
`--budget-usd`	`0` (disabled)	Total cost budget
`--skip-missing/--no-skip-missing`	`--skip-missing`	Skip missing WSI files
`--resume/--no-resume`	`--resume`	Resume from checkpoint

Output Options¶

Option	Default	Description
`--json`	False	JSON output for scripting
`-v, --verbose`	0	Verbosity level

Workflow Examples¶

Quick Validation (5 items)¶

giant benchmark gtex \
    --max-items 5 \
    --provider openai \
    -v

Full Benchmark with Resume¶

# Start benchmark (may take hours)
giant benchmark tcga --provider openai -v

# If interrupted, resume:
giant benchmark tcga --resume -v

High Concurrency¶

# Faster but higher API load
giant benchmark gtex --concurrency 8 -v

Multiple Runs (Majority Voting)¶

# 3 runs per item, report majority vote
giant benchmark gtex --runs 3 -v

Cost-Limited Run¶

# Stop when budget is exhausted
giant benchmark tcga --budget-usd 10.00 -v

Output Structure¶

Results are saved to --output-dir (default: results/):

results/
├── gtex_giant_openai_gpt-5.2_results.json    # Full results
├── checkpoints/
│   └── gtex_giant_openai_gpt-5.2.checkpoint.json  # Resume state
└── trajectories/
    ├── GTEX-OIZH-0626_run0.json              # Per-item trajectories
    └── ...

Results JSON¶

{
  "run_id": "gtex_giant_openai_gpt-5.2",
  "benchmark_name": "gtex",
  "model_name": "gpt-5.2",
  "config": {
    "mode": "giant",
    "max_steps": 20,
    "runs_per_item": 1,
    "max_concurrent": 4,
    "max_items": null,
    "skip_missing_wsis": true
  },
  "results": [
    {
      "item_id": "GTEX-OIZH-0626",
      "prediction": "Heart",
      "predicted_label": 1,
      "truth_label": 1,
      "correct": true,
      "cost_usd": 0.0378,
      "total_tokens": 1234,
      "trajectory_file": "results/trajectories/GTEX-OIZH-0626_run0.json",
      "error": null
    },
    ...
  ],
  "metrics": {
    "metric_type": "balanced_accuracy",
    "bootstrap_mean": 0.676,
    "bootstrap_std": 0.031,
    "bootstrap_ci_lower": 0.614,
    "bootstrap_ci_upper": 0.735,
    "n_replicates": 1000
  },
  "total_cost_usd": 7.21,
  "total_tokens": 1234567,
  "timestamp": "2025-12-27T00:00:00Z"
}

Metrics¶

Metric	Description
`metric_type`	`"balanced_accuracy"` for classification, `"accuracy"` for VQA
`bootstrap_mean`	Bootstrap mean of the metric
`bootstrap_std`	Bootstrap standard deviation
`bootstrap_ci_lower` / `bootstrap_ci_upper`	95% bootstrap confidence interval

Classification tasks (tcga, gtex, panda) use balanced accuracy (per the paper). VQA tasks use accuracy.

Cost Estimates¶

Costs vary significantly by provider/model and how many steps the agent uses per item. For a safe estimate:

Run a small sample: giant benchmark <dataset> --max-items 5 --json
Extrapolate from total_cost
Use --budget-usd on longer runs

Checkpoint and Resume¶

Benchmarks automatically checkpoint progress:

After each item completes
On graceful shutdown (Ctrl+C)

To resume an interrupted run:

giant benchmark gtex --resume -v

The checkpoint file contains: - Completed item indices - Partial results - Run configuration

Handling Missing Files¶

By default, missing WSI files are skipped:

# Skip missing (default)
giant benchmark gtex --skip-missing -v

# Fail on missing
giant benchmark gtex --no-skip-missing -v

Check which files are missing:

giant check-data gtex -v

Comparing to Paper Results¶

Benchmark	Our Result	Paper (GIANT x1)	Paper (GIANT x5)
GTEx	70.3%	53.7% ± 3.4%	60.7% ± 3.2%
ExpertVQA	60.1%	57.0% ± 4.5%	62.5% ± 4.4%
SlideBench	51.8%	58.9% ± 3.5%	59.4% ± 3.4%
TCGA	26.2%	32.3% ± 3.5%	29.3% ± 3.3%
PANDA	20.3%	23.2% ± 2.3%	25.4% ± 2.0%

All 5 benchmarks complete. See Benchmark Results for detailed analysis.

Troubleshooting¶

"WSI file not found"¶

Check data availability:

giant check-data gtex -v

See Data Acquisition for download instructions.

Rate Limits¶

Reduce concurrency:

giant benchmark gtex --concurrency 1 -v

Memory Issues¶

Large WSIs can consume memory. Try:

# Process one at a time
giant benchmark gtex --concurrency 1 -v

Checkpoint Corruption¶

Delete the checkpoint and restart:

rm results/checkpoints/gtex_*.checkpoint.json
giant benchmark gtex -v

Next Steps¶

Visualizing Trajectories - Inspect results
Benchmark Results - Official results
CLI Reference - All options