Navigation Algorithm¶
This page explains GIANT's core navigation algorithm, based on Algorithm 1 from the paper.
Overview¶
GIANT navigates gigapixel images through an iterative process:
- Show the LLM a low-resolution thumbnail with coordinate guides
- LLM reasons about what to examine and outputs a crop action
- Extract the requested region at high resolution
- Repeat until the LLM has enough information to answer
Algorithm 1: GIANT Navigation¶
Input: WSI W, question q, step limit T
Output: answer ŷ
1. I₀ ← Thumbnail(W) # Generate thumbnail
2. I₀ ← AddAxisGuides(I₀) # Add coordinate markers
3. C ← [(system_prompt, q, I₀)] # Initialize context
4.
5. for t = 1 to T-1 do # At most T-1 crops (paper)
6. (rₜ, aₜ) ← LLM(C) # Get reasoning + action
7.
8. if aₜ.type == "answer" then
9. return aₜ.text # Early termination
10.
11. if aₜ.type == "crop" then
12. Iₜ ← CropRegion(W, aₜ, S) # Extract region
13. C ← C ∪ [(rₜ, aₜ, Iₜ)] # Add to context
14.
15. end for
16.
17. ŷ ← ForceAnswer(C) # Final step: must answer (with retries)
18. return ŷ
Step-by-Step Breakdown¶
Step 1: Thumbnail Generation¶
The thumbnail is a low-resolution overview of the entire slide. A 100,000 x 80,000 pixel slide becomes roughly 1024 x 820 pixels.
Step 2: Axis Guides¶
# Add Level-0 coordinate markers
navigable = overlay_service.create_navigable_thumbnail(thumbnail, metadata)
Red lines with pixel coordinate labels are overlaid:
0 25000 50000 75000 100000
│ │ │ │ │
───┼──────────┼──────────┼──────────┼──────────┼───
│ │ │ │ │
│ ┌─────────────────────┐ │ │
│ │ Tissue visible │ │ │
25K ─┼────│ in this region │─────┼──────────┼───
│ │ │ │ │
│ └─────────────────────┘ │ │
│ │ │ │ │
50K ─┼──────────┼──────────┼──────────┼──────────┼───
The LLM uses these markers to specify exact coordinates.
Step 3: Context Initialization¶
The initial context includes: - System prompt (navigation instructions) - User question - Thumbnail image with axis guides
Steps 4-15: Navigation Loop¶
Each iteration:
- Build messages from context (system, user turns, assistant turns)
- Call LLM with multimodal input (text + images)
- Parse response into reasoning + action
- Execute action:
- If
crop: Extract region, add to context - If
answer: Return immediately
Step 17: Force Answer¶
If the LLM reaches the step limit without answering:
force_prompt = """
You have reached the maximum number of navigation steps ({max_steps}).
Based on all the regions you have examined, you MUST now provide your final answer.
"""
The agent retries up to 3 times to get an answer action.
Key Parameters¶
| Parameter | Default | Description |
|---|---|---|
T (max_steps) |
20 | Maximum navigation steps |
S (target_size) |
1000 (OpenAI) / 500 (Anthropic) | Output crop long-side (provider-specific) |
| Thumbnail size | 1024 | Maximum thumbnail dimension |
| Max retries | 3 | Retries for invalid coordinates |
| Oversampling bias | 0.85 | Bias toward finer pyramid levels |
Coordinate System¶
All coordinates use Level-0 (full resolution) pixel space:
x: Horizontal position from left edgey: Vertical position from top edgewidth,height: Size of region to extract
Example for a 100,000 x 80,000 slide:
{
"x": 45000, // 45% from left
"y": 20000, // 25% from top
"width": 10000, // 10% of slide width
"height": 10000 // 12.5% of slide height
}
Level Selection¶
WSIs are stored as image pyramids with multiple resolution levels:
Level 0: 100,000 x 80,000 (full resolution)
Level 1: 50,000 x 40,000 (2x downsampled)
Level 2: 25,000 x 20,000 (4x downsampled)
Level 3: 12,500 x 10,000 (8x downsampled)
GIANT automatically selects the optimal level to: 1. Avoid upsampling (blurry results) 2. Minimize downsampling (preserve detail) 3. Output at target size (1000px)
from giant.core.level_selector import PyramidLevelSelector
from giant.geometry import Region
from giant.wsi import WSIReader
with WSIReader("slide.svs") as reader:
metadata = reader.get_metadata()
selector = PyramidLevelSelector()
selected = selector.select_level(
region=Region(x=45000, y=20000, width=10000, height=10000),
metadata=metadata,
target_size=1000,
bias=0.85,
)
Error Handling¶
Invalid Coordinates¶
If the LLM provides out-of-bounds coordinates:
- Validate against slide dimensions
- Send error feedback to LLM
- Request corrected coordinates
- Retry up to
max_retriestimes
Parse Errors¶
If the LLM output can't be parsed:
- Log the raw output
- Increment error counter
- Retry with same context
- Fail after
max_retries
Cost Optimization¶
Each LLM call has a cost. GIANT optimizes by:
- Early termination: Answer as soon as evidence is sufficient
- Efficient context: Don't repeat full images in every turn
- Budget limits: Optional
--budget-usdflag stops early
Visualization¶
After a run, visualize the trajectory:
Shows: - Thumbnail with all crop regions overlaid - Step-by-step reasoning - Final answer
Implementation Reference¶
| Concept | File | Function/Class |
|---|---|---|
| Agent loop | agent/runner.py |
GIANTAgent._navigation_loop |
| Context | agent/context.py |
ContextManager |
| Thumbnail | wsi/reader.py |
WSIReader.get_thumbnail |
| Axis guides | geometry/overlay.py |
AxisGuideGenerator |
| Cropping | core/crop_engine.py |
CropEngine.crop |
| Level selection | core/level_selector.py |
PyramidLevelSelector |
Next Steps¶
- LLM Integration - How providers work
- Prompt Design - Navigation prompts
- Architecture - System design