Navigation Algorithm¶

This page explains GIANT's core navigation algorithm, based on Algorithm 1 from the paper.

Overview¶

GIANT navigates gigapixel images through an iterative process:

Show the LLM a low-resolution thumbnail with coordinate guides
LLM reasons about what to examine and outputs a crop action
Extract the requested region at high resolution
Repeat until the LLM has enough information to answer

Input: WSI W, question q, step limit T
Output: answer ŷ

1.  I₀ ← Thumbnail(W)                    # Generate thumbnail
2.  I₀ ← AddAxisGuides(I₀)               # Add coordinate markers
3.  C ← [(system_prompt, q, I₀)]         # Initialize context
4.
5.  for t = 1 to T-1 do                  # At most T-1 crops (paper)
6.      (rₜ, aₜ) ← LLM(C)                 # Get reasoning + action
7.
8.      if aₜ.type == "answer" then
9.          return aₜ.text               # Early termination
10.
11.     if aₜ.type == "crop" then
12.         Iₜ ← CropRegion(W, aₜ, S)    # Extract region
13.         C ← C ∪ [(rₜ, aₜ, Iₜ)]        # Add to context
14.
15. end for
16.
17. ŷ ← ForceAnswer(C)                   # Final step: must answer (with retries)
18. return ŷ

Step-by-Step Breakdown¶

Step 1: Thumbnail Generation¶

# Generate thumbnail fitting within max_size
thumbnail = reader.get_thumbnail((1024, 1024))

The thumbnail is a low-resolution overview of the entire slide. A 100,000 x 80,000 pixel slide becomes roughly 1024 x 820 pixels.

Step 2: Axis Guides¶

# Add Level-0 coordinate markers
navigable = overlay_service.create_navigable_thumbnail(thumbnail, metadata)

Red lines with pixel coordinate labels are overlaid:

     0        25000      50000      75000     100000
     │          │          │          │          │
  ───┼──────────┼──────────┼──────────┼──────────┼───
     │          │          │          │          │
     │    ┌─────────────────────┐     │          │
     │    │  Tissue visible     │     │          │
25K ─┼────│  in this region     │─────┼──────────┼───
     │    │                     │     │          │
     │    └─────────────────────┘     │          │
     │          │          │          │          │
50K ─┼──────────┼──────────┼──────────┼──────────┼───

The LLM uses these markers to specify exact coordinates.

Step 3: Context Initialization¶

context = ContextManager(
    wsi_path=wsi_path,
    question=question,
    max_steps=20,
)

The initial context includes: - System prompt (navigation instructions) - User question - Thumbnail image with axis guides

Each iteration:

Build messages from context (system, user turns, assistant turns)
Call LLM with multimodal input (text + images)
Parse response into reasoning + action
Execute action:
If crop: Extract region, add to context
If answer: Return immediately

Step 17: Force Answer¶

If the LLM reaches the step limit without answering:

force_prompt = """
You have reached the maximum number of navigation steps ({max_steps}).
Based on all the regions you have examined, you MUST now provide your final answer.
"""

The agent retries up to 3 times to get an answer action.

Key Parameters¶

Parameter	Default	Description
`T` (max_steps)	20	Maximum navigation steps
`S` (target_size)	1000 (OpenAI) / 500 (Anthropic)	Output crop long-side (provider-specific)
Thumbnail size	1024	Maximum thumbnail dimension
Max retries	3	Retries for invalid coordinates
Oversampling bias	0.85	Bias toward finer pyramid levels

Coordinate System¶

All coordinates use Level-0 (full resolution) pixel space:

x: Horizontal position from left edge
y: Vertical position from top edge
width, height: Size of region to extract

Example for a 100,000 x 80,000 slide:

{
  "x": 45000,      // 45% from left
  "y": 20000,      // 25% from top
  "width": 10000,  // 10% of slide width
  "height": 10000  // 12.5% of slide height
}

Level Selection¶

WSIs are stored as image pyramids with multiple resolution levels:

Level 0: 100,000 x 80,000 (full resolution)
Level 1:  50,000 x 40,000 (2x downsampled)
Level 2:  25,000 x 20,000 (4x downsampled)
Level 3:  12,500 x 10,000 (8x downsampled)

GIANT automatically selects the optimal level to: 1. Avoid upsampling (blurry results) 2. Minimize downsampling (preserve detail) 3. Output at target size (1000px)

from giant.core.level_selector import PyramidLevelSelector
from giant.geometry import Region
from giant.wsi import WSIReader

with WSIReader("slide.svs") as reader:
    metadata = reader.get_metadata()

selector = PyramidLevelSelector()
selected = selector.select_level(
    region=Region(x=45000, y=20000, width=10000, height=10000),
    metadata=metadata,
    target_size=1000,
    bias=0.85,
)

Error Handling¶

Invalid Coordinates¶

If the LLM provides out-of-bounds coordinates:

Validate against slide dimensions
Send error feedback to LLM
Request corrected coordinates
Retry up to max_retries times

Parse Errors¶

If the LLM output can't be parsed:

Log the raw output
Increment error counter
Retry with same context
Fail after max_retries

Cost Optimization¶

Each LLM call has a cost. GIANT optimizes by:

Early termination: Answer as soon as evidence is sufficient
Efficient context: Don't repeat full images in every turn
Budget limits: Optional --budget-usd flag stops early

Visualization¶

After a run, visualize the trajectory:

giant visualize trajectory.json

Shows: - Thumbnail with all crop regions overlaid - Step-by-step reasoning - Final answer

Implementation Reference¶

Concept	File	Function/Class
Agent loop	`agent/runner.py`	`GIANTAgent._navigation_loop`
Context	`agent/context.py`	`ContextManager`
Thumbnail	`wsi/reader.py`	`WSIReader.get_thumbnail`
Axis guides	`geometry/overlay.py`	`AxisGuideGenerator`
Cropping	`core/crop_engine.py`	`CropEngine.crop`
Level selection	`core/level_selector.py`	`PyramidLevelSelector`

Next Steps¶

LLM Integration - How providers work
Prompt Design - Navigation prompts
Architecture - System design