GIANT Prompt Design¶

Status: Paper-derived (pending Supplementary Material verification)

This document captures all prompt requirements extractable from the GIANT paper. When the authors release official prompts, we will compare and update.

Enhancement Note: Section "2025 Pathology VLM Best Practices" contains domain enhancements beyond the paper. These are clearly marked and separable.

Paper Evidence Summary¶

Requirement	Source	Line	Confidence
Crop budget communicated	Algorithm 1	156	High
Level-0 coordinate system	Sec 4.1	134	High
Axis guides explained	Sec 4.1	134	High
Output format (x, y, w, h)	Algorithm 1	159	High
Final answer enforcement	Fig 5 caption	200	High
Reasoning per step	Algorithm 1	159	High
Thumbnail size 1024px	Sec 4.2.1	183	High

Required Prompt Components (High Confidence)¶

1. Crop Budget¶

Evidence (Algorithm 1, line 156):

P0 = {q, nav instructions + "at most T-1 crops"};

The initial prompt must tell the model how many crops it can make.

Implementation:

You have at most {max_steps - 1} crops to explore before providing your final answer.

2. Level-0 Coordinate System¶

Evidence (Section 4.1, line 134):

"To orient the model, the thumbnail is overlaid with four evenly spaced axis guides along each dimension, labeled with absolute level-0 pixel coordinates."

The model must understand: - Coordinates are absolute (Level-0 = full resolution) - Axis guides show these coordinates visually - All bounding boxes use this system

Implementation:

The image has AXIS GUIDES overlaid - red lines labeled with ABSOLUTE LEVEL-0 PIXEL COORDINATES.
All coordinates you output must use this Level-0 system (the slide's native resolution).

3. Bounding Box Output Format¶

Evidence (Algorithm 1, line 159):

(rt, at) ← LMM(C) ; // at = (x, y, w, h)

Each step produces reasoning + action, where action is a 4-tuple.

Implementation (this repo):

The paper does not specify a serialization format (e.g., JSON) for (x, y, w, h). In this repo, response structure is enforced externally by provider integrations (OpenAI structured output / Anthropic tool use) using StepResponse:

{
  "reasoning": "I observe ...",
  "action": {
    "action_type": "crop",
    "x": 10000,
    "y": 20000,
    "width": 5000,
    "height": 5000
  }
}

See: src/giant/llm/protocol.py (StepResponse, BoundingBoxAction, FinalAnswerAction).

4. Final Answer Enforcement¶

Evidence (Figure 5 caption, line 200):

"We use a system prompt to enforce that the model provide its final response after a specific number of iterations, marking a trial incorrect if the model exceeds this limit after 3 retries."

The prompt must explicitly require the model to answer on the final step.

Implementation:

On step {max_steps}, you MUST provide your final answer using the `answer` action.

5. Two Action Types¶

Evidence (Algorithm 1, lines 151-160; Section 4.1, line 140):

Algorithm 1 defines repeated crop selection via bounding boxes, and returning a final answer yˆ once navigation ends. Section 4.1 explicitly mentions navigation repeats "until a step limit T or early stop".

Evidence (Section 4.1, line 140):

"The environment returns the next image It+1 = CropRegion(W, at, S), repeating until a step limit T or early stop."

This repo represents this contract with two action modes: - crop(x, y, w, h): Continue navigation by selecting the next region of interest. - answer(text): Terminate navigation and provide the final answer.

Implementation:

Your available actions:
1. crop(x, y, width, height) - Zoom into a region for more detail
2. answer(text) - Provide your final answer to the question

Inferred Components (Medium Confidence)¶

These are not explicitly stated but are strongly implied by the paper's methodology.

6. Pathology Domain Context¶

The paper tests on pathology slides. The model should know it's analyzing tissue.

Implementation:

You are analyzing a Whole Slide Image (WSI) - a gigapixel pathology slide.

7. Thumbnail vs Crop Distinction¶

Figure 4 shows the agent receives different image types: - Initial: Low-resolution thumbnail (blurry) - After crop: Higher-resolution region

Implementation:

The initial view is a low-resolution thumbnail. Cellular details require zooming in.

Algorithm 1 shows a loop: observe → reason → act → observe...

Implementation:

At each step:
1. Analyze the current image
2. Explain your reasoning
3. Choose to crop (for more detail) or answer (if sufficient evidence)

2025 Pathology VLM Best Practices (Domain Enhancement)¶

Note: This section contains enhancements derived from Dec 2025 literature review. These are NOT from the GIANT paper and can be removed for strict paper reproduction.

Literature Sources¶

Derived from literature search: "whole slide image LLM prompting 2025", "GPT-4V medical image navigation"

Anatomical Precision & Domain Vocabulary
Insight: Generic "describe the image" prompts perform poorly.
Action: Use specific pathology terms (e.g., "stroma", "nuclei", "architecture", "mitotic figures") in the system prompt to prime the model's vocabulary.
Hierarchical Observation (Multi-Scale)
Insight: Pathologists scan low-power (architecture) before high-power (cellular details).
Action: Explicitly instruct the model to follow this "Architecture -> Cellular" observation flow.
Visual Anchoring
Insight: Models hallucinate less when forced to reference specific coordinates or visual markers.
Action: Reinforce the "Axis Guides" instruction and ask the model to cite coordinates in its reasoning.
Role & Goal Specificity
Insight: "You are a pathologist" is good, but "You are a pathologist diagnosing cancer grade" is better.
Action: Ensure the prompt adapts to the specific task type if known (e.g., QA vs Diagnosis).

Gap Analysis (Current Implementation)¶

Current State	Gap	Fix Applied
Generic "Analyze image"	Lacks domain specificity	Added "Scan for architectural patterns, then cellular details"
"Provide reasoning"	Unstructured thought process	Enforced "Observation -> Reasoning -> Action" structure
"Low-res thumbnail"	Doesn't explain why zoom is needed	Explained "Low-res = Architecture only; High-res = Cellular"
No coordinate citing	Hallucination risk	Ask model to "Reference coordinates in reasoning"

What We Cannot Determine¶

Without the Supplementary Material, we cannot verify:

Exact wording - The precise phrases used
Provider differences - Whether OpenAI/Anthropic prompts differ
Additional constraints - Any rules not mentioned in the main text
Retry instructions - How the 3-retry policy is communicated
Question formatting - How the user's question is presented

Current Implementation¶

See: src/giant/prompts/templates.py

Our templates implement all high-confidence requirements: - Crop budget (Step X of Y, N crops remaining) - Level-0 coordinate system (axis guides explanation) - Bounding box format (crop action with x, y, width, height) - Final answer enforcement (MUST use answer on final step) - Two action types (crop, answer)

Plus domain enhancements (currently always included; remove from templates for strict reproduction): - Hierarchical analysis workflow - Pathology-specific vocabulary - Coordinate referencing in reasoning

Verification Plan¶

When Supplementary Material becomes available:

Compare official prompts to our templates
Document any differences
Update templates to match (or document intentional divergence)
Add regression tests for paper-required invariants
Decide whether to keep or remove domain enhancements

References¶

GIANT paper: _literature/markdown/giant/giant.md
Algorithm 1: lines 144-161
Section 4.1 (Axis guides): line 134
Figure 5 caption (step enforcement): line 200
Baseline thumbnail size: line 183
Current implementation: src/giant/prompts/templates.py
Bug tracking: docs/_bugs/BUG-020-placeholder-prompts.md
2025 Literature: Web search "whole slide image LLM prompting 2025"

GIANT Prompt Design¶

Paper Evidence Summary¶

Required Prompt Components (High Confidence)¶

1. Crop Budget¶

2. Level-0 Coordinate System¶

3. Bounding Box Output Format¶

4. Final Answer Enforcement¶

5. Two Action Types¶

Inferred Components (Medium Confidence)¶

6. Pathology Domain Context¶

7. Thumbnail vs Crop Distinction¶

8. Iterative Refinement¶

2025 Pathology VLM Best Practices (Domain Enhancement)¶

Literature Sources¶

Gap Analysis (Current Implementation)¶

What We Cannot Determine¶

Current Implementation¶

Verification Plan¶

References¶