Master Qualifications & Build Decisions¶

Companion document to docs/_specs/master-vision.md. This captures scope decisions, risk mitigations, and concrete choices that guide the actual v1 implementation. The master vision remains the vision; this document is the pragmatic filter.

Status (2026-01-23): v1.0 through v2.1 have shipped and the corresponding specs are archived. Treat this document as a record of v1-era scoping decisions (the “why”), not as the authoritative current roadmap (the “what”). For current status, use docs/_specs/README.md.

1) Scope Triage: What's Actually in V1¶

The master draft describes many capabilities. Here's what we're actually building first:

V1.0 (Must Ship)¶

Component	Scope	Notes
Dataset loading	Full	Parse enriched problems YAML, validate via Pydantic models
`erdos list`	Full	With filters (status, prize, tags)
`erdos show`	Full	Display single problem
`erdos refs`	Basic	Show refs from YAML only (no enrichment)
`erdos lean init`	Full	Set up Lean project, fetch mathlib
`erdos lean check`	Full	Compile file, parse errors to JSON
`erdos lean formalize`	Basic	Template-based skeleton generation
SQLite FTS	Basic	Problem statements only, no chunking
`erdos search`	Basic	BM25 only, no vectors

V1.1 (After Core Works)¶

Component	Scope	Notes
`erdos ingest`	arXiv source + Crossref metadata	arXiv source tarball + extract; Crossref metadata for DOI; no PDF conversion
Reference enrichment	Crossref + arXiv API	Crossref for DOI metadata, arXiv for source content
Full-text indexing	Deferred	v1 index/search uses problem statements/notes only; ingested extracts are stored for a future indexing spec
`erdos ask`	Basic RAG	Retrieval + LLM prompt, no reranking

V1.2+ (OpenAlex Integration)¶

Component	Scope	Notes
OpenAlex as primary	Metadata for all DOI/arXiv lookups	Replaces Crossref as primary; includes citations/topics
Crossref as fallback	Edge cases only	When OpenAlex returns nothing
arXiv for content	Source tarballs only	Default uses OpenAlex for metadata; legacy `--source arxiv` may still call the export API
`MetadataProvider` protocol	Abstraction layer	Enables pluggable sources (see DEBT-038)

V1.3+ (Deferred at time of v1 planning)¶

Component	Why Deferred
Vector embeddings	BM25 is surprisingly good for technical text; adds complexity
PDF conversion (deferred to v2.0+)	Heavy dependency + edge case hell; use arXiv HTML/LaTeX in v1; optional `[pdf]` extra (Marker) is v2.0+
MCP server	CLI + Claude skills sufficient for now
`erdos loop run` automation	Need manual workflow validated first
Postgres/pgvector	SQLite handles our scale
Qdrant	Same as above
Semantic Scholar integration	"Good redundancy" - adds citation context not in OpenAlex
Exa Research API	Agentic literature synthesis - future enhancement
zbMATH integration	Math-specific metadata - low priority

Decision: We ship when erdos show 6 && erdos lean init && erdos lean formalize 6 && erdos lean check works end-to-end. That's the real v1.0 milestone.

V4.2+ (Lead Enrichment Pipeline)¶

Component	Scope	Notes
`erdos research lead enrich`	Fetch metadata for leads	Uses FallbackProvider (OpenAlex → Crossref)
`erdos research lead ingest`	Add leads to manifest	Deduplication by DOI/arXiv ID
Unified pipeline docs	Architecture diagrams	Full discovery → proof flow documented

Gap addressed: Discovery tools (Exa, zbMATH, S2) find papers and extract DOIs, but those leads weren't connected to the enrichment/manifest pipeline. SPEC-036 bridges this gap.

2) True Vertical Slice (Day 1-3)¶

The master draft's "vertical slice" was actually horizontal. Here's the real vertical slice:

# This should work after 2-3 days of coding:
erdos show 6                           # Display problem 6 from YAML
erdos lean init                        # Set up Lean project with mathlib
erdos lean formalize 6                 # Generate Problem006.lean skeleton
erdos lean check Erdos/Problem006.lean # Verify it compiles with sorry

No network. No APIs. No embeddings. No LLM.

Just: can we read the data and produce a Lean file that compiles?

Pre-Coding Manual Workflow¶

Before writing any code, manually execute this workflow for Problem 6:

Open problems.yaml from teorth/erdosproblems, find Problem 6
Note exact field names and structure (do they match our Pydantic models?)
Create a minimal data/problems_enriched.yaml entry for Problem 6 (title + statement + status)
Google its references, find the relevant arXiv paper
Read the paper, understand what the theorem actually states
Check if Problem 6 is in the Formal Conjectures Repository (verify current coverage at implementation time)
Write a Lean file by hand that states the theorem
Compile it against mathlib4, note what imports are needed
Document friction points encountered

This manual pass will reveal problems the spec doesn't anticipate.

3) Dependency Decisions (No More "Maybe")¶

The master draft has too many "possibly X or Y" statements. Here are hard decisions:

Choice	Decision	Rationale
CLI framework	Typer + Rich	Not Click. Typer is faster to write, Rich output is nice
Package manager	uv	PEP 621 + `uv.lock`, fast, single toolchain
Database	SQLite with FTS5	Not Postgres. Not until we have 10k+ documents
Vector store	None for v1	BM25 only. Add vectors in v1.2 if retrieval quality demands it
Embedding model	Skip for v1	If we add later: SPECTER2 (scientific text) over MiniLM
PDF conversion	Skip for v1	arXiv HTML + LaTeX source only. `pdf` extra reserved for v2.0+ (Marker, see Spec 019)
Lean version	leanprover/lean4:v4.12.0	Pin exact version in `lean-toolchain`
mathlib4	Pin to tag matching Lean	Use `mathlib4` tag matching `lean-toolchain` (e.g. `v4.12.0`)
LLM integration	Claude Code environment	Not OpenAI API for v1. Claude skills + shell commands
MCP	Skip for v1	CLI is sufficient

4) PDF Conversion Strategy¶

PDF conversion is out of scope for v1. The [pdf] extra is reserved for v2.0+.

Content Acquisition Strategy (Ordered by Priority)¶

arXiv HTML (post-Dec 2023 papers) - Clean, structured, math preserved
arXiv source tarball - LaTeX → text conversion is tractable
Abstract only (via OpenAlex/Crossref) - For non-arXiv papers
Metadata only - For paywalled content

If a paper isn't on arXiv and isn't open access, we record metadata and move on. We don't try to solve PDF extraction in v1.

v2.0+ PDF Tooling Decision (Updated 2026-01-19)¶

Selected: Marker (GPL) - best quality for math PDF with LLM enhancement

Rationale for GPL exception: - Marker is the only tool with both MIT-quality AND excellent math support - Docling (MIT) is blocked by typer version conflict - PyMuPDF (AGPL) has stricter copyleft requirements - GPL is acceptable only as an opt-in extra; distributing builds that include it must comply with GPL obligations (core remains permissive)

Install path: uv sync --extra pdf will install Marker when Spec 019 is implemented.

See docs/_archive/specs/spec-019-pdf-conversion.md for full details.

5) API Orchestration Strategy (Rob C. Martin Principles)¶

Single Responsibility Principle (SRP)¶

Each data source has ONE job:

API	Responsibility	NOT Responsible For
OpenAlex	Metadata (title, authors, abstract, citations, topics)	Content extraction
arXiv	Source content (LaTeX/TeX tarballs, HTML)	Metadata (OpenAlex has it)
Crossref	DOI fallback (edge cases only)	Primary lookups

Dependency Inversion Principle (DIP)¶

Implementation status: Target architecture. The MetadataProvider port is tracked as docs/_archive/debt/debt-038-metadata-provider-abstraction.md.

┌─────────────────────────────────────────────────────────────┐
│                    High-Level Policy                        │
│                  (IngestService, AskService)                │
└───────────────────────────┬─────────────────────────────────┘
                            │ depends on abstraction
                            ▼
┌─────────────────────────────────────────────────────────────┐
│               MetadataProvider (Protocol/Port)              │
│  get_by_doi(doi) -> ReferenceRecord                         │
│  get_by_arxiv(arxiv_id) -> ReferenceRecord                  │
│  search(query) -> List[ReferenceRecord]                     │
└───────────────────────────┬─────────────────────────────────┘
                            │ implemented by
        ┌───────────────────┼───────────────────┐
        ▼                   ▼                   ▼
┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│ OpenAlexProv │   │ CrossrefProv │   │  FutureProv  │
│  (Primary)   │   │  (Fallback)  │   │ (Pluggable)  │
└──────────────┘   └──────────────┘   └──────────────┘

Open/Closed Principle (OCP)¶

Adding new sources (Semantic Scholar, Exa, zbMATH) = new provider implementation. No changes to existing code.

Why Crossref is Demoted to Fallback¶

OpenAlex already includes Crossref data - calling both is redundant
OpenAlex has built-in deduplication - matches preprint ↔ published version
OpenAlex adds value - citations, topics, institutions, concepts
More generous anonymous quota than Crossref’s polite pool

Deduplication Strategy¶

We avoid heavy custom deduplication logic.

OpenAlex often resolves cross-source duplicates internally (preprint ↔ journal, multiple DOIs, etc.). We still enforce stable keys and duplicate guards within our own manifests/datasets to avoid accidental overwrites or churn. - arXiv preprint ↔ journal version matching - Multiple DOIs for same work - Cross-repository deduplication

"Good Redundancy" vs "Bad Redundancy"¶

Bad redundancy (calling multiple APIs for the same data): - ❌ Crossref for DOI metadata + OpenAlex for DOI metadata

Good redundancy (calling APIs for DIFFERENT data): - ✅ Semantic Scholar: Citation context (who cites what and WHY) - ✅ Exa Research: Natural language synthesis (agentic queries) - ✅ zbMATH: Math-specific metadata not in general databases

Migration Path¶

Version	State
v1.1	Crossref + arXiv (current)
v1.2	OpenAlex primary + arXiv content + Crossref fallback
v1.3+	Add "good redundancy" sources (Semantic Scholar, etc.)
v2.0	Consider removing Crossref entirely (OpenAlex includes it)

6) Lean Formalization Strategy (Expanded)¶

The master draft is light on formalization details. This is the hardest part.

Phase 1: Manual Template Development¶

Before automating anything:

Pull Formal Conjectures Repository - Many statements already formalized (verify current coverage at implementation time)
Manually write 5 Problem*.lean files for target problems (4, 6, 67, 123, 316)
Document patterns:
What mathlib imports are commonly needed?
How are number theory statements structured?
How are combinatorics statements structured?
What's the minimal compilable skeleton?

Phase 2: Template-Based Generation¶

Use discovered patterns to build templates:

-- Template: Number Theory (prime-related)
import Mathlib.NumberTheory.Primes

/-- Problem {id}: {title}
{statement}
-/
theorem problem_{id} : {formal_statement} := by
  sorry

Phase 3: LLM-Assisted (V1.2+)¶

Only after templates work: - Feed problem statement + template to LLM - LLM proposes {formal_statement} portion - Human verifies it compiles and means what the problem says

Key insight: A skeleton that compiles but states the wrong theorem is worse than no skeleton. Correctness > automation.

7) GitHub Issues Breakdown¶

The master draft's 15 issues are too coarse. Here's a finer breakdown:

Epic 1: Project Scaffolding (Issues 1-4)¶

Repo structure and uv setup - pyproject.toml, uv.lock, directory tree, .gitignore
Typer CLI skeleton - erdos --version, --help, global flags
Config file parser - YAML config loading, env var precedence
Logging infrastructure - Structured JSON logs to logs/

Epic 2: Dataset Integration (Issues 5-8)¶

Git submodule setup - teorth/erdosproblems integration
YAML parser for enriched problems - Load v1 enriched YAML; detect upstream metadata-only format with a clear error
Pydantic models - ProblemRecord, ReferenceRecord
erdos list command - With filters, table output, JSON output

Epic 3: Problem Display (Issues 9-10)¶

erdos show command - Full problem display
erdos refs command - Reference listing from YAML

Epic 4: Lean Integration (Issues 11-15)¶

Lean project structure - lakefile.lean, lean-toolchain, directory setup
erdos lean init command - Detect elan, run lake update
erdos lean check command - Run lake build, capture stderr
Lean error parser - Regex for line:col:message format
erdos lean formalize command - Template-based skeleton generation

Epic 5: Basic Search (Issues 16-18)¶

SQLite schema - Problems table, FTS5 virtual table
Index builder - Populate from problem statements
erdos search command - BM25 query, snippet extraction

Epic 6: Ingestion (Issues 19-23) [V1.1]¶

arXiv API client - Metadata fetch by ID
arXiv HTML fetcher - Download ar5iv HTML
arXiv source fetcher - Download and extract tarball
LaTeX to text converter - Basic extraction, preserve math notation
erdos ingest command - Orchestrate fetch + convert + manifest

8) What Makes This Agent-Ready¶

Before handing to a coding agent, ensure:

Hard Decisions Made¶

Exact File Tree (V1.0)¶

erdos-banger/
├── README.md
├── LICENSE                          # Apache-2.0
├── pyproject.toml                   # PEP 621 project config
├── uv.lock                          # Locked deps (committed)
├── .python-version                  # Python pin for dev
├── src/
│   └── erdos/
│       ├── __init__.py
│       ├── cli.py                   # Typer app definition
│       ├── commands/
│       │   ├── __init__.py
│       │   ├── list_cmd.py          # erdos list
│       │   ├── show.py              # erdos show
│       │   ├── refs.py              # erdos refs
│       │   ├── search.py            # erdos search
│       │   └── lean.py              # erdos lean init/check/formalize
│       ├── core/
│       │   ├── __init__.py
│       │   ├── models.py            # Pydantic models
│       │   ├── problem_loader.py    # YAML parsing
│       │   ├── search_index.py      # SQLite FTS
│       │   └── lean_runner.py       # Subprocess lean/lake calls
│       ├── templates/
│       │   └── lean_skeleton.j2     # Jinja2 template for Lean files
│       └── py.typed                 # PEP 561 marker
├── data/
│   └── erdosproblems/               # Git submodule
├── formal/
│   └── lean/
│       ├── lean-toolchain           # leanprover/lean4:v4.12.0
│       ├── lakefile.lean
│       └── Erdos/                   # Generated Lean files
├── index/
│   └── .gitkeep                     # SQLite file goes here (gitignored)
├── logs/
│   └── .gitkeep                     # Logs go here (gitignored)
├── tests/
│   ├── __init__.py
│   ├── test_cli.py
│   ├── test_models.py
│   └── test_lean.py
└── .gitignore

Exact Dependencies (pyproject.toml)¶

Single source of truth: pyproject.toml (and uv.lock) define the exact tool configuration and dependency versions. docs/_archive/specs/spec-001-dev-environment-tooling.md is a frozen snapshot of the v1 scaffolding spec.

At minimum, v1 depends on: - Runtime: typer, rich, pydantic, pyyaml, jinja2 - Dev: pytest, pytest-cov, ruff, mypy, pre-commit

9) Timeline Implications¶

Given scope reduction:

Milestone	Estimate	Deliverable
Scaffolding	1 day	Repo, CLI skeleton, config
Dataset loading	1 day	YAML parser, models, `list`/`show`/`refs`
Lean integration	2 days	`lean init`/`check`/`formalize` working
Basic search	1 day	SQLite FTS, `search` command
V1.0 complete	~5 days	Core workflow functional
Ingestion (arXiv + Crossref)	2-3 days	`ingest` with arXiv source tarball + best-effort extract; Crossref metadata for DOI
Full-text search	1 day	Chunked extracts in index
`erdos ask`	1-2 days	Basic RAG with Claude
V1.1 complete	~10 days total	Research-usable tool

10) Open Questions to Resolve¶

Before coding starts:

What's the actual schema of problems.yaml? - See Spec 005; upstream is metadata-only
Which problems have Formal Conjectures entries? - Should we import those?
What mathlib4 commit works with our chosen Lean version? - Pin this
Do we need Windows support in v1? - Affects path handling, Lean setup
Who's the first user beyond us? - Shapes documentation priorities

11) Summary: What Changed¶

Master Draft Says	Qualification Says
"Possibly Docling or Nougat or GROBID"	Skip PDF in v1; PDF conversion is deferred to v2.0+; optional `[pdf]` extra uses Marker (GPL)
"Maybe SQLite or Postgres or Qdrant"	SQLite only, period
"Perhaps sentence-transformers"	No vectors in v1, BM25 only
Vertical slice includes ingestion + LLM	True vertical: show → lean init → formalize → check
15 GitHub issues	23+ finer-grained issues
MCP optional but described in detail	MCP deferred past v1 (implemented in v1.4; see Spec 017)
`erdos loop run` as v1 feature	Implemented in v1.2 after manual validation (Spec 012)
Crossref + arXiv as primary sources	OpenAlex as primary (v1.2+), Crossref as fallback
No API orchestration strategy	MetadataProvider protocol for pluggable sources
No deduplication discussion	OpenAlex handles deduplication internally

The master draft is the vision. This document is the machete that clears the path to v1.