Skip to content

ADR-003: SQLite FTS5 as the Baseline Search Index

Status: Accepted Date: 2026-01-25 Related: docs/_archive/specs/spec-006-search-index.md, src/erdos/core/search/

Context

erdos-banger needs fast, local search across:

  • problem statements and notes
  • ingested literature extracts and metadata
  • (v3+) research workspace artifacts

The project goals are:

  • local-first execution (no required external services)
  • deterministic behavior (search index can be rebuilt from SSOT files)
  • low operational overhead for contributors (no extra daemons)

Decision

Use an on-disk SQLite FTS5 index as the default, baseline search system.

  • Persisted DB path defaults to index/erdos.sqlite (gitignored).
  • erdos search --build-index builds/rebuilds the index deterministically from available SSOT artifacts.
  • Optional embedding-based search can be layered on (via the embeddings extra), but should not be required for core functionality.

Options Considered

Option A (Chosen): SQLite FTS5

Pros

  • Zero external services; works everywhere SQLite works
  • Good enough relevance for technical text (BM25)
  • Deterministic rebuilds from local artifacts
  • Easy test setup (temporary SQLite file or in-memory DB)

Cons

  • Not a distributed search system
  • Relevance and ranking are less configurable than dedicated engines

Option B: Elasticsearch / OpenSearch

Pros

  • Powerful ranking and scaling capabilities

Cons

  • Operationally heavy for a CLI-first repo
  • Harder contributor onboarding and testing

Option C: Postgres + pgvector

Pros

  • Strong relational + vector querying in one system

Cons

  • Requires a running DB service; breaks local-first/no-daemon goal

Option D: Dedicated Vector DB (Qdrant, Pinecone, etc.)

Pros

  • Great vector similarity performance and tooling

Cons

  • Adds infrastructure and vendor dependencies
  • Not necessary at current scale; reduces determinism

Consequences

  • Search remains fast and usable without any ML dependencies.
  • Embeddings remain an optional enhancement and should degrade gracefully.
  • The canonical sources remain filesystem artifacts; the index is always a derived store that can be regenerated.