Architecture¶
Target Audience: Developers contributing to the codebase
Purpose: Understand the system architecture, core components, and design patterns used throughout the pipeline
When to Use This Guide¶
Use this guide if you're: - ✅ New to the codebase (onboarding) - ✅ Understanding how components interact (pipeline flow, module dependencies) - ✅ Adding new features (need to know where code lives) - ✅ Debugging issues across modules (tracing data flow)
Related Documentation¶
- User Guide: System Overview - High-level system introduction
- Developer Guides:
- Development Workflow - Git, make commands, pre-commit hooks
- Preprocessing Internals - Dataset preprocessing patterns
- Testing Strategy - Test architecture and patterns
- Type Checking - Type safety requirements
- Implementation Details: See source code in
src/antibody_training_esm/
Core Pipeline Flow¶
- Data Loading (
src/antibody_training_esm/data/loaders.py) → Load CSV datasets - Embedding Extraction (
src/antibody_training_esm/core/embeddings.py) → ESM-1v embeddings with batching and caching - Classification (
src/antibody_training_esm/core/classifier.py) → LogisticRegression on embeddings - Training (
src/antibody_training_esm/core/trainer.py) → 10-fold CV, model persistence, evaluation - CLI (
src/antibody_training_esm/cli/) → User-facing commands
Key Modules¶
core/embeddings.py¶
ESMEmbeddingExtractor handles: - Loading ESM-1v from HuggingFace with pinned revisions - Batch processing with GPU memory management - Mean-pooling of last hidden states - Device support: CPU, CUDA, MPS
core/classifier.py¶
BinaryClassifier provides: - Dual initialization API (dict-based legacy + sklearn kwargs) - Assay-specific thresholds (ELISA: 0.5, PSR: 0.5495) - Logistic regression hyperparameters from config - Embedding extraction + classification pipeline
core/trainer.py¶
train_model orchestrates:
- Config loading from YAML
- Embedding caching (SHA-256 hashed paths)
- 10-fold stratified cross-validation on training set
- Train on full training set, test on hold-out test set
- Model persistence to .pkl files
- Comprehensive metrics (accuracy, precision, recall, F1, ROC-AUC)
datasets/base.py¶
AntibodyDataset abstract base class defines: - Standard fragment types (VH, VL, CDRs, FWRs, Full) - ANARCI annotation interface (IMGT numbering) - Common preprocessing methods - Fragment extraction for all datasets
datasets/{boughter,jain,harvey,shehata}.py¶
Dataset-specific loaders that: - Implement AntibodyDataset interface - Handle dataset-specific quirks - Provide default paths to canonical CSV files - Support fragment-level loading
Directory Structure¶
src/antibody_training_esm/ # Main package
├── core/ # Core ML pipeline
│ ├── embeddings.py # ESM-1v embedding extraction
│ ├── classifier.py # Binary classifier (LogReg + ESM)
│ ├── trainer.py # Training orchestration
│ └── config.py # Config constants
├── data/ # Data loading utilities
│ └── loaders.py # CSV loading
├── datasets/ # Dataset-specific loaders
│ ├── base.py # Abstract base class
│ ├── boughter.py # Training set
│ ├── jain.py # Test set (Novo parity)
│ ├── harvey.py # Test set (nanobodies)
│ └── shehata.py # Test set (PSR assay)
├── cli/ # Command-line interfaces
│ ├── train.py # Training CLI
│ ├── test.py # Testing CLI
│ └── preprocess.py # Preprocessing CLI
└── evaluation/ # Evaluation utilities
preprocessing/ # Dataset preprocessing pipelines
├── boughter/ # 3-stage: DNA translation → annotation → QC
├── jain/ # 2-step: Excel → CSV → P5e-S2
├── harvey/ # 2-step: Combine CSVs → fragments
└── shehata/ # 2-step: Excel → CSV → fragments
src/antibody_training_esm/conf/ # Hydra configuration directory (inside package)
├── config.yaml # Default Hydra config (Boughter train, Jain test)
experiments/ # Single source of truth for outputs
├── checkpoints/ # Trained model checkpoints (.pkl)
├── cache/ # Cached ESM embeddings
├── runs/ # Hydra outputs (gitignored)
└── benchmarks/ # Versioned benchmark artifacts
data/train/ # Training data CSVs
data/test/ # Test data CSVs
tests/ # Test suite
├── unit/ # Fast unit tests (< 1s each)
├── integration/ # Integration tests
└── e2e/ # End-to-end tests (expensive)
Preprocessing Directory Design¶
Why Preprocessing Lives at Project Root¶
Decision (2025-11-18): Preprocessing stays at root (preprocessing/), not inside src/.
Rationale - Factory vs Product Separation:
src/antibody_training_esm/= Product (Runtime Code)- Training and inference library
- What you would
pip installand import at runtime -
Contains: Models, data loaders, classifiers, evaluation
-
preprocessing/= Factory (One-Time ETL) - Data manufacturing scripts (run once to create datasets)
- Converts raw paper data (Excel, PDFs) → canonical CSVs
- Never needed at runtime or in production
- Contains: Dataset-specific ETL pipelines (Excel parsers, QC filters)
Key Principle: Separate "data creation" (Factory) from "data usage" (Product)
If preprocessing moved to src/:
- ❌ Bundles construction equipment (ETL scripts) inside the finished product
- ❌ Bloats pip package with dependencies like openpyxl (Excel parsing) - dead code in production
- ❌ Mixes "data creation" (run once) with "data usage" (run repeatedly)
Distinction: - One-time ETL (preprocessing/) → Parse Jain Excel file, translate DNA to protein (ONCE) - Runtime data loading (src/data/) → Load CSVs during training (REPEATEDLY)
Research Reproducibility: - Top-level visibility signals scientific importance - Data transformation methodology is part of the scientific contribution - Makes ETL logic discoverable for peer review
When to Reconsider (RARE): - Only if publishing as PyPI library where preprocessing utilities are imported by other projects - Only if preprocessing evolves from one-time ETL to runtime utilities - NOT for "production deployment" - current structure is architecturally correct
See Also: docs/archive/decisions/preprocessing-location-decision-2025-11-18.md for detailed analysis
Important Patterns & Conventions¶
Configuration System¶
- All training controlled via Hydra configs in
src/antibody_training_esm/conf/(inside package) - Default config:
src/antibody_training_esm/conf/config.yaml(Boughter → Jain) - Override any parameter from CLI without editing files:
antibody-train hardware.device=cuda - Config structure:
model,data,classifier,training,experiment,hardware - HuggingFace model revisions pinned for reproducibility
Dataset Organization¶
- Training data:
data/train/{dataset}/canonical/*.csv - Test data:
data/test/{dataset}/canonical/*.csvorfragments/*.csv - Raw data: Never committed to Git - stored in
data/test/and preprocessed locally - Each dataset has dedicated preprocessing pipeline in
preprocessing/{dataset}/
Embedding Caching¶
- ESM embeddings cached in
experiments/cache/as.pklfiles (NumPy arrays + metadata) - Cache key: SHA-256 hash of
model_name + revision + max_length + sequences - Includes model metadata to prevent ESM2 from reusing ESM-1v embeddings (critical bug fix 2025-11-11)
- Different backbones generate separate caches with unique hashes
- Cache validation: Stored metadata (
model_name,revision,max_length) verified on load - Recomputes embeddings if metadata mismatch detected
- Prevents silent cache collisions between different PLMs
- Performance: Prevents expensive re-computation during hyperparameter sweeps
- Location:
experiments/cache/{dataset}_{hash}_embeddings.pkl - Implementation: See
src/antibody_training_esm/core/trainer.py:304-373
Model Persistence¶
- Trained models saved as
.pklfiles inexperiments/checkpoints/ - Pickle usage limited to trusted local artifacts only
- Threat model: No internet-exposed API, no untrusted pickle loading
- Production deployment should migrate to JSON + NPZ (see
SECURITY_REMEDIATION_PLAN.md)
Type Safety¶
- 100% type coverage enforced via mypy with
disallow_untyped_defs=true - All public functions require complete type annotations
- Type failures block CI pipeline
- Track type remediation progress in
.mypy_failures.txt
Testing Strategy¶
- Unit tests (
tests/unit/): Fast, isolated, mocked dependencies - Integration tests (
tests/integration/): Multi-component interactions - E2E tests (
tests/e2e/): Full pipeline (expensive, scheduled runs) - HuggingFace downloads mocked via
tests/fixtures/mock_models.py(even in e2e) - Coverage requirement: ≥70% enforced in CI
- Test fixtures in
tests/fixtures/with deterministic data
Preprocessing Philosophy¶
- Dataset-centric organization: All preprocessing for a dataset lives in
preprocessing/{dataset}/ - Reproducibility: All preprocessing scripts are versioned and documented
- Validation: Each preprocessing step has validation script (e.g.,
validate_stages2_3.py) - Intermediate outputs: Staged outputs (raw → processed → canonical/fragments)
Fragment Types¶
Standard fragments across all datasets: - VH/VL: Variable heavy/light chains - CDRs: H-CDR½/3, L-CDR½/3, H-CDRs, L-CDRs, All-CDRs - FWRs: H-FWRs, L-FWRs, All-FWRs - Combined: VH+VL, Full (VH+VL including linkers) - Nanobody-specific: VHH_only, VHH-CDR½/3, VHH-CDRs, VHH-FWRs
Assay-Specific Thresholds¶
- ELISA (Boughter, Jain): threshold = 0.5 (standard)
- PSR (Harvey, Shehata): threshold = 0.5495 (Novo Nordisk's PSR threshold - near-parity)
- Thresholds configured in
BinaryClassifier.ASSAY_THRESHOLDS
Security & Best Practices¶
Pickle Usage¶
- Approved use cases: ML models, embedding caches, preprocessed datasets
- All files generated locally by trusted code
- Never load untrusted pickle files
- Run security scans:
uv run bandit -r src/(must remain clean)
Pre-commit Hooks¶
- Installed via
uv run pre-commit install - Auto-run on commit: ruff format, ruff lint, mypy
- Manual run:
make hooks - Failures block commits (intended behavior)
CI Pipeline¶
- Quality gate: ruff, mypy, bandit (all must pass)
- Unit tests: Fast tests with ≥70% coverage
- Integration tests: Multi-component tests
- E2E tests: Scheduled runs only (expensive)
- Security: Bandit scan must show 0 findings
Last Updated: 2025-11-28
Branch: main