Architecture¶

Target Audience: Developers contributing to the codebase

Purpose: Understand the system architecture, core components, and design patterns used throughout the pipeline

When to Use This Guide¶

Use this guide if you're: - ✅ New to the codebase (onboarding) - ✅ Understanding how components interact (pipeline flow, module dependencies) - ✅ Adding new features (need to know where code lives) - ✅ Debugging issues across modules (tracing data flow)

User Guide: System Overview - High-level system introduction
Developer Guides:
Development Workflow - Git, make commands, pre-commit hooks
Preprocessing Internals - Dataset preprocessing patterns
Testing Strategy - Test architecture and patterns
Type Checking - Type safety requirements
Implementation Details: See source code in src/antibody_training_esm/

Core Pipeline Flow¶

Data Loading (src/antibody_training_esm/data/loaders.py) → Load CSV datasets
Embedding Extraction (src/antibody_training_esm/core/embeddings.py) → ESM-1v embeddings with batching and caching
Classification (src/antibody_training_esm/core/classifier.py) → LogisticRegression on embeddings
Training (src/antibody_training_esm/core/trainer.py) → 10-fold CV, model persistence, evaluation
CLI (src/antibody_training_esm/cli/) → User-facing commands

Key Modules¶

`core/embeddings.py`¶

ESMEmbeddingExtractor handles: - Loading ESM-1v from HuggingFace with pinned revisions - Batch processing with GPU memory management - Mean-pooling of last hidden states - Device support: CPU, CUDA, MPS

`core/classifier.py`¶

BinaryClassifier provides: - Dual initialization API (dict-based legacy + sklearn kwargs) - Assay-specific thresholds (ELISA: 0.5, PSR: 0.5495) - Logistic regression hyperparameters from config - Embedding extraction + classification pipeline

`core/trainer.py`¶

train_model orchestrates: - Config loading from YAML - Embedding caching (SHA-256 hashed paths) - 10-fold stratified cross-validation on training set - Train on full training set, test on hold-out test set - Model persistence to .pkl files - Comprehensive metrics (accuracy, precision, recall, F1, ROC-AUC)

`datasets/base.py`¶

AntibodyDataset abstract base class defines: - Standard fragment types (VH, VL, CDRs, FWRs, Full) - ANARCI annotation interface (IMGT numbering) - Common preprocessing methods - Fragment extraction for all datasets

`datasets/{boughter,jain,harvey,shehata}.py`¶

Dataset-specific loaders that: - Implement AntibodyDataset interface - Handle dataset-specific quirks - Provide default paths to canonical CSV files - Support fragment-level loading

Directory Structure¶

src/antibody_training_esm/    # Main package
├── core/                     # Core ML pipeline
│   ├── embeddings.py        # ESM-1v embedding extraction
│   ├── classifier.py        # Binary classifier (LogReg + ESM)
│   ├── trainer.py           # Training orchestration
│   └── config.py            # Config constants
├── data/                     # Data loading utilities
│   └── loaders.py           # CSV loading
├── datasets/                 # Dataset-specific loaders
│   ├── base.py              # Abstract base class
│   ├── boughter.py          # Training set
│   ├── jain.py              # Test set (Novo parity)
│   ├── harvey.py            # Test set (nanobodies)
│   └── shehata.py           # Test set (PSR assay)
├── cli/                      # Command-line interfaces
│   ├── train.py             # Training CLI
│   ├── test.py              # Testing CLI
│   └── preprocess.py        # Preprocessing CLI
└── evaluation/               # Evaluation utilities

preprocessing/                # Dataset preprocessing pipelines
├── boughter/                # 3-stage: DNA translation → annotation → QC
├── jain/                    # 2-step: Excel → CSV → P5e-S2
├── harvey/                  # 2-step: Combine CSVs → fragments
└── shehata/                 # 2-step: Excel → CSV → fragments

src/antibody_training_esm/conf/        # Hydra configuration directory (inside package)
├── config.yaml              # Default Hydra config (Boughter train, Jain test)

experiments/                  # Single source of truth for outputs
├── checkpoints/              # Trained model checkpoints (.pkl)
├── cache/                    # Cached ESM embeddings
├── runs/                     # Hydra outputs (gitignored)
└── benchmarks/               # Versioned benchmark artifacts
data/train/              # Training data CSVs
data/test/               # Test data CSVs
tests/                       # Test suite
├── unit/                    # Fast unit tests (< 1s each)
├── integration/             # Integration tests
└── e2e/                     # End-to-end tests (expensive)

Preprocessing Directory Design¶

Why Preprocessing Lives at Project Root¶

Decision (2025-11-18): Preprocessing stays at root (preprocessing/), not inside src/.

Rationale - Factory vs Product Separation:

src/antibody_training_esm/ = Product (Runtime Code)
Training and inference library
What you would pip install and import at runtime
Contains: Models, data loaders, classifiers, evaluation
preprocessing/ = Factory (One-Time ETL)
Data manufacturing scripts (run once to create datasets)
Converts raw paper data (Excel, PDFs) → canonical CSVs
Never needed at runtime or in production
Contains: Dataset-specific ETL pipelines (Excel parsers, QC filters)

Key Principle: Separate "data creation" (Factory) from "data usage" (Product)

If preprocessing moved to src/: - ❌ Bundles construction equipment (ETL scripts) inside the finished product - ❌ Bloats pip package with dependencies like openpyxl (Excel parsing) - dead code in production - ❌ Mixes "data creation" (run once) with "data usage" (run repeatedly)

Distinction: - One-time ETL (preprocessing/) → Parse Jain Excel file, translate DNA to protein (ONCE) - Runtime data loading (src/data/) → Load CSVs during training (REPEATEDLY)

Research Reproducibility: - Top-level visibility signals scientific importance - Data transformation methodology is part of the scientific contribution - Makes ETL logic discoverable for peer review

When to Reconsider (RARE): - Only if publishing as PyPI library where preprocessing utilities are imported by other projects - Only if preprocessing evolves from one-time ETL to runtime utilities - NOT for "production deployment" - current structure is architecturally correct

See Also: docs/archive/decisions/preprocessing-location-decision-2025-11-18.md for detailed analysis

Important Patterns & Conventions¶

Configuration System¶

All training controlled via Hydra configs in src/antibody_training_esm/conf/ (inside package)
Default config: src/antibody_training_esm/conf/config.yaml (Boughter → Jain)
Override any parameter from CLI without editing files: antibody-train hardware.device=cuda
Config structure: model, data, classifier, training, experiment, hardware
HuggingFace model revisions pinned for reproducibility

Dataset Organization¶

Training data: data/train/{dataset}/canonical/*.csv
Test data: data/test/{dataset}/canonical/*.csv or fragments/*.csv
Raw data: Never committed to Git - stored in data/test/ and preprocessed locally
Each dataset has dedicated preprocessing pipeline in preprocessing/{dataset}/

Embedding Caching¶

ESM embeddings cached in experiments/cache/ as .pkl files (NumPy arrays + metadata)
Cache key: SHA-256 hash of model_name + revision + max_length + sequences
Includes model metadata to prevent ESM2 from reusing ESM-1v embeddings (critical bug fix 2025-11-11)
Different backbones generate separate caches with unique hashes
Cache validation: Stored metadata (model_name, revision, max_length) verified on load
Recomputes embeddings if metadata mismatch detected
Prevents silent cache collisions between different PLMs
Performance: Prevents expensive re-computation during hyperparameter sweeps
Location: experiments/cache/{dataset}_{hash}_embeddings.pkl
Implementation: See src/antibody_training_esm/core/trainer.py:304-373

Model Persistence¶

Trained models saved as .pkl files in experiments/checkpoints/
Pickle usage limited to trusted local artifacts only
Threat model: No internet-exposed API, no untrusted pickle loading
Production deployment should migrate to JSON + NPZ (see SECURITY_REMEDIATION_PLAN.md)

Type Safety¶

100% type coverage enforced via mypy with disallow_untyped_defs=true
All public functions require complete type annotations
Type failures block CI pipeline
Track type remediation progress in .mypy_failures.txt

Testing Strategy¶

Unit tests (tests/unit/): Fast, isolated, mocked dependencies
Integration tests (tests/integration/): Multi-component interactions
E2E tests (tests/e2e/): Full pipeline (expensive, scheduled runs)
HuggingFace downloads mocked via tests/fixtures/mock_models.py (even in e2e)
Coverage requirement: ≥70% enforced in CI
Test fixtures in tests/fixtures/ with deterministic data

Preprocessing Philosophy¶

Dataset-centric organization: All preprocessing for a dataset lives in preprocessing/{dataset}/
Reproducibility: All preprocessing scripts are versioned and documented
Validation: Each preprocessing step has validation script (e.g., validate_stages2_3.py)
Intermediate outputs: Staged outputs (raw → processed → canonical/fragments)

Fragment Types¶

Standard fragments across all datasets: - VH/VL: Variable heavy/light chains - CDRs: H-CDR½/3, L-CDR½/3, H-CDRs, L-CDRs, All-CDRs - FWRs: H-FWRs, L-FWRs, All-FWRs - Combined: VH+VL, Full (VH+VL including linkers) - Nanobody-specific: VHH_only, VHH-CDR½/3, VHH-CDRs, VHH-FWRs

Assay-Specific Thresholds¶

ELISA (Boughter, Jain): threshold = 0.5 (standard)
PSR (Harvey, Shehata): threshold = 0.5495 (Novo Nordisk's PSR threshold - near-parity)
Thresholds configured in BinaryClassifier.ASSAY_THRESHOLDS

Security & Best Practices¶

Pickle Usage¶

Approved use cases: ML models, embedding caches, preprocessed datasets
All files generated locally by trusted code
Never load untrusted pickle files
Run security scans: uv run bandit -r src/ (must remain clean)

Pre-commit Hooks¶

Installed via uv run pre-commit install
Auto-run on commit: ruff format, ruff lint, mypy
Manual run: make hooks
Failures block commits (intended behavior)

CI Pipeline¶

Quality gate: ruff, mypy, bandit (all must pass)
Unit tests: Fast tests with ≥70% coverage
Integration tests: Multi-component tests
E2E tests: Scheduled runs only (expensive)
Security: Bandit scan must show 0 findings

Last Updated: 2025-11-28 Branch: main