System Overview¶
Target Audience: Everyone (first-time readers, researchers, potential users)
Purpose: High-level introduction to the antibody non-specificity prediction pipeline
What is This System?¶
This is a production-grade machine learning pipeline for predicting antibody non-specificity (polyreactivity) using protein language models. The system combines state-of-the-art deep learning (ESM-1v) with interpretable classical ML (logistic regression) to classify therapeutic antibodies as specific or non-specific.
The Problem¶
Non-specific binding of therapeutic antibodies can lead to: - Faster clearance from the body - Reduced drug efficacy - Unwanted side effects - Failed clinical trials
Predicting polyreactivity from amino acid sequence is critical for drug development but traditionally requires expensive wet-lab experiments.
Our Solution¶
A two-stage computational pipeline that: 1. ESM-1v Protein Language Model → Converts antibody sequences to 1280-dimensional embeddings 2. Logistic Regression Classifier → Maps embeddings to binary predictions (specific/non-specific)
Key Achievement: Exactly reproduces Novo Nordisk's published results (68.60% on Jain benchmark - EXACT PARITY).
System Architecture¶
High-Level Pipeline¶
Antibody Sequence (FASTA/CSV)
↓
┌────────────────────────┐
│ Data Loading │ ← Load datasets (Boughter, Jain, Harvey, Shehata)
│ (loaders.py) │
└────────────────────────┘
↓
┌────────────────────────┐
│ Embedding Extraction │ ← ESM-1v: Sequence → 1280-dim vector
│ (embeddings.py) │ • Batch processing
└────────────────────────┘ • GPU/CPU support
↓ • Embedding caching (SHA-256)
┌────────────────────────┐
│ Classification │ ← Logistic Regression on embeddings
│ (classifier.py) │ • Assay-specific thresholds
└────────────────────────┘ • sklearn-compatible API
↓
┌────────────────────────┐
│ Training/Evaluation │ ← 10-fold cross-validation
│ (trainer.py) │ • Model persistence (.pkl)
└────────────────────────┘ • Comprehensive metrics
↓
Prediction: 0 (specific) or 1 (non-specific)
Core Components¶
| Component | Purpose | Technology |
|---|---|---|
| Embedding Extractor | Convert sequences to vectors | ESM-1v (HuggingFace transformers) |
| Binary Classifier | Predict specificity | sklearn LogisticRegression |
| Dataset Loaders | Load & preprocess data | pandas, ANARCI (IMGT numbering) |
| Training Pipeline | Train & evaluate models | 10-fold stratified CV |
| CLI Tools | User-facing commands | antibody-train, antibody-test |
| Caching System | Speed up re-runs | SHA-256 hashed embeddings |
Key Capabilities¶
1. Multi-Dataset Training & Testing¶
Training Set: - Boughter (2020): 914 VH sequences, ELISA polyreactivity assay
Test Sets: - Jain (2017): 86 clinical antibodies, per-antigen ELISA (Novo parity benchmark) - Harvey (2022): 141,021 nanobody sequences, PSR assay - Shehata (2019): 398 human antibodies, PSR cross-validation
2. Fragment-Level Predictions¶
The pipeline supports predictions on antibody fragments: - Full sequences: VH, VL, VH+VL - CDRs (Complementarity-Determining Regions): H-CDR½/3, L-CDR½/3, All-CDRs - FWRs (Framework Regions): H-FWRs, L-FWRs, All-FWRs - Nanobodies: VHH domain (Harvey dataset)
This enables ablation studies to determine which antibody regions drive non-specificity.
3. Assay-Specific Calibration¶
Different experimental assays require different decision thresholds: - ELISA assays: Threshold = 0.5 (Boughter, Jain datasets) - PSR assays: Threshold = 0.5495 (Harvey, Shehata datasets - near-parity with Novo)
The classifier automatically applies the correct threshold based on assay type.
4. Reproducibility & Validation¶
- Jain Benchmark: 68.60% accuracy on Jain (EXACT NOVO PARITY -
[[40, 17], [10, 19]]) - 10-fold CV: Stratified cross-validation on training set (67-71% accuracy)
- Embedding Caching: SHA-256-keyed cache prevents expensive re-computation
- Config-Driven: YAML configs for reproducible experiments
5. Production-Ready Infrastructure¶
- CI/CD: 5 GitHub Actions workflows (quality gates, tests, Docker, security)
- Test Coverage: 90.80% (403 tests: unit, integration, E2E)
- Type Safety: 100% type coverage (mypy strict mode)
- Docker Support: Dev + prod containers for reproducible environments
- Code Quality: ruff (linting + formatting), bandit (security scanning)
Technology Stack¶
Machine Learning¶
| Component | Library | Version |
|---|---|---|
| Protein Language Model | ESM-1v (HuggingFace) | facebook/esm1v_t33_650M_UR90S_1 |
| Classifier | scikit-learn | LogisticRegression |
| Embeddings | PyTorch | CPU, CUDA, MPS support |
| CDR Annotation | ANARCI | IMGT numbering scheme |
Development Tools¶
| Tool | Purpose |
|---|---|
| uv | Fast Python package manager |
| pytest | Test framework (unit/integration/e2e) |
| mypy | Static type checking (strict mode) |
| ruff | Linting + formatting (replaces black, isort, flake8) |
| pre-commit | Git hooks for code quality |
| Docker | Containerized environments |
| GitHub Actions | CI/CD (5 workflows) |
| Codecov | Coverage tracking |
Data Processing¶
- pandas: DataFrame manipulation
- numpy: Numerical operations
- openpyxl/xlrd: Excel file parsing (dataset preprocessing)
- PyYAML: Config file management
Quick Navigation¶
👤 For Users (Running the Pipeline)¶
- Installation: Installation Guide
- Quick Start: Getting Started
- Training Models: Training Guide
- Testing Models: Testing Guide
- Preprocessing: Preprocessing Guide
- Troubleshooting: Troubleshooting Guide
👨💻 For Developers (Contributing)¶
- Architecture Deep Dive: Architecture Guide
- Development Workflow: Workflow Guide
- Testing Strategy: Testing Guide
- CI/CD Setup: CI/CD Guide
- Type Checking: Type Checking Guide
- Security: Security Guide
- Preprocessing Internals: Preprocessing Guide
- Docker: Docker Guide
🔬 For Researchers (Validating Methodology)¶
- Novo Parity Analysis: research/novo-parity.md
- Methodology & Divergences: research/methodology.md
- Assay Thresholds: research/assay-thresholds.md
- Benchmark Results: research/benchmark-results.md
📊 For Dataset Users¶
- Boughter (Training): datasets/boughter/
- Jain (Novo Parity): datasets/jain/
- Harvey (Nanobodies): datasets/harvey/
- Shehata (PSR): datasets/shehata/
Key Results¶
Novo Nordisk Parity (Jain Dataset)¶
Exact reproduction of published results:
Confusion Matrix: [[40, 17], [10, 19]]
Accuracy: 68.60%
Precision: 0.528 (non-specific)
Recall: 0.655 (non-specific)
F1-Score: 0.584
Methodology: P5e-S2 subset (86 antibodies), PSR threshold 0.5495, ELISA 1-3 flags removed.
Cross-Dataset Validation¶
| Dataset | Sequences | Assay | Accuracy | Notes |
|---|---|---|---|---|
| Boughter (train) | 914 | ELISA | 67-71% | 10-fold CV |
| Jain (test) | 86 | ELISA | 68.60% | Novo parity ✅ EXACT |
| Shehata (test) | 398 | PSR | 58.8% | Threshold 0.5495 |
| Harvey (test) | 141,021 | PSR | 61.5-61.7% | Nanobodies |
Scientific Context¶
Publication¶
Paper: Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters Authors: Sakhnini et al. (2025) Method: ESM-1v embeddings + Logistic Regression Key Finding: Protein language models capture non-specificity signals from sequence alone
Why This Matters¶
- Speed: Computational prediction is 100x faster than wet-lab assays
- Cost: Eliminates expensive experimental screening for early-stage candidates
- Scale: Screen millions of sequences in silico before synthesis
- Interpretability: Linear model allows feature importance analysis
Limitations¶
- Accuracy ceiling: 66-71% accuracy (better than random, but not perfect)
- Training data: Limited to 914 labeled sequences (Boughter)
- Assay dependency: Models trained on ELISA may not generalize to PSR
- No mechanistic insight: Black-box embeddings, opaque feature extraction
See research/methodology.md for detailed analysis.
License & Citation¶
License¶
This project is licensed under the Apache License 2.0 - see LICENSE for details.
Citation¶
If you use this pipeline in your research, please cite:
Original Paper:
@article{sakhnini2025prediction,
title={Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters},
author={Sakhnini, Laila I. and Beltrame, Ludovica and Fulle, Simone and Sormanni, Pietro and Henriksen, Anette and Lorenzen, Nikolai and Vendruscolo, Michele and Granata, Daniele},
journal={bioRxiv},
year={2025},
doi={10.1101/2025.04.28.650927},
url={https://www.biorxiv.org/content/10.1101/2025.04.28.650927v1}
}
Datasets: - Boughter: Boughter et al. (2020) - Training set - Jain: Jain et al. (2017) - Novo parity benchmark - Harvey: Harvey et al. (2022) - Nanobodies - Shehata: Shehata et al. (2019) - PSR validation
See CITATIONS.md for full references.
Last Updated: 2025-11-19
Branch: dev
Version: v0.6.0+ (XGBoost classifier support)