System Overview¶

Target Audience: Everyone (first-time readers, researchers, potential users)

Purpose: High-level introduction to the antibody non-specificity prediction pipeline

What is This System?¶

This is a production-grade machine learning pipeline for predicting antibody non-specificity (polyreactivity) using protein language models. The system combines state-of-the-art deep learning (ESM-1v) with interpretable classical ML (logistic regression) to classify therapeutic antibodies as specific or non-specific.

The Problem¶

Non-specific binding of therapeutic antibodies can lead to: - Faster clearance from the body - Reduced drug efficacy - Unwanted side effects - Failed clinical trials

Predicting polyreactivity from amino acid sequence is critical for drug development but traditionally requires expensive wet-lab experiments.

Our Solution¶

A two-stage computational pipeline that: 1. ESM-1v Protein Language Model → Converts antibody sequences to 1280-dimensional embeddings 2. Logistic Regression Classifier → Maps embeddings to binary predictions (specific/non-specific)

Key Achievement: Exactly reproduces Novo Nordisk's published results (68.60% on Jain benchmark - EXACT PARITY).

System Architecture¶

High-Level Pipeline¶

Antibody Sequence (FASTA/CSV)
         ↓
┌────────────────────────┐
│  Data Loading          │  ← Load datasets (Boughter, Jain, Harvey, Shehata)
│  (loaders.py)          │
└────────────────────────┘
         ↓
┌────────────────────────┐
│  Embedding Extraction  │  ← ESM-1v: Sequence → 1280-dim vector
│  (embeddings.py)       │     • Batch processing
└────────────────────────┘     • GPU/CPU support
         ↓                      • Embedding caching (SHA-256)
┌────────────────────────┐
│  Classification        │  ← Logistic Regression on embeddings
│  (classifier.py)       │     • Assay-specific thresholds
└────────────────────────┘     • sklearn-compatible API
         ↓
┌────────────────────────┐
│  Training/Evaluation   │  ← 10-fold cross-validation
│  (trainer.py)          │     • Model persistence (.pkl)
└────────────────────────┘     • Comprehensive metrics
         ↓
   Prediction: 0 (specific) or 1 (non-specific)

Core Components¶

Component	Purpose	Technology
Embedding Extractor	Convert sequences to vectors	ESM-1v (HuggingFace transformers)
Binary Classifier	Predict specificity	sklearn LogisticRegression
Dataset Loaders	Load & preprocess data	pandas, ANARCI (IMGT numbering)
Training Pipeline	Train & evaluate models	10-fold stratified CV
CLI Tools	User-facing commands	`antibody-train`, `antibody-test`
Caching System	Speed up re-runs	SHA-256 hashed embeddings

Key Capabilities¶

1. Multi-Dataset Training & Testing¶

Training Set: - Boughter (2020): 914 VH sequences, ELISA polyreactivity assay

Test Sets: - Jain (2017): 86 clinical antibodies, per-antigen ELISA (Novo parity benchmark) - Harvey (2022): 141,021 nanobody sequences, PSR assay - Shehata (2019): 398 human antibodies, PSR cross-validation

2. Fragment-Level Predictions¶

The pipeline supports predictions on antibody fragments: - Full sequences: VH, VL, VH+VL - CDRs (Complementarity-Determining Regions): H-CDR½/3, L-CDR½/3, All-CDRs - FWRs (Framework Regions): H-FWRs, L-FWRs, All-FWRs - Nanobodies: VHH domain (Harvey dataset)

This enables ablation studies to determine which antibody regions drive non-specificity.

3. Assay-Specific Calibration¶

Different experimental assays require different decision thresholds: - ELISA assays: Threshold = 0.5 (Boughter, Jain datasets) - PSR assays: Threshold = 0.5495 (Harvey, Shehata datasets - near-parity with Novo)

The classifier automatically applies the correct threshold based on assay type.

4. Reproducibility & Validation¶

Jain Benchmark: 68.60% accuracy on Jain (EXACT NOVO PARITY - [[40, 17], [10, 19]])
10-fold CV: Stratified cross-validation on training set (67-71% accuracy)
Embedding Caching: SHA-256-keyed cache prevents expensive re-computation
Config-Driven: YAML configs for reproducible experiments

5. Production-Ready Infrastructure¶

CI/CD: 5 GitHub Actions workflows (quality gates, tests, Docker, security)
Test Coverage: 90.80% (403 tests: unit, integration, E2E)
Type Safety: 100% type coverage (mypy strict mode)
Docker Support: Dev + prod containers for reproducible environments
Code Quality: ruff (linting + formatting), bandit (security scanning)

Technology Stack¶

Machine Learning¶

Component	Library	Version
Protein Language Model	ESM-1v (HuggingFace)	`facebook/esm1v_t33_650M_UR90S_1`
Classifier	scikit-learn	LogisticRegression
Embeddings	PyTorch	CPU, CUDA, MPS support
CDR Annotation	ANARCI	IMGT numbering scheme

Development Tools¶

Tool	Purpose
uv	Fast Python package manager
pytest	Test framework (unit/integration/e2e)
mypy	Static type checking (strict mode)
ruff	Linting + formatting (replaces black, isort, flake8)
pre-commit	Git hooks for code quality
Docker	Containerized environments
GitHub Actions	CI/CD (5 workflows)
Codecov	Coverage tracking

Data Processing¶

pandas: DataFrame manipulation
numpy: Numerical operations
openpyxl/xlrd: Excel file parsing (dataset preprocessing)
PyYAML: Config file management

👤 For Users (Running the Pipeline)¶

Installation: Installation Guide
Quick Start: Getting Started
Training Models: Training Guide
Testing Models: Testing Guide
Preprocessing: Preprocessing Guide
Troubleshooting: Troubleshooting Guide

👨‍💻 For Developers (Contributing)¶

Architecture Deep Dive: Architecture Guide
Development Workflow: Workflow Guide
Testing Strategy: Testing Guide
CI/CD Setup: CI/CD Guide
Type Checking: Type Checking Guide
Security: Security Guide
Preprocessing Internals: Preprocessing Guide
Docker: Docker Guide

🔬 For Researchers (Validating Methodology)¶

Novo Parity Analysis: research/novo-parity.md
Methodology & Divergences: research/methodology.md
Assay Thresholds: research/assay-thresholds.md
Benchmark Results: research/benchmark-results.md

📊 For Dataset Users¶

Boughter (Training): datasets/boughter/
Jain (Novo Parity): datasets/jain/
Harvey (Nanobodies): datasets/harvey/
Shehata (PSR): datasets/shehata/

Key Results¶

Novo Nordisk Parity (Jain Dataset)¶

Exact reproduction of published results:

Confusion Matrix: [[40, 17], [10, 19]]
Accuracy: 68.60%
Precision: 0.528 (non-specific)
Recall: 0.655 (non-specific)
F1-Score: 0.584

Methodology: P5e-S2 subset (86 antibodies), PSR threshold 0.5495, ELISA 1-3 flags removed.

Cross-Dataset Validation¶

Dataset	Sequences	Assay	Accuracy	Notes
Boughter (train)	914	ELISA	67-71%	10-fold CV
Jain (test)	86	ELISA	68.60%	Novo parity ✅ EXACT
Shehata (test)	398	PSR	58.8%	Threshold 0.5495
Harvey (test)	141,021	PSR	61.5-61.7%	Nanobodies

Scientific Context¶

Publication¶

Paper: Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters Authors: Sakhnini et al. (2025) Method: ESM-1v embeddings + Logistic Regression Key Finding: Protein language models capture non-specificity signals from sequence alone

Why This Matters¶

Speed: Computational prediction is 100x faster than wet-lab assays
Cost: Eliminates expensive experimental screening for early-stage candidates
Scale: Screen millions of sequences in silico before synthesis
Interpretability: Linear model allows feature importance analysis

Limitations¶

Accuracy ceiling: 66-71% accuracy (better than random, but not perfect)
Training data: Limited to 914 labeled sequences (Boughter)
Assay dependency: Models trained on ELISA may not generalize to PSR
No mechanistic insight: Black-box embeddings, opaque feature extraction

See research/methodology.md for detailed analysis.

License & Citation¶

License¶

This project is licensed under the Apache License 2.0 - see LICENSE for details.

Citation¶

If you use this pipeline in your research, please cite:

Original Paper:

@article{sakhnini2025prediction,
  title={Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters},
  author={Sakhnini, Laila I. and Beltrame, Ludovica and Fulle, Simone and Sormanni, Pietro and Henriksen, Anette and Lorenzen, Nikolai and Vendruscolo, Michele and Granata, Daniele},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.04.28.650927},
  url={https://www.biorxiv.org/content/10.1101/2025.04.28.650927v1}
}

Datasets: - Boughter: Boughter et al. (2020) - Training set - Jain: Jain et al. (2017) - Novo parity benchmark - Harvey: Harvey et al. (2022) - Nanobodies - Shehata: Shehata et al. (2019) - PSR validation

See CITATIONS.md for full references.

Last Updated: 2025-11-19 Branch: dev Version: v0.6.0+ (XGBoost classifier support)