Antibody Non-Specificity Prediction Pipeline using ESM¶
"⏰ Times up, let's do this." - Leeroy Jenkins
This repository provides a machine learning pipeline to predict the non-specificity of antibodies using embeddings from the ESM-1v Protein Language Model (PLM). The project is an implementation of the methods described in the paper "Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters" by Sakhnini et al.
What is Antibody Non-Specificity?¶
Non-specific binding of therapeutic antibodies can lead to faster clearance from the body and other unwanted side effects, compromising their effectiveness and safety. Predicting this property, also known as polyreactivity, from an antibody's amino acid sequence is a critical step in drug development.
This project offers a computational pipeline to tackle this challenge. It leverages the power of ESM-1v, a state-of-the-art protein language model, to convert antibody amino acid sequences into meaningful numerical representations (embeddings). These embeddings capture complex biophysical and evolutionary information, which is then used to train a machine learning classifier to predict non-specificity.
Model Architecture¶
The model's architecture is a two-stage process designed for both power and interpretability:
-
Sequence Embedding with ESM-1v: The amino acid sequence of an antibody's Variable Heavy (VH) domain is fed into the pre-trained ESM-1v model. ESM-1v, trained on millions of diverse protein sequences, generates a high-dimensional vector (embedding) for the antibody. This vector represents the learned structural and functional properties of the sequence.
-
Classification: The generated embedding vector is then used as input for a classical machine learning model. The original paper found that a Logistic Regression classifier performed best, achieving up to 71% accuracy in 10-fold cross-validation. This second stage learns to map the sequence features captured by ESM-1v to a binary outcome: specific or non-specific.
This hybrid approach combines the deep contextual understanding of a protein language model with the efficiency and interpretability of a linear classifier.
Quick Start¶
Installation¶
# Clone repository
git clone https://github.com/The-Obstacle-Is-The-Way/antibody_training_pipeline_ESM.git
cd antibody_training_pipeline_ESM
# Set up environment with uv
curl -LsSf https://astral.sh/uv/install.sh | sh # Install uv (Linux/macOS)
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv sync --all-extras
Training a Model¶
# Train with default config
make train
# Override parameters with Hydra
uv run antibody-train hardware.device=cuda training.batch_size=32
# Run hyperparameter sweeps
uv run antibody-train --multirun classifier.C=0.1,1.0,10.0
Making Predictions¶
# CLI prediction
uv run antibody-predict \
input_file=data/test.csv \
output_file=predictions.csv \
classifier.path=experiments/checkpoints/esm1v/logreg/model.pkl
# Web UI (Gradio)
uv run antibody-app classifier.path=experiments/checkpoints/esm1v/logreg/model.pkl
Features¶
Implemented ✅¶
- Data Processing: Load, clean, and process antibody datasets including the Boughter et al. (2020) training dataset
- Sequence Annotation: ANARCI-based annotation of CDRs and extraction of VH domains from full antibody sequences
- ESM-1v Embedding: Generate embeddings for antibody sequences using the ESM-1v protein language model
- Model Training: Complete training pipeline with 10-fold cross-validation using Logistic Regression or XGBoost
- Model Evaluation: Comprehensive metrics (accuracy, precision, recall, F1, ROC-AUC) across multiple test sets
- Prediction CLI: Get predictions for new antibody sequences from trained models
- Web Application: Gradio-based frontend for interactive prediction with device optimization
- Hydra Configuration: Flexible config management with CLI overrides and hyperparameter sweeps
Roadmap 🚧¶
- Biophysical Descriptor Module: Calculate and incorporate key biophysical parameters (e.g., isoelectric point)
- Support for Other PLMs: Integration of antibody-specific language models (AbLang, AntiBERTy)
Datasets¶
This pipeline uses four antibody datasets for training and evaluation:
| Dataset | Size | Assay | Usage |
|---|---|---|---|
| Boughter | 914 VH sequences | ELISA | Primary training dataset |
| Jain | 86 clinical antibodies | Per-antigen ELISA | Clinical antibody benchmark (68.60% - EXACT NOVO PARITY) |
| Harvey | 141k nanobody sequences | PSR assay | Large-scale nanobody test set |
| Shehata | 398 human antibodies | PSR assay | Cross-assay validation |
See the Datasets section for preprocessing details and data sources.
Documentation Structure¶
- User Guide: Installation, training, testing, inference, troubleshooting
- Research: Novo parity analysis, methodology, benchmarks
- Developer Guide: Architecture, workflows, testing, CI/CD
- API Reference: Auto-generated API documentation from source code
- Datasets: Dataset-specific preprocessing and validation docs
Development Workflow¶
This project uses modern Python tooling for a streamlined development experience:
# Install dependencies
make install
# Run full quality pipeline (format, lint, typecheck, test)
make all
# Run tests with coverage
make coverage
# Serve documentation locally
make docs-serve
See the Development Workflow guide for details.
Citation¶
This work implements the methodology from:
Sakhnini et al. (2025) - Novo Nordisk & University of Cambridge
Sakhnini, L.I., Beltrame, L., Fulle, S., Sormanni, P., Henriksen, A., Lorenzen, N., Vendruscolo, M., & Granata, D. (2025). Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters. bioRxiv. https://doi.org/10.1101/2025.04.28.650927
@article{sakhnini2025prediction,
title={Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters},
author={Sakhnini, Laila I. and Beltrame, Ludovica and Fulle, Simone and Sormanni, Pietro and Henriksen, Anette and Lorenzen, Nikolai and Vendruscolo, Michele and Granata, Daniele},
journal={bioRxiv},
year={2025},
month={May},
publisher={Cold Spring Harbor Laboratory},
doi={10.1101/2025.04.28.650927},
url={https://www.biorxiv.org/content/10.1101/2025.04.28.650927v1}
}
For complete dataset citations, see CITATIONS.md.
License¶
Apache License 2.0 - See LICENSE for details.