Skip to content

Antibody Non-Specificity Prediction Pipeline using ESM

Leeroy Jenkins
"⏰ Times up, let's do this." - Leeroy Jenkins

CI Pipeline Docker CI codecov Python 3.12 License: Apache 2.0 Code style: ruff


This repository provides a machine learning pipeline to predict the non-specificity of antibodies using embeddings from the ESM-1v Protein Language Model (PLM). The project is an implementation of the methods described in the paper "Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters" by Sakhnini et al.


What is Antibody Non-Specificity?

Non-specific binding of therapeutic antibodies can lead to faster clearance from the body and other unwanted side effects, compromising their effectiveness and safety. Predicting this property, also known as polyreactivity, from an antibody's amino acid sequence is a critical step in drug development.

This project offers a computational pipeline to tackle this challenge. It leverages the power of ESM-1v, a state-of-the-art protein language model, to convert antibody amino acid sequences into meaningful numerical representations (embeddings). These embeddings capture complex biophysical and evolutionary information, which is then used to train a machine learning classifier to predict non-specificity.


Model Architecture

The model's architecture is a two-stage process designed for both power and interpretability:

  1. Sequence Embedding with ESM-1v: The amino acid sequence of an antibody's Variable Heavy (VH) domain is fed into the pre-trained ESM-1v model. ESM-1v, trained on millions of diverse protein sequences, generates a high-dimensional vector (embedding) for the antibody. This vector represents the learned structural and functional properties of the sequence.

  2. Classification: The generated embedding vector is then used as input for a classical machine learning model. The original paper found that a Logistic Regression classifier performed best, achieving up to 71% accuracy in 10-fold cross-validation. This second stage learns to map the sequence features captured by ESM-1v to a binary outcome: specific or non-specific.

This hybrid approach combines the deep contextual understanding of a protein language model with the efficiency and interpretability of a linear classifier.


Quick Start

Installation

# Clone repository
git clone https://github.com/The-Obstacle-Is-The-Way/antibody_training_pipeline_ESM.git
cd antibody_training_pipeline_ESM

# Set up environment with uv
curl -LsSf https://astral.sh/uv/install.sh | sh  # Install uv (Linux/macOS)
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv sync --all-extras

Training a Model

# Train with default config
make train

# Override parameters with Hydra
uv run antibody-train hardware.device=cuda training.batch_size=32

# Run hyperparameter sweeps
uv run antibody-train --multirun classifier.C=0.1,1.0,10.0

Making Predictions

# CLI prediction
uv run antibody-predict \
    input_file=data/test.csv \
    output_file=predictions.csv \
    classifier.path=experiments/checkpoints/esm1v/logreg/model.pkl

# Web UI (Gradio)
uv run antibody-app classifier.path=experiments/checkpoints/esm1v/logreg/model.pkl

Features

Implemented ✅

  • Data Processing: Load, clean, and process antibody datasets including the Boughter et al. (2020) training dataset
  • Sequence Annotation: ANARCI-based annotation of CDRs and extraction of VH domains from full antibody sequences
  • ESM-1v Embedding: Generate embeddings for antibody sequences using the ESM-1v protein language model
  • Model Training: Complete training pipeline with 10-fold cross-validation using Logistic Regression or XGBoost
  • Model Evaluation: Comprehensive metrics (accuracy, precision, recall, F1, ROC-AUC) across multiple test sets
  • Prediction CLI: Get predictions for new antibody sequences from trained models
  • Web Application: Gradio-based frontend for interactive prediction with device optimization
  • Hydra Configuration: Flexible config management with CLI overrides and hyperparameter sweeps

Roadmap 🚧

  • Biophysical Descriptor Module: Calculate and incorporate key biophysical parameters (e.g., isoelectric point)
  • Support for Other PLMs: Integration of antibody-specific language models (AbLang, AntiBERTy)

Datasets

This pipeline uses four antibody datasets for training and evaluation:

Dataset Size Assay Usage
Boughter 914 VH sequences ELISA Primary training dataset
Jain 86 clinical antibodies Per-antigen ELISA Clinical antibody benchmark (68.60% - EXACT NOVO PARITY)
Harvey 141k nanobody sequences PSR assay Large-scale nanobody test set
Shehata 398 human antibodies PSR assay Cross-assay validation

See the Datasets section for preprocessing details and data sources.


Documentation Structure

  • User Guide: Installation, training, testing, inference, troubleshooting
  • Research: Novo parity analysis, methodology, benchmarks
  • Developer Guide: Architecture, workflows, testing, CI/CD
  • API Reference: Auto-generated API documentation from source code
  • Datasets: Dataset-specific preprocessing and validation docs

Development Workflow

This project uses modern Python tooling for a streamlined development experience:

# Install dependencies
make install

# Run full quality pipeline (format, lint, typecheck, test)
make all

# Run tests with coverage
make coverage

# Serve documentation locally
make docs-serve

See the Development Workflow guide for details.


Citation

This work implements the methodology from:

Sakhnini et al. (2025) - Novo Nordisk & University of Cambridge

Sakhnini, L.I., Beltrame, L., Fulle, S., Sormanni, P., Henriksen, A., Lorenzen, N., Vendruscolo, M., & Granata, D. (2025). Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters. bioRxiv. https://doi.org/10.1101/2025.04.28.650927

@article{sakhnini2025prediction,
  title={Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters},
  author={Sakhnini, Laila I. and Beltrame, Ludovica and Fulle, Simone and Sormanni, Pietro and Henriksen, Anette and Lorenzen, Nikolai and Vendruscolo, Michele and Granata, Daniele},
  journal={bioRxiv},
  year={2025},
  month={May},
  publisher={Cold Spring Harbor Laboratory},
  doi={10.1101/2025.04.28.650927},
  url={https://www.biorxiv.org/content/10.1101/2025.04.28.650927v1}
}

For complete dataset citations, see CITATIONS.md.


License

Apache License 2.0 - See LICENSE for details.