Antibody Non-Specificity Prediction: Comprehensive Inference Guide¶
Target Audience: Researchers and developers using the prediction pipeline
Prerequisites: Trained model checkpoint (.pkl), Python environment set up
Quick Start: See root INFERENCE_GUIDE.md for fast reference
Table of Contents¶
- Overview
- Prerequisites
- Input Specification
- Command-Line Usage
- Output Specification
- Advanced Usage
- Batch Processing
- Troubleshooting
- Performance Tuning
- Integration with Downstream Tools
Overview¶
The antibody-predict CLI provides production-ready inference for screening antibody sequences for non-specificity (polyreactivity). It uses pre-trained ESM-1v protein language models combined with logistic regression or XGBoost classifiers to predict whether an antibody sequence is:
- Specific: Low polyreactivity (safe for therapeutic use)
- Non-specific: High polyreactivity (potential developability issues)
Key Features¶
- ✅ Pre-trained models: No training required - use published checkpoints
- ✅ CSV input/output: Standard data formats
- ✅ Batch processing: Process hundreds of sequences efficiently
- ✅ GPU acceleration: Automatic device selection (CUDA, MPS, CPU)
- ✅ Assay-specific thresholds: Optimized for ELISA or PSR assays
- ✅ Flexible column names: Works with any CSV schema
Prerequisites¶
1. Trained Model Checkpoint¶
Obtain a trained classifier via:
Option A: Train Your Own Model
Seetraining.md for detailed instructions.
Option B: Download Published Checkpoint
Supported formats:
- Development: .pkl (Pickle) - Standard single-file checkpoint
- Production: .npz + .json - Secure, pickle-free checkpoint (requires both files)
# Download from GitHub releases
wget https://github.com/the-obstacle-is-the-way/antibody_training_pipeline_ESM/releases/download/v0.6.0/boughter_vh_esm1v_logreg.pkl
mkdir -p experiments/checkpoints/esm1v/logreg/
mv boughter_vh_esm1v_logreg.pkl experiments/checkpoints/esm1v/logreg/
2. Python Environment¶
3. Hardware Recommendations¶
| Hardware | Batch Size | Throughput | Notes |
|---|---|---|---|
| CPU only | 8-16 | ~10 seq/min | Works, but slow |
| GPU (8GB VRAM) | 32 | ~100 seq/min | Recommended |
| GPU (16GB+ VRAM) | 64 | ~200 seq/min | Optimal |
| Apple Silicon (M1/M2) | 32 | ~80 seq/min | MPS acceleration |
Input Specification¶
File Format¶
Required: CSV file (.csv) with amino acid sequences
Column Requirements¶
| Column | Required? | Default Name | Description |
|---|---|---|---|
| Sequence | YES | sequence |
Amino acid sequence (20 standard AAs + X) |
| ID | No | id |
Unique identifier (auto-generated if missing) |
| Other columns | No | Any | Preserved in output |
Sequence Requirements¶
✅ Valid:
❌ Invalid:
EVQ-LVESGGGLVQPG---SLRLSCAASGFTFSSYAMSWVRQAPGKGKGLEWVSA... # Gap characters (-)
EVQLVESGGGLV*QPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGKGLEWVSA... # Stop codons (*)
EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGKGLEWVSA123 # Non-amino acid characters
Cleaning:
- Leading/trailing whitespace automatically stripped
- Case-insensitive (auto-normalized to uppercase)
- Gap characters (-) and stop codons (*) NOT supported - will cause validation errors
Example Input File¶
my_candidates.csv:
id,sequence,description,source
mAb-001,EVQLVESGGGLVQPGGSLRLSCAASGFTFSDYAMHWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAR,Primary candidate,Lab A
mAb-002,QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYGISWVRQAPGQGLEWMGWISAYNGNTNYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYCAR,Backup sequence,Lab B
mAb-003,QVQLQQSGPGLVKPSQTLSLTCAISGDSVSSNSAAWNWIRQSPSRGLEWLGRTYYRSKWYNDYAVSVKSRITINPDTSKNQFSLQLNSVTPEDTAVYYCAR,Negative control,Lab A
Command-Line Usage¶
Basic Command¶
uv run antibody-predict \
input_file="path/to/input.csv" \
output_file="path/to/output.csv" \
classifier.path="path/to/model.pkl"
All Arguments¶
| Argument | Type | Required? | Default | Description |
|---|---|---|---|---|
input_file |
Path | YES | — | Input CSV path |
output_file |
Path | No | predictions.csv |
Output CSV path |
classifier.path |
Path | YES | — | Trained model checkpoint (.pkl or .npz) |
classifier.config_path |
Path | No | Auto | JSON config for .npz models |
sequence_column |
String | No | sequence |
Column name for sequences |
assay_type |
String | No | — | ELISA or PSR (calibrated thresholds) |
threshold |
Float | No | 0.5 |
Manual probability threshold (0.0-1.0) |
model.name |
String | No | Auto-detect | ESM model architecture |
hardware.device |
String | No | Auto | cuda, mps, or cpu |
hardware.batch_size |
Int | No | 32 |
Embedding batch size |
Example Commands¶
1. Basic prediction (Pickle):
uv run antibody-predict \
input_file=data/candidates.csv \
classifier.path=experiments/checkpoints/esm1v/logreg/model.pkl
2. Production prediction (NPZ):
uv run antibody-predict \
input_file=data/candidates.csv \
classifier.path=experiments/checkpoints/esm1v/logreg/model.npz \
classifier.config_path=experiments/checkpoints/esm1v/logreg/model_config.json
3. Custom sequence column:
uv run antibody-predict \
input_file=data/canonical.csv \
sequence_column="vh_sequence" \
classifier.path=experiments/checkpoints/esm1v/logreg/model.pkl
3. PSR assay-specific threshold:
uv run antibody-predict \
input_file=data/nanobodies.csv \
assay_type="PSR" \
classifier.path=experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl
4. Manual threshold tuning:
uv run antibody-predict \
input_file=data/high_specificity_batch.csv \
threshold=0.8 \
classifier.path=experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl
5. Force CPU (no GPU):
uv run antibody-predict \
input_file=data/small_batch.csv \
hardware.device=cpu \
classifier.path=experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl
Output Specification¶
Output CSV Format¶
The output CSV contains all original columns plus two new inference columns:
| Column Name | Type | Description |
|---|---|---|
prediction |
String | Binary classification: specific or non-specific |
probability |
Float (0.0-1.0) | Confidence score for non-specificity |
Probability Interpretation: - 0.0 → 0.3: Highly specific (low polyreactivity) - 0.3 → 0.5: Borderline specific - 0.5 → 0.7: Borderline non-specific - 0.7 → 1.0: Highly non-specific (high polyreactivity)
Example Output¶
predictions.csv:
id,sequence,description,source,prediction,probability
mAb-001,EVQL...YCAR,Primary candidate,Lab A,specific,0.04
mAb-002,QVQL...YCAR,Backup sequence,Lab B,non-specific,0.89
mAb-003,QVQL...YCAR,Negative control,Lab A,specific,0.12
Result Files¶
When running with default settings, the following files are created:
output_directory/
├── predictions.csv # Main output with predictions
├── prediction_log.txt # Detailed run log
└── prediction_metadata.json # Model and config metadata
Advanced Usage¶
Assay-Specific Thresholds¶
Different assays have different optimal thresholds:
ELISA (Standard):
PSR (Polyspecificity Reagent):
uv run antibody-predict \
assay_type="PSR" \
... # Uses threshold = 0.5495 (Novo Nordisk's PSR threshold)
Threshold Tuning for Precision/Recall Trade-offs¶
High Precision (Minimize False Positives):
High Recall (Minimize False Negatives):
uv run antibody-predict \
threshold=0.3 \
... # More sensitive - catch all potential non-specific sequences
Using Different Model Checkpoints¶
XGBoost Classifier (v0.6.0+):
uv run antibody-predict \
classifier.path=experiments/checkpoints/esm1v/xgboost/boughter_vh_esm1v_xgboost.pkl \
...
ESM-2 Backbone:
uv run antibody-predict \
classifier.path=experiments/checkpoints/esm2_650m/logreg/boughter_vh_esm2_650m_logreg.pkl \
...
Batch Processing¶
Processing Large Datasets¶
For datasets with >1000 sequences, use these strategies:
1. Increase batch size (GPU):
uv run antibody-predict \
input_file=large_dataset.csv \
hardware.batch_size=64 \
hardware.device=cuda \
classifier.path=...
2. Process in chunks (CPU):
# Split input CSV into chunks
split -l 500 large_dataset.csv chunk_
# Process each chunk
for chunk in chunk_*; do
uv run antibody-predict \
input_file=$chunk \
output_file=predictions_$chunk.csv \
classifier.path=...
done
# Merge results
cat predictions_chunk_*.csv > all_predictions.csv
Parallel Processing¶
For maximum throughput on multi-GPU systems:
# GPU 0
CUDA_VISIBLE_DEVICES=0 uv run antibody-predict \
input_file=batch1.csv output_file=pred1.csv ... &
# GPU 1
CUDA_VISIBLE_DEVICES=1 uv run antibody-predict \
input_file=batch2.csv output_file=pred2.csv ... &
wait # Wait for both to finish
Troubleshooting¶
Common Errors¶
| Error Message | Cause | Solution |
|---|---|---|
Input CSV must contain a 'sequence' column |
Column name mismatch | Use sequence_column=your_column_name |
FileNotFoundError: classifier.path |
Model path incorrect | Check path exists: ls experiments/checkpoints/... |
torch.cuda.OutOfMemoryError |
GPU memory exceeded | Reduce batch size: hardware.batch_size=16 |
ValueError: Sequence contains invalid characters |
Gap characters or stop codons | Remove - and * from sequences |
KeyError: 'sequence' |
Missing sequence column | Add sequence_column argument |
Debugging Tips¶
1. Enable verbose logging:
uv run antibody-predict \
input_file=... \
+log_level=DEBUG # See detailed embedding extraction logs
2. Check model compatibility:
# Load model in Python to inspect metadata
python -c "
import pickle
with open('model.pkl', 'rb') as f:
clf = pickle.load(f)
print(clf.model_name) # Should match ESM backbone
"
3. Validate input CSV:
# Check CSV schema
head -1 my_data.csv # Print headers
# Check for gap characters
grep -c '\-' my_data.csv # Should return 0
Performance Tuning¶
GPU Memory Optimization¶
| Batch Size | VRAM Usage | Throughput | Recommendation |
|---|---|---|---|
| 8 | ~2GB | Slow | Small GPU or testing |
| 16 | ~4GB | Moderate | 8GB VRAM cards |
| 32 | ~6GB | Good | Default (recommended) |
| 64 | ~12GB | Excellent | 16GB+ VRAM cards |
| 128 | ~24GB | Maximum | A100 GPUs only |
CPU vs GPU Performance¶
Benchmark (100 sequences, ESM-1v): - CPU (12-core): ~15 minutes - GPU (RTX 3090): ~45 seconds - Apple M2 Max (MPS): ~90 seconds
Recommendation: Use GPU for >50 sequences
Integration with Downstream Tools¶
Exporting to Excel¶
# Run prediction
uv run antibody-predict input_file=data.csv output_file=predictions.csv ...
# Convert to Excel
python -c "
import pandas as pd
df = pd.read_csv('predictions.csv')
df.to_excel('predictions.xlsx', index=False)
"
Filtering Specific Sequences¶
# Get only high-confidence specific sequences
python -c "
import pandas as pd
df = pd.read_csv('predictions.csv')
specific = df[(df['prediction'] == 'specific') & (df['probability'] < 0.2)]
specific.to_csv('high_confidence_specific.csv', index=False)
"
Visualization¶
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load predictions
df = pd.read_csv('predictions.csv')
# Plot probability distribution
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='probability', hue='prediction', bins=20)
plt.xlabel('Non-Specificity Probability')
plt.ylabel('Count')
plt.title('Prediction Distribution')
plt.savefig('prediction_distribution.png')
See Also¶
- Quick Reference:
INFERENCE_GUIDE.md(root) - Training Guide:
training.md - Model Zoo:
../research/model-zoo-roadmap.md - Troubleshooting:
troubleshooting.md
Last Updated: 2025-11-19 Version: v0.6.0+