Testing Guide¶
This guide covers how to evaluate trained antibody non-specificity prediction models on test datasets.
Overview¶
Testing involves:
- Load Trained Model - Load model (pickle for research, NPZ+JSON for production)
- Load Test Data - Load test dataset (CSV format)
- Extract Embeddings - Generate ESM-1v embeddings for test sequences
- Predict - Classify sequences as specific (0) or non-specific (1)
- Evaluate - Compute performance metrics (accuracy, precision, recall, confusion matrix)
Understanding Dataset File Types¶
Before testing, it's important to understand the two types of CSV files in the pipeline:
Canonical Files vs Fragment Files¶
Fragment Files (data/test/{dataset}/fragments/*.csv) - RECOMMENDED:
- Standardized column names: sequence, label
- Ready for testing with default CLI (no config override needed)
- Created by preprocessing scripts
- Use these for most testing workflows
Canonical Files (data/test/{dataset}/canonical/*.csv) - ADVANCED:
- Original column names from source data (vh_sequence, vl_sequence)
- Includes all metadata (flags, PSR scores, etc.)
- Requires config override with sequence_column: "vh_sequence"
- Use for custom analysis requiring full metadata
Which to use?
- Quick testing: Use fragment files (work with --model and --data CLI flags)
- Metadata analysis: Use canonical files with test config YAML
Quick Testing Commands¶
Test with Model and Data Paths (Recommended)¶
# Test trained model on Jain dataset (using fragment file)
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/jain/fragments/VH_only_jain.csv
Note: Fragment files have standardized sequence column - no config override needed.
Test with Configuration File¶
# Create sample test config
uv run antibody-test --create-config
# Test using config
uv run antibody-test --config test_config.yaml
Example test_config.yaml (fragment file):
model_paths:
- "experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl"
data_paths:
- "data/test/jain/fragments/VH_only_jain.csv" # Fragment file
output_dir: "./experiments/benchmarks"
device: "auto" # Auto-detects CUDA > MPS > CPU; override if needed
batch_size: 32 # Default embedding batch size
Note: The CLI automatically organizes results hierarchically by backbone/classifier/dataset under output_dir (e.g., experiments/benchmarks/esm1v/logreg/jain/…) when the model config JSON is present alongside the checkpoint. Specify only the base output_dir; the stratification is handled for you.
Thresholds: antibody-test now auto-detects assay type from the dataset name (harvey|shehata → PSR threshold 0.5495, jain|boughter → ELISA threshold 0.5). Override with --threshold or the threshold field in a config if you need explicit control.
Test Dataset Options¶
The pipeline includes three test datasets with preprocessed fragment files:
Jain Dataset (Novo Parity Benchmark)¶
# Using fragment file (recommended - standardized columns)
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/jain/fragments/VH_only_jain.csv
Details:
- Size: 137 antibodies (fragment file includes full Jain dataset)
- 86 antibodies from P5e-S2 subset (Novo parity benchmark)
- Assay: ELISA (per-antigen binding)
- Fragment: VH
- File:
data/test/jain/fragments/VH_only_jain.csv(standardizedsequencecolumn) - Alternative:
data/test/jain/canonical/VH_only_jain_86_p5e_s2.csv(86 only, requires config override) - Expected Accuracy: ~66% on P5e-S2 subset (Novo Nordisk parity)
Harvey Dataset (Nanobodies)¶
# Test on full VHH sequences
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/harvey/fragments/VHH_only_harvey.csv
Details:
- Size: 141,021 nanobody sequences
- Assay: PSR (polyspecific reagent)
- Fragment: VHH_only (full nanobody VHH domain)
- File:
data/test/harvey/fragments/VHH_only_harvey.csv - Note: Large-scale test, may take 10-30 minutes
Fragment-Level Testing:
# Test on VHH CDRs only
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/harvey/fragments/H-CDRs_harvey.csv
Available Harvey Fragments:
VHH_only_harvey.csv- Full VHH domainH-CDR1_harvey.csv,H-CDR2_harvey.csv,H-CDR3_harvey.csv- Individual CDRsH-CDRs_harvey.csv- Concatenated CDRsH-FWRs_harvey.csv- Concatenated FWRs
Shehata Dataset (PSR Cross-Validation)¶
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/shehata/fragments/VH_only_shehata.csv
Details:
- Size: 398 human antibodies
- Assay: PSR (polyspecific reagent)
- Fragment: VH
- File:
data/test/shehata/fragments/VH_only_shehata.csv - Note: Cross-assay validation (train ELISA, test PSR)
Fragment-Level Testing¶
All datasets provide fragment-specific CSV files. Test on specific antibody regions:
Shehata Fragments (Most Complete)¶
# Test on H-CDRs (Heavy Chain CDRs)
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/shehata/fragments/H-CDRs_shehata.csv
# Test on All-CDRs (Heavy + Light)
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/shehata/fragments/All-CDRs_shehata.csv
# Test on H-FWRs (Heavy Framework Regions)
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/shehata/fragments/H-FWRs_shehata.csv
# Test on combined VH+VL
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/shehata/fragments/VH+VL_shehata.csv
Available Shehata Fragments:
VH_only_shehata.csv,VL_only_shehata.csv- Variable domainsH-CDR1_shehata.csv,H-CDR2_shehata.csv,H-CDR3_shehata.csv- Heavy CDRsL-CDR1_shehata.csv,L-CDR2_shehata.csv,L-CDR3_shehata.csv- Light CDRsH-CDRs_shehata.csv,L-CDRs_shehata.csv,All-CDRs_shehata.csv- Concatenated CDRsH-FWRs_shehata.csv,L-FWRs_shehata.csv,All-FWRs_shehata.csv- Framework regionsVH+VL_shehata.csv,Full_shehata.csv- Combined sequences
Boughter Fragments (Training Set)¶
# Test on training set fragments (for cross-validation)
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/train/boughter/annotated/VH_only_boughter.csv
Available Boughter Fragments:
data/train/boughter/annotated/VH_only_boughter.csv- VH domain (914 sequences)data/train/boughter/annotated/H-CDRs_boughter.csv- Heavy CDRsdata/train/boughter/annotated/All-CDRs_boughter.csv- All CDRs- (See
data/train/boughter/annotated/for all 16 fragments)
Using Canonical Files (Advanced)¶
Canonical files preserve original column names and full metadata. To use them, create a test config:
Example: Test with Jain canonical file
# test_config_jain_canonical.yaml
model_paths:
- "experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl"
data_paths:
- "data/test/jain/canonical/VH_only_jain_86_p5e_s2.csv"
sequence_column: "vh_sequence" # Override for canonical file
label_column: "label"
output_dir: "./experiments/benchmarks"
device: "auto"
batch_size: 32
Then run:
Why the override?
- Canonical files use vh_sequence instead of sequence (original source data columns)
- Fragment files use standardized sequence column (preprocessed for training/testing)
- Config override tells the CLI which column to read
When to use canonical files: - Access to full metadata (ELISA flags, PSR scores, source annotations) - Reproducing exact paper methodology with original data structure - Custom analysis requiring features beyond sequence + label
Understanding Test Results¶
Standard Output¶
✅ Loaded model: experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl
✅ Loaded test data: 86 samples
✅ Extracted embeddings (86 x 1280)
✅ Predictions complete
Test Set Performance:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Accuracy: 68.60%
Precision: 52.78%
Recall: 65.52%
F1 Score: 58.46%
ROC-AUC: 0.6860
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Confusion Matrix:
Predicted
Neg Pos
Actual Neg [40 17]
Pos [10 19]
Classification Report:
precision recall f1-score support
0 0.80 0.70 0.75 57
1 0.53 0.66 0.58 29
accuracy 0.69 86
macro avg 0.66 0.68 0.67 86
weighted avg 0.71 0.69 0.69 86
Interpreting Metrics¶
Accuracy: 68.60%
- Percentage of correct predictions
- Baseline: Random guessing = ~50%
- Novo Parity: 68.60% = EXACT match to Novo Figure S14A
Precision: 47.22%
- Of predicted non-specific, 47% are truly non-specific
- Low precision = many false positives
- Interpretation: Model is conservative (predicts non-specific often)
Recall: 62.96%
- Of truly non-specific, 63% were detected
- Moderate recall = misses some non-specific antibodies
- Interpretation: Model catches majority but not all
F1 Score: 54.05%
- Harmonic mean of precision and recall
- Balances false positives and false negatives
- Interpretation: Moderate overall performance
ROC-AUC: 0.6384
- Area under ROC curve
- 0.5 = random, 1.0 = perfect
- 0.64 = weak positive discrimination
Confusion Matrix:
Predicted
Neg Pos
Actual Neg [40 19] ← True Neg: 40, False Pos: 19
Pos [10 17] ← False Neg: 10, True Pos: 17
Key Observations:
- True Negatives (40): Correctly identified specific antibodies
- False Positives (17): Specific antibodies mislabeled as non-specific
- False Negatives (10): Non-specific antibodies mislabeled as specific
- True Positives (19): Correctly identified non-specific antibodies
Class Imbalance: 57 specific vs 29 non-specific (2.0:1 ratio)
- High precision on class 0 (80%) vs low precision on class 1 (47%)
- Model biased toward predicting "specific" (majority class)
Cross-Assay Testing¶
ELISA → PSR Prediction¶
Training on ELISA (Boughter) and testing on PSR (Harvey/Shehata) requires assay-specific threshold tuning. The CLI now auto-detects PSR datasets by name and applies threshold 0.5495; use the config/CLI overrides below if you want to pin a specific value.
Method 1: Test Configuration (Recommended)
Create a test config with PSR-specific threshold:
# test_config_psr.yaml
model_paths:
- "experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl"
data_paths:
- "data/test/shehata/fragments/VH_only_shehata.csv"
output_dir: "./experiments/benchmarks"
device: "auto"
batch_size: 32
# PSR assay-specific threshold
threshold: 0.5495 # Novo Nordisk PSR threshold (default ELISA: 0.5)
Method 2: Manual Threshold Adjustment (Python)
Load model and adjust threshold manually:
import numpy as np
from antibody_training_esm.core import load_model_from_npz
from antibody_training_esm.core.embeddings import ESMEmbeddingExtractor
# Option A: Load from pickle (research)
import pickle
with open("experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl", "rb") as f:
classifier = pickle.load(f)
# Option B: Load from NPZ+JSON (production)
classifier = load_model_from_npz(
npz_path="experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.npz",
json_path="experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg_config.json"
)
# Extract embeddings for test data
extractor = ESMEmbeddingExtractor(
model_name="facebook/esm1v_t33_650M_UR90S_1",
device="mps", # Override if you need "cuda" or "cpu"
batch_size=32
)
test_embeddings = extractor.extract_embeddings(test_sequences)
# Get prediction probabilities
probs = classifier.predict_proba(test_embeddings)[:, 1]
# Apply PSR-specific threshold
psr_threshold = 0.5495 # Novo Nordisk PSR threshold
predictions = (probs > psr_threshold).astype(int)
Why different thresholds?
- ELISA threshold: 0.5 (standard)
- PSR threshold: 0.5495 (empirically derived for Novo parity)
- Assays measure different binding properties
See Research Notes - Assay-Specific Thresholds for details.
Batch Testing (Multiple Datasets)¶
Method 1: Multiple Data Paths (Recommended)
Test a single model on multiple datasets in one command:
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data \
data/test/jain/fragments/VH_only_jain.csv \
data/test/shehata/fragments/VH_only_shehata.csv \
data/test/harvey/fragments/VHH_only_harvey.csv
Method 2: Test Configuration File
# test_config_multi.yaml
model_paths:
- "experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl"
data_paths:
- "data/test/jain/fragments/VH_only_jain.csv"
- "data/test/shehata/fragments/VH_only_shehata.csv"
- "data/test/harvey/fragments/VHH_only_harvey.csv"
output_dir: "./experiments/benchmarks"
Method 3: Shell Loop
# Test on multiple datasets sequentially
for data_file in \
data/test/jain/fragments/VH_only_jain.csv \
data/test/shehata/fragments/VH_only_shehata.csv; do
echo "Testing on $data_file..."
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data "$data_file"
done
Custom CSV Testing¶
Test on your own antibody dataset:
CSV Format Requirements¶
Required columns:
sequence: Antibody amino acid sequencelabel: Ground truth (0=specific, 1=non-specific)
Optional columns:
id: Antibody identifiername: Antibody namesource: Data source
Test Command¶
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/my_model.pkl \
--data /path/to/my_test_data.csv
Performance Benchmarking¶
Time and Memory Usage¶
Small Dataset (Jain, 86 sequences):
- CPU: ~30 seconds
- GPU (CUDA/MPS): ~10 seconds
- Memory: ~2 GB
Large Dataset (Harvey, 141k sequences):
- CPU: ~15-20 minutes
- GPU (CUDA/MPS): ~5-8 minutes
- Memory: ~8-12 GB
Tip: Use GPU for large-scale testing (10x speedup).
Embedding Caching¶
Test embeddings are cached (same as training):
Benefits:
- Second test run on same dataset = instant
- Cache shared with training (no duplication)
Comparing Models¶
Method 1: Multiple Models on Same Dataset
Compare performance of different models on same test set:
uv run antibody-test \
--model \
experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
experiments/checkpoints/esm2_650m/logreg/boughter_vh_esm2_650m_logreg.pkl \
--data data/test/jain/fragments/VH_only_jain.csv
Method 2: Test Configuration
# test_config_compare.yaml
model_paths:
- "experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl"
- "experiments/checkpoints/esm2_650m/logreg/boughter_vh_esm2_650m_logreg.pkl"
data_paths:
- "data/test/jain/fragments/VH_only_jain.csv"
output_dir: "./experiments/benchmarks"
Method 3: Compare Fragment Performance
Test same model on different fragments to evaluate which regions are most predictive:
# Compare VH vs CDRs vs FWRs performance
for fragment_file in \
VH_only_shehata.csv \
H-CDRs_shehata.csv \
H-FWRs_shehata.csv; do
echo "Testing on $fragment_file..."
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/shehata/fragments/$fragment_file
done
Expected Ranking (Novo Nordisk findings):
- VH - Best performance (full variable domain)
- H-CDRs - Moderate performance (binding sites only)
- H-FWRs - Lower performance (structural framework)
Troubleshooting¶
Issue: Model fails to load¶
Symptoms: FileNotFoundError or UnpicklingError
Solution:
# Check model exists
ls -lh experiments/checkpoints/esm1v/logreg/
# Verify model is valid pickle
file experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl
Issue: Sequence column not found in test CSV¶
Symptoms: KeyError: 'sequence'
Solution: Ensure test CSV has standardized sequence column:
# Check CSV structure (use fragment file)
head -n 5 data/test/jain/fragments/VH_only_jain.csv
# Expected format:
# id,sequence,label,elisa_flags,source
# abituzumab,QVQLQQSGGELAKPGASVKVSCKASGYTFSSFWMHWVRQAPGQGLEWIGYINPRSGYTEYNEIFRDKATMTTDTSTSTAYMELSSLRSEDTAVYYCASFLGRGAMDYWGQGTTVTVSS,0.0,0,jain2017_pnas
Note: Fragment CSVs from preprocessing already have standardized sequence column. Canonical CSVs use vh_sequence instead (see "Using Canonical Files" section above for config override).
Issue: Poor test performance¶
Symptoms: Accuracy < 60% on Jain dataset
Possible causes:
- Model trained on different fragment: Train on VH, test on VH (not CDRs/FWRs)
- Cross-dataset generalization: Models trained on one dataset may not generalize to others
- Assay mismatch: ELISA ≠ PSR (adjust threshold to 0.5495 for PSR)
- Overfitting: High train CV, low test accuracy (increase regularization in training config)
See Troubleshooting Guide for detailed debugging.
Issue: Test takes too long (large datasets)¶
Solution: Use GPU acceleration:
# Verify GPU available
uv run python -c "import torch; print(torch.cuda.is_available())"
# Force GPU usage
export CUDA_VISIBLE_DEVICES=0 # Use GPU 0
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/harvey/fragments/VHH_only_harvey.csv \
--device cuda
Or reduce batch size:
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/harvey/fragments/VHH_only_harvey.csv \
--batch-size 8 # Reduce from default (16)
Advanced Testing¶
Prediction Probability Thresholds¶
Adjust prediction threshold for different precision/recall tradeoffs:
# Load model
from antibody_training_esm.core.classifier import BinaryClassifier
classifier = BinaryClassifier.load("experiments/checkpoints/esm1v/logreg/boughter_train_jain_test_vh.pkl")
# Get prediction probabilities
probs = classifier.predict_proba(test_embeddings)[:, 1] # P(non-specific)
# Custom threshold
predictions = (probs > 0.6).astype(int) # More conservative (higher precision)
ROC Curve Analysis¶
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
# Get probabilities
probs = classifier.predict_proba(test_embeddings)[:, 1]
# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, probs)
roc_auc = auc(fpr, tpr)
# Plot
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--') # Random baseline
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Jain Test Set')
plt.legend()
plt.savefig('roc_curve.png')
Next Steps¶
- Preprocessing: See Preprocessing Guide to prepare new test datasets
- Training: See Training Guide to train new models
- Research Methodology: See Research Notes for scientific validation
Last Updated: 2025-11-18
Branch: dev