Preprocessing Guide¶
This guide covers how to preprocess antibody datasets for training and testing with the pipeline.
Overview¶
Preprocessing transforms raw antibody data into the canonical CSV format required by the pipeline:
Pipeline Steps:
- Data Acquisition - Download raw data (Excel, CSV, FASTA)
- Format Conversion - Convert to canonical CSV format
- Sequence Extraction - Extract antibody fragments (VH, CDRs, FWRs)
- Quality Control - Validate sequences, check labels
- Fragment Generation - Create fragment-specific datasets
When to Preprocess¶
You need to preprocess data if:
- ✅ Using provided datasets for the first time - Raw data → canonical format
- ✅ Adding a new dataset - Your own antibody data
- ✅ Extracting new fragments - VH, CDRs, FWRs from existing datasets
- ✅ Updating data - New version of published dataset
You DON'T need to preprocess if:
- ❌ Using pre-processed canonical files - Already in
data/train/ordata/test/canonical/ - ❌ Using provided fragment files - Already in
data/test/fragments/
Quick Preprocessing Commands¶
Boughter Dataset (Training Set)¶
# Stage 1: Translate DNA to protein sequences
python3 preprocessing/boughter/stage1_dna_translation.py
# Stage 2 & 3: Annotate sequences (ANARCI) + Quality control
python3 preprocessing/boughter/stage2_stage3_annotation_qc.py
Outputs:
- data/train/boughter/annotated/VH_only_boughter.csv - VH sequences (1,076 rows)
- data/train/boughter/annotated/*_boughter.csv - 16 fragment CSVs (H-CDRs, L-CDRs, etc.)
Note: Training subset (914 sequences) selected from VH_only_boughter.csv based on polyreactivity labels.
Jain Dataset (Test Set - Novo Parity)¶
# Step 1: Convert Excel to CSV
python3 preprocessing/jain/step1_convert_excel_to_csv.py
# Step 2: Preprocess P5e-S2 subset (Novo parity benchmark)
python3 preprocessing/jain/step2_preprocess_p5e_s2.py
Outputs:
Canonical files (original column names):
- data/test/jain/canonical/jain_86_novo_parity.csv - Full data (columns: id, vh_sequence, vl_sequence, ...)
- data/test/jain/canonical/VH_only_jain_86_p5e_s2.csv - VH only (columns: id, vh_sequence, label)
Fragment files (standardized columns):
- data/test/jain/fragments/VH_only_jain.csv - VH fragment (columns: id, sequence, label) ← Use for testing
- Additional fragment files in data/test/jain/fragments/ (H-CDRs, L-CDRs, etc.)
Column Naming:
- Canonical files use vh_sequence/vl_sequence (original source data)
- Fragment files use sequence (standardized for training/testing)
Harvey Dataset (Nanobody Test Set)¶
# Step 1: Combine raw CSVs
python3 preprocessing/harvey/step1_convert_raw_csvs.py
# Step 2: Extract nanobody fragments (VHH, CDRs, FWRs)
python3 preprocessing/harvey/step2_extract_fragments.py
Outputs:
Processed files:
- data/test/harvey/processed/harvey.csv - Combined raw data (intermediate)
Fragment files (standardized columns):
- data/test/harvey/fragments/VHH_only_harvey.csv - Full VHH (columns: id, sequence, label, ...)
- data/test/harvey/fragments/H-CDR1_harvey.csv - Individual CDRs
- data/test/harvey/fragments/H-CDRs_harvey.csv - Concatenated CDRs
- data/test/harvey/fragments/H-FWRs_harvey.csv - Concatenated FWRs
Column Naming:
- All fragment files use standardized sequence column (ready for testing)
Note: Fragment naming pattern: {fragmentName}_harvey.csv (not harvey_{fragmentName}.csv)
Shehata Dataset (PSR Assay Test Set)¶
# Step 1: Convert Excel to CSV
python3 preprocessing/shehata/step1_convert_excel_to_csv.py
# Step 2: Extract antibody fragments (VH, CDRs, FWRs)
python3 preprocessing/shehata/step2_extract_fragments.py
Outputs:
Processed files:
- data/test/shehata/processed/shehata.csv - Combined processed data (intermediate)
Fragment files (standardized columns):
- data/test/shehata/fragments/VH_only_shehata.csv - VH domain (columns: id, sequence, label, ...)
- data/test/shehata/fragments/H-CDRs_shehata.csv - Heavy CDRs
- data/test/shehata/fragments/All-CDRs_shehata.csv - All CDRs
- (16 fragment files total, pattern: {fragmentName}_shehata.csv)
Column Naming:
- All fragment files use standardized sequence column (ready for testing)
Note: No canonical/ directory with CSVs - only fragments/ (processed outputs only)
Canonical CSV Format¶
All datasets must be converted to this format:
Required Columns:
sequence: Antibody amino acid sequence (single-letter code)label: Binary classification (0=specific, 1=non-specific)
Optional Columns:
id: Unique identifiername: Antibody namesource: Data sourceassay: Assay type (ELISA, PSR)- Any other metadata
Preprocessing Workflows by Dataset¶
Boughter (Training Set)¶
Source: Boughter et al. (2020) - 914 VH sequences, ELISA assay
Preprocessing Steps:
- DNA → Protein Translation - Translate DNA sequences to amino acids
- ANARCI Annotation - Annotate CDRs using IMGT numbering scheme
- Quality Control - Remove sequences with annotation failures
- Label Assignment - Binary labels from polyreactivity scores
Documentation: See docs/datasets/boughter/README.md for detailed steps
Output Files:
data/train/boughter/canonical/VH_only_boughter_training.csv(914 sequences, final training set)data/train/boughter/annotated/(intermediate fragment files)
Jain (Test Set - Novo Parity)¶
Source: Jain et al. (2017) - 86 clinical antibodies, per-antigen ELISA
Preprocessing Steps:
- Excel → CSV Conversion - Extract data from supplementary Excel file
- P5e-S2 Subset Selection - Select 86 antibodies matching Novo Nordisk's test set
- Threshold Application - Binary labels from Table 1 PSR scores
- Sequence Validation - Verify sequences match published data
Documentation: See docs/datasets/jain/README.md for detailed steps
Output Files:
Canonical files (original column names):
- data/test/jain/canonical/jain_86_novo_parity.csv - Full data (columns: id, vh_sequence, vl_sequence, ...)
- data/test/jain/canonical/VH_only_jain_86_p5e_s2.csv - VH only (columns: id, vh_sequence, label)
Fragment files (standardized columns):
- data/test/jain/fragments/VH_only_jain.csv - VH fragment (columns: id, sequence, label) ← Use for testing
- Additional fragment files in data/test/jain/fragments/ (H-CDRs, L-CDRs, etc.)
Column Naming:
- Canonical files use vh_sequence/vl_sequence (original source data)
- Fragment files use sequence (standardized for training/testing)
Critical Note: Threshold selection (PSR > 0.5) follows Novo Nordisk's methodology (we achieve 68.60% - EXACT NOVO PARITY).
Harvey (Nanobody Test Set)¶
Source: Harvey et al. (2022) - 141,021 nanobody sequences, PSR assay
Preprocessing Steps:
- CSV Combination - Merge multiple raw CSV files
- Nanobody Fragment Extraction - Extract VHH, VHH-CDRs, VHH-FWRs
- PSR Label Assignment - Binary labels from PSR binding scores
- Validation - Verify sequence counts match published data
Documentation: See docs/datasets/harvey/README.md for detailed steps
Output Files:
Processed files:
- data/test/harvey/processed/harvey.csv - Combined raw data (intermediate)
Fragment files (standardized columns):
- data/test/harvey/fragments/VHH_only_harvey.csv - Full VHH (columns: id, sequence, label, ...)
- data/test/harvey/fragments/H-CDR1_harvey.csv - Individual CDRs
- data/test/harvey/fragments/H-CDRs_harvey.csv - Concatenated CDRs
- data/test/harvey/fragments/H-FWRs_harvey.csv - Concatenated FWRs
Column Naming:
- All fragment files use standardized sequence column (ready for testing)
Note: Fragment naming pattern: {fragmentName}_harvey.csv (not harvey_{fragmentName}.csv)
Shehata (PSR Cross-Validation)¶
Source: Shehata et al. (2019) - 398 human antibodies, PSR assay
Preprocessing Steps:
- Excel → CSV Conversion - Extract data from supplementary Excel file
- Antibody Fragment Extraction - Extract VH, VL, CDRs, FWRs
- PSR Label Assignment - Binary labels from PSR scores (threshold: 0.5495)
- Validation - Cross-check with published confusion matrices
Documentation: See docs/datasets/shehata/README.md for detailed steps
Output Files:
Processed files:
- data/test/shehata/processed/shehata.csv - Combined processed data (intermediate)
Fragment files (standardized columns):
- data/test/shehata/fragments/VH_only_shehata.csv - VH domain (columns: id, sequence, label, ...)
- data/test/shehata/fragments/H-CDRs_shehata.csv - Heavy CDRs
- data/test/shehata/fragments/All-CDRs_shehata.csv - All CDRs
- (16 fragment files total, pattern: {fragmentName}_shehata.csv)
Column Naming:
- All fragment files use standardized sequence column (ready for testing)
Note: No canonical/ directory with CSVs - only fragments/ (processed outputs only)
Fragment Extraction¶
Standard Fragments¶
All datasets support extraction of standard antibody fragments:
Variable Chains:
- VH - Variable Heavy chain
- VL - Variable Light chain
- VH_VL - Combined VH + VL
CDRs (Complementarity-Determining Regions):
- H-CDR1, H-CDR2, H-CDR3 - Individual Heavy CDRs
- L-CDR1, L-CDR2, L-CDR3 - Individual Light CDRs
- H-CDRs - All Heavy CDRs concatenated
- L-CDRs - All Light CDRs concatenated
- All-CDRs - All CDRs (Heavy + Light)
FWRs (Framework Regions):
- H-FWR1, H-FWR2, H-FWR3, H-FWR4 - Individual Heavy FWRs
- L-FWR1, L-FWR2, L-FWR3, L-FWR4 - Individual Light FWRs
- H-FWRs - All Heavy FWRs concatenated
- L-FWRs - All Light FWRs concatenated
- All-FWRs - All FWRs (Heavy + Light)
Nanobody-Specific Fragments (Harvey)¶
Nanobodies have single-domain VHH sequences:
VHH_only- Full VHH domainVHH-CDR1,VHH-CDR2,VHH-CDR3- Individual VHH CDRsVHH-CDRs- All VHH CDRs concatenatedVHH-FWRs- All VHH FWRs concatenated
Adding a New Dataset¶
Step 1: Create Preprocessing Directory¶
Step 2: Write Preprocessing Scripts¶
# preprocessing/my_dataset/step1_convert_to_csv.py
import pandas as pd
# Load raw data (Excel, CSV, FASTA, etc.)
df = pd.read_excel("path/to/raw_data.xlsx")
# Convert to canonical format
canonical_df = pd.DataFrame({
"sequence": df["antibody_sequence"],
"label": (df["polyreactivity_score"] > threshold).astype(int)
})
# Save canonical CSV
canonical_df.to_csv("data/test/my_dataset/canonical/my_dataset.csv", index=False)
Step 3: Extract Fragments¶
# preprocessing/my_dataset/step2_extract_fragments.py
from antibody_training_esm.datasets.base import AntibodyDataset
# Load canonical dataset
df = pd.read_csv("data/test/my_dataset/canonical/my_dataset.csv")
# Create dataset instance
dataset = MyDataset() # Implement AntibodyDataset interface
# Extract fragments
fragments = dataset.extract_all_fragments(df)
# Save fragment CSVs
for fragment_name, fragment_df in fragments.items():
fragment_df.to_csv(
f"data/test/my_dataset/fragments/my_dataset_{fragment_name}.csv",
index=False
)
Step 4: Create Dataset Documentation¶
Create docs/datasets/my_dataset/README.md documenting:
- Data source + citation
- Preprocessing steps
- File locations
- Known issues
- Example usage
Step 5: Register Dataset¶
Add dataset class to src/antibody_training_esm/datasets/:
# src/antibody_training_esm/datasets/my_dataset.py
from .base import AntibodyDataset
class MyDataset(AntibodyDataset):
def __init__(self):
super().__init__(
name="my_dataset",
canonical_path="data/test/my_dataset/canonical/my_dataset.csv",
fragments_dir="data/test/my_dataset/fragments/"
)
# Implement required methods...
Quality Control Checks¶
1. Sequence Validation¶
import re
def is_valid_sequence(seq: str) -> bool:
"""Check if sequence contains only valid amino acids."""
return bool(re.match(r'^[ACDEFGHIKLMNPQRSTVWY]+$', seq))
# Validate all sequences
df["valid"] = df["sequence"].apply(is_valid_sequence)
print(f"Invalid sequences: {(~df['valid']).sum()}")
2. Label Distribution¶
# Check class balance
print(df["label"].value_counts())
print(f"Positive rate: {df['label'].mean():.2%}")
Expected distributions:
- Boughter: ~40% non-specific
- Jain: ~31% non-specific (27/86)
- Harvey: ~variable (depends on PSR threshold)
3. Sequence Length Distribution¶
import matplotlib.pyplot as plt
df["seq_length"] = df["sequence"].str.len()
df["seq_length"].hist(bins=50)
plt.xlabel("Sequence Length")
plt.ylabel("Count")
plt.title("Sequence Length Distribution")
plt.savefig("seq_length_dist.png")
Expected ranges:
- VH: 110-130 amino acids
- VL: 100-120 amino acids
- CDRs: 5-20 amino acids each
Troubleshooting¶
Issue: ANARCI annotation fails¶
Symptoms: Many sequences skipped during annotation
Solution: Install ANARCI correctly:
Issue: Excel file won't open¶
Symptoms: openpyxl or xlrd errors
Solution: Install Excel reading libraries:
Issue: Fragment extraction produces empty sequences¶
Symptoms: Fragment CSVs have NaN or empty strings
Solution: Check ANARCI annotation success:
# Check annotation status
df["has_vh"] = df["VH"].notna() & (df["VH"] != "")
print(f"VH annotation rate: {df['has_vh'].mean():.1%}")
Issue: Label threshold unclear¶
Symptoms: Don't know how to assign binary labels from scores
Solution: Check original paper methods section:
- Boughter: Threshold from paper (polyreactivity score)
- Jain: Table 1 PSR > 0.5
- Harvey: PSR binding scores (various thresholds)
- Shehata: PSR > 0.5495 (Novo Nordisk's PSR threshold - achieves 58.29% vs their 58.8%)
See dataset-specific docs for details.
Next Steps¶
After preprocessing:
- Training: See Training Guide to train models on preprocessed data
- Testing: See Testing Guide to evaluate on preprocessed test sets
- Dataset Documentation: See
docs/datasets/for dataset-specific preprocessing details
Last Updated: 2025-11-18
Branch: dev