Shehata Dataset Preprocessing Implementation Plan¶
Issue: #3 - Shehata dataset preprocessing Author: Ray (ray/learning branch) Date: 2025-10-31 (Updated: 2025-11-06) Status: ✅ Complete - Both Phase 1 and Phase 2 fully operational
Executive Summary¶
What is the Shehata Dataset?¶
The Shehata dataset (Shehata et al. 2019, Cell Reports) contains 398 antibodies from human B cell subsets (naïve, IgG memory, IgM memory, and long-lived plasma cells) evaluated for non-specificity using the Polyspecific Reagent (PSR) assay.
Role in the Pipeline¶
CRITICAL CLARIFICATION: The Shehata dataset is an EXTERNAL TEST SET ONLY - it is NOT used for training.
Training vs Testing: - Training data: Boughter dataset (>1000 antibodies, ELISA-based labels) - Test sets: - Jain (137 clinical antibodies, ELISA) - Shehata (398 antibodies, PSR assay) ← We are here - Harvey (140K nanobodies, PSR assay)
Why test on Shehata? 1. External validation of models trained on Boughter 2. Cross-assay generalization: PSR vs ELISA (different non-specificity measurements) 3. Biological diversity: Different B cell subsets vs just antigen-specific antibodies 4. Clinical relevance: Human antibodies from various maturation stages
From the paper (Section 2.1):
"the most balanced dataset (i.e. Boughter one) was selected for training of ML models, while the remaining three (i.e. Jain, Shehata and Harvey, which consists exclusively of VHH sequences) were used for testing."
Paper's Preprocessing Procedure (Methods Section 4.3)¶
Step-by-Step from Sakhnini et al. 2025:¶
-
CDR Annotation (Line 240):
"sequences were annotated in the CDRs using ANARCI following the IMGT numbering scheme"
-
Fragment Assembly:
- Created 16 different antibody fragment sequences
-
Fragments include: VH, VL, H-CDR1, H-CDR2, H-CDR3, L-CDR1, L-CDR2, L-CDR3, H-CDRs (combined), L-CDRs (combined), etc.
-
Embedding (Lines 240-241):
- Used ESM-1v (top performer: 71% CV accuracy)
-
Mean pooling: "average of all token vectors"
-
Label Preparation:
- Binary classification: Specific (class 0) vs Non-specific (class 1)
- For Boughter training: 0 flags = specific, >3 flags = non-specific (excluded 1-3 flags)
- For Shehata testing: PSR score-based labeling
Key Finding for Shehata:¶
From Section 2.1 (Line 51):
"the Shehata dataset is unbalanced, with 7 out of 398 antibodies characterised as non-specific only"
IMPORTANT: The paper used a strict threshold for binary classification. Our analysis shows: - PSR = 0: 251 antibodies (specific) - PSR > 0: 147 antibodies (have some non-specificity) - But only ~7 considered truly "non-specific" by paper's threshold
We need to determine this threshold from the paper or their methods.
Current Repository Architecture¶
Existing Code Analysis:¶
1. Preprocessing Pattern (preprocessing/process_boughter.py):
- Uses riot_na (ANARCI wrapper) for CDR annotation
- Processes FASTA files with nucleotide sequences
- Outputs: Full sequence + individual CDR/FWR regions (amino acids)
- Format: CSV with columns like cdr1_aa_H, fwr1_aa_H, etc.
2. Data Loading (data.py):
- load_local_data(): Reads CSV files
- load_hf_dataset(): Reads from HuggingFace
- preprocess_raw_data(): Embeds sequences using ESM
- Standard format: sequence, label columns
3. Existing Datasets:
- data/train/VH_only_training_ready.csv → Boughter training data
- data/test/jain.csv → Jain test data (already processed)
4. Expected Output Format (from jain.csv):
id,heavy_seq,light_seq,label,source,smp,ova
abituzumab,QVQLQQSGGE...,DIQMTQSPSS...,0,jain2017,0.125938041,0.044483421
Implementation Complexity Assessment¶
Complexity Level: MEDIUM (2-phase implementation recommended)¶
Why Medium? 1. ✓ Simple aspects: - Raw data already available (mmc2.xlsx) - CDR annotations already in Excel file - Clear binary labels from PSR scores - CSV output format well-defined
- ⚠️ Complex aspects:
- Need to determine correct PSR threshold for binary labels
- Must match paper's exact 16 fragment types
- Integration with existing embedding pipeline
- ANARCI re-annotation vs using provided annotations
- Handling unbalanced dataset (7/398 non-specific)
Proposed Implementation: 2-Phase Approach¶
Phase 1: Basic Preprocessing Script (Minimal Viable Product)¶
Goal: Create test-ready CSV matching existing format
Effort: ~2-3 hours
Deliverable: data/test/shehata/processed/shehata.csv
Tasks:
1. ✅ Read mmc2.xlsx
2. ✅ Extract VH and VL protein sequences
3. ✅ Determine PSR threshold for binary labels (investigate paper/supplement)
4. ✅ Create CSV with columns: id, heavy_seq, light_seq, label, psr_score, b_cell_subset, source
5. ✅ Basic validation (sequence quality, label distribution)
6. ✅ Document threshold decision in docstring
Output Example:
id,heavy_seq,light_seq,label,psr_score,b_cell_subset,source
ADI-38502,EVQLLESGGGLVKPGG...,DIVMTQSPSTLSASVG...,0,0.0,IgG memory,shehata2019
ADI-38501,EVQLLESGGGLVQPGG...,DIVMTQSPATLSLSPG...,0,0.023184,IgG memory,shehata2019
Integration Point:
- Works with existing data.load_local_data()
- Can be used by test.py for model evaluation
- Enables reproduction of Figure 3C-D from paper
Phase 2: Full Fragment Extraction (Paper-Complete)¶
Goal: Generate all 16 antibody fragments for comprehensive testing
Effort: ~4-6 hours
Deliverable: preprocessing/shehata/step2_extract_fragments.py + multiple output CSVs
Tasks:
1. ✅ Re-annotate sequences using ANARCI (IMGT scheme) for consistency
2. ✅ Extract all 16 fragments:
- VH, VL (full variable domains)
- H-CDR1, H-CDR2, H-CDR3
- L-CDR1, L-CDR2, L-CDR3
- H-CDRs (concatenated)
- L-CDRs (concatenated)
- H-FWRs, L-FWRs
- VH+VL
- Other combinations tested in paper
3. ✅ Create separate CSV for each fragment type (like training data)
4. ✅ Match exact preprocessing from process_boughter.py style
5. ✅ Comprehensive validation against paper's reported statistics
Output Structure:
data/test/
├── shehata/
│ ├── VH_only_shehata.csv
│ ├── VL_only_shehata.csv
│ ├── H-CDR3_shehata.csv
│ ├── H-CDRs_shehata.csv
│ └── ... (16 total files)
Integration Point: - Enables testing all 16 fragment-specific models - Reproduces Table 4 results from paper - Allows fragment-level performance comparison
Decision Points & Open Questions¶
1. PSR Score Threshold for Binary Labels¶
Question: What PSR score constitutes "non-specific"?
Options: - A. PSR > 0 (147 non-specific) - Any polyreactivity - B. Paper's threshold (~7 non-specific) - Need to find this value - C. Quartile-based (PSR > 75th percentile ≈ 0.017)
Action Required: - Search paper supplement for threshold - Check original Shehata 2019 paper - Email authors if unclear - Document decision rationale
Current Data:
PSR Score Distribution:
Mean: 0.0346
Median: 0.0 (50% have PSR = 0)
75th percentile: 0.0169
Max: 0.711
2. Use Provided CDR Annotations vs Re-annotate with ANARCI?¶
Question: mmc2.xlsx already has CDR annotations. Re-annotate or use as-is?
Options: - A. Use provided annotations - Faster, author-verified - B. Re-annotate with ANARCI - Ensures consistency with Boughter processing
Recommendation: - Phase 1: Use provided annotations (faster MVP) - Phase 2: Re-annotate for consistency, compare results
3. Handle VH-only vs VH+VL sequences?¶
Current state: mmc2.xlsx has both VH and VL sequences
From paper: VH domain is primary predictor (71% accuracy)
Recommendation: - Generate both VH-only and VH+VL versions - Primary testing on VH-only (matches training data format) - VL available for future bispecific/full-length analysis
File Structure Plan¶
antibody_training_pipeline_ESM/
├── preprocessing/
│ ├── process_boughter.py (existing)
│ └── process_shehata.py (NEW - Phase 2)
│
├── scripts/ (NEW directory for utilities)
│ └── prepare_shehata_basic.py (NEW - Phase 1)
│
├── data/test/
│ ├── mmc2.xlsx (raw data - gitignored)
│ ├── mmc3.xlsx (clinical subset - gitignored)
│ ├── mmc4.xlsx (gitignored)
│ ├── mmc5.xlsx (gitignored)
│ ├── jain.csv (existing)
│ ├── shehata.csv (NEW - Phase 1 output)
│ └── shehata/ (NEW - Phase 2 outputs)
│ ├── VH_only_shehata.csv
│ ├── H-CDR3_shehata.csv
│ └── ... (other fragments)
│
└── docs/
├── shehata_preprocessing_implementation_plan.md (THIS FILE)
└── shehata_preprocessing_validation.md (NEW - post-implementation)
Integration with Existing Codebase¶
Compatibility Checklist:¶
✅ Data format: CSV with sequence, label columns
✅ Loading: Works with data.load_local_data(file_path, sequence_column, label_column)
✅ Testing: Compatible with test.py --data data/test/shehata/processed/shehata.csv
✅ Embedding: Sequences ready for ESM-1v processing
✅ Naming: Follows {fragment}_{dataset}.csv pattern
Code Reuse Opportunities:¶
- From
process_boughter.py: annotate()function → ANARCI annotation pattern- CDR/FWR extraction logic
-
Output formatting
-
From
data.py: load_local_data()→ No changes needed-
preprocess_raw_data()→ Works with any sequence list -
From existing test datasets:
jain.csv→ Template for column names and format
Validation Strategy¶
Post-Implementation Validation:¶
- Data Integrity:
- ✓ 398 antibodies (or 402 if including extras)
- ✓ No missing VH sequences
- ✓ Valid amino acid sequences only
-
✓ Label distribution matches paper
-
Paper Reproduction:
- ✓ Can reproduce Figure 3C-D (Shehata PSR predictions)
- ✓ Distribution of predicted probabilities matches paper
-
✓ ~7 non-specific antibodies in binary classification
-
Integration Testing:
- ✓ Loads successfully with
load_local_data() - ✓ Embeddings generate without errors
- ✓ Test script runs end-to-end
-
✓ Metrics calculated correctly
-
Cross-dataset Comparison:
- ✓ Format matches
jain.csvstructure - ✓ Sequence lengths in reasonable range
- ✓ Label distribution documented
Dependencies & Requirements¶
Python Packages:¶
pandas>=1.5.0 # Excel reading, CSV writing
openpyxl>=3.0.0 # Excel file support
numpy>=1.24.0 # Array operations
biopython>=1.79 # Sequence handling (if re-annotating)
anarci or riot_na # CDR annotation (Phase 2 only)
External Tools:¶
- ANARCI (Phase 2): CDR numbering tool
- Installation:
pip install riot-naor compile ANARCI from source - Alternative: Use provided annotations from mmc2.xlsx
Timeline Estimate¶
Phase 1 (Basic CSV):¶
- Planning: 1 hour (reading paper, this doc)
- Coding: 1-2 hours
- Testing/Validation: 1 hour
- Total: ~3-4 hours
Phase 2 (Full fragments):¶
- ANARCI setup: 1 hour
- Coding: 3-4 hours
- Testing/Validation: 2 hours
- Total: ~6-7 hours
Grand Total: 10-11 hours for complete implementation¶
Success Criteria¶
Phase 1 Complete When:¶
-
data/test/shehata/processed/shehata.csvexists - 398 rows (one per antibody)
- Columns:
id, heavy_seq, light_seq, label, psr_score, b_cell_subset, source - Binary labels defined with documented threshold (0.31002 = 98.24%ile)
- No missing VH/VL sequences
- Loads successfully with existing data pipeline (validation passed)
- README/documentation updated
Phase 2 Complete When:¶
-
preprocessing/shehata/step2_extract_fragments.pyscript exists - 16 fragment-specific CSV files generated in data/test/shehata/
- ANARCI re-annotation using IMGT scheme (riot_na v4.0.5)
- All fragments validated for sequence quality (lengths match expected ranges)
- Fragment CSVs have standardized format (id, sequence, label, psr_score, b_cell_subset, source)
- Can reproduce paper's Figure 3 results (requires model training/inference)
- Comprehensive documentation of preprocessing choices
Risks & Mitigation¶
Risk 1: PSR Threshold Uncertainty¶
Impact: Wrong threshold → incorrect labels → invalid test results Mitigation: - Document assumption clearly - Test with multiple thresholds - Compare to paper's reported statistics (7/398)
Risk 2: ANARCI Installation Issues¶
Impact: Can't complete Phase 2 Mitigation: - Phase 1 uses provided annotations - Document ANARCI as optional dependency - Provide containerized environment (Docker)
Risk 3: Format Incompatibility¶
Impact: Preprocessed data doesn't work with existing pipeline
Mitigation:
- Follow jain.csv as exact template
- Integration test early
- Ask maintainer (@ludocomito) for format validation
Implementation Status (2025-11-06)¶
Completed Tasks:¶
- ✅ Phase 1 Complete - Excel → CSV conversion (398 sequences)
- ✅ Phase 2 Complete - CSV → Fragments (16 fragment files, all gap-free)
- ✅ P0 Blocker Resolved - Gap characters eliminated from all fragments
- ✅ PSR Threshold Calibrated - Dataset-specific threshold: 0.5495 (implemented in
src/antibody_training_esm/core/classifier.py) - ✅ Validation Complete - All fragments verified gap-free, label distribution preserved
- ✅ Documentation Updated - See
data/test/shehata/README.mdfor SSOT
Historical Checklists (Archived):¶
For historical context on the implementation process, see:
- docs/datasets/shehata/archive/shehata_conversion_verification_report.md (Phase 1 verification)
- docs/datasets/shehata/archive/p0_blocker_first_principles_validation.md (P0 gap blocker analysis)
- docs/datasets/shehata/archive/shehata_cleanup_plan.md (Dataset reorganization)
Current Status: Pipeline fully operational and ready for model training/testing.
References¶
- Sakhnini et al. 2025 - "Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters"
- bioRxiv: 10.1101/2025.04.28.650927
-
Local:
literature/markdown/Sakhnini_2025_Antibody_NonSpecificity_PLM_Biophysical.md -
Shehata et al. 2019 - "Affinity Maturation Enhances Antibody Specificity but Compromises Conformational Stability"
- Cell Reports: 10.1016/j.celrep.2019.08.056
-
Supplementary: mmc2.xlsx (main dataset)
-
Repository Issues:
- Issue #3: Shehata preprocessing (this task)
- Issue #2: Jain preprocessing (reference)
- Issue #1: HuggingFace migration (future)
Appendix: Raw Data Inventory¶
mmc2.xlsx (Main Dataset):¶
- Rows: 402 antibodies (398 reported + 4 extras)
- Columns: 25 total
- Clone name
- B cell subset (Naïve, IgG memory, IgM memory, LLPCs)
- Biophysical: TmApp, HIC retention time
- PSR Score (non-specificity label)
- Sequences: VH/VL Protein + DNA
- CDR/FWR regions: Pre-annotated (H-CDR1-3, L-CDR1-3, FR1-4)
- Germline info
Other Files:¶
- mmc3.xlsx: 42 clinical antibodies (subset)
- mmc4.xlsx: 10 antibodies with mutation data
- mmc5.xlsx: 26 antibodies with HIC focus
Decision: Use mmc2.xlsx as primary source for all preprocessing.
END OF IMPLEMENTATION PLAN
Status: Ready for review and decision on Phase 1 vs Phase 1+2 implementation.