Skip to content

Shehata Dataset Phase 2 Completion Report

⚠️ LEGACY DOCUMENTATION (v1.x)

This document references old root imports (e.g., from model import ESMEmbeddingExtractor). In v2.0.0+, use: from antibody_training_esm.core.embeddings import ESMEmbeddingExtractor

See v2-structure-migration.md for import structure guide.

Date: 2025-10-31 (Original) | 2025-11-02 (P0 Blocker Found & Fixed) Issue: #3 - Shehata dataset preprocessing (Phase 2) Status:COMPLETE (P0 blocker resolved)


✅ RESOLUTION (2025-11-02)

P0 BLOCKER FIXED:

Gap characters in fragment sequences have been ELIMINATED.

Fix Applied: - Updated preprocessing/shehata/step2_extract_fragments.py:63 to use annotation.sequence_aa (gap-free) - Regenerated all 16 fragment CSVs - Enhanced validation script with gap detection

Validation Results:

✅ All 16 fragment CSVs: 0 gap characters
✅ ESM-1v embedding compatibility: CONFIRMED
✅ Previously affected sequences (13 VH, 4 VL, 17 Full): ALL CLEAN

Regression Prevention: - scripts/validation/validate_shehata_conversion.py now includes fragment gap validation - Automatic detection prevents P0 blocker from re-occurring

Status: ✅ Phase 2 PRODUCTION-READY


Timeline

2025-10-31: Phase 2 initially completed (gap issue not detected) 2025-11-02: Deep analysis revealed P0 blocker (gap characters in fragments) 2025-11-02: Fix applied, validated, and regression prevention implemented


Executive Summary

All 16 fragment types successfully extracted and validatedANARCI re-annotation using IMGT scheme complete (riot_na v4.0.5)Output format compatible with existing pipelineReady for model training and inference (ESM embedding validated)


Deliverables

1. Preprocessing Script

File: preprocessing/shehata/step2_extract_fragments.py (288 lines)

Key Features: - Uses ANARCI (riot_na) for IMGT-based CDR/FWR annotation - Processes amino acid sequences (not nucleotides like Boughter) - Extracts all 16 fragment types following Sakhnini et al. 2025 - Handles annotation failures gracefully with error reporting - Progress tracking with tqdm - Comprehensive validation summary

Usage:

python3 preprocessing/shehata/step2_extract_fragments.py

Performance: - Annotated: 398/398 antibodies (100% success rate) - Processing time: ~3.3 seconds - Speed: ~120 antibodies/second


2. Fragment CSV Files

Location: data/test/shehata/

Files Created: 16 fragment-specific CSVs

Fragment Filename Rows Length Range Mean Length
Full variable domains
VH VH_only_shehata.csv 398 114-140 aa 122.6 aa
VL VL_only_shehata.csv 398 103-120 aa 108.9 aa
Heavy chain CDRs
H-CDR1 H-CDR1_shehata.csv 398 3-10 aa 8.2 aa
H-CDR2 H-CDR2_shehata.csv 398 6-13 aa 7.9 aa
H-CDR3 H-CDR3_shehata.csv 398 7-33 aa 15.4 aa
Light chain CDRs
L-CDR1 L-CDR1_shehata.csv 398 4-14 aa 7.7 aa
L-CDR2 L-CDR2_shehata.csv 398 3-7 aa 3.1 aa
L-CDR3 L-CDR3_shehata.csv 398 5-20 aa 9.6 aa
Concatenated CDRs
H-CDRs H-CDRs_shehata.csv 398 22-49 aa 31.5 aa
L-CDRs L-CDRs_shehata.csv 398 14-32 aa 20.4 aa
All-CDRs All-CDRs_shehata.csv 398 40-69 aa 51.9 aa
Framework regions
H-FWRs H-FWRs_shehata.csv 398 89-101 aa 91.0 aa
L-FWRs L-FWRs_shehata.csv 398 87-92 aa 88.7 aa
All-FWRs All-FWRs_shehata.csv 398 178-190 aa 179.7 aa
Paired/Full
VH+VL VH+VL_shehata.csv 398 218-248 aa 231.5 aa
Full Full_shehata.csv 398 218-248 aa 231.5 aa

File Format (standardized):

id,sequence,label,psr_score,b_cell_subset,source
ADI-38502,EVQLLESGGGLVKPGG...,0,0.0,IgG memory,shehata2019

Columns: - id: Clone identifier - sequence: Fragment sequence (CDR, FWR, or full domain) - label: Binary non-specificity (0=specific, 1=non-specific) - psr_score: Polyspecific Reagent score (continuous) - b_cell_subset: B cell origin (Naïve, IgG memory, IgM memory, LLPCs) - source: Dataset provenance (shehata2019)


Validation Results

Fragment Length Validation

All fragment lengths match expected antibody structure:

  • VH domains: 114-140 aa (expected: ~110-130 aa) ✓
  • VL domains: 103-120 aa (expected: ~100-115 aa) ✓
  • H-CDR3: 7-33 aa (highly variable, expected range) ✓
  • L-CDR3: 5-20 aa (shorter than heavy, expected) ✓
  • CDR½: 3-14 aa (conserved, expected) ✓
  • FWRs: 87-101 aa per chain (expected: ~85-100 aa) ✓

Label Distribution Validation

All fragments preserve original label distribution:

  • Specific (label=0): 391 antibodies (98.2%)
  • Non-specific (label=1): 7 antibodies (1.8%)
  • Matches Phase 1 CSV:
  • Matches paper (7/398):

Data Integrity Validation

All fragment files validated:

  • Total files created: 16/16 ✓
  • All files have 398 rows:
  • No missing sequences:
  • No annotation failures: 398/398 success ✓
  • Standardized format:

Comparison with Paper Methodology

Sakhnini et al. 2025 (Methods Section 4.3)

Paper's approach:

"sequences were annotated in the CDRs using ANARCI following the IMGT numbering scheme"

Our implementation: - ✅ Used ANARCI (riot_na v4.0.5) - ✅ IMGT numbering scheme - ✅ Extracted all 16 fragment types tested in paper - ✅ Mean pooling of ESM-1v embeddings (documented in data.py)

Paper's fragments (Section 2.1, Table 4): - VH, VL ✓ - H-CDR1, H-CDR2, H-CDR3 ✓ - L-CDR1, L-CDR2, L-CDR3 ✓ - H-CDRs, L-CDRs ✓ - H-FWRs, L-FWRs ✓ - VH+VL, Full ✓ - All-CDRs, All-FWRs ✓

Match: 16/16 fragments ✅


Key Differences from Phase 1

Aspect Phase 1 (Basic CSV) Phase 2 (Fragment Extraction)
Input Excel (mmc2.xlsx) CSV (shehata.csv)
Processing Sanitization + conversion ANARCI annotation + fragment extraction
Output 1 CSV (full VH/VL) 16 CSVs (all fragment types)
Annotation None (used pre-annotated) ANARCI re-annotation (IMGT)
Integration Compatible with load_local_data() Compatible with load_local_data()
Purpose Basic test set Fragment-specific model testing

Integration Compatibility

Data Loading

Pattern:

from data import load_local_data

df = load_local_data(
    'data/test/shehata/fragments/VH_only_shehata.csv',
    sequence_column='sequence',
    label_column='label'
)

Compatible with: - ✅ data.load_local_data() - ✅ data.preprocess_raw_data() (ESM embedding) - ✅ test.py (model evaluation) - ✅ Existing training pipeline

File Naming Convention

Pattern: {fragment_type}_shehata.csv

Examples: - VH_only_shehata.csv (matches training: VH_only_training_ready.csv) - H-CDR3_shehata.csv - VH+VL_shehata.csv


Code Quality

Following Rob C. Martin Clean Code Principles

Single Responsibility: Each function has one clear purpose ✅ Descriptive Names: annotate_sequence(), create_fragment_csvs()Small Functions: Average 20-30 lines per function ✅ Error Handling: Try/except with informative warnings ✅ Type Hints: All function signatures typed ✅ Documentation: Comprehensive docstrings ✅ No Magic Numbers: All constants named and explained

Pattern Consistency

Follows preprocessing/process_boughter.py pattern: - Same annotator initialization pattern - Similar fragment extraction logic - Consistent CSV output format - Compatible with existing pipeline


Files Modified/Created

New Files (Phase 2):

  1. preprocessing/shehata/step2_extract_fragments.py (288 lines)
  2. Main preprocessing script
  3. ANARCI annotation + fragment extraction
  4. Comprehensive validation reporting

  5. data/test/shehata/fragments/*.csv (16 files)

  6. All fragment-specific CSVs
  7. Standardized format
  8. Ready for model inference

  9. docs/datasets/shehata/shehata_phase2_completion_report.md (THIS FILE)

  10. Phase 2 completion documentation
  11. Comprehensive validation results

Modified Files (November 2025 Cleanup):

Note: The documentation updates mentioned in the original version of this file were completed during the November 2025 cleanup:

  1. docs/datasets/shehata/shehata_preprocessing_implementation_plan.md
  2. Status updated to "Complete - Both Phase 1 and Phase 2 fully operational" (2025-11-06)
  3. Marked all checklist items as complete

  4. docs/datasets/shehata/archive/shehata_conversion_verification_report.md

  5. Archived as historical document (2025-11-06)
  6. P0 blocker warning updated to show resolution

Outstanding Tasks (Post-Phase 2)

Model Evaluation (Not Part of Preprocessing):

  1. ✅ Load fragments with data.load_local_data()
  2. ✅ Generate ESM-1v embeddings for all 16 fragment types
  3. ✅ Run inference with trained models (Achieved 58.29% accuracy with PSR threshold)
  4. ✅ Compare performance across fragments
  5. ✅ Reproduce paper Figure 3C-D (PSR predictions) - See docs/research/benchmark-results.md
  6. ✅ Create performance comparison table (Table 4 from paper)

Repository Hygiene:

  1. ✅ Create comprehensive PR for Issue #3
  2. ✅ Update main README with Shehata dataset info
  3. ✅ Add data/test/shehata/ to .gitignore if needed
  4. ✅ Document dependencies (riot_na) in requirements.txt

Dependencies

Python Packages (Phase 2):

pandas>=1.5.0          # CSV handling
riot_na==4.0.5         # ANARCI wrapper (IMGT numbering)
biopython==1.84        # Sequence handling (riot_na dependency)
tqdm>=4.65.0           # Progress bars

Installation:

pip install riot-na  # Installs riot_na + dependencies

Note: riot_na has ANARCI pre-compiled, no manual ANARCI installation needed.


Performance Metrics

Processing Performance:

  • Total antibodies: 398
  • Annotation success rate: 100% (398/398)
  • Processing time: ~3.3 seconds
  • Speed: ~120 antibodies/second
  • Fragment CSVs created: 16
  • Total output size: ~750 KB

Data Quality:

  • Invalid sequences: 0
  • Annotation failures: 0
  • Missing data: 0
  • Label preservation: 100%

Verification Checklist

Phase 2 Success Criteria:

  • preprocessing/shehata/step2_extract_fragments.py script exists
  • 16 fragment-specific CSV files generated
  • ANARCI re-annotation using IMGT scheme
  • All fragments validated for sequence quality
  • Fragment CSVs have standardized format
  • Comprehensive documentation of preprocessing choices
  • Code follows clean code principles
  • Pattern consistent with process_boughter.py
  • Can reproduce paper's Figure 3 results (requires model training - not part of preprocessing)

Conclusion

Phase 2 preprocessing is 100% complete and validated

Key Achievements: - Implemented full ANARCI-based fragment extraction - Created all 16 fragment types following paper methodology - Achieved 100% annotation success rate - Maintained clean code principles throughout - Full integration compatibility with existing pipeline - Comprehensive documentation and validation

Ready for: - Model training/inference on fragment-specific inputs - Paper result reproduction (Figure 3, Table 4) - Production use in antibody non-specificity prediction - PR submission to close Issue #3


Next Steps

Immediate (for Issue #3 PR): 1. Test one fragment CSV with existing model (if available) 2. Create comprehensive PR with all Phase 1 + Phase 2 work 3. Update main README to document Shehata dataset

Future (separate issues/PRs): 1. Reproduce paper results (Figure 3C-D, Table 4) 2. Train fragment-specific models 3. Compare performance across all 16 fragment types 4. Optimize threshold for binary classification


Verified by: - Direct code execution ✅ - Fragment length validation ✅ - Label distribution verification ✅ - Format consistency checks ✅ - Integration pattern validation ✅

Sign-off: Phase 2 COMPLETE ✅