Preprocessing Internals¶

Target Audience: Developers implementing new dataset preprocessing pipelines

Purpose: Document general preprocessing techniques, validation strategies, and Excel-to-CSV conversion methods used across datasets (Jain, Shehata).

When to Use This Guide¶

You need this guide if you're:

✅ Adding a new dataset to the pipeline
✅ Implementing preprocessing scripts for Excel/CSV data sources
✅ Validating dataset conversions with multi-method cross-checking
✅ Deriving binary labels from continuous scores (thresholds)
✅ Debugging preprocessing issues (missing data, sequence validation)

You DON'T need this guide if you're:

❌ Using existing preprocessed datasets (see User Guide - Preprocessing)
❌ Training/testing models (see User Guide - Training)

User-facing: User Guide - Preprocessing - How to run existing preprocessing scripts
Dataset-specific: docs/datasets/{boughter,jain,harvey,shehata}/ - Dataset-specific preprocessing details
Architecture: Developer Guide - Architecture - System overview
Testing: Developer Guide - Testing Strategy - How to test preprocessing code

Quick Reference: Preprocessing Patterns¶

All datasets follow similar patterns:

Step 1: Format Conversion - Raw data (Excel, CSV, FASTA) → canonical CSV
Step 2: Fragment Extraction - Canonical CSV → fragment CSVs (VH, CDRs, FWRs)
Step 3: Validation - Multi-method verification, checksum generation
Step 4: Documentation - Update dataset README with provenance

Key Files: - preprocessing/{dataset}/step1_*.py - Format conversion - preprocessing/{dataset}/step2_*.py - Fragment extraction - scripts/validation/validate_{dataset}_conversion.py - Validation scripts - docs/datasets/{dataset}/README.md - Dataset documentation

Excel to CSV Conversion Methods & Validation¶

⚠️ LEGACY DOCUMENTATION NOTE

This document may reference old root imports (e.g., from data import load_data). In v2.0.0+, use: from antibody_training_esm.data.loaders import load_data

See docs/archive/migrations/v2-structure-migration.md for migration details.

Purpose: Document all available methods for converting supplementary Excel files (Shehata & Jain datasets) to CSV with validation.

Last Updated: 2025-11-01 Related Issues: #3 (Shehata), #2 (Jain)

Overview of Methods¶

Method 1a: Python Script – Shehata (RECOMMENDED) ⭐¶

Tool: preprocessing/shehata/step1_convert_excel_to_csv.py
Pros: Full control, validation built-in, reproducible, transparent
Cons: Requires Python environment
Validation: Multi-method cross-checking with scripts/validation/validate_shehata_conversion.py

Method 1b: Python Script – Jain (RECOMMENDED) ⭐¶

Tool: preprocessing/jain/step1_convert_excel_to_csv.py
Pros: Deterministic flag derivation (Table 1 thresholds), full provenance columns, label handling consistent with Sakhnini et al.
Cons: Requires Python environment
Validation: scripts/validation/validate_jain_conversion.py (rebuilds pipeline & checks SHA256)

Method 2: CLI Tools¶

Tools: in2csv, ssconvert, xlsx2csv
Pros: Simple one-liners, no coding
Cons: Less control, limited validation, may lose data fidelity

Method 3: Excel GUI Export¶

Tool: Microsoft Excel, LibreOffice Calc
Pros: Visual inspection, familiar interface
Cons: Manual process, not reproducible, error-prone

Method 1a: Shehata Python Script (DETAILED)¶

Installation¶

# Already have pandas and openpyxl from earlier
pip show pandas openpyxl

# If needed:
pip install pandas openpyxl

Usage (Shehata)¶

cd /Users/ray/Desktop/CLARITY-DIGITAL-TWIN/antibody_training_pipeline_ESM

# Run conversion script (interactive)
python3 preprocessing/shehata/step1_convert_excel_to_csv.py

Interactive prompts: 1. Shows PSR score distribution 2. Suggests threshold based on paper (7/398 non-specific) 3. Asks for confirmation or custom threshold 4. Validates sequences 5. Saves to data/test/shehata/processed/shehata.csv

Validation (Shehata)¶

# Run multi-method validation
python3 scripts/validation/validate_shehata_conversion.py

What it checks: - ✓ Reads Excel with pandas (openpyxl) - ✓ Reads Excel with openpyxl directly - ✓ Compares both methods (ensures Excel reading is correct) - ✓ Compares CSV with original Excel (ensures conversion is correct) - ✓ Validates sequences (VH Protein → heavy_seq, VL Protein → light_seq) - ✓ Validates ID mapping (Clone name → id) - ✓ Calculates checksums for integrity - ✓ Reports summary statistics

Output Example (Shehata)¶

id,heavy_seq,light_seq,label,psr_score,b_cell_subset,source
ADI-38502,EVQLLESGGGLVKPGGSLRLSCAASGFIFSDYSMNWVRQAPGKGLEWVS...,DIVMTQSPSTLSASVGDRVTITCRASQSISSWLAWYQQKPGKAPKLLIYK...,0,0.0,IgG memory,shehata2019
ADI-38501,EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYSMNWVRQAPGKGLEWVS...,DIVMTQSPATLSLSPGERATLSCRASQSISTYLAWYQQKPGQAPRLLIY...,0,0.023184,IgG memory,shehata2019

Method 1b: Jain Python Script (DETAILED)¶

Usage (Jain)¶

cd /Users/ray/Desktop/CLARITY-DIGITAL-TWIN/antibody_training_pipeline_ESM

# Generate canonical Jain dataset
python3 preprocessing/jain/step1_convert_excel_to_csv.py --verbose

What it does: 1. Loads SD01/SD02/SD03 spreadsheets (metadata, VH/VL sequences, biophysical assays) 2. Sanitizes amino acid sequences (removes gaps/whitespace/non-standard residues) 3. Applies Jain Table 1 thresholds (four developability flag clusters) 4. Emits data/test/jain.csv with explicit flags_total, flag_category, nullable label, and supporting assay columns

Validation (Jain)¶

python3 scripts/validation/validate_jain_conversion.py

Checks performed: - Re-runs the conversion pipeline in-memory and asserts equality with the CSV (assert_frame_equal) - Reports flag/label distributions (specific 67 / mild 67 / non_specific 3) - Confirms VH/VL sequences contain only valid residues (ACDEFGHIKLMNPQRSTVWYX) - Prints Table 1 threshold clauses and SHA256 checksum (b1a6d7399260aef1a894743877a726caa248d12d948b8216822cb2a5b9bc96a3)

Output Example (Jain)¶

id,heavy_seq,light_seq,flags_total,flag_category,label,flag_self_interaction,flag_chromatography,flag_polyreactivity,flag_stability,source,smp,ova,bvp_elisa,...
abituzumab,QVQLQQSGGELAKPGASVKVSCKASGYTFSSFWMHWVRQAPGKGLEWIGYINPRSGYTEYNEIFRDKATMTTDTSTSTAYMELSSLRSEDTAVYYCASFLGRGAMDYWGQGTTVTVSS,DIQMTQSPSSLSASVGDRVTITCRASQDISNYLAWYQQKPGKAPKLLIYYTSKIHSGVPSRFSGSGSGTDYTFTISSLQPEDIATYYCQQGNTFPYTFGQGTKVEIK,1,mild,,False,False,True,False,jain2017,0.166666,1.137375,2.720799,...

label uses pandas nullable integers: 0 for specific, 1 for ≥4 flags (non-specific), blank for mild (1–3 flags).

Method 2: CLI Tools¶

Option A: in2csv (csvkit)¶

Install:

pip install csvkit

Usage:

# Convert single sheet
in2csv data/test/shehata/raw/shehata-mmc2.xlsx > data/test/mmc2_raw.csv

# Specify sheet
in2csv --sheet "Sheet1" data/test/shehata/raw/shehata-mmc2.xlsx > data/test/mmc2_raw.csv

Pros: - Simple one-liner - Part of csvkit suite (useful for CSV manipulation)

Cons: - No column mapping (need post-processing) - No label conversion - No validation

Option B: ssconvert (Gnumeric)¶

Install (macOS):

brew install gnumeric

Usage:

ssconvert data/test/shehata/raw/shehata-mmc2.xlsx data/test/mmc2_raw.csv

Pros: - Fast - Reliable for simple conversions

Cons: - Requires Gnumeric installation (large dependency) - No Python integration - No validation

Option C: xlsx2csv¶

Install:

pip install xlsx2csv

Usage:

xlsx2csv data/test/shehata/raw/shehata-mmc2.xlsx data/test/mmc2_raw.csv

Pros: - Lightweight - Pure Python

Cons: - Basic functionality - No post-processing

After CLI Conversion: Post-Processing Needed¶

# Read raw CSV from CLI tool
df_raw = pd.read_csv('data/test/mmc2_raw.csv')

# Map to jain.csv format
df_processed = pd.DataFrame({
    'id': df_raw['Clone name'],
    'heavy_seq': df_raw['VH Protein'],
    'light_seq': df_raw['VL Protein'],
    'label': (df_raw['PSR Score'] > threshold).astype(int),
    'psr_score': df_raw['PSR Score'],
    'b_cell_subset': df_raw['B cell subset'],
    'source': 'shehata2019'
})

df_processed.to_csv('data/test/shehata/processed/shehata.csv', index=False)

Method 3: Excel GUI Export¶

Steps¶

Open data/test/shehata/raw/shehata-mmc2.xlsx in Excel or LibreOffice
File → Save As → CSV (UTF-8)
Save as mmc2_raw.csv
Run post-processing script (same as CLI method above)

Cons: - Manual, error-prone - Not reproducible - Hard to document - NOT RECOMMENDED for scientific work

Validation Strategy: Multi-Layer Approach¶

Layer 1: Internal Consistency¶

Read Excel with multiple libraries (pandas, openpyxl)
Compare results to ensure Excel reading is correct

Layer 2: Conversion Accuracy¶

Compare CSV output with original Excel
Validate every sequence (VH, VL)
Check ID mapping
Verify label conversion logic

Layer 3: Format Compatibility¶

Compare with jain.csv structure
Test loading with data.load_local_data()
Ensure column names match expected format

Layer 4: Statistical Validation¶

Check row counts (should be 398-402)
Verify label distribution (7/398 non-specific per paper)
Validate sequence lengths (reasonable range)
Check for missing data

Layer 5: File Integrity¶

Calculate checksums (SHA256)
Store for future verification
Detect any corruption or modification

Comparison of Methods¶

Method	Effort	Control	Validation	Reproducible	Recommended
Python script	Low (run script)	High	Built-in	✓ Yes	⭐ YES
in2csv	Low (one-liner)	Low	Manual	✓ Yes	Only if Python unavailable
ssconvert	Low	Low	Manual	✓ Yes	No (large dependency)
xlsx2csv	Low	Low	Manual	✓ Yes	No (same as in2csv but worse)
Excel GUI	Medium	Low	None	✗ No	❌ NO

Recommended Workflow¶

Step 1: Convert¶

cd /Users/ray/Desktop/CLARITY-DIGITAL-TWIN/antibody_training_pipeline_ESM
python3 preprocessing/shehata/step1_convert_excel_to_csv.py

Step 2: Validate¶

python3 scripts/validation/validate_shehata_conversion.py

Step 3: Compare with Jain¶

# Check format matches
head -n 3 data/test/jain.csv
head -n 3 data/test/shehata/processed/shehata.csv

# Check column compatibility
python3 -c "
import pandas as pd
jain = pd.read_csv('data/test/jain.csv')
shehata = pd.read_csv('data/test/shehata/processed/shehata.csv')
print('Jain columns:', list(jain.columns))
print('Shehata columns:', list(shehata.columns))
print('Common columns:', set(jain.columns) & set(shehata.columns))
"

Step 4: Integration Test¶

# Test with existing data pipeline
python3 -c "
from data import load_local_data

# Load Shehata
X, y = load_local_data(
    'data/test/shehata/processed/shehata.csv',
    sequence_column='heavy_seq',
    label_column='label'
)
print(f'Loaded {len(X)} sequences, {len(y)} labels')
print(f'Label distribution: {list(zip(*np.unique(y, return_counts=True)))}')
"

Handling Potential Issues¶

Issue 1: PSR Threshold Uncertainty¶

Problem: Paper says "7/398 non-specific" but doesn't specify exact threshold

Solutions: 1. Percentile-based: Use 98.24^th percentile (7/398 = 1.76%) 2. Literature search: Check Shehata et al. 2019 original paper 3. Conservative: Use PSR > 0 (any polyreactivity) 4. Ask maintainer: Contact @ludocomito or paper authors

Implemented: Script calculates percentile and prompts user to confirm

Issue 2: Missing Data¶

Problem: mmc2.xlsx has 402 rows but paper reports 398

Solutions: 1. Check if 4 extras are controls/outliers 2. Filter based on PSR score availability 3. Document the discrepancy 4. Keep all 402 unless paper specifies exclusion criteria

Implemented: Script reports row count and missing data

Issue 3: Sequence Validation Failures¶

Problem: Invalid amino acids or corrupted sequences

Solutions: 1. Check against valid AA alphabet (ACDEFGHIKLMNPQRSTVWY) 2. Compare multiple Excel reading methods 3. Manual spot-check suspicious sequences 4. Report all invalid sequences for review

Implemented: Validation script checks AA validity

External Verification (Using Multiple Agents)¶

Approach 1: Use Task Tool with Multiple Agents¶

# Launch multiple agents in parallel to verify conversion
# Agent 1: Read Excel and report statistics
# Agent 2: Read CSV and report statistics
# Agent 3: Compare and validate

# Then cross-check their consensus

Approach 2: Use Different Python Environments¶

# Environment 1: conda with pandas 1.x
conda create -n verify1 pandas=1.5 openpyxl
conda activate verify1
python scripts/validation/validate_shehata_conversion.py > verify1.log

# Environment 2: venv with pandas 2.x
python3 -m venv verify2
source verify2/bin/activate
pip install pandas openpyxl
python scripts/validation/validate_shehata_conversion.py > verify2.log

# Compare logs
diff verify1.log verify2.log

Approach 3: Independent Tool Verification¶

# Method 1: Python script
python3 preprocessing/shehata/step1_convert_excel_to_csv.py

# Method 2: in2csv + manual processing
in2csv data/test/shehata/raw/shehata-mmc2.xlsx > mmc2_in2csv.csv

# Method 3: R (if available)
Rscript -e "
library(readxl)
library(readr)
df <- read_excel('data/test/shehata/raw/shehata-mmc2.xlsx')
write_csv(df, 'mmc2_R.csv')
"

# Compare all three CSVs
# They should have identical VH/VL sequences

Checksums for Verification¶

After running conversion, store these checksums:

# Excel source file
shasum -a 256 data/test/shehata/raw/shehata-mmc2.xlsx

# Generated CSV
shasum -a 256 data/test/shehata/processed/shehata.csv

# Store in docs/checksums.txt for future verification

Expected checksums (after first conversion):

# To be filled after first successful conversion
# mmc2.xlsx: <SHA256>
# shehata.csv: <SHA256>

Documentation Requirements¶

After conversion, document:

Method used: Python script / CLI tool / GUI
PSR threshold: Exact value and rationale
Row count: 398 or 402, and why
Exclusions: Any sequences excluded and reason
Validation results: Pass/fail for all checks
Checksums: SHA256 for source and output
Date: When conversion was performed
Reviewer: Who validated the conversion

Template: See docs/shehata_conversion_log.md (to be created after conversion)

Sequence Handling & ANARCI Gotchas¶

Important Lesson: Always use the canonical source sequence for full V-domains. Do not reconstruct them from annotated fragments.

The "Data Cleaning" Trap¶

A common pattern in bioinformatics pipelines is to "clean" sequence data by annotating it (e.g., with ANARCI) and then reconstructing the sequence from the identified regions (FWR1 + CDR1 + ... + FWR4).

DO NOT DO THIS.

Why Reconstruction Fails¶

Reference numbering schemes like IMGT define strict boundaries for variable regions. However, real-world biological sequences (and canonical dataset records) often contain biologically relevant residues at the termini that fall just outside these strict boundaries.

Case Study: Pertuzumab (Jain Dataset)

Canonical Input: ...YWGQGTLVTVSS (Ends in 'SS')
ANARCI IMGT FWR4: ...YWGQGTLVTVS (Ends in 'S')
The Discrepancy: ANARCI considers the final 'S' (position 119) to be part of the constant region start, excluding it from the FWR4 annotation.

If you reconstruct the sequence from fragments ("".join([fwr1, ..., fwr4])), you lose the final 'S'. This modifies the sequence relative to the ground truth and invalidates benchmarks.

Best Practice¶

When extracting fragments or processing sequences:

Trust the Source: Use the input sequence (annotation.sequence_aa) as the definitive full V-domain.
Use Fragments for Features: Use the annotated regions (CDRs, FWRs) for feature engineering or specific analysis.
Validate Lengths: Ensure your processed sequences match the length of the canonical input sequences.

Code Example (Correct)¶

# preprocessing/fragment_utils.py

def process_sequence(annotation):
    # CORRECT: Use the full annotated sequence
    full_seq = annotation.sequence_aa

    # INCORRECT: Reconstruct from fragments (RISK OF DATA LOSS)
    # full_seq = "".join([annotation.fwr1_aa, annotation.cdr1_aa, ...])

    return full_seq

Preprocessing Internals¶

When to Use This Guide¶

Related Documentation¶

Quick Reference: Preprocessing Patterns¶

Excel to CSV Conversion Methods & Validation¶

Overview of Methods¶

Method 1a: Python Script – Shehata (RECOMMENDED) ⭐¶

Method 1b: Python Script – Jain (RECOMMENDED) ⭐¶

Method 2: CLI Tools¶

Method 3: Excel GUI Export¶

Method 1a: Shehata Python Script (DETAILED)¶

Installation¶

Usage (Shehata)¶

Validation (Shehata)¶

Output Example (Shehata)¶

Method 1b: Jain Python Script (DETAILED)¶

Usage (Jain)¶

Validation (Jain)¶

Output Example (Jain)¶

Method 2: CLI Tools¶

Option A: in2csv (csvkit)¶

Option B: ssconvert (Gnumeric)¶

Option C: xlsx2csv¶

After CLI Conversion: Post-Processing Needed¶

Method 3: Excel GUI Export¶

Steps¶

Validation Strategy: Multi-Layer Approach¶

Layer 1: Internal Consistency¶

Layer 2: Conversion Accuracy¶

Layer 3: Format Compatibility¶

Layer 4: Statistical Validation¶

Layer 5: File Integrity¶

Comparison of Methods¶

Recommended Workflow¶

Step 1: Convert¶

Step 2: Validate¶

Step 3: Compare with Jain¶

Step 4: Integration Test¶

Handling Potential Issues¶

Issue 1: PSR Threshold Uncertainty¶

Issue 2: Missing Data¶

Issue 3: Sequence Validation Failures¶

External Verification (Using Multiple Agents)¶

Approach 1: Use Task Tool with Multiple Agents¶

Approach 2: Use Different Python Environments¶

Approach 3: Independent Tool Verification¶

Checksums for Verification¶

Documentation Requirements¶

Sequence Handling & ANARCI Gotchas¶

The "Data Cleaning" Trap¶

Why Reconstruction Fails¶

Best Practice¶

Code Example (Correct)¶