⚠️ HISTORICAL DOCUMENT - November 2025 Cleanup

This document describes the cleanup plan from 2025-11-05 that has since been executed.

For current pipeline documentation, see: data/test/shehata/README.md

The cleanup described below is complete and all changes have been applied.

Note: This document references leroy-jenkins/full-send which was renamed to main on 2025-11-28.

Shehata Dataset Cleanup - Complete Plan (HISTORICAL)¶

Date: 2025-11-05 Branch: leroy-jenkins/shehata-cleanup Status: ✅ EXECUTED (archived for provenance)

Executive Summary¶

Complete cleanup of Shehata dataset AND script organization, following Rob C. Martin discipline.

Scope: 1. Reorganize Shehata to 4-tier structure (raw/ → processed/ → canonical/ → fragments/) 2. Update 8 Python scripts with new paths 3. Update 35+ documentation references 4. Clean up duplicate scripts (scripts/ root vs scripts/*/subdirectories)

Validation: All claims verified from first principles.

Decisions Made (Final)¶

Script duplication: Option B - Delete root duplicates, use subdirectory versions
canonical/ directory: Empty with README (Shehata is external test set)
mmc¾/5 files: Move to raw/ (complete provenance)

Part 1: File Reorganization¶

Current State¶

data/test/
├── shehata-mmc2.xlsx          (Main data - 402 rows)
├── shehata-mmc3.xlsx          (Unused)
├── shehata-mmc4.xlsx          (Unused)
├── shehata-mmc5.xlsx          (Unused)
├── shehata.csv                (398 antibodies)
└── shehata/                   (16 fragment CSVs)

Target State¶

data/test/shehata/
├── README.md
├── raw/
│   ├── README.md
│   ├── shehata-mmc2.xlsx
│   ├── shehata-mmc3.xlsx
│   ├── shehata-mmc4.xlsx
│   └── shehata-mmc5.xlsx
├── processed/
│   ├── README.md
│   └── shehata.csv (398 antibodies)
├── canonical/
│   └── README.md (empty - external test set)
└── fragments/
    ├── README.md
    └── [16 fragment CSVs]

File Move Commands¶

# Create structure
mkdir -p data/test/shehata/{raw,processed,canonical,fragments}

# Move Excel files to raw/
mv data/test/shehata-mmc2.xlsx data/test/shehata/raw/
mv data/test/shehata-mmc3.xlsx data/test/shehata/raw/
mv data/test/shehata-mmc4.xlsx data/test/shehata/raw/
mv data/test/shehata-mmc5.xlsx data/test/shehata/raw/

# Move processed CSV
mv data/test/shehata.csv data/test/shehata/processed/

# Move fragments
mv data/test/shehata/*.csv data/test/shehata/fragments/

Part 2: Python Script Updates (8 files)¶

Core Scripts (3 files)¶

1. preprocessing/shehata/step1_convert_excel_to_csv.py

# Line 275-276:
# OLD:
excel_path = Path("data/test/mmc2.xlsx")  # ❌ File doesn't exist
output_path = Path("data/test/shehata.csv")

# NEW:
excel_path = Path("data/test/shehata/raw/shehata-mmc2.xlsx")
output_path = Path("data/test/shehata/processed/shehata.csv")

2. preprocessing/shehata/step2_extract_fragments.py

# Line 220-221:
# OLD:
csv_path = Path("data/test/shehata.csv")
output_dir = Path("data/test/shehata")

# NEW:
csv_path = Path("data/test/shehata/processed/shehata.csv")
output_dir = Path("data/test/shehata/fragments")

3. scripts/validation/validate_shehata_conversion.py

# Line 194-195, 287:
# OLD:
excel_path = Path("data/test/mmc2.xlsx")  # ❌ File doesn't exist
csv_path = Path("data/test/shehata.csv")
fragments_dir = Path("data/test/shehata")

# NEW:
excel_path = Path("data/test/shehata/raw/shehata-mmc2.xlsx")
csv_path = Path("data/test/shehata/processed/shehata.csv")
fragments_dir = Path("data/test/shehata/fragments")

Analysis/Testing Scripts (5 files)¶

4. scripts/analysis/analyze_threshold_optimization.py ~~DELETED~~

# This script was deleted as experimental (Nov 2025 cleanup)
# Purpose fulfilled: PSR threshold (0.549) already discovered and implemented
# Results documented in docs/research/assay-thresholds.md

5. scripts/testing/demo_assay_specific_thresholds.py

# Line 96:
"data/test/shehata/VH_only_shehata.csv"
→ "data/test/shehata/fragments/VH_only_shehata.csv"

6. scripts/validate_fragments.py

# Line 193:
("shehata", Path("data/test/shehata"), 16)
→ ("shehata", Path("data/test/shehata/fragments"), 16)

7. scripts/validation/validate_fragments.py

# Line 193 (same as #6, duplicate file):
("shehata", Path("data/test/shehata"), 16)
→ ("shehata", Path("data/test/shehata/fragments"), 16)

8. tests/test_shehata_embedding_compatibility.py

# Lines 25, 59, 114, 163, 200:
fragments_dir = Path("data/test/shehata")
→ fragments_dir = Path("data/test/shehata/fragments")

Part 3: Script Duplication Cleanup¶

Duplicates to Delete (4 files)¶

Delete these root-level scripts:

rm scripts/convert_jain_excel_to_csv.py        # Moved to preprocessing/
rm scripts/convert_harvey_csvs.py              # Moved to preprocessing/
rm scripts/validate_jain_conversion.py         # Use scripts/validation/ version
rm scripts/validate_fragments.py               # Use scripts/validation/ version

Canonical versions (keep these): - ✅ preprocessing/jain/step1_convert_excel_to_csv.py - ✅ preprocessing/harvey/step1_convert_raw_csvs.py - ✅ scripts/validation/validate_jain_conversion.py - ✅ scripts/validation/validate_fragments.py

Part 4: Documentation Updates (35+ files)¶

Root README.md (2 lines)¶

File: README.md

Lines 109-110:

# OLD:
- `data/test/shehata.csv` - Full paired VH+VL sequences (398 antibodies)
- `data/test/shehata/*.csv` - 16 fragment-specific files (...)

# NEW:
- `data/test/shehata/processed/shehata.csv` - Full paired VH+VL sequences (398 antibodies)
- `data/test/shehata/fragments/*.csv` - 16 fragment-specific files (...)

Top-Level Docs (3 files)¶

File: docs/COMPLETE_VALIDATION_RESULTS.md

Line 128:

# OLD:
**Test file**: `data/test/shehata/VH_only_shehata.csv`

# NEW:
**Test file**: `data/test/shehata/fragments/VH_only_shehata.csv`

File: docs/BENCHMARK_TEST_RESULTS.md

Line 75:

# OLD:
**Test file:** `data/test/shehata/VH_only_shehata.csv`

# NEW:
**Test file:** `data/test/shehata/fragments/VH_only_shehata.csv`

File: docs/research/assay-thresholds.md

Line 143:

# OLD:
df = pd.read_csv("data/test/shehata/VH_only_shehata.csv")

# NEW:
df = pd.read_csv("data/test/shehata/fragments/VH_only_shehata.csv")

Shehata-Specific Docs (7 files in docs/datasets/shehata/)¶

Update all references in: 1. docs/datasets/shehata/shehata_preprocessing_implementation_plan.md 2. docs/datasets/shehata/shehata_data_sources.md 3. docs/datasets/shehata/shehata_phase2_completion_report.md 4. docs/datasets/shehata/archive/shehata_conversion_verification_report.md 5. docs/datasets/shehata/archive/shehata_blocker_analysis.md 6. docs/datasets/shehata/archive/p0_blocker_first_principles_validation.md 7. docs/datasets/shehata/shehata_preprocessing_implementation_plan.md

Pattern (apply to all 7 files):

# Find and replace across all docs/datasets/shehata/ files:
data/test/shehata.csv → data/test/shehata/processed/shehata.csv
data/test/shehata/*.csv → data/test/shehata/fragments/*.csv
data/test/shehata/VH_only → data/test/shehata/fragments/VH_only
data/test/mmc2.xlsx → data/test/shehata/raw/shehata-mmc2.xlsx

Harvey Docs (if Option B - script cleanup)¶

Update references to root-level scripts:

File: docs/datasets/harvey/archive/harvey_data_cleaning_log.md (7 references)

scripts/convert_harvey_csvs.py → preprocessing/harvey/step1_convert_raw_csvs.py
scripts/validate_fragments.py → scripts/validation/validate_fragments.py

File: docs/datasets/harvey/harvey_data_sources.md (3 references)

scripts/convert_harvey_csvs.py → preprocessing/harvey/step1_convert_raw_csvs.py

File: docs/datasets/harvey/harvey_script_status.md (3 references)

scripts/validate_fragments.py → scripts/validation/validate_fragments.py

Part 5: Create New READMEs (5 files)¶

1. data/test/shehata/README.md (Master)¶

Content: - Citation (Shehata 2019 + Sakhnini 2025) - Quick start guide - Data flow diagram - Link to subdirectory READMEs - Verification commands

2. data/test/shehata/raw/README.md¶

Content: - Original Excel files description - Citation details - DO NOT MODIFY warning - Note: mmc¾/5 unused but archived for provenance - Conversion instructions

3. data/test/shehata/processed/README.md¶

Content: - shehata.csv description (398 antibodies) - PSR score thresholding (98.24^th percentile) - Label distribution (391 specific, 7 non-specific) - Regeneration instructions

4. data/test/shehata/canonical/README.md¶

Content: - Explanation: Shehata is external test set - No subsampling needed (unlike Jain) - Full 398-antibody dataset in processed/ is canonical - canonical/ kept empty for consistency with Jain structure

5. data/test/shehata/fragments/README.md¶

Content: - 16 fragment types explained - ANARCI/IMGT numbering methodology - Usage examples for each fragment - Fragment type use cases - CRITICAL: Note about P0 blocker (gap characters) and fix

Verification Plan (Complete)¶

1. File Move Verification¶

echo "Raw files (4):" && ls -1 data/test/shehata/raw/*.xlsx | wc -l
echo "Processed files (1):" && ls -1 data/test/shehata/processed/*.csv | wc -l
echo "Fragment files (16):" && ls -1 data/test/shehata/fragments/*.csv | wc -l
echo "Total CSVs (17):" && find data/test/shehata -name "*.csv" | wc -l

2. P0 Blocker Regression Check (CRITICAL)¶

grep -c '\-' data/test/shehata/fragments/*.csv | grep -v ':0$'
# Should return NOTHING (all files have 0 gaps)

3. Row Count Validation¶

for f in data/test/shehata/fragments/*.csv; do
  count=$(wc -l < "$f")
  if [ "$count" -ne 399 ]; then
    echo "ERROR: $f has $count lines (expected 399)"
  fi
done

4. Label Distribution Check¶

python3 -c "
import pandas as pd
files = ['processed/shehata.csv', 'fragments/VH_only_shehata.csv', 'fragments/Full_shehata.csv']
for f in files:
    path = f'data/test/shehata/{f}'
    df = pd.read_csv(path)
    dist = df['label'].value_counts().sort_index().to_dict()
    expected = {0: 391, 1: 7}
    status = '✅' if dist == expected else '❌'
    print(f'{status} {f}: {dist}')
"

5. Script Regeneration Test¶

python3 preprocessing/shehata/step1_convert_excel_to_csv.py  # Should work
python3 preprocessing/shehata/step2_extract_fragments.py                     # Should work

6. Comprehensive Validation¶

python3 scripts/validation/validate_shehata_conversion.py  # Should pass all checks
python3 tests/test_shehata_embedding_compatibility.py      # Should pass all tests

7. Model Test¶

python3 test.py --model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
  --data data/test/shehata/fragments/VH_only_shehata.csv
# Should load and run successfully

8. Documentation Validation¶

# Check no references to old paths remain
grep -rn "data/test/shehata\.csv" docs/ README.md --include="*.md" | grep -v "processed/"
# Should return NOTHING

grep -rn "data/test/mmc2\.xlsx" docs/ --include="*.md" | grep -v "raw/"
# Should return NOTHING

# Check no references to deleted root scripts
grep -rn "scripts/convert_jain_excel" docs/ --include="*.md" | grep -v "conversion/"
grep -rn "scripts/convert_harvey" docs/ --include="*.md" | grep -v "conversion/"
grep -rn "scripts/validate_jain" docs/ --include="*.md" | grep -v "validation/"
grep -rn "scripts/validate_fragments" docs/ --include="*.md" | grep -v "validation/"
# All should return NOTHING

Execution Order (Critical)¶

Execute in this exact order:

Phase 1: Prepare¶

✅ Create branch leroy-jenkins/shehata-cleanup
Create directory structure
Create all 5 READMEs

Phase 2: Move Files¶

Move 4 Excel files → raw/
Move shehata.csv → processed/
Move 16 fragments → fragments/

Phase 3: Update Python Scripts¶

Update 3 core scripts (conversion, processing, validation)
Update 5 analysis/testing scripts

Phase 4: Clean Duplicate Scripts¶

Delete 4 root-level duplicate scripts
Verify no broken imports

Phase 5: Update Documentation¶

Update README.md (2 lines)
Update 3 top-level docs
Update 7 shehata docs
Update 3 harvey docs (script references)

Phase 6: Verify¶

Run all 8 verification checks
Confirm all pass

Phase 7: Commit¶

Commit with detailed message
Push to branch

Complete File Checklist¶

Python Files to Modify (8)¶

preprocessing/shehata/step1_convert_excel_to_csv.py
preprocessing/shehata/step2_extract_fragments.py
scripts/validation/validate_shehata_conversion.py
~~scripts/analysis/analyze_threshold_optimization.py~~ (DELETED - experimental script)
scripts/testing/demo_assay_specific_thresholds.py
scripts/validate_fragments.py
scripts/validation/validate_fragments.py
tests/test_shehata_embedding_compatibility.py

Python Files to Delete (4)¶

scripts/convert_jain_excel_to_csv.py
scripts/convert_harvey_csvs.py
scripts/validate_jain_conversion.py
scripts/validate_fragments.py

READMEs to Create (5)¶

data/test/shehata/README.md
data/test/shehata/raw/README.md
data/test/shehata/processed/README.md
data/test/shehata/canonical/README.md
data/test/shehata/fragments/README.md

Documentation Files to Update (13)¶

Time Estimate¶

Total: 60-75 minutes

Phase 1 (Prepare): 5 min
Phase 2 (Move): 5 min
Phase 3 (Scripts): 15 min
Phase 4 (Duplicates): 5 min
Phase 5 (Docs): 20 min
Phase 6 (Verify): 10 min
Phase 7 (Commit): 5 min

Citation (Correct)¶

Dataset Source: Shehata, L. et al. (2019). "Affinity maturation enhances antibody specificity but compromises conformational stability." Cell Reports 28(13):3300-3308.e4. DOI: 10.1016/j.celrep.2019.08.056

Methodology Source: Sakhnini, L.I., et al. (2025). "Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters." bioRxiv. DOI: 10.1101/2025.04.28.650927

Status: ✅ COMPLETE PLAN - READY TO EXECUTE

All feedback validated from first principles. Plan now includes: - ✅ All 8 Python script updates - ✅ All 4 duplicate script deletions - ✅ All 35+ documentation updates - ✅ Complete verification checklist - ✅ Clear execution order

Awaiting your go-ahead to execute, boss.