⚠️ HISTORICAL DOCUMENT - November 2025 Cleanup
This document describes the cleanup plan from 2025-11-05 that has since been executed.
For current pipeline documentation, see:
data/test/shehata/README.mdThe cleanup described below is complete and all changes have been applied.
Note: This document references
leroy-jenkins/full-sendwhich was renamed tomainon 2025-11-28.
Shehata Dataset Cleanup - Complete Plan (HISTORICAL)¶
Date: 2025-11-05
Branch: leroy-jenkins/shehata-cleanup
Status: ✅ EXECUTED (archived for provenance)
Executive Summary¶
Complete cleanup of Shehata dataset AND script organization, following Rob C. Martin discipline.
Scope: 1. Reorganize Shehata to 4-tier structure (raw/ → processed/ → canonical/ → fragments/) 2. Update 8 Python scripts with new paths 3. Update 35+ documentation references 4. Clean up duplicate scripts (scripts/ root vs scripts/*/subdirectories)
Validation: All claims verified from first principles.
Decisions Made (Final)¶
- Script duplication: Option B - Delete root duplicates, use subdirectory versions
- canonical/ directory: Empty with README (Shehata is external test set)
- mmc¾/5 files: Move to raw/ (complete provenance)
Part 1: File Reorganization¶
Current State¶
data/test/
├── shehata-mmc2.xlsx (Main data - 402 rows)
├── shehata-mmc3.xlsx (Unused)
├── shehata-mmc4.xlsx (Unused)
├── shehata-mmc5.xlsx (Unused)
├── shehata.csv (398 antibodies)
└── shehata/ (16 fragment CSVs)
Target State¶
data/test/shehata/
├── README.md
├── raw/
│ ├── README.md
│ ├── shehata-mmc2.xlsx
│ ├── shehata-mmc3.xlsx
│ ├── shehata-mmc4.xlsx
│ └── shehata-mmc5.xlsx
├── processed/
│ ├── README.md
│ └── shehata.csv (398 antibodies)
├── canonical/
│ └── README.md (empty - external test set)
└── fragments/
├── README.md
└── [16 fragment CSVs]
File Move Commands¶
# Create structure
mkdir -p data/test/shehata/{raw,processed,canonical,fragments}
# Move Excel files to raw/
mv data/test/shehata-mmc2.xlsx data/test/shehata/raw/
mv data/test/shehata-mmc3.xlsx data/test/shehata/raw/
mv data/test/shehata-mmc4.xlsx data/test/shehata/raw/
mv data/test/shehata-mmc5.xlsx data/test/shehata/raw/
# Move processed CSV
mv data/test/shehata.csv data/test/shehata/processed/
# Move fragments
mv data/test/shehata/*.csv data/test/shehata/fragments/
Part 2: Python Script Updates (8 files)¶
Core Scripts (3 files)¶
1. preprocessing/shehata/step1_convert_excel_to_csv.py
# Line 275-276:
# OLD:
excel_path = Path("data/test/mmc2.xlsx") # ❌ File doesn't exist
output_path = Path("data/test/shehata.csv")
# NEW:
excel_path = Path("data/test/shehata/raw/shehata-mmc2.xlsx")
output_path = Path("data/test/shehata/processed/shehata.csv")
2. preprocessing/shehata/step2_extract_fragments.py
# Line 220-221:
# OLD:
csv_path = Path("data/test/shehata.csv")
output_dir = Path("data/test/shehata")
# NEW:
csv_path = Path("data/test/shehata/processed/shehata.csv")
output_dir = Path("data/test/shehata/fragments")
3. scripts/validation/validate_shehata_conversion.py
# Line 194-195, 287:
# OLD:
excel_path = Path("data/test/mmc2.xlsx") # ❌ File doesn't exist
csv_path = Path("data/test/shehata.csv")
fragments_dir = Path("data/test/shehata")
# NEW:
excel_path = Path("data/test/shehata/raw/shehata-mmc2.xlsx")
csv_path = Path("data/test/shehata/processed/shehata.csv")
fragments_dir = Path("data/test/shehata/fragments")
Analysis/Testing Scripts (5 files)¶
4. scripts/analysis/analyze_threshold_optimization.py DELETED
# This script was deleted as experimental (Nov 2025 cleanup)
# Purpose fulfilled: PSR threshold (0.549) already discovered and implemented
# Results documented in docs/research/assay-thresholds.md
5. scripts/testing/demo_assay_specific_thresholds.py
# Line 96:
"data/test/shehata/VH_only_shehata.csv"
→ "data/test/shehata/fragments/VH_only_shehata.csv"
6. scripts/validate_fragments.py
# Line 193:
("shehata", Path("data/test/shehata"), 16)
→ ("shehata", Path("data/test/shehata/fragments"), 16)
7. scripts/validation/validate_fragments.py
# Line 193 (same as #6, duplicate file):
("shehata", Path("data/test/shehata"), 16)
→ ("shehata", Path("data/test/shehata/fragments"), 16)
8. tests/test_shehata_embedding_compatibility.py
# Lines 25, 59, 114, 163, 200:
fragments_dir = Path("data/test/shehata")
→ fragments_dir = Path("data/test/shehata/fragments")
Part 3: Script Duplication Cleanup¶
Duplicates to Delete (4 files)¶
Delete these root-level scripts:
rm scripts/convert_jain_excel_to_csv.py # Moved to preprocessing/
rm scripts/convert_harvey_csvs.py # Moved to preprocessing/
rm scripts/validate_jain_conversion.py # Use scripts/validation/ version
rm scripts/validate_fragments.py # Use scripts/validation/ version
Canonical versions (keep these):
- ✅ preprocessing/jain/step1_convert_excel_to_csv.py
- ✅ preprocessing/harvey/step1_convert_raw_csvs.py
- ✅ scripts/validation/validate_jain_conversion.py
- ✅ scripts/validation/validate_fragments.py
Part 4: Documentation Updates (35+ files)¶
Root README.md (2 lines)¶
File: README.md
Lines 109-110:
# OLD:
- `data/test/shehata.csv` - Full paired VH+VL sequences (398 antibodies)
- `data/test/shehata/*.csv` - 16 fragment-specific files (...)
# NEW:
- `data/test/shehata/processed/shehata.csv` - Full paired VH+VL sequences (398 antibodies)
- `data/test/shehata/fragments/*.csv` - 16 fragment-specific files (...)
Top-Level Docs (3 files)¶
File: docs/COMPLETE_VALIDATION_RESULTS.md
Line 128:
# OLD:
**Test file**: `data/test/shehata/VH_only_shehata.csv`
# NEW:
**Test file**: `data/test/shehata/fragments/VH_only_shehata.csv`
File: docs/BENCHMARK_TEST_RESULTS.md
Line 75:
# OLD:
**Test file:** `data/test/shehata/VH_only_shehata.csv`
# NEW:
**Test file:** `data/test/shehata/fragments/VH_only_shehata.csv`
File: docs/research/assay-thresholds.md
Line 143:
# OLD:
df = pd.read_csv("data/test/shehata/VH_only_shehata.csv")
# NEW:
df = pd.read_csv("data/test/shehata/fragments/VH_only_shehata.csv")
Shehata-Specific Docs (7 files in docs/datasets/shehata/)¶
Update all references in:
1. docs/datasets/shehata/shehata_preprocessing_implementation_plan.md
2. docs/datasets/shehata/shehata_data_sources.md
3. docs/datasets/shehata/shehata_phase2_completion_report.md
4. docs/datasets/shehata/archive/shehata_conversion_verification_report.md
5. docs/datasets/shehata/archive/shehata_blocker_analysis.md
6. docs/datasets/shehata/archive/p0_blocker_first_principles_validation.md
7. docs/datasets/shehata/shehata_preprocessing_implementation_plan.md
Pattern (apply to all 7 files):
# Find and replace across all docs/datasets/shehata/ files:
data/test/shehata.csv → data/test/shehata/processed/shehata.csv
data/test/shehata/*.csv → data/test/shehata/fragments/*.csv
data/test/shehata/VH_only → data/test/shehata/fragments/VH_only
data/test/mmc2.xlsx → data/test/shehata/raw/shehata-mmc2.xlsx
Harvey Docs (if Option B - script cleanup)¶
Update references to root-level scripts:
File: docs/datasets/harvey/archive/harvey_data_cleaning_log.md (7 references)
scripts/convert_harvey_csvs.py → preprocessing/harvey/step1_convert_raw_csvs.py
scripts/validate_fragments.py → scripts/validation/validate_fragments.py
File: docs/datasets/harvey/harvey_data_sources.md (3 references)
File: docs/datasets/harvey/harvey_script_status.md (3 references)
Part 5: Create New READMEs (5 files)¶
1. data/test/shehata/README.md (Master)¶
Content: - Citation (Shehata 2019 + Sakhnini 2025) - Quick start guide - Data flow diagram - Link to subdirectory READMEs - Verification commands
2. data/test/shehata/raw/README.md¶
Content: - Original Excel files description - Citation details - DO NOT MODIFY warning - Note: mmc¾/5 unused but archived for provenance - Conversion instructions
3. data/test/shehata/processed/README.md¶
Content: - shehata.csv description (398 antibodies) - PSR score thresholding (98.24th percentile) - Label distribution (391 specific, 7 non-specific) - Regeneration instructions
4. data/test/shehata/canonical/README.md¶
Content: - Explanation: Shehata is external test set - No subsampling needed (unlike Jain) - Full 398-antibody dataset in processed/ is canonical - canonical/ kept empty for consistency with Jain structure
5. data/test/shehata/fragments/README.md¶
Content: - 16 fragment types explained - ANARCI/IMGT numbering methodology - Usage examples for each fragment - Fragment type use cases - CRITICAL: Note about P0 blocker (gap characters) and fix
Verification Plan (Complete)¶
1. File Move Verification¶
echo "Raw files (4):" && ls -1 data/test/shehata/raw/*.xlsx | wc -l
echo "Processed files (1):" && ls -1 data/test/shehata/processed/*.csv | wc -l
echo "Fragment files (16):" && ls -1 data/test/shehata/fragments/*.csv | wc -l
echo "Total CSVs (17):" && find data/test/shehata -name "*.csv" | wc -l
2. P0 Blocker Regression Check (CRITICAL)¶
grep -c '\-' data/test/shehata/fragments/*.csv | grep -v ':0$'
# Should return NOTHING (all files have 0 gaps)
3. Row Count Validation¶
for f in data/test/shehata/fragments/*.csv; do
count=$(wc -l < "$f")
if [ "$count" -ne 399 ]; then
echo "ERROR: $f has $count lines (expected 399)"
fi
done
4. Label Distribution Check¶
python3 -c "
import pandas as pd
files = ['processed/shehata.csv', 'fragments/VH_only_shehata.csv', 'fragments/Full_shehata.csv']
for f in files:
path = f'data/test/shehata/{f}'
df = pd.read_csv(path)
dist = df['label'].value_counts().sort_index().to_dict()
expected = {0: 391, 1: 7}
status = '✅' if dist == expected else '❌'
print(f'{status} {f}: {dist}')
"
5. Script Regeneration Test¶
python3 preprocessing/shehata/step1_convert_excel_to_csv.py # Should work
python3 preprocessing/shehata/step2_extract_fragments.py # Should work
6. Comprehensive Validation¶
python3 scripts/validation/validate_shehata_conversion.py # Should pass all checks
python3 tests/test_shehata_embedding_compatibility.py # Should pass all tests
7. Model Test¶
python3 test.py --model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/shehata/fragments/VH_only_shehata.csv
# Should load and run successfully
8. Documentation Validation¶
# Check no references to old paths remain
grep -rn "data/test/shehata\.csv" docs/ README.md --include="*.md" | grep -v "processed/"
# Should return NOTHING
grep -rn "data/test/mmc2\.xlsx" docs/ --include="*.md" | grep -v "raw/"
# Should return NOTHING
# Check no references to deleted root scripts
grep -rn "scripts/convert_jain_excel" docs/ --include="*.md" | grep -v "conversion/"
grep -rn "scripts/convert_harvey" docs/ --include="*.md" | grep -v "conversion/"
grep -rn "scripts/validate_jain" docs/ --include="*.md" | grep -v "validation/"
grep -rn "scripts/validate_fragments" docs/ --include="*.md" | grep -v "validation/"
# All should return NOTHING
Execution Order (Critical)¶
Execute in this exact order:
Phase 1: Prepare¶
- ✅ Create branch
leroy-jenkins/shehata-cleanup - Create directory structure
- Create all 5 READMEs
Phase 2: Move Files¶
- Move 4 Excel files → raw/
- Move shehata.csv → processed/
- Move 16 fragments → fragments/
Phase 3: Update Python Scripts¶
- Update 3 core scripts (conversion, processing, validation)
- Update 5 analysis/testing scripts
Phase 4: Clean Duplicate Scripts¶
- Delete 4 root-level duplicate scripts
- Verify no broken imports
Phase 5: Update Documentation¶
- Update README.md (2 lines)
- Update 3 top-level docs
- Update 7 shehata docs
- Update 3 harvey docs (script references)
Phase 6: Verify¶
- Run all 8 verification checks
- Confirm all pass
Phase 7: Commit¶
- Commit with detailed message
- Push to branch
Complete File Checklist¶
Python Files to Modify (8)¶
-
preprocessing/shehata/step1_convert_excel_to_csv.py -
preprocessing/shehata/step2_extract_fragments.py -
scripts/validation/validate_shehata_conversion.py -
(DELETED - experimental script)scripts/analysis/analyze_threshold_optimization.py -
scripts/testing/demo_assay_specific_thresholds.py -
scripts/validate_fragments.py -
scripts/validation/validate_fragments.py -
tests/test_shehata_embedding_compatibility.py
Python Files to Delete (4)¶
-
scripts/convert_jain_excel_to_csv.py -
scripts/convert_harvey_csvs.py -
scripts/validate_jain_conversion.py -
scripts/validate_fragments.py
READMEs to Create (5)¶
-
data/test/shehata/README.md -
data/test/shehata/raw/README.md -
data/test/shehata/processed/README.md -
data/test/shehata/canonical/README.md -
data/test/shehata/fragments/README.md
Documentation Files to Update (13)¶
-
README.md(2 lines) -
docs/COMPLETE_VALIDATION_RESULTS.md(1 line) -
docs/BENCHMARK_TEST_RESULTS.md(1 line) -
docs/research/assay-thresholds.md(1 line) -
docs/datasets/shehata/shehata_preprocessing_implementation_plan.md -
docs/datasets/shehata/shehata_data_sources.md -
docs/datasets/shehata/shehata_phase2_completion_report.md -
docs/datasets/shehata/archive/shehata_conversion_verification_report.md -
docs/datasets/shehata/archive/shehata_blocker_analysis.md -
docs/datasets/shehata/archive/p0_blocker_first_principles_validation.md -
docs/datasets/harvey/archive/harvey_data_cleaning_log.md(7 refs) -
docs/datasets/harvey/harvey_data_sources.md(3 refs) -
docs/datasets/harvey/harvey_script_status.md(3 refs)
Time Estimate¶
Total: 60-75 minutes
- Phase 1 (Prepare): 5 min
- Phase 2 (Move): 5 min
- Phase 3 (Scripts): 15 min
- Phase 4 (Duplicates): 5 min
- Phase 5 (Docs): 20 min
- Phase 6 (Verify): 10 min
- Phase 7 (Commit): 5 min
Citation (Correct)¶
Dataset Source: Shehata, L. et al. (2019). "Affinity maturation enhances antibody specificity but compromises conformational stability." Cell Reports 28(13):3300-3308.e4. DOI: 10.1016/j.celrep.2019.08.056
Methodology Source: Sakhnini, L.I., et al. (2025). "Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters." bioRxiv. DOI: 10.1101/2025.04.28.650927
Status: ✅ COMPLETE PLAN - READY TO EXECUTE
All feedback validated from first principles. Plan now includes: - ✅ All 8 Python script updates - ✅ All 4 duplicate script deletions - ✅ All 35+ documentation updates - ✅ Complete verification checklist - ✅ Clear execution order
Awaiting your go-ahead to execute, boss.