Skip to content

⚠️ HISTORICAL DOCUMENT - November 2025 Cleanup

This document describes the cleanup plan from 2025-11-05 that has since been executed.

For current pipeline documentation, see: data/test/shehata/README.md

The cleanup described below is complete and all changes have been applied.

Note: This document references leroy-jenkins/full-send which was renamed to main on 2025-11-28.


Shehata Dataset Cleanup - Complete Plan (HISTORICAL)

Date: 2025-11-05 Branch: leroy-jenkins/shehata-cleanup Status:EXECUTED (archived for provenance)


Executive Summary

Complete cleanup of Shehata dataset AND script organization, following Rob C. Martin discipline.

Scope: 1. Reorganize Shehata to 4-tier structure (raw/ → processed/ → canonical/ → fragments/) 2. Update 8 Python scripts with new paths 3. Update 35+ documentation references 4. Clean up duplicate scripts (scripts/ root vs scripts/*/subdirectories)

Validation: All claims verified from first principles.


Decisions Made (Final)

  1. Script duplication: Option B - Delete root duplicates, use subdirectory versions
  2. canonical/ directory: Empty with README (Shehata is external test set)
  3. mmc¾/5 files: Move to raw/ (complete provenance)

Part 1: File Reorganization

Current State

data/test/
├── shehata-mmc2.xlsx          (Main data - 402 rows)
├── shehata-mmc3.xlsx          (Unused)
├── shehata-mmc4.xlsx          (Unused)
├── shehata-mmc5.xlsx          (Unused)
├── shehata.csv                (398 antibodies)
└── shehata/                   (16 fragment CSVs)

Target State

data/test/shehata/
├── README.md
├── raw/
│   ├── README.md
│   ├── shehata-mmc2.xlsx
│   ├── shehata-mmc3.xlsx
│   ├── shehata-mmc4.xlsx
│   └── shehata-mmc5.xlsx
├── processed/
│   ├── README.md
│   └── shehata.csv (398 antibodies)
├── canonical/
│   └── README.md (empty - external test set)
└── fragments/
    ├── README.md
    └── [16 fragment CSVs]

File Move Commands

# Create structure
mkdir -p data/test/shehata/{raw,processed,canonical,fragments}

# Move Excel files to raw/
mv data/test/shehata-mmc2.xlsx data/test/shehata/raw/
mv data/test/shehata-mmc3.xlsx data/test/shehata/raw/
mv data/test/shehata-mmc4.xlsx data/test/shehata/raw/
mv data/test/shehata-mmc5.xlsx data/test/shehata/raw/

# Move processed CSV
mv data/test/shehata.csv data/test/shehata/processed/

# Move fragments
mv data/test/shehata/*.csv data/test/shehata/fragments/

Part 2: Python Script Updates (8 files)

Core Scripts (3 files)

1. preprocessing/shehata/step1_convert_excel_to_csv.py

# Line 275-276:
# OLD:
excel_path = Path("data/test/mmc2.xlsx")  # ❌ File doesn't exist
output_path = Path("data/test/shehata.csv")

# NEW:
excel_path = Path("data/test/shehata/raw/shehata-mmc2.xlsx")
output_path = Path("data/test/shehata/processed/shehata.csv")

2. preprocessing/shehata/step2_extract_fragments.py

# Line 220-221:
# OLD:
csv_path = Path("data/test/shehata.csv")
output_dir = Path("data/test/shehata")

# NEW:
csv_path = Path("data/test/shehata/processed/shehata.csv")
output_dir = Path("data/test/shehata/fragments")

3. scripts/validation/validate_shehata_conversion.py

# Line 194-195, 287:
# OLD:
excel_path = Path("data/test/mmc2.xlsx")  # ❌ File doesn't exist
csv_path = Path("data/test/shehata.csv")
fragments_dir = Path("data/test/shehata")

# NEW:
excel_path = Path("data/test/shehata/raw/shehata-mmc2.xlsx")
csv_path = Path("data/test/shehata/processed/shehata.csv")
fragments_dir = Path("data/test/shehata/fragments")

Analysis/Testing Scripts (5 files)

4. scripts/analysis/analyze_threshold_optimization.py DELETED

# This script was deleted as experimental (Nov 2025 cleanup)
# Purpose fulfilled: PSR threshold (0.549) already discovered and implemented
# Results documented in docs/research/assay-thresholds.md

5. scripts/testing/demo_assay_specific_thresholds.py

# Line 96:
"data/test/shehata/VH_only_shehata.csv"
 "data/test/shehata/fragments/VH_only_shehata.csv"

6. scripts/validate_fragments.py

# Line 193:
("shehata", Path("data/test/shehata"), 16)
 ("shehata", Path("data/test/shehata/fragments"), 16)

7. scripts/validation/validate_fragments.py

# Line 193 (same as #6, duplicate file):
("shehata", Path("data/test/shehata"), 16)
 ("shehata", Path("data/test/shehata/fragments"), 16)

8. tests/test_shehata_embedding_compatibility.py

# Lines 25, 59, 114, 163, 200:
fragments_dir = Path("data/test/shehata")
 fragments_dir = Path("data/test/shehata/fragments")


Part 3: Script Duplication Cleanup

Duplicates to Delete (4 files)

Delete these root-level scripts:

rm scripts/convert_jain_excel_to_csv.py        # Moved to preprocessing/
rm scripts/convert_harvey_csvs.py              # Moved to preprocessing/
rm scripts/validate_jain_conversion.py         # Use scripts/validation/ version
rm scripts/validate_fragments.py               # Use scripts/validation/ version

Canonical versions (keep these): - ✅ preprocessing/jain/step1_convert_excel_to_csv.py - ✅ preprocessing/harvey/step1_convert_raw_csvs.py - ✅ scripts/validation/validate_jain_conversion.py - ✅ scripts/validation/validate_fragments.py


Part 4: Documentation Updates (35+ files)

Root README.md (2 lines)

File: README.md

Lines 109-110:

# OLD:
- `data/test/shehata.csv` - Full paired VH+VL sequences (398 antibodies)
- `data/test/shehata/*.csv` - 16 fragment-specific files (...)

# NEW:
- `data/test/shehata/processed/shehata.csv` - Full paired VH+VL sequences (398 antibodies)
- `data/test/shehata/fragments/*.csv` - 16 fragment-specific files (...)

Top-Level Docs (3 files)

File: docs/COMPLETE_VALIDATION_RESULTS.md

Line 128:

# OLD:
**Test file**: `data/test/shehata/VH_only_shehata.csv`

# NEW:
**Test file**: `data/test/shehata/fragments/VH_only_shehata.csv`

File: docs/BENCHMARK_TEST_RESULTS.md

Line 75:

# OLD:
**Test file:** `data/test/shehata/VH_only_shehata.csv`

# NEW:
**Test file:** `data/test/shehata/fragments/VH_only_shehata.csv`

File: docs/research/assay-thresholds.md

Line 143:

# OLD:
df = pd.read_csv("data/test/shehata/VH_only_shehata.csv")

# NEW:
df = pd.read_csv("data/test/shehata/fragments/VH_only_shehata.csv")

Shehata-Specific Docs (7 files in docs/datasets/shehata/)

Update all references in: 1. docs/datasets/shehata/shehata_preprocessing_implementation_plan.md 2. docs/datasets/shehata/shehata_data_sources.md 3. docs/datasets/shehata/shehata_phase2_completion_report.md 4. docs/datasets/shehata/archive/shehata_conversion_verification_report.md 5. docs/datasets/shehata/archive/shehata_blocker_analysis.md 6. docs/datasets/shehata/archive/p0_blocker_first_principles_validation.md 7. docs/datasets/shehata/shehata_preprocessing_implementation_plan.md

Pattern (apply to all 7 files):

# Find and replace across all docs/datasets/shehata/ files:
data/test/shehata.csv  data/test/shehata/processed/shehata.csv
data/test/shehata/*.csv  data/test/shehata/fragments/*.csv
data/test/shehata/VH_only  data/test/shehata/fragments/VH_only
data/test/mmc2.xlsx  data/test/shehata/raw/shehata-mmc2.xlsx

Harvey Docs (if Option B - script cleanup)

Update references to root-level scripts:

File: docs/datasets/harvey/archive/harvey_data_cleaning_log.md (7 references)

scripts/convert_harvey_csvs.py  preprocessing/harvey/step1_convert_raw_csvs.py
scripts/validate_fragments.py  scripts/validation/validate_fragments.py

File: docs/datasets/harvey/harvey_data_sources.md (3 references)

scripts/convert_harvey_csvs.py  preprocessing/harvey/step1_convert_raw_csvs.py

File: docs/datasets/harvey/harvey_script_status.md (3 references)

scripts/validate_fragments.py  scripts/validation/validate_fragments.py


Part 5: Create New READMEs (5 files)

1. data/test/shehata/README.md (Master)

Content: - Citation (Shehata 2019 + Sakhnini 2025) - Quick start guide - Data flow diagram - Link to subdirectory READMEs - Verification commands

2. data/test/shehata/raw/README.md

Content: - Original Excel files description - Citation details - DO NOT MODIFY warning - Note: mmc¾/5 unused but archived for provenance - Conversion instructions

3. data/test/shehata/processed/README.md

Content: - shehata.csv description (398 antibodies) - PSR score thresholding (98.24th percentile) - Label distribution (391 specific, 7 non-specific) - Regeneration instructions

4. data/test/shehata/canonical/README.md

Content: - Explanation: Shehata is external test set - No subsampling needed (unlike Jain) - Full 398-antibody dataset in processed/ is canonical - canonical/ kept empty for consistency with Jain structure

5. data/test/shehata/fragments/README.md

Content: - 16 fragment types explained - ANARCI/IMGT numbering methodology - Usage examples for each fragment - Fragment type use cases - CRITICAL: Note about P0 blocker (gap characters) and fix


Verification Plan (Complete)

1. File Move Verification

echo "Raw files (4):" && ls -1 data/test/shehata/raw/*.xlsx | wc -l
echo "Processed files (1):" && ls -1 data/test/shehata/processed/*.csv | wc -l
echo "Fragment files (16):" && ls -1 data/test/shehata/fragments/*.csv | wc -l
echo "Total CSVs (17):" && find data/test/shehata -name "*.csv" | wc -l

2. P0 Blocker Regression Check (CRITICAL)

grep -c '\-' data/test/shehata/fragments/*.csv | grep -v ':0$'
# Should return NOTHING (all files have 0 gaps)

3. Row Count Validation

for f in data/test/shehata/fragments/*.csv; do
  count=$(wc -l < "$f")
  if [ "$count" -ne 399 ]; then
    echo "ERROR: $f has $count lines (expected 399)"
  fi
done

4. Label Distribution Check

python3 -c "
import pandas as pd
files = ['processed/shehata.csv', 'fragments/VH_only_shehata.csv', 'fragments/Full_shehata.csv']
for f in files:
    path = f'data/test/shehata/{f}'
    df = pd.read_csv(path)
    dist = df['label'].value_counts().sort_index().to_dict()
    expected = {0: 391, 1: 7}
    status = '✅' if dist == expected else '❌'
    print(f'{status} {f}: {dist}')
"

5. Script Regeneration Test

python3 preprocessing/shehata/step1_convert_excel_to_csv.py  # Should work
python3 preprocessing/shehata/step2_extract_fragments.py                     # Should work

6. Comprehensive Validation

python3 scripts/validation/validate_shehata_conversion.py  # Should pass all checks
python3 tests/test_shehata_embedding_compatibility.py      # Should pass all tests

7. Model Test

python3 test.py --model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
  --data data/test/shehata/fragments/VH_only_shehata.csv
# Should load and run successfully

8. Documentation Validation

# Check no references to old paths remain
grep -rn "data/test/shehata\.csv" docs/ README.md --include="*.md" | grep -v "processed/"
# Should return NOTHING

grep -rn "data/test/mmc2\.xlsx" docs/ --include="*.md" | grep -v "raw/"
# Should return NOTHING

# Check no references to deleted root scripts
grep -rn "scripts/convert_jain_excel" docs/ --include="*.md" | grep -v "conversion/"
grep -rn "scripts/convert_harvey" docs/ --include="*.md" | grep -v "conversion/"
grep -rn "scripts/validate_jain" docs/ --include="*.md" | grep -v "validation/"
grep -rn "scripts/validate_fragments" docs/ --include="*.md" | grep -v "validation/"
# All should return NOTHING

Execution Order (Critical)

Execute in this exact order:

Phase 1: Prepare

  1. ✅ Create branch leroy-jenkins/shehata-cleanup
  2. Create directory structure
  3. Create all 5 READMEs

Phase 2: Move Files

  1. Move 4 Excel files → raw/
  2. Move shehata.csv → processed/
  3. Move 16 fragments → fragments/

Phase 3: Update Python Scripts

  1. Update 3 core scripts (conversion, processing, validation)
  2. Update 5 analysis/testing scripts

Phase 4: Clean Duplicate Scripts

  1. Delete 4 root-level duplicate scripts
  2. Verify no broken imports

Phase 5: Update Documentation

  1. Update README.md (2 lines)
  2. Update 3 top-level docs
  3. Update 7 shehata docs
  4. Update 3 harvey docs (script references)

Phase 6: Verify

  1. Run all 8 verification checks
  2. Confirm all pass

Phase 7: Commit

  1. Commit with detailed message
  2. Push to branch

Complete File Checklist

Python Files to Modify (8)

  • preprocessing/shehata/step1_convert_excel_to_csv.py
  • preprocessing/shehata/step2_extract_fragments.py
  • scripts/validation/validate_shehata_conversion.py
  • scripts/analysis/analyze_threshold_optimization.py (DELETED - experimental script)
  • scripts/testing/demo_assay_specific_thresholds.py
  • scripts/validate_fragments.py
  • scripts/validation/validate_fragments.py
  • tests/test_shehata_embedding_compatibility.py

Python Files to Delete (4)

  • scripts/convert_jain_excel_to_csv.py
  • scripts/convert_harvey_csvs.py
  • scripts/validate_jain_conversion.py
  • scripts/validate_fragments.py

READMEs to Create (5)

  • data/test/shehata/README.md
  • data/test/shehata/raw/README.md
  • data/test/shehata/processed/README.md
  • data/test/shehata/canonical/README.md
  • data/test/shehata/fragments/README.md

Documentation Files to Update (13)

  • README.md (2 lines)
  • docs/COMPLETE_VALIDATION_RESULTS.md (1 line)
  • docs/BENCHMARK_TEST_RESULTS.md (1 line)
  • docs/research/assay-thresholds.md (1 line)
  • docs/datasets/shehata/shehata_preprocessing_implementation_plan.md
  • docs/datasets/shehata/shehata_data_sources.md
  • docs/datasets/shehata/shehata_phase2_completion_report.md
  • docs/datasets/shehata/archive/shehata_conversion_verification_report.md
  • docs/datasets/shehata/archive/shehata_blocker_analysis.md
  • docs/datasets/shehata/archive/p0_blocker_first_principles_validation.md
  • docs/datasets/harvey/archive/harvey_data_cleaning_log.md (7 refs)
  • docs/datasets/harvey/harvey_data_sources.md (3 refs)
  • docs/datasets/harvey/harvey_script_status.md (3 refs)

Time Estimate

Total: 60-75 minutes

  • Phase 1 (Prepare): 5 min
  • Phase 2 (Move): 5 min
  • Phase 3 (Scripts): 15 min
  • Phase 4 (Duplicates): 5 min
  • Phase 5 (Docs): 20 min
  • Phase 6 (Verify): 10 min
  • Phase 7 (Commit): 5 min

Citation (Correct)

Dataset Source: Shehata, L. et al. (2019). "Affinity maturation enhances antibody specificity but compromises conformational stability." Cell Reports 28(13):3300-3308.e4. DOI: 10.1016/j.celrep.2019.08.056

Methodology Source: Sakhnini, L.I., et al. (2025). "Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters." bioRxiv. DOI: 10.1101/2025.04.28.650927


Status:COMPLETE PLAN - READY TO EXECUTE

All feedback validated from first principles. Plan now includes: - ✅ All 8 Python script updates - ✅ All 4 duplicate script deletions - ✅ All 35+ documentation updates - ✅ Complete verification checklist - ✅ Clear execution order

Awaiting your go-ahead to execute, boss.