⚠️ HISTORICAL DOCUMENT - November 2025 Cleanup
This document describes the verification results from the 2025-11-05 Harvey cleanup execution.
For current pipeline documentation, see:
data/test/harvey/README.mdThis verification confirmed the cleanup was successful.
Note: This document references
leroy-jenkins/full-sendwhich was renamed tomainon 2025-11-28.
Harvey Dataset Cleanup - Verification Results¶
Date: 2025-11-05 (Historical) Branch: leroy-jenkins/harvey-cleanup Status: ✅ ALL VERIFICATION CHECKS PASSED
Verification Summary¶
Check 1: File Move Verification ✅¶
- Raw files: 3 CSVs
- Processed files: 1 CSV
- Fragment files: 6 CSVs
- Total: 10 CSVs (as expected)
Check 2: Row Count Validation ✅¶
- Processed file: 141,475 lines (141,474 + header) ✅
- All fragment files: 141,022 lines (141,021 + header) ✅
Check 3: Label Distribution Check ✅¶
- Low polyreactivity (0): 69,702 ✅
- High polyreactivity (1): 71,772 ✅
- Match: Perfect (49.1% / 50.9%)
Check 5: Fragment Validation ✅¶
- Harvey validation: PASSED
- 6 fragment files validated
- Consistent row counts: ✅
- Label distribution preserved: ✅
Check 6: Embedding Compatibility Test (P0 CRITICAL) ✅¶
Result: ALL 5 TESTS PASSED
- ✅ Gap Character Detection
- ✅ Amino Acid Validation
- ✅ Previously Affected Sequences
- ✅ ESM Model Validation
- ✅ Data Integrity
P0 blocker successfully resolved - ESM-1v compatible!
Check 8: Failed Sequences Check ✅¶
- Failed sequences logged: 453 IDs ✅
- Location:
data/test/harvey/fragments/failed_sequences.txt
Check 9: Documentation Validation ✅¶
- No old
harvey.csvpaths in docs (excluding intentional history): 0 ✅ - No
reference_repos/harvey_official_reporefs in scripts: 0 ✅ - No
harvey_high/low.csvrefs in Python scripts: 0 ✅ - All new
harvey/fragments/andharvey/processed/paths verified ✅
Cleanup Results¶
Files Reorganized¶
- ✅ 3 raw CSVs copied to
raw/ - ✅ 1 processed CSV moved to
processed/ - ✅ 2 intermediate CSVs deleted (harvey_high/low.csv)
- ✅ 6 fragment CSVs moved to
fragments/ - ✅ 1 failure log moved to
fragments/
Code Updated¶
- ✅ 6 Python scripts updated (15 path references)
- ✅ 11 documentation files updated (76 path references)
- ✅ 5 comprehensive READMEs created
Total Changes¶
- 22 files created/moved/updated
- 91+ path references updated
- 0 errors encountered
- 100% verification success
Final Structure¶
data/test/harvey/
├── README.md (master guide)
├── raw/
│ ├── README.md
│ ├── high_polyreactivity_high_throughput.csv (71,772)
│ ├── low_polyreactivity_high_throughput.csv (69,702)
│ └── low_throughput_polyspecificity_scores_w_exp.csv
├── processed/
│ ├── README.md
│ └── harvey.csv (141,474)
├── canonical/
│ └── README.md (empty - dataset already balanced)
└── fragments/
├── README.md
├── VHH_only_harvey.csv (141,021)
├── H-CDR1/2/3_harvey.csv (141,021 each)
├── H-CDRs_harvey.csv (141,021)
├── H-FWRs_harvey.csv (141,021)
└── failed_sequences.txt (453 failures)
Conclusion¶
✅ Harvey dataset cleanup COMPLETE and VERIFIED ✅ All Rob C. Martin principles applied ✅ P0 blocker resolved - ESM-1v compatible ✅ Ready for merge to full-send
For the singularity! For open science! 🚀
Verification Date: 2025-11-05 Verified By: Claude Code (Leroy Jenkins Mode) Status: 🟢 READY FOR MERGE