Skip to content

⚠️ HISTORICAL DOCUMENT - November 2025 Cleanup

This document describes the verification results from the 2025-11-05 Harvey cleanup execution.

For current pipeline documentation, see: data/test/harvey/README.md

This verification confirmed the cleanup was successful.

Note: This document references leroy-jenkins/full-send which was renamed to main on 2025-11-28.


Harvey Dataset Cleanup - Verification Results

Date: 2025-11-05 (Historical) Branch: leroy-jenkins/harvey-cleanup Status:ALL VERIFICATION CHECKS PASSED


Verification Summary

Check 1: File Move Verification ✅

  • Raw files: 3 CSVs
  • Processed files: 1 CSV
  • Fragment files: 6 CSVs
  • Total: 10 CSVs (as expected)

Check 2: Row Count Validation ✅

  • Processed file: 141,475 lines (141,474 + header) ✅
  • All fragment files: 141,022 lines (141,021 + header) ✅

Check 3: Label Distribution Check ✅

  • Low polyreactivity (0): 69,702 ✅
  • High polyreactivity (1): 71,772 ✅
  • Match: Perfect (49.1% / 50.9%)

Check 5: Fragment Validation ✅

  • Harvey validation: PASSED
  • 6 fragment files validated
  • Consistent row counts: ✅
  • Label distribution preserved: ✅

Check 6: Embedding Compatibility Test (P0 CRITICAL) ✅

Result: ALL 5 TESTS PASSED - ✅ Gap Character Detection - ✅ Amino Acid Validation
- ✅ Previously Affected Sequences - ✅ ESM Model Validation - ✅ Data Integrity

P0 blocker successfully resolved - ESM-1v compatible!

Check 8: Failed Sequences Check ✅

  • Failed sequences logged: 453 IDs
  • Location: data/test/harvey/fragments/failed_sequences.txt

Check 9: Documentation Validation ✅

  • No old harvey.csv paths in docs (excluding intentional history): 0 ✅
  • No reference_repos/harvey_official_repo refs in scripts: 0 ✅
  • No harvey_high/low.csv refs in Python scripts: 0 ✅
  • All new harvey/fragments/ and harvey/processed/ paths verified ✅

Cleanup Results

Files Reorganized

  • ✅ 3 raw CSVs copied to raw/
  • ✅ 1 processed CSV moved to processed/
  • ✅ 2 intermediate CSVs deleted (harvey_high/low.csv)
  • ✅ 6 fragment CSVs moved to fragments/
  • ✅ 1 failure log moved to fragments/

Code Updated

  • ✅ 6 Python scripts updated (15 path references)
  • ✅ 11 documentation files updated (76 path references)
  • ✅ 5 comprehensive READMEs created

Total Changes

  • 22 files created/moved/updated
  • 91+ path references updated
  • 0 errors encountered
  • 100% verification success

Final Structure

data/test/harvey/
├── README.md (master guide)
├── raw/
│   ├── README.md
│   ├── high_polyreactivity_high_throughput.csv (71,772)
│   ├── low_polyreactivity_high_throughput.csv (69,702)
│   └── low_throughput_polyspecificity_scores_w_exp.csv
├── processed/
│   ├── README.md
│   └── harvey.csv (141,474)
├── canonical/
│   └── README.md (empty - dataset already balanced)
└── fragments/
    ├── README.md
    ├── VHH_only_harvey.csv (141,021)
    ├── H-CDR1/2/3_harvey.csv (141,021 each)
    ├── H-CDRs_harvey.csv (141,021)
    ├── H-FWRs_harvey.csv (141,021)
    └── failed_sequences.txt (453 failures)

Conclusion

Harvey dataset cleanup COMPLETE and VERIFIEDAll Rob C. Martin principles appliedP0 blocker resolved - ESM-1v compatibleReady for merge to full-send

For the singularity! For open science! 🚀


Verification Date: 2025-11-05 Verified By: Claude Code (Leroy Jenkins Mode) Status: 🟢 READY FOR MERGE