β οΈ HISTORICAL DOCUMENT - November 2025 Cleanup
This document describes the cleanup investigation from 2025-11-05. The cleanup was subsequently approved and executed successfully.
For current pipeline documentation, see:
data/test/harvey/README.mdStatus warnings below are historical and do not reflect the current state.
Note: This document references
leroy-jenkins/full-sendwhich was renamed tomainon 2025-11-28.
Harvey Dataset Cleanup - Senior Investigation¶
Date: 2025-11-05 (Historical) Branch: leroy-jenkins/full-send Status: π INVESTIGATION - AWAITING SENIOR APPROVAL (Historical - approved and completed)
Executive Summary¶
Harvey dataset structure is MESSY and requires cleanup similar to Shehata/Jain reorganization.
Current Problems: 1. β Raw source files NOT in data/test/ (in reference_repos/) 2. β Processed files scattered (3 CSVs at root, 6 in subdirectory) 3. β No clear data flow documentation 4. β No README files in data/test/harvey/ 5. β Inconsistent with Shehata/Jain 4-tier structure
Recommendation: Apply same 4-tier cleanup (raw β processed β canonical β fragments)
Audit & Validation Summary¶
Date Validated: 2025-11-05 (comprehensive first-principles audit)
Validation Methodology: - β Every script path reference verified by reading source files - β Every documentation path reference confirmed via grep search - β All line numbers validated against actual code - β Comprehensive search for Harvey references (Python + Markdown) - β Comparison with audit findings from external review
Validated Findings:
| Category | Count | Status |
|---|---|---|
| Python Scripts | 6 files | β All 15 path references verified |
| Markdown Docs | 11 files | β All 76 path references verified |
| Total References | 91+ | β Complete inventory |
Key Numbers (Validated): - π§ 6 Python scripts need path updates (15 total references) - π 11 Markdown files need path updates (76 total references) - ποΈ 10 files to move (3 copy to raw/, 1 move to processed/, 7 move to fragments/) - ποΈ 2 files to delete (harvey_high.csv, harvey_low.csv - duplicates of raw sources) - π 5 READMEs to create (master, raw, processed, canonical, fragments) - β±οΈ 60-75 minutes estimated execution time (revised upward after audit)
Comparison with Initial Estimate: - Scripts: 2 β 6 files (comprehensive audit found 4 more) - Documentation: 8+ β 11 files (audit identified exact count) - Path references: ~20 β 91+ references (4.5x more than initially estimated)
Confidence Level: π’ HIGH - All claims validated from first principles
Current State (MESSY)¶
File Layout¶
reference_repos/harvey_official_repo/backend/app/experiments/
βββ high_polyreactivity_high_throughput.csv (71,772 + header)
βββ low_polyreactivity_high_throughput.csv (69,702 + header)
βββ low_throughput_polyspecificity_scores_w_exp.csv (48 + header)
data/test/ (ROOT LEVEL - BAD)
βββ harvey.csv (141,474 antibodies + header = 141,475 lines)
βββ harvey_high.csv (71,772 + header = 71,773 lines)
βββ harvey_low.csv (69,702 + header = 69,703 lines)
βββ harvey/ (SUBDIRECTORY - MIXED PURPOSE)
βββ H-CDR1_harvey.csv (141,021 + header)
βββ H-CDR2_harvey.csv (141,021 + header)
βββ H-CDR3_harvey.csv (141,021 + header)
βββ H-CDRs_harvey.csv (141,021 + header)
βββ H-FWRs_harvey.csv (141,021 + header)
βββ VHH_only_harvey.csv (141,021 + header)
βββ failed_sequences.txt (453 failed ANARCI annotations)
Problems Identified¶
P1: Raw sources outside data/test/
- Raw data in reference_repos/ not version controlled with dataset
- Should be copied/symlinked to data/test/harvey/raw/
- Breaking principle: "All data sources in data/test/"
P2: Processed files at root level
- harvey.csv, harvey_high.csv, harvey_low.csv at data/test/ root
- Should be in data/test/harvey/processed/
- Breaking principle: "Organized by dataset, not scattered"
P3: No canonical/ directory - Harvey is training set (not external test like Shehata) - Should have canonical benchmarks similar to Boughter - Breaking principle: "Consistent 4-tier structure"
P4: Mixed purpose harvey/ directory
- Currently contains only fragments
- Should be harvey/fragments/ specifically
- Breaking principle: "Single Responsibility - one dir, one purpose"
P5: No README documentation - No provenance documentation in harvey/ directory - No data flow explanation - Breaking principle: "Self-documenting structure"
P6: Inconsistent with Shehata/Jain cleanup - Shehata/Jain now have clean 4-tier structure - Harvey still has old messy structure - Breaking principle: "Consistent patterns across datasets"
Data Flow Analysis¶
Current Flow (Undocumented)¶
reference_repos/harvey_official_repo/backend/app/experiments/
βββ high_polyreactivity_high_throughput.csv (71,772)
βββ low_polyreactivity_high_throughput.csv (69,702)
β [preprocessing/harvey/step1_convert_raw_csvs.py]
data/test/harvey.csv (141,474 combined)
β [preprocessing/harvey/step2_extract_fragments.py + ANARCI]
data/test/harvey/ fragments (141,021 - 453 failures)
Missing intermediate files: - harvey_high.csv and harvey_low.csv appear to be copies from reference_repos - Purpose unclear (are they needed? duplicates?) - No documentation explaining their role
ANARCI failures: - 453 sequences failed annotation (0.32% failure rate) - Documented in failed_sequences.txt - Acceptable loss, but should be tracked in README
Proposed Structure (CLEAN)¶
Target Layout¶
data/test/harvey/
βββ README.md β Master guide
βββ raw/ β Original sources (DO NOT MODIFY)
β βββ README.md
β βββ high_polyreactivity_high_throughput.csv (71,772)
β βββ low_polyreactivity_high_throughput.csv (69,702)
β βββ low_throughput_polyspecificity_scores_w_exp.csv (48 - optional)
βββ processed/ β Converted datasets
β βββ README.md
β βββ harvey.csv (141,474 combined - SSOT)
β [harvey_high/low.csv DELETED per Decision 2 - scripts read from raw/]
βββ canonical/ β Final benchmarks
β βββ README.md
β βββ [TO BE DETERMINED - training splits? balanced subsets?]
βββ fragments/ β Region-specific extracts
βββ README.md
βββ VHH_only_harvey.csv (141,021)
βββ H-CDR1/2/3_harvey.csv
βββ H-CDRs_harvey.csv
βββ H-FWRs_harvey.csv
βββ failed_sequences.txt (453 failures logged)
Comparison with Clean Datasets¶
Shehata (CLEAN) β ¶
shehata/
βββ raw/ (4 Excel files)
βββ processed/ (shehata.csv - 398 antibodies)
βββ canonical/ (empty - external test set)
βββ fragments/ (16 fragments)
Benefits: - Clear separation of stages - Complete provenance documentation - Reproducible pipelines - Self-documenting with READMEs
Jain (CLEAN) β ¶
jain/
βββ raw/ (3 PNAS Excel + 1 private ELISA)
βββ processed/ (jain.csv, jain_ELISA_ONLY_116.csv)
βββ canonical/ (jain_86_novo_parity.csv)
βββ fragments/ (16 fragments + extras)
Benefits: - Same 4-tier structure - Benchmarks in canonical/ - All derived files reproducible
Harvey (MESSY) β¶
reference_repos/harvey_official_repo/ (raw - WRONG LOCATION)
data/test/harvey.csv (root - WRONG LOCATION)
data/test/harvey_high.csv (root - WRONG LOCATION)
data/test/harvey_low.csv (root - WRONG LOCATION)
data/test/harvey/ (fragments only - MIXED PURPOSE)
Problems: - No consistent structure - Files scattered across locations - No provenance documentation - Inconsistent with other datasets
Cleanup Scope¶
Files to Move/Modify (10 files move, 2 files delete)¶
From reference_repos β raw/ (COPY 3 files): - high_polyreactivity_high_throughput.csv - low_polyreactivity_high_throughput.csv - low_throughput_polyspecificity_scores_w_exp.csv (optional)
From data/test/ root β processed/ (MOVE 1 file): - harvey.csv β data/test/harvey/processed/harvey.csv
From data/test/ root (DELETE 2 files per Decision 2): - β harvey_high.csv (delete - duplicate of raw source, scripts will read from raw/) - β harvey_low.csv (delete - duplicate of raw source, scripts will read from raw/)
From data/test/harvey/ β fragments/ (MOVE 6 CSVs + 1 log): - H-CDR1_harvey.csv - H-CDR2_harvey.csv - H-CDR3_harvey.csv - H-CDRs_harvey.csv - H-FWRs_harvey.csv - VHH_only_harvey.csv - failed_sequences.txt
Scripts to Update (6 files, 15 total path references)¶
VALIDATED: All path references confirmed from first principles (2025-11-05)
1. preprocessing/harvey/step1_convert_raw_csvs.py (4 path references)
# Lines 119-121: Path variables
# OLD:
high_csv = Path("data/test/harvey_high.csv")
low_csv = Path("data/test/harvey_low.csv")
output_csv = Path("data/test/harvey.csv")
# NEW:
high_csv = Path("data/test/harvey/raw/high_polyreactivity_high_throughput.csv")
low_csv = Path("data/test/harvey/raw/low_polyreactivity_high_throughput.csv")
output_csv = Path("data/test/harvey/processed/harvey.csv")
# Lines 127-135: Error messages (update all path strings)
# Lines 141-143: Print statements (update displayed paths)
2. preprocessing/harvey/step2_extract_fragments.py (3 path references)
# Lines 202-203: Path variables
# OLD:
csv_path = Path("data/test/harvey.csv")
output_dir = Path("data/test/harvey")
# NEW:
csv_path = Path("data/test/harvey/processed/harvey.csv")
output_dir = Path("data/test/harvey/fragments")
# Line 133: Failure log path
# OLD:
failure_log = Path("data/test/harvey/failed_sequences.txt")
# NEW:
failure_log = Path("data/test/harvey/fragments/failed_sequences.txt")
# Lines 207-211: Error messages and docstrings (update all path strings)
3. scripts/validation/validate_fragments.py (1 path reference)
# Line 193: Harvey validation entry
# OLD:
("harvey", Path("data/test/harvey"), 6),
# NEW:
("harvey", Path("data/test/harvey/fragments"), 6),
4. scripts/rethreshold_harvey.py (1 path reference) DELETED
# This script was deleted as experimental (Nov 2025 cleanup)
# Purpose fulfilled: PSR threshold (0.549) already discovered and implemented
# Results documented in docs/research/assay-thresholds.md
5. preprocessing/harvey/test_psr_threshold.py (1 path reference)
# Line 73: Harvey file path
# OLD:
harvey_file = "data/test/harvey/VHH_only_harvey.csv"
# NEW:
harvey_file = "data/test/harvey/fragments/VHH_only_harvey.csv"
6. tests/test_harvey_embedding_compatibility.py (5 path references)
# Lines 45, 96, 242: Harvey directory paths
# OLD:
harvey_dir = Path("data/test/harvey")
# NEW:
harvey_dir = Path("data/test/harvey/fragments")
# Lines 153, 203: VHH file paths
# OLD:
vhh_file = Path("data/test/harvey/VHH_only_harvey.csv")
# NEW:
vhh_file = Path("data/test/harvey/fragments/VHH_only_harvey.csv")
Documentation to Update (11 files, 76 total path references)¶
VALIDATED: All path references confirmed via grep search (2025-11-05)
Harvey-specific docs (7 files in docs/harvey/ - 43 references):
- harvey_data_sources.md (5 references)
- harvey_data_cleaning_log.md (12 references)
- harvey_preprocessing_implementation_plan.md (3 references)
- harvey_script_status.md (7 references)
- harvey_script_audit_request.md (6 references)
- HARVEY_P0_FIX_REPORT.md (8 references)
- HARVEY_TEST_RESULTS.md (2 references)
Root-level Harvey docs (2 files - 31 references):
- docs/harvey_data_sources.md (9 references)
- docs/harvey_data_cleaning_log.md (22 references)
Global benchmark docs (2 files - 2 references):
- docs/COMPLETE_VALIDATION_RESULTS.md (line 176: VHH_only_harvey.csv path)
- docs/BENCHMARK_TEST_RESULTS.md (line 142: VHH_only_harvey.csv path)
Script documentation:
- scripts/testing/README.md (usage section references old harvey paths)
READMEs to Create (5 files)¶
- data/test/harvey/README.md (master guide)
- Dataset overview
- Data flow diagram
- Citation information
- Quick start guide
-
Verification commands
-
data/test/harvey/raw/README.md
- Original source files
- Data provenance
- Citation (Harvey et al., Mason et al.)
- DO NOT MODIFY warning
-
Conversion instructions
-
data/test/harvey/processed/README.md
- CSV conversion details
- Label assignment (0=low poly, 1=high poly)
- Label distribution (49.1% / 50.9%)
- harvey_high/low.csv purpose
-
Regeneration instructions
-
data/test/harvey/canonical/README.md
- Purpose: Training benchmarks
- Decision needed: balanced subsets? cross-validation splits?
-
Comparison with Boughter canonical/
-
data/test/harvey/fragments/README.md
- 6 fragment types (VHH only - nanobodies)
- ANARCI annotation details
- Failed sequences (453 - 0.32%)
- Fragment use cases
Key Decisions Required¶
Decision 1: Raw Data Location¶
Question: Copy or symlink reference_repos files to data/test/harvey/raw/?
Options: - Option A: Copy files (15MB + 15MB = 30MB) - β Self-contained data/test/ - β No external dependencies - β Duplicated data (uses more space)
- Option B: Symlink files
- β No duplication
- β Single source of truth
-
β Breaks if reference_repos/ moved
-
Option C: Keep in reference_repos, update paths
- β No duplication
- β External dependency
- β Inconsistent with Shehata/Jain
Recommendation: Option A (Copy) - Consistency with Shehata/Jain, self-contained
Decision 2: harvey_high.csv and harvey_low.csv¶
Question: Keep or delete intermediate files?
Current state: - harvey_high.csv = copy of raw/high_polyreactivity_high_throughput.csv - harvey_low.csv = copy of raw/low_polyreactivity_high_throughput.csv - Both used as input to preprocessing/harvey/step1_convert_raw_csvs.py
Options: - Option A: Keep in processed/ - β Explicit intermediate files - β Can regenerate harvey.csv from these - β Duplicated data (3x storage)
- Option B: Delete, use raw/ directly
- β DRY principle (no duplication)
- β Scripts read directly from raw/
- β Loses intermediate checkpoint
Recommendation: Option B (Delete) - Scripts should read from raw/, output to processed/harvey.csv
Decision 3: canonical/ Contents¶
Question: What benchmarks belong in harvey/canonical/?
Harvey characteristics: - 141,021 nanobodies (training set) - Balanced classes (49.1% / 50.9%) - High-throughput dataset (not curated like Jain)
Options: - Option A: Empty (like Shehata) - Use full 141,021 dataset directly - No subsampling needed
- Option B: Balanced subset
- Create 10k balanced subset for quick testing
-
Similar to Boughter canonical/
-
Option C: Cross-validation splits
- Pre-defined train/val splits
- Ensures consistent benchmarking
Recommendation: Option A (Empty) - Full dataset is already balanced, no need for canonical subsets
Verification Plan¶
1. File Move Verification¶
echo "Raw files (3):" && ls -1 data/test/harvey/raw/*.csv | wc -l
echo "Processed files (1):" && ls -1 data/test/harvey/processed/*.csv | wc -l
echo "Fragment files (6):" && ls -1 data/test/harvey/fragments/*.csv | wc -l
echo "Total CSVs (10):" && find data/test/harvey -name "*.csv" | wc -l
2. Row Count Validation¶
# Processed should have 141,474 + header
wc -l data/test/harvey/processed/harvey.csv # Should be 141,475
# All fragments should have 141,021 + header
for f in data/test/harvey/fragments/*.csv; do
count=$(wc -l < "$f")
if [ "$count" -ne 141022 ]; then
echo "ERROR: $f has $count lines (expected 141022)"
fi
done
3. Label Distribution Check¶
python3 -c "
import pandas as pd
df = pd.read_csv('data/test/harvey/processed/harvey.csv')
dist = df['label'].value_counts().sort_index().to_dict()
expected = {0: 69702, 1: 71772} # low (0) and high (1) polyreactivity
print(f'Label distribution: {dist}')
print(f'Expected: {expected}')
print('Match:', dist == expected)
"
4. Script Regeneration Test¶
# Test conversion script
python3 preprocessing/harvey/step1_convert_raw_csvs.py
# Test fragment extraction
python3 preprocessing/harvey/step2_extract_fragments.py
5. Fragment Validation¶
python3 scripts/validation/validate_fragments.py
# Should validate harvey fragments (now points to harvey/fragments/)
6. Embedding Compatibility Test (P0 Regression Check)¶
# CRITICAL: Run embedding compatibility test after cleanup
python3 tests/test_harvey_embedding_compatibility.py
# Ensures no gap characters reintroduced, ESM-1v compatible
# Tests all 6 fragment files, validates no '-' characters
7. Model Test¶
python3 test.py --model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/harvey/fragments/VHH_only_harvey.csv
# Should load and run successfully with new paths
8. Failed Sequences Check¶
# Verify failed_sequences.txt has 453 entries and moved to fragments/
wc -l data/test/harvey/fragments/failed_sequences.txt # Should be 453
9. Documentation Validation¶
# Check no references to old paths remain (should find 0 after cleanup)
grep -rn "data/test/harvey\.csv" docs/ README.md --include="*.md" | grep -v "processed/"
# Should return NOTHING
grep -rn "reference_repos/harvey_official_repo" scripts/ --include="*.py"
# Should return NOTHING
grep -rn "data/test/harvey_high\|data/test/harvey_low" . --include="*.py" --include="*.md"
# Should return NOTHING (files deleted)
# Verify all fragments paths use new structure
grep -rn "data/test/harvey/fragments" scripts/ tests/ --include="*.py"
# Should find 15 references (all updated)
Execution Plan (7 Phases)¶
Estimated time: 60-75 minutes (revised upward after comprehensive audit)
Phase 1: Prepare (5 min)¶
- Create directory structure:
data/test/harvey/{raw,processed,canonical,fragments} - Create 5 comprehensive READMEs
Phase 2: Move Raw Files (5 min)¶
- Copy 3 CSV files from reference_repos/ β raw/
Phase 3: Move Processed Files (2 min)¶
- Move harvey.csv β processed/
- Delete harvey_high.csv and harvey_low.csv (Decision 2)
Phase 4: Move Fragments (2 min)¶
- Move 6 fragment CSVs β fragments/
- Move failed_sequences.txt β fragments/
Phase 5: Update Scripts (15 min)¶
- Update preprocessing/harvey/step1_convert_raw_csvs.py (4 path references, docstrings, error messages)
- Update preprocessing/harvey/step2_extract_fragments.py (3 path references, failure log, docstrings)
- Update scripts/validation/validate_fragments.py (1 path reference)
- Update scripts/rethreshold_harvey.py (1 path reference)
- Update preprocessing/harvey/test_psr_threshold.py (1 path reference)
- Update tests/test_harvey_embedding_compatibility.py (5 path references)
- Total: 6 files, 15 path references
Phase 6: Update Documentation (30 min)¶
- Update 7 files in docs/harvey/ (43 references)
- Update 2 root-level Harvey docs (31 references)
- Update 2 global benchmark docs (2 references)
- Update scripts/testing/README.md (usage examples)
- Total: 11 files, 76+ path references
Phase 7: Verify (15 min)¶
- Run all 9 verification checks (including embedding compatibility test)
- Ensure all pass
- Confirm 0 references to old paths remain
Risk Assessment¶
Low Risk β ¶
- Harvey has good docs (5 docs in docs/harvey/, 2 at root)
- Simple structure (only 2 scripts, 6 fragments)
- No P0 blockers (ANARCI issues already resolved)
- Balanced dataset (no label issues)
- Reference implementation (Shehata cleanup already done)
Medium Risk β οΈ¶
- Raw data dependency (reference_repos/ outside version control)
- Intermediate files (harvey_high/low.csv purpose unclear)
- canonical/ decision (empty vs. subsets?)
Mitigation¶
- Copy raw files to data/test/ (self-contained)
- Delete intermediate files (simplify)
- Start with empty canonical/ (add later if needed)
Comparison with Shehata Cleanup¶
Similarities¶
- Both need 4-tier structure
- Both have fragments in subdirectory
- Both need README documentation
- Both need script path updates
- Both need doc updates
Differences¶
- Harvey is SIMPLER:
- Only 6 fragments (vs 16 for Shehata)
- Only 2 scripts (vs 3 for Shehata)
- Raw files are CSVs (vs Excel for Shehata)
- No canonical benchmarks needed
- No duplicate script cleanup needed
Estimated complexity: 60% of Shehata cleanup effort
Rob C. Martin Principles Applied¶
β Single Responsibility Principle - Each directory serves ONE purpose β DRY (Don't Repeat Yourself) - No duplicate files β Clean Code - Clear naming, self-documenting structure β Traceability - Complete provenance documentation β Reproducibility - Scripts regenerate all derived files β Consistency - Same 4-tier pattern as Shehata/Jain
Recommendation¶
PROCEED WITH CLEANUP following Shehata pattern.
Rationale: 1. Harvey structure is inconsistent with cleaned Shehata/Jain 2. Cleanup is SIMPLER than Shehata (fewer files, no duplicates) 3. Low risk (good docs, no P0 blockers) 4. High benefit (consistent dataset organization) 5. Fast execution (45-60 minutes estimated)
Proposed branch: leroy-jenkins/harvey-cleanup
Execution: Same disciplined approach as Shehata: 1. Senior review this document β 2. Get approval for decisions 3. Create branch 4. Execute 7 phases 5. Verify with 8 checks 6. Merge to leroy-jenkins/full-send
Questions for Senior Approval¶
Q1: Approve Decision 1 (Copy raw files to data/test/harvey/raw/)?
Q2: Approve Decision 2 (Delete harvey_high/low.csv intermediates)?
Q3: Approve Decision 3 (Empty canonical/ directory)?
Q4: Proceed with harvey-cleanup branch creation?
Q5: Any additional concerns or requirements before execution?
Status: βΈοΈ AWAITING SENIOR APPROVAL
Next step: Get approval for all 5 questions, then execute cleanup.
Date: 2025-11-05 16:45 Investigator: Claude Code (Senior Review Mode) Reviewer: [PENDING]