Harvey Preprocessing Script - Status Report¶
Date: 2025-11-01 (Updated: 2025-11-18)
Script: preprocessing/harvey/step2_extract_fragments.py
Status: ✅ COMPLETE - All scripts validated and production-ready
Script Status (2025-11-06)¶
CONFIRMED: Data source verified as official Harvey Lab repository (debbiemarkslab/nanobody-polyreactivity). All preprocessing scripts operational and validated:
- ✅
preprocessing/harvey/step1_convert_raw_csvs.py- Converts raw CSVs to processed format - ✅
preprocessing/harvey/step2_extract_fragments.py- ANARCI annotation and fragment extraction - ✅
preprocessing/harvey/test_psr_threshold.py- Standalone validation with PSR threshold (0.5495) - ✅ All validation tests passing
- ✅ P0 blocker resolved (gap characters removed)
- ✅ Best benchmark parity achieved (61.33% with PSR threshold 0.5495 vs Novo's 61.7%, gap: -0.37pp)
Pipeline fully operational. See data/test/harvey/README.md for current SSOT.
Processing Summary (Validated)¶
Harvey preprocessing script has been created, audited, and executed. Processing of the HuggingFace download (141,474 nanobodies) completed successfully with a 99.68% annotation rate.
What Was Done¶
1. Script Creation¶
- ✅ Created
preprocessing/harvey/step2_extract_fragments.py(250 lines) - ✅ Based on
process_shehata.pytemplate - ✅ Adapted for nanobodies (VHH only, no light chain)
- ✅ Follows
docs/datasets/harvey/harvey_preprocessing_implementation_plan.mdspecifications
2. External Audit¶
- ✅ Launched independent Sonnet agent for verification
- ✅ Audit found 2 critical issues + 4 minor issues
- ✅ All issues documented in
docs/datasets/harvey/archive/harvey_script_audit_request.md
3. Fixes Applied¶
Critical fixes:
1. ✅ Added sequence_length column to all fragment CSVs (spec compliance)
2. ✅ Fixed ID generation to use sequential counter (no gaps on failures)
Recommended fixes:
3. ✅ Added failure log file (data/test/harvey/fragments/failed_sequences.txt)
4. ✅ Removed emojis from output (replaced with [OK], [DONE])
4. Code Quality¶
- ✅ Formatted with
black - ✅ Imports sorted with
isort - ✅ Type-checked with
mypy - ✅ No critical lint errors
Script Specifications¶
Input¶
- File:
data/test/harvey/processed/harvey.csv - Rows: 141,474 nanobodies
- Columns: seq, CDR1_nogaps, CDR2_nogaps, CDR3_nogaps, label
Processing¶
- Method: ANARCI (riot_na) with IMGT numbering
- Annotation: Heavy chain only (VHH)
- Error handling: Skip failures, log to file, continue processing
Output¶
Directory: data/test/harvey/fragments/
6 Fragment CSV files (each 141,021 rows × 5 columns):
1. VHH_only_harvey.csv
2. H-CDR1_harvey.csv
3. H-CDR2_harvey.csv
4. H-CDR3_harvey.csv
5. H-CDRs_harvey.csv
6. H-FWRs_harvey.csv
CSV Columns (all files):
Additional output:
- failed_sequences.txt (if any ANARCI failures)
Execution Plan¶
Prerequisites¶
✅ data/test/harvey/processed/harvey.csv exists (141,474 rows)
✅ riot_na installed (ANARCI wrapper)
✅ Dependencies: pandas, tqdm
✅ Disk space: ~200 MB for output CSVs
Run Command¶
# From project root directory
# Recommended: Run in tmux/screen (may take 10-120 minutes)
tmux new -s harvey_processing
# Execute
python3 preprocessing/harvey/step2_extract_fragments.py
Runtime (observed)¶
- Session: tmux
harvey_processing - Wall-clock: ~14 minutes (start 13:45, finish 13:59 on 2025-11-01)
- Average throughput: ~235 sequences/second
- Output log stored in tmux scrollback (see first 50 failed IDs in
failed_sequences.txt)
Execution Highlights¶
- Successfully annotated: 141,021 / 141,474 nanobodies (99.68%)
- Failures logged: 453 (0.32%) — IDs recorded in
data/test/harvey/fragments/failed_sequences.txt - Fragment files generated with consistent row counts (141,021) and 5-column schema
- Label distribution preserved: 69,262 low (49.1%), 71,759 high (50.9%)
- Validation:
python3 scripts/validation/validate_fragments.py→ PASS
Validation Checklist¶
After execution, verify:
File Creation¶
-
data/test/harvey/fragments/directory exists - All 6 fragment CSVs created
-
failed_sequences.txtexists (if failures > 0)
Row Counts¶
- All 6 CSVs have same row count (~140K, minus failures)
- Total annotations ≈ 141,474 (99%+ success rate expected)
Data Quality¶
- No empty sequences in CSVs
- Label distribution ~50/50 (balanced)
- Sequence lengths in expected ranges:
- VHH: 102-137 aa
- CDR1: 5-14 aa
- CDR2: 6-11 aa
- CDR3: 8-28 aa (longer in nanobodies)
- CDRs: 24-48 aa
- FWRs: 87-101 aa
CSV Format¶
- Column order:
id, sequence, label, source, sequence_length - IDs sequential: harvey_000001, harvey_000002, ...
- Source = "harvey2022" for all rows
Known Limitations¶
- ANARCI dependency: Requires riot_na to be installed and working
- Runtime uncertainty: Depends on riot_na performance (not benchmarked yet)
- Memory usage: Loads full dataset into memory (~500 MB estimated)
- No checkpointing: If interrupted, must restart from beginning
Next Steps¶
Completed¶
- ✅ Script created and audited
- ✅ Critical fixes applied
- ✅ Full preprocessing run (tmux session
harvey_processing) - ✅ Validation:
python3 scripts/validation/validate_fragments.py - ✅ Documentation updated with run outcomes
Completed (All Tasks Done)¶
- ✅ Test loading with
data.load_local_data() - ✅ Run model inference (61.33% accuracy with PSR threshold)
- ✅ Compare results with Sakhnini et al. 2025 (-0.37pp gap, best parity)
- ✅ Issue #4 closed and validated
References¶
- Implementation plan:
docs/datasets/harvey/harvey_preprocessing_implementation_plan.md - Audit request:
docs/datasets/harvey/archive/harvey_script_audit_request.md - Data sources:
docs/datasets/harvey/harvey_data_sources.md - Cleaning log:
docs/datasets/harvey/archive/harvey_data_cleaning_log.md - Template:
preprocessing/shehata/step2_extract_fragments.py - Input data:
data/test/harvey/processed/harvey.csv(141,474 rows)
✅ FINAL STATUS: COMPLETE AND VALIDATED¶
Processing: Complete (141,021 sequences, 99.68% success rate) P0 Fix: Resolved (gap characters removed) Benchmark: 61.33% accuracy (PSR threshold 0.5495, -0.37pp gap vs Novo) Status: ✅ Production-ready - Best benchmark parity achieved
Last updated: 2025-11-18