Harvey Preprocessing Script - Status Report¶

Date: 2025-11-01 (Updated: 2025-11-18) Script: preprocessing/harvey/step2_extract_fragments.py Status: ✅ COMPLETE - All scripts validated and production-ready

Script Status (2025-11-06)¶

CONFIRMED: Data source verified as official Harvey Lab repository (debbiemarkslab/nanobody-polyreactivity). All preprocessing scripts operational and validated:

✅ preprocessing/harvey/step1_convert_raw_csvs.py - Converts raw CSVs to processed format
✅ preprocessing/harvey/step2_extract_fragments.py - ANARCI annotation and fragment extraction
✅ preprocessing/harvey/test_psr_threshold.py - Standalone validation with PSR threshold (0.5495)
✅ All validation tests passing
✅ P0 blocker resolved (gap characters removed)
✅ Best benchmark parity achieved (61.33% with PSR threshold 0.5495 vs Novo's 61.7%, gap: -0.37pp)

Pipeline fully operational. See data/test/harvey/README.md for current SSOT.

Processing Summary (Validated)¶

Harvey preprocessing script has been created, audited, and executed. Processing of the HuggingFace download (141,474 nanobodies) completed successfully with a 99.68% annotation rate.

What Was Done¶

1. Script Creation¶

✅ Created preprocessing/harvey/step2_extract_fragments.py (250 lines)
✅ Based on process_shehata.py template
✅ Adapted for nanobodies (VHH only, no light chain)
✅ Follows docs/datasets/harvey/harvey_preprocessing_implementation_plan.md specifications

2. External Audit¶

✅ Launched independent Sonnet agent for verification
✅ Audit found 2 critical issues + 4 minor issues
✅ All issues documented in docs/datasets/harvey/archive/harvey_script_audit_request.md

3. Fixes Applied¶

Critical fixes: 1. ✅ Added sequence_length column to all fragment CSVs (spec compliance) 2. ✅ Fixed ID generation to use sequential counter (no gaps on failures)

Recommended fixes: 3. ✅ Added failure log file (data/test/harvey/fragments/failed_sequences.txt) 4. ✅ Removed emojis from output (replaced with [OK], [DONE])

4. Code Quality¶

✅ Formatted with black
✅ Imports sorted with isort
✅ Type-checked with mypy
✅ No critical lint errors

Script Specifications¶

Input¶

File: data/test/harvey/processed/harvey.csv
Rows: 141,474 nanobodies
Columns: seq, CDR1_nogaps, CDR2_nogaps, CDR3_nogaps, label

Processing¶

Method: ANARCI (riot_na) with IMGT numbering
Annotation: Heavy chain only (VHH)
Error handling: Skip failures, log to file, continue processing

Output¶

Directory: data/test/harvey/fragments/

6 Fragment CSV files (each 141,021 rows × 5 columns): 1. VHH_only_harvey.csv 2. H-CDR1_harvey.csv 3. H-CDR2_harvey.csv 4. H-CDR3_harvey.csv 5. H-CDRs_harvey.csv 6. H-FWRs_harvey.csv

CSV Columns (all files):

id,sequence,label,source,sequence_length
harvey_000001,QVQLVESGG...,1,harvey2022,127

Additional output: - failed_sequences.txt (if any ANARCI failures)

Execution Plan¶

Prerequisites¶

✅ data/test/harvey/processed/harvey.csv exists (141,474 rows) ✅ riot_na installed (ANARCI wrapper) ✅ Dependencies: pandas, tqdm ✅ Disk space: ~200 MB for output CSVs

Run Command¶

# From project root directory

# Recommended: Run in tmux/screen (may take 10-120 minutes)
tmux new -s harvey_processing

# Execute
python3 preprocessing/harvey/step2_extract_fragments.py

Runtime (observed)¶

Session: tmux harvey_processing
Wall-clock: ~14 minutes (start 13:45, finish 13:59 on 2025-11-01)
Average throughput: ~235 sequences/second
Output log stored in tmux scrollback (see first 50 failed IDs in failed_sequences.txt)

Execution Highlights¶

Successfully annotated: 141,021 / 141,474 nanobodies (99.68%)
Failures logged: 453 (0.32%) — IDs recorded in data/test/harvey/fragments/failed_sequences.txt
Fragment files generated with consistent row counts (141,021) and 5-column schema
Label distribution preserved: 69,262 low (49.1%), 71,759 high (50.9%)
Validation: python3 scripts/validation/validate_fragments.py → PASS

Validation Checklist¶

After execution, verify:

File Creation¶

data/test/harvey/fragments/ directory exists
All 6 fragment CSVs created
failed_sequences.txt exists (if failures > 0)

Row Counts¶

All 6 CSVs have same row count (~140K, minus failures)
Total annotations ≈ 141,474 (99%+ success rate expected)

Data Quality¶

No empty sequences in CSVs
Label distribution ~50/50 (balanced)
Sequence lengths in expected ranges:
VHH: 102-137 aa
CDR1: 5-14 aa
CDR2: 6-11 aa
CDR3: 8-28 aa (longer in nanobodies)
CDRs: 24-48 aa
FWRs: 87-101 aa

CSV Format¶

Column order: id, sequence, label, source, sequence_length
IDs sequential: harvey_000001, harvey_000002, ...
Source = "harvey2022" for all rows

Known Limitations¶

ANARCI dependency: Requires riot_na to be installed and working
Runtime uncertainty: Depends on riot_na performance (not benchmarked yet)
Memory usage: Loads full dataset into memory (~500 MB estimated)
No checkpointing: If interrupted, must restart from beginning

Next Steps¶

Completed¶

✅ Script created and audited
✅ Critical fixes applied
✅ Full preprocessing run (tmux session harvey_processing)
✅ Validation: python3 scripts/validation/validate_fragments.py
✅ Documentation updated with run outcomes

Completed (All Tasks Done)¶

✅ Test loading with data.load_local_data()
✅ Run model inference (61.33% accuracy with PSR threshold)
✅ Compare results with Sakhnini et al. 2025 (-0.37pp gap, best parity)
✅ Issue #4 closed and validated

References¶

Implementation plan: docs/datasets/harvey/harvey_preprocessing_implementation_plan.md
Audit request: docs/datasets/harvey/archive/harvey_script_audit_request.md
Data sources: docs/datasets/harvey/harvey_data_sources.md
Cleaning log: docs/datasets/harvey/archive/harvey_data_cleaning_log.md
Template: preprocessing/shehata/step2_extract_fragments.py
Input data: data/test/harvey/processed/harvey.csv (141,474 rows)

✅ FINAL STATUS: COMPLETE AND VALIDATED¶

Processing: Complete (141,021 sequences, 99.68% success rate) P0 Fix: Resolved (gap characters removed) Benchmark: 61.33% accuracy (PSR threshold 0.5495, -0.37pp gap vs Novo) Status: ✅ Production-ready - Best benchmark parity achieved

Last updated: 2025-11-18