Skip to content

Harvey Preprocessing Script - Status Report

Date: 2025-11-01 (Updated: 2025-11-18) Script: preprocessing/harvey/step2_extract_fragments.py Status:COMPLETE - All scripts validated and production-ready


Script Status (2025-11-06)

CONFIRMED: Data source verified as official Harvey Lab repository (debbiemarkslab/nanobody-polyreactivity). All preprocessing scripts operational and validated:

  • preprocessing/harvey/step1_convert_raw_csvs.py - Converts raw CSVs to processed format
  • preprocessing/harvey/step2_extract_fragments.py - ANARCI annotation and fragment extraction
  • preprocessing/harvey/test_psr_threshold.py - Standalone validation with PSR threshold (0.5495)
  • ✅ All validation tests passing
  • ✅ P0 blocker resolved (gap characters removed)
  • Best benchmark parity achieved (61.33% with PSR threshold 0.5495 vs Novo's 61.7%, gap: -0.37pp)

Pipeline fully operational. See data/test/harvey/README.md for current SSOT.


Processing Summary (Validated)

Harvey preprocessing script has been created, audited, and executed. Processing of the HuggingFace download (141,474 nanobodies) completed successfully with a 99.68% annotation rate.


What Was Done

1. Script Creation

  • ✅ Created preprocessing/harvey/step2_extract_fragments.py (250 lines)
  • ✅ Based on process_shehata.py template
  • ✅ Adapted for nanobodies (VHH only, no light chain)
  • ✅ Follows docs/datasets/harvey/harvey_preprocessing_implementation_plan.md specifications

2. External Audit

  • ✅ Launched independent Sonnet agent for verification
  • ✅ Audit found 2 critical issues + 4 minor issues
  • ✅ All issues documented in docs/datasets/harvey/archive/harvey_script_audit_request.md

3. Fixes Applied

Critical fixes: 1. ✅ Added sequence_length column to all fragment CSVs (spec compliance) 2. ✅ Fixed ID generation to use sequential counter (no gaps on failures)

Recommended fixes: 3. ✅ Added failure log file (data/test/harvey/fragments/failed_sequences.txt) 4. ✅ Removed emojis from output (replaced with [OK], [DONE])

4. Code Quality

  • ✅ Formatted with black
  • ✅ Imports sorted with isort
  • ✅ Type-checked with mypy
  • ✅ No critical lint errors

Script Specifications

Input

  • File: data/test/harvey/processed/harvey.csv
  • Rows: 141,474 nanobodies
  • Columns: seq, CDR1_nogaps, CDR2_nogaps, CDR3_nogaps, label

Processing

  • Method: ANARCI (riot_na) with IMGT numbering
  • Annotation: Heavy chain only (VHH)
  • Error handling: Skip failures, log to file, continue processing

Output

Directory: data/test/harvey/fragments/

6 Fragment CSV files (each 141,021 rows × 5 columns): 1. VHH_only_harvey.csv 2. H-CDR1_harvey.csv 3. H-CDR2_harvey.csv 4. H-CDR3_harvey.csv 5. H-CDRs_harvey.csv 6. H-FWRs_harvey.csv

CSV Columns (all files):

id,sequence,label,source,sequence_length
harvey_000001,QVQLVESGG...,1,harvey2022,127

Additional output: - failed_sequences.txt (if any ANARCI failures)


Execution Plan

Prerequisites

data/test/harvey/processed/harvey.csv exists (141,474 rows) ✅ riot_na installed (ANARCI wrapper) ✅ Dependencies: pandas, tqdm ✅ Disk space: ~200 MB for output CSVs

Run Command

# From project root directory

# Recommended: Run in tmux/screen (may take 10-120 minutes)
tmux new -s harvey_processing

# Execute
python3 preprocessing/harvey/step2_extract_fragments.py

Runtime (observed)

  • Session: tmux harvey_processing
  • Wall-clock: ~14 minutes (start 13:45, finish 13:59 on 2025-11-01)
  • Average throughput: ~235 sequences/second
  • Output log stored in tmux scrollback (see first 50 failed IDs in failed_sequences.txt)

Execution Highlights

  • Successfully annotated: 141,021 / 141,474 nanobodies (99.68%)
  • Failures logged: 453 (0.32%) — IDs recorded in data/test/harvey/fragments/failed_sequences.txt
  • Fragment files generated with consistent row counts (141,021) and 5-column schema
  • Label distribution preserved: 69,262 low (49.1%), 71,759 high (50.9%)
  • Validation: python3 scripts/validation/validate_fragments.pyPASS

Validation Checklist

After execution, verify:

File Creation

  • data/test/harvey/fragments/ directory exists
  • All 6 fragment CSVs created
  • failed_sequences.txt exists (if failures > 0)

Row Counts

  • All 6 CSVs have same row count (~140K, minus failures)
  • Total annotations ≈ 141,474 (99%+ success rate expected)

Data Quality

  • No empty sequences in CSVs
  • Label distribution ~50/50 (balanced)
  • Sequence lengths in expected ranges:
  • VHH: 102-137 aa
  • CDR1: 5-14 aa
  • CDR2: 6-11 aa
  • CDR3: 8-28 aa (longer in nanobodies)
  • CDRs: 24-48 aa
  • FWRs: 87-101 aa

CSV Format

  • Column order: id, sequence, label, source, sequence_length
  • IDs sequential: harvey_000001, harvey_000002, ...
  • Source = "harvey2022" for all rows

Known Limitations

  1. ANARCI dependency: Requires riot_na to be installed and working
  2. Runtime uncertainty: Depends on riot_na performance (not benchmarked yet)
  3. Memory usage: Loads full dataset into memory (~500 MB estimated)
  4. No checkpointing: If interrupted, must restart from beginning

Next Steps

Completed

  1. ✅ Script created and audited
  2. ✅ Critical fixes applied
  3. ✅ Full preprocessing run (tmux session harvey_processing)
  4. ✅ Validation: python3 scripts/validation/validate_fragments.py
  5. ✅ Documentation updated with run outcomes

Completed (All Tasks Done)

  1. ✅ Test loading with data.load_local_data()
  2. ✅ Run model inference (61.33% accuracy with PSR threshold)
  3. ✅ Compare results with Sakhnini et al. 2025 (-0.37pp gap, best parity)
  4. ✅ Issue #4 closed and validated

References

  • Implementation plan: docs/datasets/harvey/harvey_preprocessing_implementation_plan.md
  • Audit request: docs/datasets/harvey/archive/harvey_script_audit_request.md
  • Data sources: docs/datasets/harvey/harvey_data_sources.md
  • Cleaning log: docs/datasets/harvey/archive/harvey_data_cleaning_log.md
  • Template: preprocessing/shehata/step2_extract_fragments.py
  • Input data: data/test/harvey/processed/harvey.csv (141,474 rows)


✅ FINAL STATUS: COMPLETE AND VALIDATED

Processing: Complete (141,021 sequences, 99.68% success rate) P0 Fix: Resolved (gap characters removed) Benchmark: 61.33% accuracy (PSR threshold 0.5495, -0.37pp gap vs Novo) Status:Production-ready - Best benchmark parity achieved

Last updated: 2025-11-18