Harvey Dataset - P0 Blocker Fix Report¶

Date: 2025-11-02 (Last Updated: 2025-11-18) Branch: ray/learning → feat/harvey-preprocessing → dev Issue: #4 – Harvey dataset preprocessing Status: ✅ P0 BLOCKER RESOLVED AND VALIDATED

Executive Summary¶

The Harvey dataset processing script had the EXACT SAME P0 BLOCKER as Shehata: using annotation.sequence_alignment_aa (IMGT-aligned WITH gap characters) instead of annotation.sequence_aa (raw sequence WITHOUT gaps) for the full VHH sequence.

Impact: 12,116 sequences (8.6%) in VHH_only_harvey.csv contained gap characters -, causing ESM-1v embedding validation to crash.

Fix: One-line change in preprocessing/harvey/step2_extract_fragments.py:48 Status: ✅ All 141,021 sequences now gap-free Tests: ✅ 5/5 comprehensive tests passing

P0 Blocker Details¶

The Bug¶

File: preprocessing/harvey/step2_extract_fragments.py Line: 48 Issue: Using wrong attribute from riot_na annotation

# BEFORE (with gaps - WRONG)
fragments = {
    "full_seq_H": annotation.sequence_alignment_aa,  # IMGT-aligned with gaps
    ...
}

# AFTER (gap-free - CORRECT)
fragments = {
    "full_seq_H": annotation.sequence_aa,  # Raw sequence, no gaps (P0 fix)
    ...
}

Root Cause¶

The riot_na library (ANARCI wrapper) provides two sequence attributes: - annotation.sequence_alignment_aa: IMGT-numbered alignment WITH gaps (- characters) - annotation.sequence_aa: Raw amino acid sequence WITHOUT gaps

ESM-1v requirement: Only accepts valid amino acids "ACDEFGHIKLMNPQRSTVWYX" (no - gap character)

Result: The IMGT-aligned sequence with gaps causes ESM-1v to reject the input during validation (model.py:86-90).

Impact Assessment¶

Before Fix (Original Processing)¶

Generated: 2025-11-01 (original run) Source: preprocessing/harvey/step2_extract_fragments.py with sequence_alignment_aa

File	Sequences	Gaps	Gap %
VHH_only_harvey.csv	141,021	12,116	8.6%
H-CDR1_harvey.csv	141,021	0	0%
H-CDR2_harvey.csv	141,021	0	0%
H-CDR3_harvey.csv	141,021	0	0%
H-CDRs_harvey.csv	141,021	0	0%
H-FWRs_harvey.csv	141,021	0	0%

Critical: Only the full VHH sequence was affected because CDR/FWR fragments use .cdr*_aa and .fwr*_aa attributes, which are gap-free by design.

After Fix (Regenerated with P0 Fix)¶

Generated: 2025-11-02 Source: preprocessing/harvey/step2_extract_fragments.py with sequence_aa (gap-free)

File	Sequences	Gap %
VHH_only_harvey.csv	141,021	0% ✅
H-CDR1_harvey.csv	141,021	0%
H-CDR2_harvey.csv	141,021	0%
H-CDR3_harvey.csv	141,021	0%
H-CDRs_harvey.csv	141,021	0%
H-FWRs_harvey.csv	141,021	0%

Result: ✅ All 141,021 sequences are now gap-free and ESM-1v compatible

Data Source Clarification¶

IMPORTANT: Previous documentation incorrectly stated that the Harvey dataset came from HuggingFace ZYMScott/polyreaction (a Harvey + GP-nano combined dataset). This is INCORRECT.

Actual Data Source (Correct)¶

Source: Official Harvey Lab GitHub Repository Repo: debbiemarkslab/nanobody-polyreactivity Location: backend/app/experiments/ Files: - high_polyreactivity_high_throughput.csv (71,772 sequences) - low_polyreactivity_high_throughput.csv (69,702 sequences)

Total: 141,474 sequences → 141,021 after ANARCI annotation (99.68% success rate)

Conversion Script: preprocessing/harvey/step1_convert_raw_csvs.py - Extracts full sequences from IMGT position columns (1-128) - Combines high/low CSVs with binary labels (0=low, 1=high) - Outputs: data/test/harvey/processed/harvey.csv

What ZYMScott/polyreaction Is (NOT Used)¶

Per NbBench paper (Zhang et al. 2025, arxiv:2505.02022): - Harvey [52] + GP-nano [53] COMBINED dataset - Created by NbBench team as curated benchmark - NOT the pure Harvey 2022 dataset

We did NOT use this. Our data comes directly from the official Harvey repository.

Fix Implementation¶

Step 1: Apply P0 Fix¶

File: preprocessing/harvey/step2_extract_fragments.py:48

- "full_seq_H": annotation.sequence_alignment_aa,
+ "full_seq_H": annotation.sequence_aa,  # Gap-free sequence (P0 fix)

Step 2: Regenerate All Harvey Fragments¶

python3 preprocessing/harvey/step2_extract_fragments.py

Runtime: ~10 minutes Output: 6 fragment CSV files (141,021 sequences each)

Step 3: Validate Gap Removal¶

python3 -c "import pandas as pd; \
  vhh = pd.read_csv('data/test/harvey/fragments/VHH_only_harvey.csv'); \
  print(f'Gaps: {vhh[\"sequence\"].str.contains(\"-\", na=False).sum()}')"

Result: Gaps: 0 ✅

Step 4: Run Comprehensive Test Suite¶

python3 tests/test_harvey_embedding_compatibility.py

Result: 5/5 tests passed ✅

Test Suite Results¶

Test 1: Gap Character Detection¶

✅ PASS - All 6 fragment files gap-free (141,021 sequences each)

Test 2: Amino Acid Validation¶

✅ PASS - All sequences contain only valid amino acids (423,063 sequences validated)

Test 3: Previously Affected Sequences¶

✅ PASS - Spot-checked 5 sequences, all gap-free - Before fix: 12,116 sequences with gaps (8.6%) - After fix: 0 sequences with gaps (0%)

Test 4: ESM Model Validation Simulation¶

✅ PASS - All 141,021 sequences passed model.py:86-90 validation logic

Test 5: Data Integrity¶

✅ PASS - All 6 files present with 141,021 rows - Label distribution: 49.1% low, 50.9% high (balanced ✓)

Comparison with Shehata Fix¶

Both datasets had the EXACT SAME P0 BLOCKER - here's the parallel:

Aspect	Shehata	Harvey
Bug Location	`process_shehata.py:63`	`process_harvey.py:48`
Bug Type	`sequence_alignment_aa` (gaps)	`sequence_alignment_aa` (gaps)
Fix	`→ sequence_aa` (gap-free)	`→ sequence_aa` (gap-free)
Affected File	VH_only_shehata.csv	VHH_only_harvey.csv
Impact	100% of 398 sequences	8.6% of 141,021 sequences
Test Suite	test_shehata_embedding_compatibility.py	test_harvey_embedding_compatibility.py
Test Results	5/5 passed ✅	5/5 passed ✅

Key Difference: Shehata's IMGT alignment had gaps in ALL sequences (VH and VL), while Harvey only had gaps in ~8.6% of VHH sequences due to ANARCI's specific insertion handling for nanobodies.

Files Modified¶

Code Changes¶

✅ preprocessing/harvey/step2_extract_fragments.py:48 - P0 fix applied
✅ tests/test_harvey_embedding_compatibility.py - New test suite created

Data Regenerated¶

✅ data/test/harvey/fragments/VHH_only_harvey.csv - 12,116 gaps removed
✅ data/test/harvey/fragments/H-CDR1_harvey.csv - Already gap-free
✅ data/test/harvey/fragments/H-CDR2_harvey.csv - Already gap-free
✅ data/test/harvey/fragments/H-CDR3_harvey.csv - Already gap-free
✅ data/test/harvey/fragments/H-CDRs_harvey.csv - Already gap-free
✅ data/test/harvey/fragments/H-FWRs_harvey.csv - Already gap-free

Documentation¶

✅ docs/datasets/harvey/harvey_p0_fix_report.md - This report
⬜ docs/datasets/harvey/harvey_data_sources.md - Needs update to correct ZYMScott misinformation

Next Steps¶

On ray/learning Branch (Current)¶

✅ P0 fix applied
✅ All fragments regenerated (gap-free)
✅ Test suite created and passing
✅ Documentation created

Completed (All Branches Merged)¶

✅ Cherry-picked P0 fix commit from ray/learning
✅ Regenerated Harvey fragments (141,021 sequences, gap-free)
✅ Test suite validated (all tests passing)
✅ Documentation updated with correct data source
✅ PR merged and deployed to production

Lessons Learned¶

Why This Happened Twice¶

Non-obvious API: riot_na library provides both sequence_aa and sequence_alignment_aa without clear documentation about which to use
Inconsistent behavior: CDR/FWR fragments use gap-free attributes (.cdr*_aa, .fwr*_aa), but full sequence requires explicit .sequence_aa selection
Silent failure: ANARCI doesn't warn about gaps; validation only fails at ESM-1v embedding time

Prevention¶

First-principles validation: Always check for gap characters after ANARCI annotation
Comprehensive test suites: Include gap detection tests for ALL fragment types
Code review: Double-check riot_na attribute selection in all preprocessing scripts
Documentation: Clearly document the .sequence_aa (gap-free) vs .sequence_alignment_aa (with gaps) distinction

References¶

Harvey Paper: Harvey et al. 2022, Nature Communications 13, 7554
Official Repo: https://github.com/debbiemarkslab/nanobody-polyreactivity
Novo Nordisk Paper: Sakhnini et al. 2025, bioRxiv 2025.04.28.650927
ANARCI: Dunbar & Deane 2016, Bioinformatics
riot_na Library: v4.0.5 (ANARCI Python wrapper)
ESM-1v Model: facebook/esm1v_t33_650M_UR90S_1

✅ FINAL STATUS: RESOLVED AND PRODUCTION-READY¶

Fix Applied: 2025-11-02 All Data Regenerated: 2025-11-02 Tests Passing: 5/5 ✅ Benchmark Validation: 61.33% accuracy (PSR threshold 0.5495, -0.37pp gap vs Novo) Production Model: experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl Status: ✅ RESOLVED - All 141,021 sequences gap-free and production-validated

Last Updated: 2025-11-18