Skip to content

Harvey Dataset – Preprocessing Implementation Plan

Date: 2025-11-01 (Updated: 2025-11-18) Issue: #4 – Harvey dataset preprocessing Status:COMPLETE - Implementation validated, pipeline operational, best benchmark parity achieved


Implementation Status (2025-11-06)

UPDATE: Implementation is complete and fully validated. Data source confirmed as the official Harvey Lab repository (debbiemarkslab/nanobody-polyreactivity), NOT the HuggingFace ZYMScott/polyreaction dataset.

Pipeline status: All preprocessing scripts operational and verified. See data/test/harvey/README.md for current SSOT.


Objective

Extract nanobody (VHH) fragments from the Harvey dataset following Sakhnini et al. 2025 methodology to enable testing of ESM-1v based polyreactivity prediction models.


Methodology Reference: Sakhnini et al. 2025

Model Architecture (Section 2.3, line 71)

From literature/markdown/Sakhnini_2025_Antibody_NonSpecificity_PLM_Biophysical/Sakhnini_2025_Antibody_NonSpecificity_PLM_Biophysical.md:

"To show which antibody fragment contributed most to non-specificity, we annotated the CDRs using ANARCI in the IMGT numbering scheme and trained 12 different antibody fragment-specific binary classification models (see Table 4). Overall, all of the protein language models (PLMs) performed well with 66-71% 10-fold CV accuracy."

Table 4 Fragment Types (line 380-388):

1. VH (full heavy variable domain)
2. VL (full light variable domain)
3. H-CDRs (concatenated H-CDR1+2+3)
4. L-CDRs (concatenated L-CDR1+2+3)
5. H-FWRs (concatenated H-FWR1+2+3+4)
6. L-FWRs (concatenated L-FWR1+2+3+4)
7. VH+VL (paired variable domains)
8. All-CDRs (H-CDRs + L-CDRs)
9. All-FWRs (H-FWRs + L-FWRs)
10. Full (VH + VL)
11. VH H-CDR3
12. VL L-CDR3

Top Performing Model

From Section 2.3 (line 73):

"The highest PLM-based predictability is achieved by encoding the VH domain. Across different validation procedures, the VH-based classifier demonstrated the best performance with a mean accuracy of 71% in 10-fold CV."

Critical Finding for Harvey: Harvey dataset consists of nanobodies (VHH) - single heavy-chain variable domains with NO light chain.

Therefore, for Harvey we extract VHH-specific fragments only: 1. VHH (full nanobody = equivalent to VH) 2. H-CDR1 3. H-CDR2 4. H-CDR3 5. H-CDRs (concatenated CDR1+2+3) 6. H-FWRs (concatenated FWR1+2+3+4)

NO light chain fragments (L-CDR½/3, L-FWRs, VL, etc.)


Harvey Dataset Testing in Sakhnini

Usage Context (Section 2.7, line 131)

"To find out whether our ESM 1v mean-mode VH-based LogisticReg model can extend its applicability further to the non-specificity scored by the PSR assay, the Shehata dataset and the VH-based Nb dataset by Harvey and co-authors [[45]], here referred to as the Harvey dataset, were tested."

Key Points: - Harvey used as test set only (NOT for training) - Model was trained on Boughter dataset (~1000 antibodies) - Testing evaluates generalization to nanobodies

Performance Results (Section 2.7, line 131-132)

"A similar forecast was observed for the Harvey dataset; all the specific PSR-scored Nbs resulted in a broad probability distribution, while the non-specific PSR-scored ones resulted in a narrower probability distribution towards higher non-specificity (Figure 3E,F)."

Interpretation: - Model predicts high polyreactivity nanobodies better than low - PSR assay spectrum differs from ELISA (used for Boughter training) - Still provides useful signal for nanobody polyreactivity prediction


Input Data

Source File

Path: data/test/harvey/processed/harvey.csv Rows: 141,474 nanobodies (141,475 with header) Downloaded: 2025-11-01 from HuggingFace ZYMScott/polyreaction

Current Column Structure

seq                 : Full nanobody VHH sequence (52-137 aa)
CDR1_nogaps         : H-CDR1 sequence (pre-extracted, no gaps)
CDR2_nogaps         : H-CDR2 sequence (pre-extracted, no gaps)
CDR3_nogaps         : H-CDR3 sequence (pre-extracted, no gaps)
label               : Binary polyreactivity (0=low, 1=high)

⚠️ CRITICAL: NO CDR LENGTH FILTERING

Harvey's original filter (from Harvey et al. 2022 line 142): - CDR1 length == 8 - CDR2 length == 8 OR 9 - CDR3 length between 6-22 - Result: 134,302 sequences (65,147 low + 69,155 high)

Novo Nordisk's approach (from Sakhnini et al. 2025): - NO CDR length filtering - Used all 141,474 sequences from HuggingFace - Cited as ">140,000 naïve nanobodies" in Table 4

Our Implementation: - ✅ Process ALL 141,474 sequences with NO filtering - ✅ Match Novo Nordisk methodology exactly - ✅ Broader coverage than Harvey's original training set - ⚠️ Some sequences may have CDR lengths outside Harvey's constraints

Data Quality Notes

From HuggingFace dataset inspection: - Label distribution: 69,702 low (49.3%), 71,772 high (50.7%) - balanced - Sequence length: 52-137 aa (typical nanobody VHH range: 110-130 aa) - No missing values observed - CDRs already extracted but numbering scheme unknown

Question: Are HuggingFace CDRs IMGT-numbered? Answer: Unknown - safer to re-extract using ANARCI for consistency


Processing Pipeline

Step 1: ANARCI Annotation (IMGT Scheme)

Tool: riot_na.create_riot_aa() Numbering: IMGT (consistent with Jain/Shehata preprocessing)

For each nanobody sequence in harvey.csv: 1. Run annotator.run_on_sequence(seq_id, sequence) 2. Extract VHH fragments: - fwr1_aa_H: Framework 1 - cdr1_aa_H: CDR1 - fwr2_aa_H: Framework 2 - cdr2_aa_H: CDR2 - fwr3_aa_H: Framework 3 - cdr3_aa_H: CDR3 - fwr4_aa_H: Framework 4 - sequence_alignment_aa: Full aligned VHH

  1. Create concatenated fragments:
  2. H-CDRs: CDR1 + CDR2 + CDR3 (no separators)
  3. H-FWRs: FWR1 + FWR2 + FWR3 + FWR4 (no separators)

Step 2: Fragment CSV Generation

Create directory: data/test/harvey/fragments/

Generate 6 fragment CSV files:

  1. VHH_only_harvey.csv
  2. Columns: id, sequence, label, source, sequence_length
  3. Sequence: Full VHH from ANARCI alignment
  4. 141,474 rows

  5. H-CDR1_harvey.csv

  6. Columns: id, sequence, label, source, sequence_length
  7. Sequence: H-CDR1 only
  8. 141,474 rows

  9. H-CDR2_harvey.csv

  10. Columns: id, sequence, label, source, sequence_length
  11. Sequence: H-CDR2 only
  12. 141,474 rows

  13. H-CDR3_harvey.csv

  14. Columns: id, sequence, label, source, sequence_length
  15. Sequence: H-CDR3 only
  16. 141,474 rows

  17. H-CDRs_harvey.csv

  18. Columns: id, sequence, label, source, sequence_length
  19. Sequence: CDR1+CDR2+CDR3 concatenated
  20. 141,474 rows

  21. H-FWRs_harvey.csv

  22. Columns: id, sequence, label, source, sequence_length
  23. Sequence: FWR1+FWR2+FWR3+FWR4 concatenated
  24. 141,474 rows

Column Definitions: - id: Unique identifier (row index or nanobody name if available) - sequence: Extracted fragment sequence - label: Binary polyreactivity (0=low, 1=high) from input - source: "harvey2022" (dataset provenance) - sequence_length: Length of fragment in amino acids

Step 3: Validation

Validation Script: scripts/validate_harvey_processing.py

Checks: 1. Row count consistency: All 6 fragment files have 141,474 rows 2. No missing sequences: No empty/null sequences 3. Sequence composition: Only valid amino acids (ACDEFGHIKLMNPQRSTVWY) 4. Label preservation: Binary labels (0/1) match input harvey.csv 5. Fragment relationships: - CDR1+CDR2+CDR3 concatenated = H-CDRs - FWR1+FWR2+FWR3+FWR4 concatenated = H-FWRs - All fragments extracted from same VHH sequence 6. Length distributions: - VHH: ~110-130 aa (nanobody typical range) - CDR1: ~8-12 aa (IMGT H-CDR1 range) - CDR2: ~7-10 aa (IMGT H-CDR2 range) - CDR3: ~10-20 aa (IMGT H-CDR3 range, longer in nanobodies)


Implementation Files

Primary Script

File: preprocessing/harvey/step2_extract_fragments.py Purpose: Extract VHH fragments using ANARCI (IMGT numbering) Dependencies: - pandas: DataFrame manipulation - riot_na: ANARCI wrapper for antibody annotation - tqdm: Progress bar for 141K sequences

Usage:

python3 preprocessing/harvey/step2_extract_fragments.py

Expected Runtime: ~10-30 minutes (141K sequences × ANARCI annotation)

Validation Script

File: scripts/validate_harvey_processing.py Purpose: Verify fragment extraction correctness Checks: Row counts, sequence composition, label preservation, fragment relationships

Usage:

python3 scripts/validate_harvey_processing.py


Output Structure

data/test/harvey/fragments/
├── VHH_only_harvey.csv       (141,474 rows)
├── H-CDR1_harvey.csv          (141,474 rows)
├── H-CDR2_harvey.csv          (141,474 rows)
├── H-CDR3_harvey.csv          (141,474 rows)
├── H-CDRs_harvey.csv          (141,474 rows)
└── H-FWRs_harvey.csv          (141,474 rows)

Total: 6 fragment files (nanobody-specific, no light chain)


Comparison with Jain/Shehata Preprocessing

Similarities

  1. ANARCI annotation with IMGT numbering scheme
  2. Fragment extraction for ESM-1v embedding
  3. CSV output with standardized columns
  4. Validation of row counts and sequence composition

Differences

Aspect Jain/Shehata Harvey
Antibody Type Full IgG (VH+VL) Nanobody (VHH only)
Chain Types Heavy + Light (2 chains) Heavy only (1 chain)
Fragment Count 16 files (H+L combinations) 6 files (H only)
Dataset Size 137-398 sequences 141,474 sequences
Label Type Multi-flag ELISA (0-6) Binary PSR (0/1)
Source Excel (SD01-SD03) HuggingFace CSV
Use Case Test set Test set

File Mapping

Jain/Shehata fragments NOT applicable to Harvey: - ❌ VL_only (no light chain) - ❌ L-CDR½/3 (no light chain) - ❌ L-CDRs, L-FWRs (no light chain) - ❌ VH+VL (nanobody = VHH only) - ❌ All-CDRs, All-FWRs (would just be H-CDRs, H-FWRs) - ❌ Full (same as VHH)

Harvey-specific fragments (VHH analogs): - ✅ VHH_only ≈ VH_only (full variable domain) - ✅ H-CDR½/3 (same as Jain/Shehata) - ✅ H-CDRs (same as Jain/Shehata) - ✅ H-FWRs (same as Jain/Shehata)


Edge Cases & Error Handling

ANARCI Annotation Failures

Possible causes: - Invalid sequence (non-amino acid characters) - Sequence too short/long for nanobody - ANARCI cannot assign numbering

Handling: 1. Log failed sequences with IDs 2. Skip failed sequences (continue processing) 3. Report failure count and IDs in summary 4. Do NOT halt entire pipeline for individual failures

Expected failure rate: <1% (based on Jain/Shehata experience; actual run on 2025-11-01 saw 453 failures = 0.32%)

Sequence Length Outliers

Nanobody expected range: 110-130 aa Dataset actual range: 52-137 aa (from HuggingFace inspection)

Short sequences (< 100 aa): - Possibly truncated or incomplete nanobodies - ANARCI may fail or produce incomplete annotation - Action: Annotate anyway, log if ANARCI fails

Long sequences (> 140 aa): - Possibly includes linkers or tags - ANARCI should handle (will annotate VHH domain) - Action: Annotate normally

Label Verification

Input labels: 0 (low polyreactivity), 1 (high polyreactivity) Expected distribution: ~50/50 (balanced dataset)

Validation: - Ensure labels are binary (0 or 1 only) - Check distribution remains balanced after processing - Preserve original labels (no transformation)


Documentation Deliverables

After implementation:

  1. harvey_data_cleaning_log.md
  2. ANARCI failures and resolutions
  3. Sequence outliers and handling decisions
  4. Label distribution verification

  5. harvey_preprocessing_verification_report.md

  6. Validation results (row counts, composition, etc.)
  7. Fragment length distributions
  8. Comparison with HuggingFace CDRs (if available)
  9. SHA256 hashes for reproducibility

  10. Update README.md

  11. Add Harvey dataset to preprocessing section
  12. Document fragment file structure
  13. Link to Sakhnini et al. 2025 methodology

Testing Strategy

Test file: tests/test_harvey_processing.py

  1. Test ANARCI annotation:
  2. Known nanobody sequence → expected CDR/FWR split
  3. Invalid sequence → graceful error handling

  4. Test fragment concatenation:

  5. CDR1+CDR2+CDR3 = H-CDRs
  6. FWR1+FWR2+FWR3+FWR4 = H-FWRs

  7. Test CSV generation:

  8. Column names match specification
  9. No missing values
  10. Labels preserved

Integration Test

Process sample: First 1000 sequences from harvey.csv Verify: - All 6 fragment files created - 1000 rows in each file - No ANARCI failures (or <1%) - Labels match input

Full Pipeline Test

Process: All 141,474 sequences Verify: All validation checks pass (validate_harvey_processing.py) Benchmark: Runtime, memory usage, failure rate


Approval Required Before Implementation

Questions to confirm:

  1. ✅ Use combined harvey.csv (all 141K) or splits (train/val/test)?
  2. Recommendation: Combined (matches Novo Nordisk test set usage)

  3. ✅ Re-extract CDRs with ANARCI or trust HuggingFace CDRs?

  4. Recommendation: Re-extract (ensures IMGT consistency with Jain/Shehata)

  5. ✅ Generate 6 fragment files (VHH-specific) or more?

  6. Recommendation: 6 files (matches nanobody structure)

  7. ✅ Handle ANARCI failures by skipping or halting?

  8. Recommendation: Skip and log (avoid pipeline failure)

  9. ✅ Expected runtime: 10-30 minutes acceptable?

  10. Recommendation: Yes (141K sequences is large but manageable)

Implementation Timeline

Phase 1: Script Development (30-60 min) - Write preprocessing/harvey/step2_extract_fragments.py - Adapt from preprocess_jain_p5e_s2.py (remove light chain logic)

Phase 2: Test Run (10-30 min) - Process first 1000 sequences - Verify output format - Check ANARCI failure rate

Phase 3: Full Processing (10-30 min) - Process all 141,474 sequences - Generate 6 fragment CSV files

Phase 4: Validation (5-10 min) - Run validate_harvey_processing.py - Verify all checks pass

Phase 5: Documentation (30-60 min) - Write cleaning log - Write verification report - Update README

Total estimated time: 1.5-3 hours


References

  • [Sakhnini et al. 2025] Sakhnini LI, et al. Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters. bioRxiv (2025). https://doi.org/10.1101/2025.04.28.650927

  • [Harvey et al. 2022] Harvey EP, et al. An in silico method to assess antibody fragment polyreactivity. Nat Commun 13, 7554 (2022). https://doi.org/10.1038/s41467-022-35276-4

  • [ANARCI] Dunbar J, Deane CM. ANARCI: antigen receptor numbering and receptor classification. Bioinformatics (2016). https://doi.org/10.1093/bioinformatics/btv552

  • [IMGT] Lefranc MP, et al. IMGT unique numbering for immunoglobulin and T cell receptor variable domains. Dev Comp Immunol (2003).