Skip to content

Harvey Dataset Documentation

Date: 2025-11-06 (Updated: 2025-11-18) Status:COMPLETE - Production model validated with PSR threshold Pipeline: data/test/harvey/preprocessing/harvey/


✅ PRODUCTION STATUS

Test Data: 141,021 nanobody sequences (VHH only) Accuracy: 61.33% (PSR threshold 0.5495) Novo Benchmark: 61.7% Gap: -0.37pp ⭐ (best parity across all test datasets) Status:Production-ready and externally validated


Quick Start

For the complete Harvey dataset pipeline, see the authoritative documentation:

👉 data/test/harvey/README.mdSSOT

That README contains: - Complete 2-step pipeline (convert raw CSVs → extract fragments) - Dataset statistics (141,474 → 141,021 sequences) - Data source information (official Harvey Lab repo) - Execution instructions - Validation procedures


Current Documentation Structure

Active Reference Docs

Technical Reports: - harvey_p0_fix_report.md - Gap character fix (ANARCI sequence_aa vs sequence_alignment_aa) - harvey_test_results.md - Benchmark validation (61.33% with PSR threshold 0.5495 vs Novo's 61.7%)

Methodology & Status: - harvey_data_sources.md - Data provenance and source verification - harvey_preprocessing_implementation_plan.md - Implementation methodology - harvey_script_status.md - Current script status

Historical Archive

Development logs (November 2025): - archive/harvey_data_cleaning_log.md - Initial data discovery - archive/harvey_script_audit_request.md - External code audit - archive/harvey_cleanup_investigation.md - File reorganization plan - archive/harvey_cleanup_verification.md - Cleanup verification results

⚠️ Note: Archived docs contain historical status warnings that do not reflect current state.


Harvey Dataset Summary

Source: Harvey et al. (2022) - Nanobody polyreactivity from FACS + deep sequencing Size: 141,474 nanobodies (VHH only) Processed: 141,021 sequences (453 ANARCI failures, 0.32%) Labels: Binary (0=low polyreactivity, 1=high polyreactivity) Assay: PSR (Poly-Specificity Reagent)

Official Repository: debbiemarkslab/nanobody-polyreactivity


Key Results

P0 Fix (2025-11-02)

  • Issue: Gap characters in VHH sequences (8.6% affected)
  • Fix: Changed annotation.sequence_alignment_aaannotation.sequence_aa
  • Result: All 141,021 sequences now ESM-1v compatible

Benchmark Validation (2025-11-18, PSR Threshold)

  • Our result: 61.33% accuracy (PSR threshold 0.5495)
  • Novo Nordisk: 61.7% accuracy
  • Difference: Only -0.37 percentage points
  • Sensitivity: 95.5% (better than Novo's 94.2%)

Best benchmark parity achieved (smallest gap across all datasets)

Note: Harvey uses PSR assay, requiring PSR-specific threshold (0.5495) instead of ELISA threshold (0.5). This matches Novo's methodology.


Data: - Raw CSVs: data/test/harvey/raw/ - Processed: data/test/harvey/processed/harvey.csv - Fragments: data/test/harvey/fragments/*.csv (6 fragment types)

Scripts: - Step 1: preprocessing/harvey/step1_convert_raw_csvs.py - Step 2: preprocessing/harvey/step2_extract_fragments.py - Testing: preprocessing/harvey/test_psr_threshold.py

Tests: - Embedding compatibility: tests/test_harvey_embedding_compatibility.py - Fragment validation: scripts/validation/validate_fragments.py


References


Last Updated: 2025-11-18 Status:Production ready - Best benchmark parity across all test datasets