Harvey Dataset Documentation¶
Date: 2025-11-06 (Updated: 2025-11-18)
Status: ✅ COMPLETE - Production model validated with PSR threshold
Pipeline: data/test/harvey/ → preprocessing/harvey/
✅ PRODUCTION STATUS¶
Test Data: 141,021 nanobody sequences (VHH only) Accuracy: 61.33% (PSR threshold 0.5495) Novo Benchmark: 61.7% Gap: -0.37pp ⭐ (best parity across all test datasets) Status: ✅ Production-ready and externally validated
Quick Start¶
For the complete Harvey dataset pipeline, see the authoritative documentation:
👉 data/test/harvey/README.md ← SSOT
That README contains: - Complete 2-step pipeline (convert raw CSVs → extract fragments) - Dataset statistics (141,474 → 141,021 sequences) - Data source information (official Harvey Lab repo) - Execution instructions - Validation procedures
Current Documentation Structure¶
Active Reference Docs¶
Technical Reports:
- harvey_p0_fix_report.md - Gap character fix (ANARCI sequence_aa vs sequence_alignment_aa)
- harvey_test_results.md - Benchmark validation (61.33% with PSR threshold 0.5495 vs Novo's 61.7%)
Methodology & Status:
- harvey_data_sources.md - Data provenance and source verification
- harvey_preprocessing_implementation_plan.md - Implementation methodology
- harvey_script_status.md - Current script status
Historical Archive¶
Development logs (November 2025):
- archive/harvey_data_cleaning_log.md - Initial data discovery
- archive/harvey_script_audit_request.md - External code audit
- archive/harvey_cleanup_investigation.md - File reorganization plan
- archive/harvey_cleanup_verification.md - Cleanup verification results
⚠️ Note: Archived docs contain historical status warnings that do not reflect current state.
Harvey Dataset Summary¶
Source: Harvey et al. (2022) - Nanobody polyreactivity from FACS + deep sequencing Size: 141,474 nanobodies (VHH only) Processed: 141,021 sequences (453 ANARCI failures, 0.32%) Labels: Binary (0=low polyreactivity, 1=high polyreactivity) Assay: PSR (Poly-Specificity Reagent)
Official Repository: debbiemarkslab/nanobody-polyreactivity
Key Results¶
P0 Fix (2025-11-02)¶
- Issue: Gap characters in VHH sequences (8.6% affected)
- Fix: Changed
annotation.sequence_alignment_aa→annotation.sequence_aa - Result: All 141,021 sequences now ESM-1v compatible
Benchmark Validation (2025-11-18, PSR Threshold)¶
- Our result: 61.33% accuracy (PSR threshold 0.5495)
- Novo Nordisk: 61.7% accuracy
- Difference: Only -0.37 percentage points ⭐
- Sensitivity: 95.5% (better than Novo's 94.2%)
✅ Best benchmark parity achieved (smallest gap across all datasets)
Note: Harvey uses PSR assay, requiring PSR-specific threshold (0.5495) instead of ELISA threshold (0.5). This matches Novo's methodology.
Quick Links¶
Data:
- Raw CSVs: data/test/harvey/raw/
- Processed: data/test/harvey/processed/harvey.csv
- Fragments: data/test/harvey/fragments/*.csv (6 fragment types)
Scripts:
- Step 1: preprocessing/harvey/step1_convert_raw_csvs.py
- Step 2: preprocessing/harvey/step2_extract_fragments.py
- Testing: preprocessing/harvey/test_psr_threshold.py
Tests:
- Embedding compatibility: tests/test_harvey_embedding_compatibility.py
- Fragment validation: scripts/validation/validate_fragments.py
References¶
- Harvey et al. (2022): An in silico method to assess antibody fragment polyreactivity. Nat Commun 13, 7554. DOI: 10.1038/s41467-022-35276-4
- Sakhnini et al. (2025): Prediction of Antibody Non-Specificity using Protein Language Models. bioRxiv DOI: 10.1101/2025.04.28.650927
Last Updated: 2025-11-18 Status: ✅ Production ready - Best benchmark parity across all test datasets