⚠️ HISTORICAL DOCUMENT - November 2025 Cleanup

This document describes an external audit request from 2025-11-01 during the initial Harvey script development.

For current pipeline documentation, see: data/test/harvey/README.md

Status warnings below are historical and do not reflect the current state.

Harvey Preprocessing Script - Audit Request¶

Date: 2025-11-01 (Historical) Script: preprocessing/harvey/step2_extract_fragments.py Purpose: Extract VHH (nanobody) fragments from Harvey dataset for ESM-1v testing

⚠️ Dataset Source Verification Issue (2025-11-01 17:30)¶

NOTE: Audit findings remain valid for processing methodology. Dataset source subsequently identified as requiring verification (HuggingFace ZYMScott/polyreaction = Harvey+GP-nano combined dataset, not pure Harvey 2022).

Script validity: Processing logic is correct and ready for use once correct dataset source is verified.

Request¶

Please audit preprocessing/harvey/step2_extract_fragments.py against the following specifications:

Does it follow the methodology in docs/datasets/harvey/harvey_preprocessing_implementation_plan.md?
Does it match the codebase style of preprocessing/shehata/step2_extract_fragments.py?
Are there any bugs, edge cases, or issues?
Will it correctly process 141,474 nanobody sequences?

Specifications (from harvey_preprocessing_implementation_plan.md)¶

Input File¶

Path: data/test/harvey/processed/harvey.csv
Rows: 141,474 nanobodies (plus header)
Columns: seq, CDR1_nogaps, CDR2_nogaps, CDR3_nogaps, label
Labels: Binary (0=low polyreactivity, 1=high polyreactivity)

Processing Method¶

Tool: ANARCI (riot_na) with IMGT numbering scheme
Sequence type: VHH (nanobody) - heavy chain only, NO light chain
Expected failures: <1% (based on Shehata experience)

Output¶

Directory: data/test/harvey/fragments/

6 Fragment CSV files: 1. VHH_only_harvey.csv - Full nanobody variable domain 2. H-CDR1_harvey.csv - Heavy CDR1 3. H-CDR2_harvey.csv - Heavy CDR2 4. H-CDR3_harvey.csv - Heavy CDR3 5. H-CDRs_harvey.csv - Concatenated H-CDR1+2+3 6. H-FWRs_harvey.csv - Concatenated H-FWR1+2+3+4

CSV Format (all files):

id,sequence,label,source,sequence_length
harvey_000001,QVQLVESGG...,1,harvey2022,127

Columns: - id: Generated ID (harvey_000001, harvey_000002, etc.) - sequence: Fragment sequence (CDR, FWR, or full VHH) - label: Binary polyreactivity (0=low, 1=high) - source: Dataset provenance ("harvey2022") - sequence_length: Fragment length in amino acids

Expected Performance¶

Processing time: 10-30 minutes (141K sequences)
Success rate: >99% (ANARCI annotation)
Label distribution: ~50/50 (balanced dataset)

Comparison with Shehata Script¶

Harvey script should be similar to preprocessing/shehata/step2_extract_fragments.py BUT:

Same as Shehata:¶

✅ Uses riot_na.create_riot_aa() for ANARCI
✅ Same annotate_sequence() pattern
✅ Same CSV output format (id, sequence, label, source)
✅ Same error handling (skip failures, log warnings)
✅ Same progress bar (tqdm)

Different from Shehata:¶

❌ NO light chain processing (VHH only)
❌ 6 fragments (not 16)
❌ Different input format (harvey.csv has "seq" column, not "heavy_seq"/"light_seq")
❌ Different metadata (no psr_score, no b_cell_subset)
❌ IDs generated (harvey_XXXXXX) instead of loaded from CSV
❌ Much larger dataset (141K vs 398 sequences)

Key Verification Points¶

1. Input Handling¶

Reads data/test/harvey/processed/harvey.csv correctly
Accesses row["seq"] (not row["heavy_seq"])
Handles 141,474 rows without memory issues
Generates IDs as harvey_000001, harvey_000002, etc.

2. ANARCI Annotation¶

Uses heavy chain annotation only (no light chain calls)
Extracts 8 raw fragments: full_seq_H, fwr1-4_aa_H, cdr1-3_aa_H
Creates 2 concatenated fragments: cdrs_H, fwrs_H
Handles annotation failures gracefully (skip, log, continue)

3. Fragment CSV Generation¶

Creates exactly 6 CSV files (not 16)
Correct fragment mapping:
VHH_only → full_seq_H
H-CDR1 → cdr1_aa_H
H-CDR2 → cdr2_aa_H
H-CDR3 → cdr3_aa_H
H-CDRs → cdrs_H
H-FWRs → fwrs_H
All files have 5 columns: id, sequence, label, source, sequence_length
Source is "harvey2022" (not "harvey2025" or other)

4. Error Handling¶

Checks if data/test/harvey/processed/harvey.csv exists
Try/except around ANARCI calls
Logs failed sequences to stderr
Continues processing after failures (doesn't halt)
Reports failure count in summary

5. Output Validation¶

All 6 fragment files created
Row count: ~141K (minus ANARCI failures)
Label distribution preserved from input
No missing values in output CSVs

Specific Code Review Questions¶

Line 92: Does seq_id = f"harvey_{idx+1:06d}" correctly generate harvey_000001, harvey_000002, etc.?
Note: idx is 0-indexed, so idx+1 is needed
Line 94: Is row["seq"] the correct column name?
Verify against data/test/harvey/processed/harvey.csv header
Line 99: Is metadata complete?
Should it include anything from HuggingFace besides label?
Line 130-141: Are fragment names and mappings correct?
VHH_only → full_seq_H ✓
H-CDR1 → cdr1_aa_H ✓
H-CDRs → cdrs_H ✓
H-FWRs → fwrs_H ✓
Line 198: Is output directory path correct?
Should be data/test/harvey/fragments/ (new structure after cleanup)
Memory usage: Will loading 141K sequences into DataFrame cause issues?
Estimate: ~20-30 MB for CSV, 100-200 MB for processed data (acceptable)

Expected Runtime¶

Conservative estimates: - ANARCI annotation: ~0.1-0.2s per sequence - Total: 141,474 × 0.15s ≈ 6 hours (worst case)

Optimistic estimates: - ANARCI annotation: ~0.01-0.05s per sequence (if riot_na is fast) - Total: 141,474 × 0.03s ≈ 70 minutes

Recommendation: Run in tmux/screen for safety

Audit Checklist¶

Code Quality¶

Correctness¶

Matches harvey_preprocessing_implementation_plan.md
Consistent with process_shehata.py pattern
No light chain code (common bug)
Correct column names for harvey.csv

Performance¶

Will handle 141K sequences
No obvious bottlenecks
Memory usage reasonable

Edge Cases¶

Handles ANARCI failures
Handles missing input file
Handles malformed sequences
Reports errors clearly

References¶

Implementation plan: docs/datasets/harvey/harvey_preprocessing_implementation_plan.md
Data sources: docs/datasets/harvey/harvey_data_sources.md
Cleaning log: docs/datasets/harvey/archive/harvey_data_cleaning_log.md
Template script: preprocessing/shehata/step2_extract_fragments.py
Sakhnini et al. 2025: Table 4 (Harvey dataset specs)

Auditor: Please verify the script is correct before running on 141K sequences.