Boughter Dataset Documentation¶

Purpose: Reference documentation for the Boughter 2020 antibody polyreactivity dataset processing and methodology.

Dataset: Boughter et al. (2020) - 1,171 mouse antibodies with ELISA polyreactivity measurements

Quick Start¶

For Implementation¶

→ See preprocessing/boughter/README.md - SINGLE SOURCE OF TRUTH for pipeline implementation, scripts, and usage

For Understanding the Dataset¶

Start here with this directory's documentation ↓

Documentation Structure¶

Core Reference Documents¶

1. complete_history.md¶

Purpose: Master historical reference and QC level comparison - Complete dataset evolution (1,171 → 1,117 → 1,110 → 1,065 → 914 training) - Two QC levels explained (Boughter QC vs Strict QC) - Position 118 resolution - Training file selection guidance - Why strict QC was archived (no performance improvement)

Read this if: You want comprehensive context on dataset processing decisions

2. novo_methodology_clarification.md¶

Purpose: Critical methodological insight resolving apparent contradiction in Novo's paper - "Boughter methodology" = QC filtering + flagging (NOT CDR boundaries) - Evidence from Boughter's actual seq_loader.py code - Novo's pipeline: ANARCI/IMGT → Boughter QC → Boughter flagging - Resolves: How can you use "Boughter methodology" AND "ANARCI/IMGT"?

Read this if: You're replicating Novo Nordisk methodology or confused about CDR boundaries

3. p0_fix_report.md¶

Purpose: Essential bug fix documentation (P0 blocker) - Gap character contamination (13 sequences, 1.2%) - Stop codon contamination (241 sequences, 22.6%) - Fix: V-domain reconstruction from fragments - Test suite: 5/5 tests passing - Why bug only affected Boughter (DNA input) not Harvey/Shehata (protein input)

Read this if: You want to understand why V-domain reconstruction was necessary

Technical Analysis Documents¶

4. cdr_boundary_investigation.md¶

Purpose: CDR boundary technical analysis - Position 118 discrepancy (Boughter's IgBLAST vs IMGT standard) - Biological rationale (position 118 is Framework 4 anchor, W or F, 99% conserved) - Harvey et al. 2022 validation (CDR2 variable lengths are normal) - Resolution: Use strict IMGT (CDR-H3 = positions 105-117)

Read this if: You need technical justification for CDR boundary decisions

5. data_sources.md¶

Purpose: Novo Nordisk methodology requirements specification - Complete requirements from Sakhnini et al. 2025 paper - ELISA polyreactivity panel (7 antigens) - Flagging strategy (0, 1-3 exclude, 4+) - What IS and is NOT specified in Novo's paper - Updates through 2025-11-04 clarification

Read this if: You're implementing Novo's methodology from scratch

6. cdr_boundary_first_principles_audit.md¶

Purpose: Gold standard first-principles analysis of CDR boundaries - Rigorous analysis from IMGT.org official documentation - Multi-source validation (IMGT, Boughter code, Harvey paper) - Biological + ML rationale for excluding position 118 - 2025 best practices (post-annotation QC)

Read this if: You need the most authoritative technical reference

Document Hierarchy¶

preprocessing/boughter/README.md ← SINGLE SOURCE OF TRUTH (implementation)
         ↑
         └── References these docs for methodology and context

docs/datasets/boughter/
├── README.md (THIS FILE) ← Documentation index
│
├── complete_history.md ← Master reference
├── novo_methodology_clarification.md ← Key insight
├── p0_fix_report.md ← Critical bug fix
│
├── cdr_boundary_investigation.md ← CDR boundaries
├── data_sources.md ← Novo requirements
├── cdr_boundary_first_principles_audit.md ← Gold standard reference
│
└── archive/ ← Historical investigation and status reports
    ├── README.md ← Archive index
    ├── BOUGHTER_NOVO_REPLICATION_ANALYSIS.md ← Investigation process
    ├── accuracy_verification_report.md ← Pre-P0 fix report
    └── boughter_processing_status.md ← Status report (2025-11-02)

Key Facts (from current implementation)¶

Pipeline Statistics¶

Raw input: 1,171 DNA sequences (6 subsets)
Stage 1 (DNA translation): 1,117 protein sequences (95.4% success)
Stage 2 (ANARCI annotation): 1,110 annotated (99.4% success)
Stage 3 (QC filtering): 1,065 clean sequences (95.9% retention)
Training subset: 914 sequences (Novo flagging: 0 and 4+ flags only)

Novo Flagging Strategy¶

0 flags → Specific (label=0, include in training)
1-3 flags → Mildly polyreactive (EXCLUDE from training)
4+ flags → Non-specific (label=1, include in training)

16 Fragment Types¶

VH_only, VL_only (full variable domains)
H-CDR1, H-CDR2, H-CDR3 (heavy chain CDRs)
L-CDR1, L-CDR2, L-CDR3 (light chain CDRs)
H-CDRs, L-CDRs (concatenated CDRs)
H-FWRs, L-FWRs (concatenated frameworks)
VH+VL (paired variable domains)
All-CDRs, All-FWRs (all concatenated)
Full (alias for VH+VL)

Training Files¶

Production: data/train/boughter/canonical/VH_only_boughter_training.csv (914 sequences)
Fragment files: data/train/boughter/annotated/*_boughter.csv (1,065 sequences each, 16 files)

References¶

Primary Papers¶

Boughter et al. (2020) - "Biochemical patterns of antibody polyreactivity revealed through a bioinformatics-based analysis of CDR loops." eLife 9:e61393. DOI: 10.7554/eLife.61393
Sakhnini et al. (2025) - "Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters." bioRxiv. DOI: 10.1101/2025.04.28.650927

Supporting Papers¶

Harvey et al. (2022) - Validation of CDR2 variable lengths
IMGT documentation - CDR/Framework boundary definitions

For Contributors¶

When to Update Documentation¶

Implementation changes → Update preprocessing/boughter/README.md (SSOT)
Methodology insights → Update relevant docs/datasets/boughter/*.md files
Bug fixes → Create new report (follow p0_fix_report.md pattern)
Historical/debugging docs → Move to archive/ with explanation

Documentation Principles¶

preprocessing/boughter/README.md is the SINGLE SOURCE OF TRUTH for implementation
docs/datasets/boughter/*.md provide context, rationale, and technical justification
docs/datasets/boughter/archive/ contains historical investigation and status reports
Keep docs DRY (Don't Repeat Yourself) - reference other docs instead of duplicating

✅ Current Status¶

Production Model: experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl Training Data: 914 sequences (VH_only_boughter_training.csv) Validation: Jain 68.60% (exact parity) | Shehata 58.29% | Harvey 61.33% Status: ✅ Production-ready and externally validated

Last Updated: 2025-11-18 Documentation Version: 3.0 (production status added) Status: ✅ Active and maintained