Jain Dataset Documentation¶

Purpose: Reference documentation for the Jain 2017 clinical antibody dataset processing and methodology.

Dataset: Jain et al. (2017) PNAS - 137 clinical-stage antibodies with biophysical measurements and polyreactivity data

Quick Start¶

For Implementation¶

→ See preprocessing/jain/README.md - SINGLE SOURCE OF TRUTH for pipeline implementation, scripts, and usage

For Understanding the Dataset¶

Start here with this directory's documentation ↓

Documentation Structure¶

Core Reference Documents¶

1. label_discrepancy_findings.md ✅ ACCURATE¶

Purpose: Critical investigation report documenting the 38.7% label error bug and fix - Discovered mismatch between ELISA-based SSOT and paper-based flags_total - Root cause: fragments derived from wrong labeling system - Fix: step3_extract_fragments.py regeneration with ELISA-based labels - Before: 67/27/43 (wrong) → After: 94/22/21 (correct)

Read this if: You want to understand why labels changed or the quality control process

2. complete_guide.md ✅ ACCURATE¶

Purpose: Complete guide to the current preprocessing + canonical benchmark artifacts - Canonical 86-antibody benchmark (exact Novo parity) - File inventory and how to run verification

Read this if: You want comprehensive context (but verify against preprocessing code)

3. complete_history.md ⚠️ HISTORICAL¶

Purpose: Historical reference showing evolution of methodologies - Contains valuable historical context and retired approaches

Read this if: You need historical context on why we changed methodologies

4. reorganization_complete.md ⚠️ HISTORICAL¶

Purpose: File reorganization effort from 2025-11-05 - Directory structure (raw/processed/canonical/fragments) is accurate - Verification sections may reflect older baselines

Read this if: You want to understand the file organization

5. jain_data_sources.md ✅ ACCURATE¶

Purpose: Source file descriptions and data provenance - Describes the public SD files + private ELISA source used by the pipeline

Read this if: You need to know where the raw data comes from

Current Pipeline (SSOT)¶

Implemented in: preprocessing/jain/ scripts

Step 1: Excel → CSV Conversion (ELISA-only)
  Raw: data/test/jain/raw/*.xlsx
  ↓ preprocessing/jain/step1_convert_excel_to_csv.py
  Output: data/test/jain/processed/jain_with_private_elisa_FULL.csv (137 antibodies)
          data/test/jain/processed/jain_ELISA_ONLY_116.csv (116 antibodies)

Step 2: P5e-S2 Novo Parity Method
  Input: jain_ELISA_ONLY_116.csv (116)
  ↓ preprocessing/jain/step2_preprocess_p5e_s2.py
    - Reclassify 5 specific→non-specific (PSR/Tm/clinical)
    - Remove 30 specific by PSR/AC-SINS sorting
    - Tier D: flip 2 labels (lebrikizumab, galiximab)
  Output: data/test/jain/canonical/jain_86_novo_parity.csv (86 antibodies)

Step 3: Fragment Extraction (ANARCI/IMGT)
  Input: jain_with_private_elisa_FULL.csv (137)
  ↓ preprocessing/jain/step3_extract_fragments.py
  Output: data/test/jain/fragments/*.csv (16 fragment types)

Key Facts (Current Implementation)¶

Pipeline Statistics¶

Raw input: 137 antibodies with private ELISA data
ELISA filtering: 116 antibodies (excludes ELISA 1-3 flags)
P5e-S2 parity: 86 antibodies (57 specific / 29 non-specific)
Our result: 68.60% accuracy, CM [[40,17],[10,19]] - EXACT NOVO PARITY

Label Distribution (ELISA-based SSOT)¶

Specific (0): 94 antibodies
Non-specific (1): 22 antibodies
Mild (NaN): 21 antibodies (excluded from training)

16 Fragment Types¶

VH_only, VL_only (full variable domains)
H-CDR1, H-CDR2, H-CDR3 (heavy chain CDRs)
L-CDR1, L-CDR2, L-CDR3 (light chain CDRs)
H-CDRs, L-CDRs (concatenated CDRs)
H-FWRs, L-FWRs (concatenated frameworks)
VH+VL (paired variable domains)
All-CDRs, All-FWRs (all concatenated)
Full (alias for VH+VL)

References¶

Primary Papers¶

Jain et al. (2017) - "Biophysical properties of the clinical-stage antibody landscape." PNAS 114(5):944-949. DOI: https://doi.org/10.1073/pnas.1616408114
Sakhnini et al. (2025) - "Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters." bioRxiv. DOI: https://doi.org/10.1101/2025.04.28.650927

For Contributors¶

When to Update Documentation¶

Implementation changes → Update preprocessing/jain/README.md (SSOT)
Methodology insights → Update relevant docs/datasets/jain/*.md files
Bug fixes → Update label_discrepancy_findings.md or create new report
Historical/debugging docs → Move to docs/datasets/jain/archive/ with explanation

Documentation Principles¶

preprocessing/jain/README.md is the SINGLE SOURCE OF TRUTH for implementation
docs/datasets/jain/*.md provide context, rationale, and technical justification
docs/datasets/jain/archive/ contains retired methodologies and historical analysis
Keep docs DRY - reference other docs instead of duplicating
Mark outdated sections with clear ⚠️ WARNING banners

Known Documentation Issues¶

Medium Priority (Historical cleanup): - complete_history.md and reorganization_complete.md contain historical baselines and retired approaches

Low Priority (Keep As-Is): - label_discrepancy_findings.md - Accurate historical bug report ✅

Last Updated: 2025-11-17 Documentation Version: 2.0 (post-code-drift-cleanup) Status: ⚠️ Partially outdated - major rewrites needed for complete_guide and data_sources