Skip to content

Jain Dataset Documentation

Purpose: Reference documentation for the Jain 2017 clinical antibody dataset processing and methodology.

Dataset: Jain et al. (2017) PNAS - 137 clinical-stage antibodies with biophysical measurements and polyreactivity data


Quick Start

For Implementation

→ See preprocessing/jain/README.md - SINGLE SOURCE OF TRUTH for pipeline implementation, scripts, and usage

For Understanding the Dataset

Start here with this directory's documentation ↓


Documentation Structure

Core Reference Documents

1. label_discrepancy_findings.md ✅ ACCURATE

Purpose: Critical investigation report documenting the 38.7% label error bug and fix - Discovered mismatch between ELISA-based SSOT and paper-based flags_total - Root cause: fragments derived from wrong labeling system - Fix: step3_extract_fragments.py regeneration with ELISA-based labels - Before: 67/27/43 (wrong) → After: 94/22/21 (correct)

Read this if: You want to understand why labels changed or the quality control process


2. complete_guide.md ✅ ACCURATE

Purpose: Complete guide to the current preprocessing + canonical benchmark artifacts - Canonical 86-antibody benchmark (exact Novo parity) - File inventory and how to run verification

Read this if: You want comprehensive context (but verify against preprocessing code)


3. complete_history.md ⚠️ HISTORICAL

Purpose: Historical reference showing evolution of methodologies - Contains valuable historical context and retired approaches

Read this if: You need historical context on why we changed methodologies


4. reorganization_complete.md ⚠️ HISTORICAL

Purpose: File reorganization effort from 2025-11-05 - Directory structure (raw/processed/canonical/fragments) is accurate - Verification sections may reflect older baselines

Read this if: You want to understand the file organization


5. jain_data_sources.md ✅ ACCURATE

Purpose: Source file descriptions and data provenance - Describes the public SD files + private ELISA source used by the pipeline

Read this if: You need to know where the raw data comes from


Current Pipeline (SSOT)

Implemented in: preprocessing/jain/ scripts

Step 1: Excel → CSV Conversion (ELISA-only)
  Raw: data/test/jain/raw/*.xlsx
  ↓ preprocessing/jain/step1_convert_excel_to_csv.py
  Output: data/test/jain/processed/jain_with_private_elisa_FULL.csv (137 antibodies)
          data/test/jain/processed/jain_ELISA_ONLY_116.csv (116 antibodies)

Step 2: P5e-S2 Novo Parity Method
  Input: jain_ELISA_ONLY_116.csv (116)
  ↓ preprocessing/jain/step2_preprocess_p5e_s2.py
    - Reclassify 5 specific→non-specific (PSR/Tm/clinical)
    - Remove 30 specific by PSR/AC-SINS sorting
    - Tier D: flip 2 labels (lebrikizumab, galiximab)
  Output: data/test/jain/canonical/jain_86_novo_parity.csv (86 antibodies)

Step 3: Fragment Extraction (ANARCI/IMGT)
  Input: jain_with_private_elisa_FULL.csv (137)
  ↓ preprocessing/jain/step3_extract_fragments.py
  Output: data/test/jain/fragments/*.csv (16 fragment types)

Key Facts (Current Implementation)

Pipeline Statistics

  • Raw input: 137 antibodies with private ELISA data
  • ELISA filtering: 116 antibodies (excludes ELISA 1-3 flags)
  • P5e-S2 parity: 86 antibodies (57 specific / 29 non-specific)
  • Our result: 68.60% accuracy, CM [[40,17],[10,19]] - EXACT NOVO PARITY

Label Distribution (ELISA-based SSOT)

  • Specific (0): 94 antibodies
  • Non-specific (1): 22 antibodies
  • Mild (NaN): 21 antibodies (excluded from training)

16 Fragment Types

  1. VH_only, VL_only (full variable domains)
  2. H-CDR1, H-CDR2, H-CDR3 (heavy chain CDRs)
  3. L-CDR1, L-CDR2, L-CDR3 (light chain CDRs)
  4. H-CDRs, L-CDRs (concatenated CDRs)
  5. H-FWRs, L-FWRs (concatenated frameworks)
  6. VH+VL (paired variable domains)
  7. All-CDRs, All-FWRs (all concatenated)
  8. Full (alias for VH+VL)

References

Primary Papers

  • Jain et al. (2017) - "Biophysical properties of the clinical-stage antibody landscape." PNAS 114(5):944-949. DOI: https://doi.org/10.1073/pnas.1616408114

  • Sakhnini et al. (2025) - "Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters." bioRxiv. DOI: https://doi.org/10.1101/2025.04.28.650927


For Contributors

When to Update Documentation

  1. Implementation changes → Update preprocessing/jain/README.md (SSOT)
  2. Methodology insights → Update relevant docs/datasets/jain/*.md files
  3. Bug fixes → Update label_discrepancy_findings.md or create new report
  4. Historical/debugging docs → Move to docs/datasets/jain/archive/ with explanation

Documentation Principles

  • preprocessing/jain/README.md is the SINGLE SOURCE OF TRUTH for implementation
  • docs/datasets/jain/*.md provide context, rationale, and technical justification
  • docs/datasets/jain/archive/ contains retired methodologies and historical analysis
  • Keep docs DRY - reference other docs instead of duplicating
  • Mark outdated sections with clear ⚠️ WARNING banners

Known Documentation Issues

Medium Priority (Historical cleanup): - complete_history.md and reorganization_complete.md contain historical baselines and retired approaches

Low Priority (Keep As-Is): - label_discrepancy_findings.md - Accurate historical bug report ✅


Last Updated: 2025-11-17 Documentation Version: 2.0 (post-code-drift-cleanup) Status: ⚠️ Partially outdated - major rewrites needed for complete_guide and data_sources