Jain Dataset Documentation¶
Purpose: Reference documentation for the Jain 2017 clinical antibody dataset processing and methodology.
Dataset: Jain et al. (2017) PNAS - 137 clinical-stage antibodies with biophysical measurements and polyreactivity data
Quick Start¶
For Implementation¶
→ See preprocessing/jain/README.md - SINGLE SOURCE OF TRUTH for pipeline implementation, scripts, and usage
For Understanding the Dataset¶
Start here with this directory's documentation ↓
Documentation Structure¶
Core Reference Documents¶
1. label_discrepancy_findings.md ✅ ACCURATE¶
Purpose: Critical investigation report documenting the 38.7% label error bug and fix - Discovered mismatch between ELISA-based SSOT and paper-based flags_total - Root cause: fragments derived from wrong labeling system - Fix: step3_extract_fragments.py regeneration with ELISA-based labels - Before: 67/27/43 (wrong) → After: 94/22/21 (correct)
Read this if: You want to understand why labels changed or the quality control process
2. complete_guide.md ✅ ACCURATE¶
Purpose: Complete guide to the current preprocessing + canonical benchmark artifacts - Canonical 86-antibody benchmark (exact Novo parity) - File inventory and how to run verification
Read this if: You want comprehensive context (but verify against preprocessing code)
3. complete_history.md ⚠️ HISTORICAL¶
Purpose: Historical reference showing evolution of methodologies - Contains valuable historical context and retired approaches
Read this if: You need historical context on why we changed methodologies
4. reorganization_complete.md ⚠️ HISTORICAL¶
Purpose: File reorganization effort from 2025-11-05 - Directory structure (raw/processed/canonical/fragments) is accurate - Verification sections may reflect older baselines
Read this if: You want to understand the file organization
5. jain_data_sources.md ✅ ACCURATE¶
Purpose: Source file descriptions and data provenance - Describes the public SD files + private ELISA source used by the pipeline
Read this if: You need to know where the raw data comes from
Current Pipeline (SSOT)¶
Implemented in: preprocessing/jain/ scripts
Step 1: Excel → CSV Conversion (ELISA-only)
Raw: data/test/jain/raw/*.xlsx
↓ preprocessing/jain/step1_convert_excel_to_csv.py
Output: data/test/jain/processed/jain_with_private_elisa_FULL.csv (137 antibodies)
data/test/jain/processed/jain_ELISA_ONLY_116.csv (116 antibodies)
Step 2: P5e-S2 Novo Parity Method
Input: jain_ELISA_ONLY_116.csv (116)
↓ preprocessing/jain/step2_preprocess_p5e_s2.py
- Reclassify 5 specific→non-specific (PSR/Tm/clinical)
- Remove 30 specific by PSR/AC-SINS sorting
- Tier D: flip 2 labels (lebrikizumab, galiximab)
Output: data/test/jain/canonical/jain_86_novo_parity.csv (86 antibodies)
Step 3: Fragment Extraction (ANARCI/IMGT)
Input: jain_with_private_elisa_FULL.csv (137)
↓ preprocessing/jain/step3_extract_fragments.py
Output: data/test/jain/fragments/*.csv (16 fragment types)
Key Facts (Current Implementation)¶
Pipeline Statistics¶
- Raw input: 137 antibodies with private ELISA data
- ELISA filtering: 116 antibodies (excludes ELISA 1-3 flags)
- P5e-S2 parity: 86 antibodies (57 specific / 29 non-specific)
- Our result: 68.60% accuracy, CM [[40,17],[10,19]] - EXACT NOVO PARITY
Label Distribution (ELISA-based SSOT)¶
- Specific (0): 94 antibodies
- Non-specific (1): 22 antibodies
- Mild (NaN): 21 antibodies (excluded from training)
16 Fragment Types¶
- VH_only, VL_only (full variable domains)
- H-CDR1, H-CDR2, H-CDR3 (heavy chain CDRs)
- L-CDR1, L-CDR2, L-CDR3 (light chain CDRs)
- H-CDRs, L-CDRs (concatenated CDRs)
- H-FWRs, L-FWRs (concatenated frameworks)
- VH+VL (paired variable domains)
- All-CDRs, All-FWRs (all concatenated)
- Full (alias for VH+VL)
References¶
Primary Papers¶
-
Jain et al. (2017) - "Biophysical properties of the clinical-stage antibody landscape." PNAS 114(5):944-949. DOI: https://doi.org/10.1073/pnas.1616408114
-
Sakhnini et al. (2025) - "Prediction of Antibody Non-Specificity using Protein Language Models and Biophysical Parameters." bioRxiv. DOI: https://doi.org/10.1101/2025.04.28.650927
For Contributors¶
When to Update Documentation¶
- Implementation changes → Update
preprocessing/jain/README.md(SSOT) - Methodology insights → Update relevant
docs/datasets/jain/*.mdfiles - Bug fixes → Update label_discrepancy_findings.md or create new report
- Historical/debugging docs → Move to
docs/datasets/jain/archive/with explanation
Documentation Principles¶
- preprocessing/jain/README.md is the SINGLE SOURCE OF TRUTH for implementation
- docs/datasets/jain/*.md provide context, rationale, and technical justification
- docs/datasets/jain/archive/ contains retired methodologies and historical analysis
- Keep docs DRY - reference other docs instead of duplicating
- Mark outdated sections with clear ⚠️ WARNING banners
Known Documentation Issues¶
Medium Priority (Historical cleanup):
- complete_history.md and reorganization_complete.md contain historical baselines and retired approaches
Low Priority (Keep As-Is):
- label_discrepancy_findings.md - Accurate historical bug report ✅
Last Updated: 2025-11-17 Documentation Version: 2.0 (post-code-drift-cleanup) Status: ⚠️ Partially outdated - major rewrites needed for complete_guide and data_sources