⚠️ HISTORICAL DOCUMENT - November 2025 Cleanup
This document describes the P0 blocker validation from 2025-11-02 during initial Shehata Phase 2 development.
For current pipeline documentation, see:
data/test/shehata/README.mdThe gap character issue described below has been resolved. All fragment CSVs are now gap-free.
P0 Blocker: First Principles Validation (HISTORICAL)¶
In v2.0.0+, use:
from antibody_training_esm.core.embeddings import ESMEmbeddingExtractorSee IMPORT_AND_STRUCTURE_GUIDE.md for current usage.
Date: 2025-11-02 Question: Is the gap character bug real, or is it a hallucination? Analyst: Claude Code (verified by user request)
Executive Summary¶
✅ CONFIRMED: This is a REAL P0 bug, not a hallucination.
The bug was validated from first principles through: 1. Code inspection of actual validation logic 2. Runtime behavior testing 3. Actual data verification 4. Paper methodology cross-reference
1. Where Does the Validation Requirement Come From?¶
Source 1: Codebase (model.py)¶
Location: model.py:33-37 and model.py:86-90
# ESMEmbeddingExtractor.embed_sequence()
valid_aas = set("ACDEFGHIKLMNPQRSTVWYX")
sequence = sequence.upper().strip()
if not all(aa in valid_aas for aa in sequence):
raise ValueError("Invalid amino acid characters in sequence")
Validation Logic:
- ESM tokenizer REQUIRES all characters in sequence to be in valid_aas set
- The set does NOT include - (hyphen/gap character)
- Any sequence with - will raise ValueError
Is this assumption correct? ✅ YES - This is standard ESM-1v preprocessing - ESM models expect raw amino acid sequences - Gap characters are IMGT alignment artifacts, not real amino acids - The valid AA set matches standard 20 amino acids + X (unknown)
Source 2: Paper Methodology¶
Location: literature/markdown/Sakhnini_2025_Antibody_NonSpecificity_PLM_Biophysical.md:240
"sequences were annotated in the CDRs using ANARCI following the IMGT numbering scheme. Following this, 16 different antibody fragment sequences were assembled and embedded by three state-of-the-art protein language models (PLMs), ESM 1v"
Paper Does NOT Specify:
- Whether they used sequence_alignment_aa (with gaps) or sequence_aa (without gaps)
- Whether they strip gaps before embedding
- Which riot_na field they used
Paper DOES Specify: - They use ANARCI with IMGT scheme ✓ - They embed sequences with ESM-1v ✓ - Embeddings use mean pooling ✓
Inference: ✅ They MUST have gap-free sequences for ESM embedding - ESM-1v cannot tokenize gap characters - Paper shows successful embeddings - Therefore, they either used gap-free field OR stripped gaps
Source 3: Domain Knowledge¶
IMGT Numbering:
- Assigns fixed position numbers to antibody residues
- Uses gaps (-) for structural alignment where deletions occur
- Example: Position 60 might be missing → represented as - in alignment
ESM-1v Requirements: - Protein language model trained on UniProt sequences - Expects continuous amino acid sequences - No alignment artifacts (gaps, insertion codes, etc.)
Standard Practice: ✅ Always strip IMGT gaps before embedding - Alignment is for structural comparison - Embedding requires biological sequence - This is universal in antibody ML pipelines
2. Testing the Claim: Runtime Validation¶
Test 1: Does - Actually Fail Validation?¶
sequence_with_gaps = "EVQLVESGGGLVQPGGSLRLSCTTSGITFSRYAMTWVRQSATKGLEWISSI--GSGSTFYADSVKGRFTISRDNSKDTLY"
sequence_without_gaps = "EVQLVESGGGLVQPGGSLRLSCTTSGITFSRYAMTWVRQSATKGLEWISSIGSGSTFYADSVKGRFTISRDNSKDTLY"
valid_aas = set("ACDEFGHIKLMNPQRSTVWYX")
# Test WITH gaps
all(aa in valid_aas for aa in sequence_with_gaps)
>>> False # ❌ FAILS validation
# Test WITHOUT gaps
all(aa in valid_aas for aa in sequence_without_gaps)
>>> True # ✅ PASSES validation
Result: ✅ CONFIRMED - Gap character causes validation failure
Test 2: Do Actual Shehata Files Have Gaps?¶
import pandas as pd
df = pd.read_csv('data/test/shehata/fragments/VH_only_shehata.csv')
gap_count = df['sequence'].str.contains('-').sum()
>>> 13 sequences with "-" gaps
df = pd.read_csv('data/test/shehata/VL_only_shehata.csv')
gap_count = df['sequence'].str.contains('-').sum()
>>> 4 sequences with "-" gaps
df = pd.read_csv('data/test/shehata/Full_shehata.csv')
gap_count = df['sequence'].str.contains('-').sum()
>>> 17 sequences with "-" gaps
Result: ✅ CONFIRMED - Actual files have gap characters
Example (ADI-47173):
Test 3: When Would This Be Caught?¶
At Runtime - During Embedding:
# In data.py:preprocess_raw_data()
X_embedded = embedding_extractor.extract_batch_embeddings(X) # Calls model.py
# In model.py:extract_batch_embeddings()
if not all(aa in valid_aas for aa in seq) or len(seq) < 1:
logger.warning(f"Invalid sequence at index {start_idx + len(cleaned_sequences)}, using zeros")
cleaned_sequences.append("M") # Placeholder for invalid sequences
Result: ✅ CONFIRMED - Would be caught at runtime - 13 VH sequences would trigger validation error - 4 VL sequences would trigger validation error - 17 Full sequences would trigger validation error - Logged as warnings, replaced with placeholder "M" - INCORRECT EMBEDDINGS returned
Impact: - Silent failure (logged as warning) - Garbage embeddings for 17 antibodies - Model trained on incorrect representations - Results CANNOT reproduce paper
3. Root Cause Analysis¶
Where Do Gaps Come From?¶
Phase 1 (Excel → CSV): ✅ CORRECT
- preprocessing/shehata/step1_convert_excel_to_csv.py:21-46 has sanitize_sequence()
- Removes all - characters from input sequences
- data/test/shehata/processed/shehata.csv is gap-free ✓
Phase 2 (CSV → Fragments): ❌ BROKEN
- preprocessing/shehata/step2_extract_fragments.py:63 uses annotation.sequence_alignment_aa
- This field includes IMGT gaps for structural alignment
- Should use annotation.sequence_aa (no gaps)
Verification with riot_na¶
import riot_na
annotator = riot_na.create_riot_aa()
# Test with sequence that has structural deletion
result = annotator.run_on_sequence('test', clean_sequence)
# Two different fields:
result.sequence_alignment_aa # IMGT-aligned WITH gaps ("-")
result.sequence_aa # Raw sequence WITHOUT gaps
# CDR/FWR fragments:
result.cdr1_aa # Individual regions don't have gaps ✓
result.cdr2_aa # Individual regions don't have gaps ✓
result.fwr1_aa # Individual regions don't have gaps ✓
Result: ✅ Root cause confirmed - ANARCI provides TWO versions of full sequence - We're using the WRONG one (alignment version) - CDR/FWR fragments are correct (no gaps) - Only full-length sequences affected
4. Is This Documented Anywhere in the Repo?¶
Evidence of Prior Awareness¶
File: docs/shehata_data_cleaning_log.md:13-18
### Issue 1: Gap Characters in Sequences (CRITICAL)
- **13 VH sequences** contain gap characters (`-`)
- **11 VL sequences** contain gap characters (`-`)
- Gap characters are IMGT numbering artifacts from alignment
File: docs/shehata_conversion_verification_report.md:20-24
### 1. ✅ CRITICAL: Gap Character Sanitization
- 13 VH + 11 VL sequences contained gap characters (`-`) from IMGT numbering
- Original code validated but never sanitized sequences
Interpretation: ⚠️ Gap issue was KNOWN for Phase 1 (Excel → CSV) ✅ Phase 1 was FIXED with sanitization ❌ Phase 2 (fragment extraction) was NEVER checked for gaps
5. Validation Plan: How to Prove It's Fixed¶
After Fix Applied:¶
# Test 1: Check all fragment files for gaps
import pandas as pd
from pathlib import Path
fragment_files = Path('data/test/shehata/').glob('*.csv')
for file in fragment_files:
df = pd.read_csv(file)
gap_count = df['sequence'].str.contains('-').sum()
assert gap_count == 0, f"{file.name} has {gap_count} sequences with gaps"
Test 2: Try actual embedding¶
# Load ESM model
from model import ESMEmbeddingExtractor
model = ESMEmbeddingExtractor('facebook/esm1v_t33_650M_UR90S_1', 'cpu')
# Load Shehata VH sequences
df = pd.read_csv('data/test/shehata/fragments/VH_only_shehata.csv')
sequences = df['sequence'].tolist()
# Try embedding - should NOT raise errors or warnings
embeddings = model.extract_batch_embeddings(sequences)
# Verify no zero/placeholder embeddings
assert not any(np.allclose(emb, 0) for emb in embeddings)
6. Conclusion¶
Is This Bug Real?¶
✅ YES - Validated from multiple angles:
1. Code logic: model.py explicitly rejects - characters
2. Actual data: 13 VH, 4 VL, 17 Full sequences have gaps
3. Runtime behavior: Would cause validation errors during embedding
4. Paper methodology: ESM-1v requires gap-free sequences
5. Domain knowledge: Standard practice is to strip IMGT gaps
6. Root cause: Using wrong riot_na field (sequence_alignment_aa vs sequence_aa)
Is This a Hallucination?¶
❌ NO - Every claim validated: - Gap characters exist in actual files ✓ - They will break ESM embedding ✓ - Root cause identified ✓ - Fix is straightforward ✓
What Are the Assumptions?¶
- ESM-1v cannot tokenize gaps: ✅ Standard fact
- riot_na provides gap-free field: ✅ Verified by inspection
- Paper used gap-free sequences: ✅ Inferred (ESM worked for them)
- model.py validation is correct: ✅ Matches ESM best practices
When Would We Discover This Without Analysis?¶
During embedding pipeline execution: - Logged as 17 warnings - Sequences replaced with placeholder "M" - Model would train but results would be wrong - Might not be obvious until comparing results with paper
During Discord review: - "Did you check for IMGT gaps in fragments?" - "Why are your embeddings different from the paper?" - This is the criticism we want to avoid
Action Items¶
URGENT (P0):¶
-
Update
preprocessing/shehata/step2_extract_fragments.py:63 -
Re-run fragment extraction:
-
Validate all gaps removed:
-
Test ESM embedding on all fragments
Documentation:¶
- Update
docs/SHEHATA_BLOCKER_ANALYSIS.mdwith validation results - Add validation script to prevent regression
- Document in
docs/shehata_phase2_completion_report.md
Appendix: Exact Line References¶
| File | Line | Issue | Fix |
|---|---|---|---|
preprocessing/shehata/step2_extract_fragments.py |
63 | Uses sequence_alignment_aa (has gaps) |
Use sequence_aa (no gaps) |
data/test/shehata/fragments/VH_only_shehata.csv |
Multiple | 13 sequences with - |
Re-extract after fix |
data/test/shehata/VL_only_shehata.csv |
Multiple | 4 sequences with - |
Re-extract after fix |
data/test/shehata/Full_shehata.csv |
Multiple | 17 sequences with - |
Re-extract after fix |
Sign-Off¶
Validation Complete: ✅ Bug Confirmed: ✅ Fix Identified: ✅ Ready for Implementation: ✅ (pending senior approval)
This is NOT a hallucination. This is a real, reproducible bug with clear evidence and a straightforward fix.