⚠️ HISTORICAL DOCUMENT - November 2025 Cleanup

This document describes the P0 blocker validation from 2025-11-02 during initial Shehata Phase 2 development.

For current pipeline documentation, see: data/test/shehata/README.md

The gap character issue described below has been resolved. All fragment CSVs are now gap-free.

P0 Blocker: First Principles Validation (HISTORICAL)¶

In v2.0.0+, use: from antibody_training_esm.core.embeddings import ESMEmbeddingExtractor

See IMPORT_AND_STRUCTURE_GUIDE.md for current usage.

Date: 2025-11-02 Question: Is the gap character bug real, or is it a hallucination? Analyst: Claude Code (verified by user request)

Executive Summary¶

✅ CONFIRMED: This is a REAL P0 bug, not a hallucination.

The bug was validated from first principles through: 1. Code inspection of actual validation logic 2. Runtime behavior testing 3. Actual data verification 4. Paper methodology cross-reference

1. Where Does the Validation Requirement Come From?¶

Source 1: Codebase (model.py)¶

Location: model.py:33-37 and model.py:86-90

# ESMEmbeddingExtractor.embed_sequence()
valid_aas = set("ACDEFGHIKLMNPQRSTVWYX")
sequence = sequence.upper().strip()

if not all(aa in valid_aas for aa in sequence):
    raise ValueError("Invalid amino acid characters in sequence")

Validation Logic: - ESM tokenizer REQUIRES all characters in sequence to be in valid_aas set - The set does NOT include - (hyphen/gap character) - Any sequence with - will raise ValueError

Is this assumption correct? ✅ YES - This is standard ESM-1v preprocessing - ESM models expect raw amino acid sequences - Gap characters are IMGT alignment artifacts, not real amino acids - The valid AA set matches standard 20 amino acids + X (unknown)

Source 2: Paper Methodology¶

Location: literature/markdown/Sakhnini_2025_Antibody_NonSpecificity_PLM_Biophysical.md:240

"sequences were annotated in the CDRs using ANARCI following the IMGT numbering scheme. Following this, 16 different antibody fragment sequences were assembled and embedded by three state-of-the-art protein language models (PLMs), ESM 1v"

Paper Does NOT Specify: - Whether they used sequence_alignment_aa (with gaps) or sequence_aa (without gaps) - Whether they strip gaps before embedding - Which riot_na field they used

Paper DOES Specify: - They use ANARCI with IMGT scheme ✓ - They embed sequences with ESM-1v ✓ - Embeddings use mean pooling ✓

Inference: ✅ They MUST have gap-free sequences for ESM embedding - ESM-1v cannot tokenize gap characters - Paper shows successful embeddings - Therefore, they either used gap-free field OR stripped gaps

Source 3: Domain Knowledge¶

IMGT Numbering: - Assigns fixed position numbers to antibody residues - Uses gaps (-) for structural alignment where deletions occur - Example: Position 60 might be missing → represented as - in alignment

ESM-1v Requirements: - Protein language model trained on UniProt sequences - Expects continuous amino acid sequences - No alignment artifacts (gaps, insertion codes, etc.)

Standard Practice: ✅ Always strip IMGT gaps before embedding - Alignment is for structural comparison - Embedding requires biological sequence - This is universal in antibody ML pipelines

2. Testing the Claim: Runtime Validation¶

Test 1: Does `-` Actually Fail Validation?¶

sequence_with_gaps = "EVQLVESGGGLVQPGGSLRLSCTTSGITFSRYAMTWVRQSATKGLEWISSI--GSGSTFYADSVKGRFTISRDNSKDTLY"
sequence_without_gaps = "EVQLVESGGGLVQPGGSLRLSCTTSGITFSRYAMTWVRQSATKGLEWISSIGSGSTFYADSVKGRFTISRDNSKDTLY"

valid_aas = set("ACDEFGHIKLMNPQRSTVWYX")

# Test WITH gaps
all(aa in valid_aas for aa in sequence_with_gaps)
>>> False  # ❌ FAILS validation

# Test WITHOUT gaps
all(aa in valid_aas for aa in sequence_without_gaps)
>>> True   # ✅ PASSES validation

Result: ✅ CONFIRMED - Gap character causes validation failure

Test 2: Do Actual Shehata Files Have Gaps?¶

import pandas as pd

df = pd.read_csv('data/test/shehata/fragments/VH_only_shehata.csv')
gap_count = df['sequence'].str.contains('-').sum()
>>> 13 sequences with "-" gaps

df = pd.read_csv('data/test/shehata/VL_only_shehata.csv')
gap_count = df['sequence'].str.contains('-').sum()
>>> 4 sequences with "-" gaps

df = pd.read_csv('data/test/shehata/Full_shehata.csv')
gap_count = df['sequence'].str.contains('-').sum()
>>> 17 sequences with "-" gaps

Result: ✅ CONFIRMED - Actual files have gap characters

Example (ADI-47173):

EVQLVESGGGVVQPGRSLRLSCAASGFTFDRYGMHWIRQAPGKGLECVALISFDGSHK-YADSVKGRFTISRDNSRNTLY...
                                                           ^
                                                        GAP CHAR

Test 3: When Would This Be Caught?¶

At Runtime - During Embedding:

# In data.py:preprocess_raw_data()
X_embedded = embedding_extractor.extract_batch_embeddings(X)  # Calls model.py

# In model.py:extract_batch_embeddings()
if not all(aa in valid_aas for aa in seq) or len(seq) < 1:
    logger.warning(f"Invalid sequence at index {start_idx + len(cleaned_sequences)}, using zeros")
    cleaned_sequences.append("M")  # Placeholder for invalid sequences

Result: ✅ CONFIRMED - Would be caught at runtime - 13 VH sequences would trigger validation error - 4 VL sequences would trigger validation error - 17 Full sequences would trigger validation error - Logged as warnings, replaced with placeholder "M" - INCORRECT EMBEDDINGS returned

Impact: - Silent failure (logged as warning) - Garbage embeddings for 17 antibodies - Model trained on incorrect representations - Results CANNOT reproduce paper

3. Root Cause Analysis¶

Where Do Gaps Come From?¶

Phase 1 (Excel → CSV): ✅ CORRECT - preprocessing/shehata/step1_convert_excel_to_csv.py:21-46 has sanitize_sequence() - Removes all - characters from input sequences - data/test/shehata/processed/shehata.csv is gap-free ✓

Phase 2 (CSV → Fragments): ❌ BROKEN - preprocessing/shehata/step2_extract_fragments.py:63 uses annotation.sequence_alignment_aa - This field includes IMGT gaps for structural alignment - Should use annotation.sequence_aa (no gaps)

Verification with riot_na¶

import riot_na
annotator = riot_na.create_riot_aa()

# Test with sequence that has structural deletion
result = annotator.run_on_sequence('test', clean_sequence)

# Two different fields:
result.sequence_alignment_aa  # IMGT-aligned WITH gaps ("-")
result.sequence_aa            # Raw sequence WITHOUT gaps

# CDR/FWR fragments:
result.cdr1_aa  # Individual regions don't have gaps ✓
result.cdr2_aa  # Individual regions don't have gaps ✓
result.fwr1_aa  # Individual regions don't have gaps ✓

Result: ✅ Root cause confirmed - ANARCI provides TWO versions of full sequence - We're using the WRONG one (alignment version) - CDR/FWR fragments are correct (no gaps) - Only full-length sequences affected

4. Is This Documented Anywhere in the Repo?¶

Evidence of Prior Awareness¶

File: docs/shehata_data_cleaning_log.md:13-18

### Issue 1: Gap Characters in Sequences (CRITICAL)
- **13 VH sequences** contain gap characters (`-`)
- **11 VL sequences** contain gap characters (`-`)
- Gap characters are IMGT numbering artifacts from alignment

File: docs/shehata_conversion_verification_report.md:20-24

### 1. ✅ CRITICAL: Gap Character Sanitization
- 13 VH + 11 VL sequences contained gap characters (`-`) from IMGT numbering
- Original code validated but never sanitized sequences

Interpretation: ⚠️ Gap issue was KNOWN for Phase 1 (Excel → CSV) ✅ Phase 1 was FIXED with sanitization ❌ Phase 2 (fragment extraction) was NEVER checked for gaps

5. Validation Plan: How to Prove It's Fixed¶

After Fix Applied:¶

# Test 1: Check all fragment files for gaps
import pandas as pd
from pathlib import Path

fragment_files = Path('data/test/shehata/').glob('*.csv')
for file in fragment_files:
    df = pd.read_csv(file)
    gap_count = df['sequence'].str.contains('-').sum()
    assert gap_count == 0, f"{file.name} has {gap_count} sequences with gaps"

Test 2: Try actual embedding¶

# Load ESM model
from model import ESMEmbeddingExtractor

model = ESMEmbeddingExtractor('facebook/esm1v_t33_650M_UR90S_1', 'cpu')

# Load Shehata VH sequences
df = pd.read_csv('data/test/shehata/fragments/VH_only_shehata.csv')
sequences = df['sequence'].tolist()

# Try embedding - should NOT raise errors or warnings
embeddings = model.extract_batch_embeddings(sequences)

# Verify no zero/placeholder embeddings
assert not any(np.allclose(emb, 0) for emb in embeddings)

6. Conclusion¶

Is This Bug Real?¶

✅ YES - Validated from multiple angles: 1. Code logic: model.py explicitly rejects - characters 2. Actual data: 13 VH, 4 VL, 17 Full sequences have gaps 3. Runtime behavior: Would cause validation errors during embedding 4. Paper methodology: ESM-1v requires gap-free sequences 5. Domain knowledge: Standard practice is to strip IMGT gaps 6. Root cause: Using wrong riot_na field (sequence_alignment_aa vs sequence_aa)

Is This a Hallucination?¶

❌ NO - Every claim validated: - Gap characters exist in actual files ✓ - They will break ESM embedding ✓ - Root cause identified ✓ - Fix is straightforward ✓

What Are the Assumptions?¶

ESM-1v cannot tokenize gaps: ✅ Standard fact
riot_na provides gap-free field: ✅ Verified by inspection
Paper used gap-free sequences: ✅ Inferred (ESM worked for them)
model.py validation is correct: ✅ Matches ESM best practices

When Would We Discover This Without Analysis?¶

During embedding pipeline execution: - Logged as 17 warnings - Sequences replaced with placeholder "M" - Model would train but results would be wrong - Might not be obvious until comparing results with paper

During Discord review: - "Did you check for IMGT gaps in fragments?" - "Why are your embeddings different from the paper?" - This is the criticism we want to avoid

Action Items¶

URGENT (P0):¶

Update preprocessing/shehata/step2_extract_fragments.py:63

# Change FROM:
f"full_seq_{chain}": annotation.sequence_alignment_aa

# Change TO:
f"full_seq_{chain}": annotation.sequence_aa

Re-run fragment extraction:

python3 preprocessing/shehata/step2_extract_fragments.py

Validate all gaps removed:

grep -r '\-' data/test/shehata/fragments/*.csv | wc -l
# Should be 0

Test ESM embedding on all fragments

Documentation:¶

Update docs/SHEHATA_BLOCKER_ANALYSIS.md with validation results
Add validation script to prevent regression
Document in docs/shehata_phase2_completion_report.md

Appendix: Exact Line References¶

File	Line	Issue	Fix
`preprocessing/shehata/step2_extract_fragments.py`	63	Uses `sequence_alignment_aa` (has gaps)	Use `sequence_aa` (no gaps)
`data/test/shehata/fragments/VH_only_shehata.csv`	Multiple	13 sequences with `-`	Re-extract after fix
`data/test/shehata/VL_only_shehata.csv`	Multiple	4 sequences with `-`	Re-extract after fix
`data/test/shehata/Full_shehata.csv`	Multiple	17 sequences with `-`	Re-extract after fix

Sign-Off¶

Validation Complete: ✅ Bug Confirmed: ✅ Fix Identified: ✅ Ready for Implementation: ✅ (pending senior approval)

This is NOT a hallucination. This is a real, reproducible bug with clear evidence and a straightforward fix.