Troubleshooting Guide¶
This guide provides solutions to common issues encountered when using the antibody training pipeline.
Installation Issues¶
uv command not found after installation¶
Symptoms:
Solution:
Restart your terminal or manually add uv to your PATH:
# Add to ~/.bashrc or ~/.zshrc
export PATH="$HOME/.cargo/bin:$PATH"
# Reload shell config
source ~/.bashrc # or source ~/.zshrc
Python version mismatch¶
Symptoms:
Solution:
Install Python 3.12:
# Ubuntu/Debian
sudo apt update
sudo apt install python3.12
# macOS (Homebrew)
brew install python@3.12
# Or use pyenv
pyenv install 3.12
pyenv local 3.12
Permission denied on macOS/Linux¶
Symptoms:
Solution:
Never use sudo with uv. Fix ownership instead:
# Fix ~/.local ownership
sudo chown -R $USER:$USER ~/.local
# Fix .venv ownership (if needed)
sudo chown -R $USER:$USER .venv
HuggingFace Cache Permission Denied (Linux/WSL2)¶
Symptoms:
OSError: PermissionError at /home/user/.cache/huggingface/hub when downloading facebook/esm1v_t33_650M_UR90S_1
Or test failures with:
Root Cause:
The HuggingFace cache directory was created by a different user (often root) or has incorrect permissions.
Solution:
# Fix cache ownership (replace 'user' with your username)
sudo chown -R $USER:$USER ~/.cache/huggingface
# OR - Delete and recreate the cache directory
rm -rf ~/.cache/huggingface
mkdir -p ~/.cache/huggingface
Verify:
Prevention:
Never run model downloads with sudo. Use uv run commands as your regular user.
GPU / Hardware Issues¶
MPS Memory Issues (Apple Silicon)¶
Symptoms:
Solution 1: Reduce Batch Size
Solution 2: Clear MPS Cache
Solution 3: Use CPU Instead
Permanent Fix:
The MPS memory leak was fixed in commit 9c8e5f2. If still encountering issues, see docs/archive/investigations/2025-11-03-mps-memory-leak.md for historical context.
CUDA Out of Memory¶
Symptoms:
Solution 1: Reduce Batch Size
# src/antibody_training_esm/conf/config.yaml
training:
batch_size: 8 # Default; lower further if needed
Solution 2: Clear CUDA Cache
Solution 3: Use Smaller Model
# src/antibody_training_esm/conf/config.yaml
model:
name: "facebook/esm1v_t33_650M_UR90S_1" # 650M parameters
# Instead of:
# name: "facebook/esm2_t36_3B_UR50D" # 3B parameters
Solution 4: Use CPU
GPU Not Detected¶
Symptoms:
Solution (CUDA):
# Check GPU is visible
nvidia-smi
# Verify PyTorch CUDA installation
python -c "import torch; print(torch.version.cuda)"
# Reinstall PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Solution (MPS):
# Verify macOS version (≥12.3 required)
sw_vers
# Check MPS is available
python -c "import torch; print(torch.backends.mps.is_available())"
Training Issues¶
ESM-1v Download Fails¶
Symptoms:
Solution:
# Set HuggingFace cache directory
export HF_HOME=/path/with/space
# Use HuggingFace mirror (if in region with restrictions)
export HF_ENDPOINT=https://hf-mirror.com
# Retry download
uv run antibody-train
"Label column not found" Error¶
Symptoms:
Solution:
Ensure training CSV has label column with 0/1 values:
If using different column name, update CSV:
import pandas as pd
df = pd.read_csv("data.csv")
df = df.rename(columns={"polyreactivity": "label"})
df.to_csv("data_fixed.csv", index=False)
Embedding Cache Out of Sync¶
Symptoms:
- Embeddings don't match expected shape
- Predictions are random/nonsensical
- Cache from old model version
Solution:
Clear embeddings cache and retrain:
Cache is SHA-256 hashed by: - Model name - Dataset path - Model revision
Any change invalidates cache automatically, but manual clearing ensures fresh start.
Poor Cross-Validation Performance¶
Symptoms:
Possible Causes & Solutions:
1. Wrong Column Names
# Ensure column names match CSV file
data:
sequence_column: "sequence" # Default column name for sequences
label_column: "label" # Default column name for labels
Check your CSV has these columns:
import pandas as pd
df = pd.read_csv("train.csv")
print(df.columns) # Should include 'sequence' and 'label'
2. Label Encoding Error
# Check label distribution
import pandas as pd
df = pd.read_csv("train.csv")
print(df["label"].value_counts())
# Should show: 0: XXX, 1: YYY (binary labels)
3. Sequence Quality Issues
# Check for invalid sequences
df["valid"] = df["VH"].str.match(r'^[ACDEFGHIKLMNPQRSTVWY]+$')
print(f"Invalid: {(~df['valid']).sum()}")
4. Model Not Loaded Correctly
Training Takes Too Long¶
Symptoms:
- Training runs for hours on small dataset
- Embedding extraction stuck
Solution 1: Use GPU
Solution 2: Check Dataset Size
import pandas as pd
df = pd.read_csv("train.csv")
print(f"Dataset size: {len(df)}")
# Boughter: 914 sequences
# Jain: 86 sequences
# If >10k, expect longer training
Solution 3: Verify Embeddings Are Cached
Data Validation Errors¶
Invalid Sequence Characters¶
Error:
ValueError: Found 3 invalid sequence(s) in batch 5:
Index 42: 'ACDEF-GHIKL...' (invalid characters: {'-'})
Index 43: 'ACDEFXGHIKL...' (invalid characters: {'X'})
Cause: Sequences contain invalid amino acids, gaps, or special characters.
Solution:
1. Check your preprocessing output - sequences should only contain valid amino acids
2. Remove gaps (-) and special characters from sequences
3. If using "X" for ambiguous residues: This is supported in v0.3.0+ (21 amino acids)
Note: Prior to v0.3.0, invalid sequences were silently replaced with "M" (methionine), causing silent data corruption. Now validation prevents this.
Corrupted Embedding Cache¶
Error:
orCause: Embedding cache corrupted from: - Training interrupted during cache write - Old cache from pre-v0.3.0 (had bugs that created zero embeddings) - Disk corruption
Solution:
Note: Cache validation added in v0.3.0. Corruption now detected immediately instead of silently training on garbage data.
Wrong Embedding Shape¶
Error:
Cause: Cache was created for different dataset or is corrupted.
Solution:
Configuration Validation Errors¶
Missing Config Sections or Keys¶
Error:
ValueError: Config validation failed:
- Missing config sections: experiment
- Missing config keys: data.test_file, training.n_splits
Cause: Config YAML is incomplete or using old format.
Solution:
1. Check your config against the default Hydra config (src/antibody_training_esm/conf/config.yaml)
2. Add missing sections/keys
3. Required keys as of v0.3.0:
- data: train_file, test_file, embeddings_cache_dir (default: experiments/cache/)
- model: name, device
- training: log_level, metrics, n_splits
- experiment: name
Note: Config validation added in v0.3.0. Validates BEFORE GPU allocation to prevent expensive failures.
Invalid Log Level¶
Error:
ValueError: Invalid log_level 'DEBG' in config. Must be one of: {'DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'}
Cause: Typo in log level or invalid value.
Solution: Use one of the valid log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL (case-insensitive)
Missing CSV Columns¶
Error:
ValueError: Sequence column 'sequences' not found in data/train/boughter/VH_only.csv.
Available columns: ['id', 'sequence', 'label', 'VH_sequence']
Cause: Config specifies wrong column name or CSV is malformed.
Solution:
1. Check the "Available columns" in error message
2. Update data.sequence_column in config to match actual column name
3. Or regenerate CSV with correct column names
Note: Error message now shows available columns (v0.3.0+) to help debugging.
Cache & Persistence Errors¶
Cache Preserved on Training Failure¶
Note: Prior to v0.3.0, embedding cache was deleted even if training failed. This meant hours of GPU compute were lost on any failure.
As of v0.3.0: - Cache only deleted after SUCCESSFUL training completion - If training fails, cache is preserved for next attempt - Saves expensive re-computation on config errors or data issues
Corrupted Cache from Old Version¶
Symptoms: - NaN embeddings - All-zero embeddings - Wrong embedding shapes - Silent training failures (pre-v0.3.0)
Cause: Cache created with pre-v0.3.0 code that had bugs: - P0-6: Batch failures filled zero vectors - P0-5: Invalid sequences replaced with "M" - P1-1/P1-2: Division by zero created NaN embeddings
Solution:
# Delete old cache
rm -rf experiments/cache/
# Retrain with v0.3.0+ (has validation)
uv run antibody-train
Prevention: v0.3.0+ validates cache integrity on load. Corrupted cache now detected immediately.
Empty Dataset Loaded¶
Error:
ValueError: Loaded dataset is empty: data/test/jain/P5e_S2.csv
The CSV file may be corrupted or truncated. Please check the file or re-run preprocessing.
Cause: CSV file is empty, truncated, or preprocessing failed.
Solution:
1. Check file exists and has content: wc -l data/test/jain/P5e_S2.csv
2. Re-run preprocessing for that dataset
3. Check preprocessing logs for errors
Note: Dataset validation added in v0.3.0. Empty datasets now fail immediately instead of causing mysterious crashes later.
Testing Issues¶
Model Fails to Load¶
Symptoms:
Solution:
# Check model exists
ls -lh experiments/checkpoints/esm1v/logreg/
# Verify model path in command (using fragment file for compatibility)
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \ # Correct path
--data data/test/jain/fragments/VH_only_jain.csv
"Sequence column not found" in Test CSV¶
Symptoms:
ValueError: Sequence column 'sequence' not found in dataset. Available columns: ['id', 'vh_sequence', 'vl_sequence', ...]
Root Cause:
You're trying to test with a canonical file using default config:
# THIS FAILS (canonical file has vh_sequence, not sequence)
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/jain/canonical/VH_only_jain_86_p5e_s2.csv
Solution 1: Use Fragment Files (Recommended)
Fragment files have standardized sequence column:
# THIS WORKS (fragment file has sequence column)
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/jain/fragments/VH_only_jain.csv
Solution 2: Create Config for Canonical Files
If you need to use canonical files (for metadata access):
# test_config_canonical.yaml
model_paths:
- "experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl"
data_paths:
- "data/test/jain/canonical/VH_only_jain_86_p5e_s2.csv"
sequence_column: "vh_sequence" # Override for canonical file
label_column: "label"
Then run:
Understanding File Types:
| File Type | Location | Columns | Use Case |
|---|---|---|---|
| Canonical | data/test/{dataset}/canonical/ |
vh_sequence, vl_sequence |
Full metadata, requires config |
| Fragment | data/test/{dataset}/fragments/ |
sequence, label |
Standardized, works with defaults |
Check CSV columns:
head -n 1 data/test/jain/canonical/VH_only_jain_86_p5e_s2.csv
# Output: id,vh_sequence,label (needs sequence_column: "vh_sequence")
head -n 1 data/test/jain/fragments/VH_only_jain.csv
# Output: id,sequence,label (works with defaults)
Poor Test Performance (Cross-Dataset)¶
Symptoms:
- Train CV: 71% accuracy
- Test (different dataset): 55% accuracy
Expected Behavior:
Cross-dataset generalization is inherently challenging:
- Cross-assay: ELISA → PSR (different binding measurements)
- Cross-species: Human antibodies → Nanobodies (different structure)
- Cross-source: Different labs, protocols, quality control
Solutions:
1. Use Correct Dataset Files
# Train ELISA, test ELISA (Boughter → Jain)
# Use fragment file for compatibility with default config
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/jain/fragments/VH_only_jain.csv
2. Tune Assay-Specific Thresholds (PSR Assays)
For ELISA → PSR prediction, adjust threshold in test config:
# test_config_psr.yaml
model_paths:
- "experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl"
data_paths:
- "data/test/shehata/fragments/VH_only_shehata.csv"
threshold: 0.5495 # Novo Nordisk PSR threshold (default ELISA: 0.5)
Or manually in Python:
import pickle
# Load model
with open("experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl", "rb") as f:
classifier = pickle.load(f)
# Get prediction probabilities
probs = classifier.predict_proba(test_embeddings)[:, 1]
# Apply PSR threshold
psr_predictions = (probs > 0.5495).astype(int)
3. Match Fragment Types
Train and test on same fragment type:
# If trained on VH, test on VH (not CDRs or FWRs)
uv run antibody-test \
--model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
--data data/test/shehata/fragments/VH_only_shehata.csv # VH only
4. Accept Lower Performance
Cross-assay accuracy typically drops 5-10%:
- Same-assay (ELISA→ELISA): 66-71%
- Cross-assay (ELISA→PSR): 60-65%
See Research Notes - Benchmark Results for expected performance ranges.
Test Takes Too Long (Large Datasets)¶
Symptoms:
Testing Harvey (141k sequences) takes >30 minutes
Solution:
Use GPU acceleration:
# Verify GPU available
python -c "import torch; print(torch.cuda.is_available())"
# Force GPU usage
export CUDA_VISIBLE_DEVICES=0
uv run antibody-test --model experiments/checkpoints/esm1v/logreg/my_model.pkl --dataset harvey
Expected Times:
| Dataset | Size | CPU | GPU (CUDA/MPS) |
|---|---|---|---|
| Jain | 86 | 30s | 10s |
| Shehata | 398 | 2m | 30s |
| Harvey | 141k | 20m | 5-8m |
Preprocessing Issues¶
ANARCI Annotation Fails¶
Symptoms:
Solution:
Install ANARCI (requires Conda/Mamba):
# Create conda environment with ANARCI
conda create -n anarci python=3.12
conda activate anarci
conda install -c bioconda anarci
# Verify installation
anarci -h
Note: ANARCI is required for CDR/FWR extraction but not for training on pre-extracted VH sequences.
Excel File Won't Open¶
Symptoms:
Solution:
Install Excel reading libraries:
Fragment Extraction Produces Empty Sequences¶
Symptoms:
Fragment CSVs have NaN or empty strings
Solution:
Check ANARCI annotation success:
import pandas as pd
df = pd.read_csv("canonical/my_dataset.csv")
# Check annotation rate
df["has_vh"] = df["VH"].notna() & (df["VH"] != "")
print(f"VH annotation rate: {df['has_vh'].mean():.1%}")
# Expected: >90% for high-quality data
If annotation rate is low (<80%):
- Check sequence quality (valid amino acids only)
- Verify ANARCI is installed correctly
- Inspect failed sequences manually
Configuration Issues¶
YAML Syntax Error¶
Symptoms:
Solution:
Check YAML syntax:
# INCORRECT (missing space after colon)
data:
train_file:"path/to/file.csv"
# CORRECT (space after colon)
data:
train_file: "path/to/file.csv"
Validate YAML:
Config File Not Found¶
Symptoms:
Solution:
Use absolute path or ensure working directory is correct:
# From repository root
uv run antibody-train
# Or use absolute path
uv run antibody-train --config /full/path/to/config.yaml
Config Group Override Not Applied¶
Symptoms:
# Using ESM2 config group, but still trains with ESM-1v
uv run antibody-train model=esm2_650m
# Output shows: Loading model facebook/esm1v_t33_650M_UR90S_1 ❌ WRONG!
Cause: This was a critical bug (2025-11-11) where ConfigStore registrations conflicted with YAML config groups, causing Hydra to ignore config group overrides.
Status: ✅ FIXED in production (ConfigStore registrations commented out)
Solution (if you encounter this):
-
Verify config group override syntax:
-
Check Hydra sees the override:
-
If issue persists, check ConfigStore registrations are commented out:
Historical Context: See docs/archive/investigations/2025-11-11-cli-override-bug.md for full root cause analysis.
Development / CI Issues¶
Pre-commit Hook Blocks Commit¶
Symptoms:
Solution:
This is intentional - hooks enforce code quality. Fix the errors:
# Auto-fix formatting
make format
# Check remaining issues
make lint
# Run all quality checks
make all
# Try commit again
git commit -m "Your message"
To bypass hooks (NOT RECOMMENDED):
Type Checking Fails¶
Symptoms:
Solution:
Add type annotations:
# INCORRECT (no return type)
def my_function(x):
return x * 2
# CORRECT (with return type)
def my_function(x: int) -> int:
return x * 2
This repository enforces 100% type safety. See Developer Guide - Type Checking for details (pending Phase 4).
Tests Fail in CI but Pass Locally¶
Symptoms:
Possible Causes:
- Environment differences - CI uses fresh environment
- Cached data - Local has cached embeddings, CI doesn't
- Random seeds - Non-deterministic test behavior
Solution:
# Test in fresh environment
rm -rf .venv experiments/cache/
uv venv
source .venv/bin/activate
uv sync
pytest
Common Error Messages¶
RuntimeError: Cannot re-initialize CUDA in forked subprocess¶
Solution:
Set multiprocessing start method:
pickle.UnpicklingError: invalid load key¶
Symptoms:
Model file is corrupted
Solution:
Retrain model:
torch.cuda.OutOfMemoryError: CUDA out of memory¶
See CUDA Out of Memory section above.
Getting Help¶
If you encounter an issue not covered here:
- Check existing documentation:
- System Overview
- Installation Guide
- Training Guide
- Testing Guide
-
Check CI/CD logs:
-
See Developer Guide - CI/CD (pending Phase 4)
-
Review historical issues:
-
See Archive for past debugging sessions
-
File a GitHub issue:
- Include: OS, Python version, GPU type, error message, minimal reproducible example
Quick Diagnostic Commands¶
Run these commands to diagnose common issues:
# Check Python version
python --version # Should be 3.12+
# Check uv installation
uv --version
# Check GPU availability
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
python -c "import torch; print(f'MPS: {torch.backends.mps.is_available()}')"
# Check installed packages
uv pip list
# Check repository structure
ls -lh experiments/ data/train/ data/test/
# Check embeddings cache
ls -lh experiments/cache/
# Run full quality pipeline
make all
Last Updated: 2025-11-18
Branch: dev