Troubleshooting Guide¶

This guide provides solutions to common issues encountered when using the antibody training pipeline.

Installation Issues¶

`uv` command not found after installation¶

Symptoms:

$ uv --version
bash: uv: command not found

Solution:

Restart your terminal or manually add uv to your PATH:

# Add to ~/.bashrc or ~/.zshrc
export PATH="$HOME/.cargo/bin:$PATH"

# Reload shell config
source ~/.bashrc  # or source ~/.zshrc

Python version mismatch¶

Symptoms:

error: Python 3.12 is required but Python 3.10 is installed

Solution:

Install Python 3.12:

# Ubuntu/Debian
sudo apt update
sudo apt install python3.12

# macOS (Homebrew)
brew install python@3.12

# Or use pyenv
pyenv install 3.12
pyenv local 3.12

Permission denied on macOS/Linux¶

Symptoms:

PermissionError: [Errno 13] Permission denied

Solution:

Never use sudo with uv. Fix ownership instead:

# Fix ~/.local ownership
sudo chown -R $USER:$USER ~/.local

# Fix .venv ownership (if needed)
sudo chown -R $USER:$USER .venv

HuggingFace Cache Permission Denied (Linux/WSL2)¶

Symptoms:

OSError: PermissionError at /home/user/.cache/huggingface/hub when downloading facebook/esm1v_t33_650M_UR90S_1

Or test failures with:

E   PermissionError: [Errno 13] Permission denied: '/home/user/.cache/huggingface/hub'

Root Cause:

The HuggingFace cache directory was created by a different user (often root) or has incorrect permissions.

Solution:

# Fix cache ownership (replace 'user' with your username)
sudo chown -R $USER:$USER ~/.cache/huggingface

# OR - Delete and recreate the cache directory
rm -rf ~/.cache/huggingface
mkdir -p ~/.cache/huggingface

Verify:

ls -la ~/.cache/huggingface
# Should show your username, not 'root'

Prevention:

Never run model downloads with sudo. Use uv run commands as your regular user.

GPU / Hardware Issues¶

MPS Memory Issues (Apple Silicon)¶

Symptoms:

RuntimeError: MPS backend out of memory

Solution 1: Reduce Batch Size

# src/antibody_training_esm/conf/config.yaml
training:
  batch_size: 4  # Reduce from default (8)

Solution 2: Clear MPS Cache

import torch
torch.mps.empty_cache()

Solution 3: Use CPU Instead

# src/antibody_training_esm/conf/config.yaml
hardware:
  device: "cpu"

Permanent Fix:

The MPS memory leak was fixed in commit 9c8e5f2. If still encountering issues, see docs/archive/investigations/2025-11-03-mps-memory-leak.md for historical context.

CUDA Out of Memory¶

Symptoms:

RuntimeError: CUDA out of memory. Tried to allocate XX.XX MiB

Solution 1: Reduce Batch Size

# src/antibody_training_esm/conf/config.yaml
training:
  batch_size: 8  # Default; lower further if needed

Solution 2: Clear CUDA Cache

import torch
torch.cuda.empty_cache()

Solution 3: Use Smaller Model

# src/antibody_training_esm/conf/config.yaml
model:
  name: "facebook/esm1v_t33_650M_UR90S_1"  # 650M parameters
  # Instead of:
  # name: "facebook/esm2_t36_3B_UR50D"  # 3B parameters

Solution 4: Use CPU

# src/antibody_training_esm/conf/config.yaml
hardware:
  device: "cpu"

GPU Not Detected¶

Symptoms:

CUDA available: False
MPS available: False

Solution (CUDA):

# Check GPU is visible
nvidia-smi

# Verify PyTorch CUDA installation
python -c "import torch; print(torch.version.cuda)"

# Reinstall PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Solution (MPS):

# Verify macOS version (≥12.3 required)
sw_vers

# Check MPS is available
python -c "import torch; print(torch.backends.mps.is_available())"

Training Issues¶

ESM-1v Download Fails¶

Symptoms:

ConnectionError: HTTPSConnectionPool(host='huggingface.co', port=443)

Solution:

# Set HuggingFace cache directory
export HF_HOME=/path/with/space

# Use HuggingFace mirror (if in region with restrictions)
export HF_ENDPOINT=https://hf-mirror.com

# Retry download
uv run antibody-train

"Label column not found" Error¶

Symptoms:

KeyError: 'label'

Solution:

Ensure training CSV has label column with 0/1 values:

sequence,label
EVQLVESGGGLV...,0
QVQLQESGPGLV...,1

If using different column name, update CSV:

import pandas as pd
df = pd.read_csv("data.csv")
df = df.rename(columns={"polyreactivity": "label"})
df.to_csv("data_fixed.csv", index=False)

Embedding Cache Out of Sync¶

Symptoms:

Embeddings don't match expected shape
Predictions are random/nonsensical
Cache from old model version

Solution:

Clear embeddings cache and retrain:

rm -rf experiments/cache/
uv run antibody-train

Cache is SHA-256 hashed by: - Model name - Dataset path - Model revision

Any change invalidates cache automatically, but manual clearing ensures fresh start.

Poor Cross-Validation Performance¶

Symptoms:

10-Fold CV Accuracy: 52% ± 8%  # Near random

Possible Causes & Solutions:

1. Wrong Column Names

# Ensure column names match CSV file
data:
  sequence_column: "sequence"  # Default column name for sequences
  label_column: "label"        # Default column name for labels

Check your CSV has these columns:

import pandas as pd
df = pd.read_csv("train.csv")
print(df.columns)  # Should include 'sequence' and 'label'

2. Label Encoding Error

# Check label distribution
import pandas as pd
df = pd.read_csv("train.csv")
print(df["label"].value_counts())
# Should show: 0: XXX, 1: YYY (binary labels)

3. Sequence Quality Issues

# Check for invalid sequences
df["valid"] = df["VH"].str.match(r'^[ACDEFGHIKLMNPQRSTVWY]+$')
print(f"Invalid: {(~df['valid']).sum()}")

4. Model Not Loaded Correctly

# Verify ESM-1v downloaded
ls ~/.cache/huggingface/hub/models--facebook--esm1v_t33_650M_UR90S_1/

Training Takes Too Long¶

Symptoms:

Training runs for hours on small dataset
Embedding extraction stuck

Solution 1: Use GPU

# src/antibody_training_esm/conf/config.yaml
hardware:
  device: "cuda"  # or "mps" for Apple Silicon

Solution 2: Check Dataset Size

import pandas as pd
df = pd.read_csv("train.csv")
print(f"Dataset size: {len(df)}")
# Boughter: 914 sequences
# Jain: 86 sequences
# If >10k, expect longer training

Solution 3: Verify Embeddings Are Cached

# Check cache directory
ls -lh experiments/cache/
# Should see .npy files after first run

Data Validation Errors¶

Invalid Sequence Characters¶

Error:

ValueError: Found 3 invalid sequence(s) in batch 5:
  Index 42: 'ACDEF-GHIKL...' (invalid characters: {'-'})
  Index 43: 'ACDEFXGHIKL...' (invalid characters: {'X'})

Cause: Sequences contain invalid amino acids, gaps, or special characters.

Solution: 1. Check your preprocessing output - sequences should only contain valid amino acids 2. Remove gaps (-) and special characters from sequences 3. If using "X" for ambiguous residues: This is supported in v0.3.0+ (21 amino acids)

Note: Prior to v0.3.0, invalid sequences were silently replaced with "M" (methionine), causing silent data corruption. Now validation prevents this.

Corrupted Embedding Cache¶

Error:

ValueError: Embeddings from cache contain 15 NaN values

or

ValueError: Embeddings from cache contain 3 all-zero rows

Cause: Embedding cache corrupted from: - Training interrupted during cache write - Old cache from pre-v0.3.0 (had bugs that created zero embeddings) - Disk corruption

Solution:

rm -rf experiments/cache/
uv run antibody-train

Note: Cache validation added in v0.3.0. Corruption now detected immediately instead of silently training on garbage data.

Wrong Embedding Shape¶

Error:

ValueError: Embeddings from cache have wrong shape: expected 914 sequences, got 900

Cause: Cache was created for different dataset or is corrupted.

Solution:

rm -rf experiments/cache/
uv run antibody-train

Configuration Validation Errors¶

Missing Config Sections or Keys¶

Error:

ValueError: Config validation failed:
  - Missing config sections: experiment
  - Missing config keys: data.test_file, training.n_splits

Cause: Config YAML is incomplete or using old format.

Solution: 1. Check your config against the default Hydra config (src/antibody_training_esm/conf/config.yaml) 2. Add missing sections/keys 3. Required keys as of v0.3.0: - data: train_file, test_file, embeddings_cache_dir (default: experiments/cache/) - model: name, device - training: log_level, metrics, n_splits - experiment: name

Note: Config validation added in v0.3.0. Validates BEFORE GPU allocation to prevent expensive failures.

Invalid Log Level¶

Error:

ValueError: Invalid log_level 'DEBG' in config. Must be one of: {'DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'}

Cause: Typo in log level or invalid value.

Solution: Use one of the valid log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL (case-insensitive)

Missing CSV Columns¶

Error:

ValueError: Sequence column 'sequences' not found in data/train/boughter/VH_only.csv.
Available columns: ['id', 'sequence', 'label', 'VH_sequence']

Cause: Config specifies wrong column name or CSV is malformed.

Solution: 1. Check the "Available columns" in error message 2. Update data.sequence_column in config to match actual column name 3. Or regenerate CSV with correct column names

Note: Error message now shows available columns (v0.3.0+) to help debugging.

Cache & Persistence Errors¶

Cache Preserved on Training Failure¶

Note: Prior to v0.3.0, embedding cache was deleted even if training failed. This meant hours of GPU compute were lost on any failure.

As of v0.3.0: - Cache only deleted after SUCCESSFUL training completion - If training fails, cache is preserved for next attempt - Saves expensive re-computation on config errors or data issues

Corrupted Cache from Old Version¶

Symptoms: - NaN embeddings - All-zero embeddings - Wrong embedding shapes - Silent training failures (pre-v0.3.0)

Cause: Cache created with pre-v0.3.0 code that had bugs: - P0-6: Batch failures filled zero vectors - P0-5: Invalid sequences replaced with "M" - P1-1/P1-2: Division by zero created NaN embeddings

Solution:

# Delete old cache
rm -rf experiments/cache/

# Retrain with v0.3.0+ (has validation)
uv run antibody-train

Prevention: v0.3.0+ validates cache integrity on load. Corrupted cache now detected immediately.

Empty Dataset Loaded¶

Error:

ValueError: Loaded dataset is empty: data/test/jain/P5e_S2.csv
The CSV file may be corrupted or truncated. Please check the file or re-run preprocessing.

Cause: CSV file is empty, truncated, or preprocessing failed.

Solution: 1. Check file exists and has content: wc -l data/test/jain/P5e_S2.csv 2. Re-run preprocessing for that dataset 3. Check preprocessing logs for errors

Note: Dataset validation added in v0.3.0. Empty datasets now fail immediately instead of causing mysterious crashes later.

Testing Issues¶

Model Fails to Load¶

Symptoms:

FileNotFoundError: experiments/checkpoints/esm1v/logreg/my_model.pkl not found

Solution:

# Check model exists
ls -lh experiments/checkpoints/esm1v/logreg/

# Verify model path in command (using fragment file for compatibility)
uv run antibody-test \
  --model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \  # Correct path
  --data data/test/jain/fragments/VH_only_jain.csv

"Sequence column not found" in Test CSV¶

Symptoms:

ValueError: Sequence column 'sequence' not found in dataset. Available columns: ['id', 'vh_sequence', 'vl_sequence', ...]

Root Cause:

You're trying to test with a canonical file using default config:

# THIS FAILS (canonical file has vh_sequence, not sequence)
uv run antibody-test \
  --model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
  --data data/test/jain/canonical/VH_only_jain_86_p5e_s2.csv

Solution 1: Use Fragment Files (Recommended)

Fragment files have standardized sequence column:

# THIS WORKS (fragment file has sequence column)
uv run antibody-test \
  --model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
  --data data/test/jain/fragments/VH_only_jain.csv

Solution 2: Create Config for Canonical Files

If you need to use canonical files (for metadata access):

# test_config_canonical.yaml
model_paths:
  - "experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl"
data_paths:
  - "data/test/jain/canonical/VH_only_jain_86_p5e_s2.csv"
sequence_column: "vh_sequence"  # Override for canonical file
label_column: "label"

Then run:

uv run antibody-test --config test_config_canonical.yaml

Understanding File Types:

File Type	Location	Columns	Use Case
Canonical	`data/test/{dataset}/canonical/`	`vh_sequence`, `vl_sequence`	Full metadata, requires config
Fragment	`data/test/{dataset}/fragments/`	`sequence`, `label`	Standardized, works with defaults

Check CSV columns:

head -n 1 data/test/jain/canonical/VH_only_jain_86_p5e_s2.csv
# Output: id,vh_sequence,label (needs sequence_column: "vh_sequence")

head -n 1 data/test/jain/fragments/VH_only_jain.csv
# Output: id,sequence,label (works with defaults)

Poor Test Performance (Cross-Dataset)¶

Symptoms:

Train CV: 71% accuracy
Test (different dataset): 55% accuracy

Expected Behavior:

Cross-dataset generalization is inherently challenging:

Cross-assay: ELISA → PSR (different binding measurements)
Cross-species: Human antibodies → Nanobodies (different structure)
Cross-source: Different labs, protocols, quality control

Solutions:

1. Use Correct Dataset Files

# Train ELISA, test ELISA (Boughter → Jain)
# Use fragment file for compatibility with default config
uv run antibody-test \
  --model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
  --data data/test/jain/fragments/VH_only_jain.csv

2. Tune Assay-Specific Thresholds (PSR Assays)

For ELISA → PSR prediction, adjust threshold in test config:

# test_config_psr.yaml
model_paths:
  - "experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl"

data_paths:
  - "data/test/shehata/fragments/VH_only_shehata.csv"

threshold: 0.5495  # Novo Nordisk PSR threshold (default ELISA: 0.5)

Or manually in Python:

import pickle

# Load model
with open("experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl", "rb") as f:
    classifier = pickle.load(f)

# Get prediction probabilities
probs = classifier.predict_proba(test_embeddings)[:, 1]

# Apply PSR threshold
psr_predictions = (probs > 0.5495).astype(int)

3. Match Fragment Types

Train and test on same fragment type:

# If trained on VH, test on VH (not CDRs or FWRs)
uv run antibody-test \
  --model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
  --data data/test/shehata/fragments/VH_only_shehata.csv  # VH only

4. Accept Lower Performance

Cross-assay accuracy typically drops 5-10%:

Same-assay (ELISA→ELISA): 66-71%
Cross-assay (ELISA→PSR): 60-65%

See Research Notes - Benchmark Results for expected performance ranges.

Test Takes Too Long (Large Datasets)¶

Symptoms:

Testing Harvey (141k sequences) takes >30 minutes

Solution:

Use GPU acceleration:

# Verify GPU available
python -c "import torch; print(torch.cuda.is_available())"

# Force GPU usage
export CUDA_VISIBLE_DEVICES=0
uv run antibody-test --model experiments/checkpoints/esm1v/logreg/my_model.pkl --dataset harvey

Expected Times:

Dataset	Size	CPU	GPU (CUDA/MPS)
Jain	86	30s	10s
Shehata	398	2m	30s
Harvey	141k	20m	5-8m

Preprocessing Issues¶

ANARCI Annotation Fails¶

Symptoms:

Command 'anarci' not found

Solution:

Install ANARCI (requires Conda/Mamba):

# Create conda environment with ANARCI
conda create -n anarci python=3.12
conda activate anarci
conda install -c bioconda anarci

# Verify installation
anarci -h

Note: ANARCI is required for CDR/FWR extraction but not for training on pre-extracted VH sequences.

Excel File Won't Open¶

Symptoms:

ImportError: Missing optional dependency 'openpyxl'

Solution:

Install Excel reading libraries:

uv pip install openpyxl xlrd

Fragment Extraction Produces Empty Sequences¶

Symptoms:

Fragment CSVs have NaN or empty strings

Solution:

Check ANARCI annotation success:

import pandas as pd
df = pd.read_csv("canonical/my_dataset.csv")

# Check annotation rate
df["has_vh"] = df["VH"].notna() & (df["VH"] != "")
print(f"VH annotation rate: {df['has_vh'].mean():.1%}")

# Expected: >90% for high-quality data

If annotation rate is low (<80%):

Check sequence quality (valid amino acids only)
Verify ANARCI is installed correctly
Inspect failed sequences manually

Configuration Issues¶

YAML Syntax Error¶

Symptoms:

yaml.scanner.ScannerError: while scanning a simple key

Solution:

Check YAML syntax:

# INCORRECT (missing space after colon)
data:
  train_file:"path/to/file.csv"

# CORRECT (space after colon)
data:
  train_file: "path/to/file.csv"

Validate YAML:

python -c "import yaml; yaml.safe_load(open('src/antibody_training_esm/conf/config.yaml'))"

Config File Not Found¶

Symptoms:

FileNotFoundError: configs/my_config.yaml not found

Solution:

Use absolute path or ensure working directory is correct:

# From repository root
uv run antibody-train

# Or use absolute path
uv run antibody-train --config /full/path/to/config.yaml

Config Group Override Not Applied¶

Symptoms:

# Using ESM2 config group, but still trains with ESM-1v
uv run antibody-train model=esm2_650m
# Output shows: Loading model facebook/esm1v_t33_650M_UR90S_1  ❌ WRONG!

Cause: This was a critical bug (2025-11-11) where ConfigStore registrations conflicted with YAML config groups, causing Hydra to ignore config group overrides.

Status: ✅ FIXED in production (ConfigStore registrations commented out)

Solution (if you encounter this):

Verify config group override syntax:

# CORRECT - config group override
antibody-train model=esm2_650m

# ALSO CORRECT - field override
antibody-train model.name=facebook/esm2_t33_650M_UR50D

Check Hydra sees the override:

antibody-train --cfg job model=esm2_650m | grep "model:"
# Should show: model.name: facebook/esm2_t33_650M_UR50D

If issue persists, check ConfigStore registrations are commented out:

grep -n "cs.store" src/antibody_training_esm/conf/config_schema.py
# All lines should start with "#" (commented)

Historical Context: See docs/archive/investigations/2025-11-11-cli-override-bug.md for full root cause analysis.

Development / CI Issues¶

Pre-commit Hook Blocks Commit¶

Symptoms:

ruff....................................Failed
- hook id: ruff
- exit code: 1

Solution:

This is intentional - hooks enforce code quality. Fix the errors:

# Auto-fix formatting
make format

# Check remaining issues
make lint

# Run all quality checks
make all

# Try commit again
git commit -m "Your message"

To bypass hooks (NOT RECOMMENDED):

git commit --no-verify -m "Your message"

Type Checking Fails¶

Symptoms:

error: Function is missing a return type annotation

Solution:

Add type annotations:

# INCORRECT (no return type)
def my_function(x):
    return x * 2

# CORRECT (with return type)
def my_function(x: int) -> int:
    return x * 2

This repository enforces 100% type safety. See Developer Guide - Type Checking for details (pending Phase 4).

Tests Fail in CI but Pass Locally¶

Symptoms:

# Local
pytest  # All pass

# CI
pytest  # Some fail

Possible Causes:

Environment differences - CI uses fresh environment
Cached data - Local has cached embeddings, CI doesn't
Random seeds - Non-deterministic test behavior

Solution:

# Test in fresh environment
rm -rf .venv experiments/cache/
uv venv
source .venv/bin/activate
uv sync
pytest

Common Error Messages¶

`RuntimeError: Cannot re-initialize CUDA in forked subprocess`¶

Solution:

Set multiprocessing start method:

import multiprocessing
multiprocessing.set_start_method('spawn', force=True)

`pickle.UnpicklingError: invalid load key`¶

Symptoms:

Model file is corrupted

Solution:

Retrain model:

rm experiments/checkpoints/esm1v/logreg/corrupted_model.pkl
uv run antibody-train

`torch.cuda.OutOfMemoryError: CUDA out of memory`¶

See CUDA Out of Memory section above.

Getting Help¶

If you encounter an issue not covered here:

Check existing documentation:
System Overview
Installation Guide
Training Guide
Testing Guide
Preprocessing Guide
Check CI/CD logs:
See Developer Guide - CI/CD (pending Phase 4)
Review historical issues:
See Archive for past debugging sessions
File a GitHub issue:
Include: OS, Python version, GPU type, error message, minimal reproducible example

Quick Diagnostic Commands¶

Run these commands to diagnose common issues:

# Check Python version
python --version  # Should be 3.12+

# Check uv installation
uv --version

# Check GPU availability
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
python -c "import torch; print(f'MPS: {torch.backends.mps.is_available()}')"

# Check installed packages
uv pip list

# Check repository structure
ls -lh experiments/ data/train/ data/test/

# Check embeddings cache
ls -lh experiments/cache/

# Run full quality pipeline
make all

Last Updated: 2025-11-18 Branch: dev

Troubleshooting Guide¶

Installation Issues¶

uv command not found after installation¶

Python version mismatch¶

Permission denied on macOS/Linux¶

HuggingFace Cache Permission Denied (Linux/WSL2)¶

GPU / Hardware Issues¶

MPS Memory Issues (Apple Silicon)¶

CUDA Out of Memory¶

GPU Not Detected¶

Training Issues¶

ESM-1v Download Fails¶

"Label column not found" Error¶

Embedding Cache Out of Sync¶

Poor Cross-Validation Performance¶

Training Takes Too Long¶

Data Validation Errors¶

Invalid Sequence Characters¶

Corrupted Embedding Cache¶

Wrong Embedding Shape¶

Configuration Validation Errors¶

Missing Config Sections or Keys¶

Invalid Log Level¶

Missing CSV Columns¶

Cache & Persistence Errors¶

Cache Preserved on Training Failure¶

Corrupted Cache from Old Version¶

Empty Dataset Loaded¶

Testing Issues¶

Model Fails to Load¶

"Sequence column not found" in Test CSV¶

Poor Test Performance (Cross-Dataset)¶

Test Takes Too Long (Large Datasets)¶

Preprocessing Issues¶

ANARCI Annotation Fails¶

Excel File Won't Open¶

Fragment Extraction Produces Empty Sequences¶

Configuration Issues¶

YAML Syntax Error¶

Config File Not Found¶

Config Group Override Not Applied¶

Development / CI Issues¶

Pre-commit Hook Blocks Commit¶

Type Checking Fails¶

Tests Fail in CI but Pass Locally¶

Common Error Messages¶

RuntimeError: Cannot re-initialize CUDA in forked subprocess¶

pickle.UnpicklingError: invalid load key¶

torch.cuda.OutOfMemoryError: CUDA out of memory¶

Getting Help¶

Quick Diagnostic Commands¶

`uv` command not found after installation¶

`RuntimeError: Cannot re-initialize CUDA in forked subprocess`¶

`pickle.UnpicklingError: invalid load key`¶

`torch.cuda.OutOfMemoryError: CUDA out of memory`¶