Skip to content

Troubleshooting Guide

This guide provides solutions to common issues encountered when using the antibody training pipeline.


Installation Issues

uv command not found after installation

Symptoms:

$ uv --version
bash: uv: command not found

Solution:

Restart your terminal or manually add uv to your PATH:

# Add to ~/.bashrc or ~/.zshrc
export PATH="$HOME/.cargo/bin:$PATH"

# Reload shell config
source ~/.bashrc  # or source ~/.zshrc

Python version mismatch

Symptoms:

error: Python 3.12 is required but Python 3.10 is installed

Solution:

Install Python 3.12:

# Ubuntu/Debian
sudo apt update
sudo apt install python3.12

# macOS (Homebrew)
brew install python@3.12

# Or use pyenv
pyenv install 3.12
pyenv local 3.12

Permission denied on macOS/Linux

Symptoms:

PermissionError: [Errno 13] Permission denied

Solution:

Never use sudo with uv. Fix ownership instead:

# Fix ~/.local ownership
sudo chown -R $USER:$USER ~/.local

# Fix .venv ownership (if needed)
sudo chown -R $USER:$USER .venv

HuggingFace Cache Permission Denied (Linux/WSL2)

Symptoms:

OSError: PermissionError at /home/user/.cache/huggingface/hub when downloading facebook/esm1v_t33_650M_UR90S_1

Or test failures with:

E   PermissionError: [Errno 13] Permission denied: '/home/user/.cache/huggingface/hub'

Root Cause:

The HuggingFace cache directory was created by a different user (often root) or has incorrect permissions.

Solution:

# Fix cache ownership (replace 'user' with your username)
sudo chown -R $USER:$USER ~/.cache/huggingface

# OR - Delete and recreate the cache directory
rm -rf ~/.cache/huggingface
mkdir -p ~/.cache/huggingface

Verify:

ls -la ~/.cache/huggingface
# Should show your username, not 'root'

Prevention:

Never run model downloads with sudo. Use uv run commands as your regular user.


GPU / Hardware Issues

MPS Memory Issues (Apple Silicon)

Symptoms:

RuntimeError: MPS backend out of memory

Solution 1: Reduce Batch Size

# src/antibody_training_esm/conf/config.yaml
training:
  batch_size: 4  # Reduce from default (8)

Solution 2: Clear MPS Cache

import torch
torch.mps.empty_cache()

Solution 3: Use CPU Instead

# src/antibody_training_esm/conf/config.yaml
hardware:
  device: "cpu"

Permanent Fix:

The MPS memory leak was fixed in commit 9c8e5f2. If still encountering issues, see docs/archive/investigations/2025-11-03-mps-memory-leak.md for historical context.


CUDA Out of Memory

Symptoms:

RuntimeError: CUDA out of memory. Tried to allocate XX.XX MiB

Solution 1: Reduce Batch Size

# src/antibody_training_esm/conf/config.yaml
training:
  batch_size: 8  # Default; lower further if needed

Solution 2: Clear CUDA Cache

import torch
torch.cuda.empty_cache()

Solution 3: Use Smaller Model

# src/antibody_training_esm/conf/config.yaml
model:
  name: "facebook/esm1v_t33_650M_UR90S_1"  # 650M parameters
  # Instead of:
  # name: "facebook/esm2_t36_3B_UR50D"  # 3B parameters

Solution 4: Use CPU

# src/antibody_training_esm/conf/config.yaml
hardware:
  device: "cpu"

GPU Not Detected

Symptoms:

CUDA available: False
MPS available: False

Solution (CUDA):

# Check GPU is visible
nvidia-smi

# Verify PyTorch CUDA installation
python -c "import torch; print(torch.version.cuda)"

# Reinstall PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Solution (MPS):

# Verify macOS version (≥12.3 required)
sw_vers

# Check MPS is available
python -c "import torch; print(torch.backends.mps.is_available())"

Training Issues

ESM-1v Download Fails

Symptoms:

ConnectionError: HTTPSConnectionPool(host='huggingface.co', port=443)

Solution:

# Set HuggingFace cache directory
export HF_HOME=/path/with/space

# Use HuggingFace mirror (if in region with restrictions)
export HF_ENDPOINT=https://hf-mirror.com

# Retry download
uv run antibody-train

"Label column not found" Error

Symptoms:

KeyError: 'label'

Solution:

Ensure training CSV has label column with 0/1 values:

sequence,label
EVQLVESGGGLV...,0
QVQLQESGPGLV...,1

If using different column name, update CSV:

import pandas as pd
df = pd.read_csv("data.csv")
df = df.rename(columns={"polyreactivity": "label"})
df.to_csv("data_fixed.csv", index=False)

Embedding Cache Out of Sync

Symptoms:

  • Embeddings don't match expected shape
  • Predictions are random/nonsensical
  • Cache from old model version

Solution:

Clear embeddings cache and retrain:

rm -rf experiments/cache/
uv run antibody-train

Cache is SHA-256 hashed by: - Model name - Dataset path - Model revision

Any change invalidates cache automatically, but manual clearing ensures fresh start.


Poor Cross-Validation Performance

Symptoms:

10-Fold CV Accuracy: 52% ± 8%  # Near random

Possible Causes & Solutions:

1. Wrong Column Names

# Ensure column names match CSV file
data:
  sequence_column: "sequence"  # Default column name for sequences
  label_column: "label"        # Default column name for labels

Check your CSV has these columns:

import pandas as pd
df = pd.read_csv("train.csv")
print(df.columns)  # Should include 'sequence' and 'label'

2. Label Encoding Error

# Check label distribution
import pandas as pd
df = pd.read_csv("train.csv")
print(df["label"].value_counts())
# Should show: 0: XXX, 1: YYY (binary labels)

3. Sequence Quality Issues

# Check for invalid sequences
df["valid"] = df["VH"].str.match(r'^[ACDEFGHIKLMNPQRSTVWY]+$')
print(f"Invalid: {(~df['valid']).sum()}")

4. Model Not Loaded Correctly

# Verify ESM-1v downloaded
ls ~/.cache/huggingface/hub/models--facebook--esm1v_t33_650M_UR90S_1/

Training Takes Too Long

Symptoms:

  • Training runs for hours on small dataset
  • Embedding extraction stuck

Solution 1: Use GPU

# src/antibody_training_esm/conf/config.yaml
hardware:
  device: "cuda"  # or "mps" for Apple Silicon

Solution 2: Check Dataset Size

import pandas as pd
df = pd.read_csv("train.csv")
print(f"Dataset size: {len(df)}")
# Boughter: 914 sequences
# Jain: 86 sequences
# If >10k, expect longer training

Solution 3: Verify Embeddings Are Cached

# Check cache directory
ls -lh experiments/cache/
# Should see .npy files after first run

Data Validation Errors

Invalid Sequence Characters

Error:

ValueError: Found 3 invalid sequence(s) in batch 5:
  Index 42: 'ACDEF-GHIKL...' (invalid characters: {'-'})
  Index 43: 'ACDEFXGHIKL...' (invalid characters: {'X'})

Cause: Sequences contain invalid amino acids, gaps, or special characters.

Solution: 1. Check your preprocessing output - sequences should only contain valid amino acids 2. Remove gaps (-) and special characters from sequences 3. If using "X" for ambiguous residues: This is supported in v0.3.0+ (21 amino acids)

Note: Prior to v0.3.0, invalid sequences were silently replaced with "M" (methionine), causing silent data corruption. Now validation prevents this.


Corrupted Embedding Cache

Error:

ValueError: Embeddings from cache contain 15 NaN values
or
ValueError: Embeddings from cache contain 3 all-zero rows

Cause: Embedding cache corrupted from: - Training interrupted during cache write - Old cache from pre-v0.3.0 (had bugs that created zero embeddings) - Disk corruption

Solution:

rm -rf experiments/cache/
uv run antibody-train

Note: Cache validation added in v0.3.0. Corruption now detected immediately instead of silently training on garbage data.


Wrong Embedding Shape

Error:

ValueError: Embeddings from cache have wrong shape: expected 914 sequences, got 900

Cause: Cache was created for different dataset or is corrupted.

Solution:

rm -rf experiments/cache/
uv run antibody-train


Configuration Validation Errors

Missing Config Sections or Keys

Error:

ValueError: Config validation failed:
  - Missing config sections: experiment
  - Missing config keys: data.test_file, training.n_splits

Cause: Config YAML is incomplete or using old format.

Solution: 1. Check your config against the default Hydra config (src/antibody_training_esm/conf/config.yaml) 2. Add missing sections/keys 3. Required keys as of v0.3.0: - data: train_file, test_file, embeddings_cache_dir (default: experiments/cache/) - model: name, device - training: log_level, metrics, n_splits - experiment: name

Note: Config validation added in v0.3.0. Validates BEFORE GPU allocation to prevent expensive failures.


Invalid Log Level

Error:

ValueError: Invalid log_level 'DEBG' in config. Must be one of: {'DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'}

Cause: Typo in log level or invalid value.

Solution: Use one of the valid log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL (case-insensitive)


Missing CSV Columns

Error:

ValueError: Sequence column 'sequences' not found in data/train/boughter/VH_only.csv.
Available columns: ['id', 'sequence', 'label', 'VH_sequence']

Cause: Config specifies wrong column name or CSV is malformed.

Solution: 1. Check the "Available columns" in error message 2. Update data.sequence_column in config to match actual column name 3. Or regenerate CSV with correct column names

Note: Error message now shows available columns (v0.3.0+) to help debugging.


Cache & Persistence Errors

Cache Preserved on Training Failure

Note: Prior to v0.3.0, embedding cache was deleted even if training failed. This meant hours of GPU compute were lost on any failure.

As of v0.3.0: - Cache only deleted after SUCCESSFUL training completion - If training fails, cache is preserved for next attempt - Saves expensive re-computation on config errors or data issues


Corrupted Cache from Old Version

Symptoms: - NaN embeddings - All-zero embeddings - Wrong embedding shapes - Silent training failures (pre-v0.3.0)

Cause: Cache created with pre-v0.3.0 code that had bugs: - P0-6: Batch failures filled zero vectors - P0-5: Invalid sequences replaced with "M" - P1-1/P1-2: Division by zero created NaN embeddings

Solution:

# Delete old cache
rm -rf experiments/cache/

# Retrain with v0.3.0+ (has validation)
uv run antibody-train

Prevention: v0.3.0+ validates cache integrity on load. Corrupted cache now detected immediately.


Empty Dataset Loaded

Error:

ValueError: Loaded dataset is empty: data/test/jain/P5e_S2.csv
The CSV file may be corrupted or truncated. Please check the file or re-run preprocessing.

Cause: CSV file is empty, truncated, or preprocessing failed.

Solution: 1. Check file exists and has content: wc -l data/test/jain/P5e_S2.csv 2. Re-run preprocessing for that dataset 3. Check preprocessing logs for errors

Note: Dataset validation added in v0.3.0. Empty datasets now fail immediately instead of causing mysterious crashes later.


Testing Issues

Model Fails to Load

Symptoms:

FileNotFoundError: experiments/checkpoints/esm1v/logreg/my_model.pkl not found

Solution:

# Check model exists
ls -lh experiments/checkpoints/esm1v/logreg/

# Verify model path in command (using fragment file for compatibility)
uv run antibody-test \
  --model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \  # Correct path
  --data data/test/jain/fragments/VH_only_jain.csv

"Sequence column not found" in Test CSV

Symptoms:

ValueError: Sequence column 'sequence' not found in dataset. Available columns: ['id', 'vh_sequence', 'vl_sequence', ...]

Root Cause:

You're trying to test with a canonical file using default config:

# THIS FAILS (canonical file has vh_sequence, not sequence)
uv run antibody-test \
  --model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
  --data data/test/jain/canonical/VH_only_jain_86_p5e_s2.csv

Solution 1: Use Fragment Files (Recommended)

Fragment files have standardized sequence column:

# THIS WORKS (fragment file has sequence column)
uv run antibody-test \
  --model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
  --data data/test/jain/fragments/VH_only_jain.csv

Solution 2: Create Config for Canonical Files

If you need to use canonical files (for metadata access):

# test_config_canonical.yaml
model_paths:
  - "experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl"
data_paths:
  - "data/test/jain/canonical/VH_only_jain_86_p5e_s2.csv"
sequence_column: "vh_sequence"  # Override for canonical file
label_column: "label"

Then run:

uv run antibody-test --config test_config_canonical.yaml

Understanding File Types:

File Type Location Columns Use Case
Canonical data/test/{dataset}/canonical/ vh_sequence, vl_sequence Full metadata, requires config
Fragment data/test/{dataset}/fragments/ sequence, label Standardized, works with defaults

Check CSV columns:

head -n 1 data/test/jain/canonical/VH_only_jain_86_p5e_s2.csv
# Output: id,vh_sequence,label (needs sequence_column: "vh_sequence")

head -n 1 data/test/jain/fragments/VH_only_jain.csv
# Output: id,sequence,label (works with defaults)


Poor Test Performance (Cross-Dataset)

Symptoms:

  • Train CV: 71% accuracy
  • Test (different dataset): 55% accuracy

Expected Behavior:

Cross-dataset generalization is inherently challenging:

  • Cross-assay: ELISA → PSR (different binding measurements)
  • Cross-species: Human antibodies → Nanobodies (different structure)
  • Cross-source: Different labs, protocols, quality control

Solutions:

1. Use Correct Dataset Files

# Train ELISA, test ELISA (Boughter → Jain)
# Use fragment file for compatibility with default config
uv run antibody-test \
  --model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
  --data data/test/jain/fragments/VH_only_jain.csv

2. Tune Assay-Specific Thresholds (PSR Assays)

For ELISA → PSR prediction, adjust threshold in test config:

# test_config_psr.yaml
model_paths:
  - "experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl"

data_paths:
  - "data/test/shehata/fragments/VH_only_shehata.csv"

threshold: 0.5495  # Novo Nordisk PSR threshold (default ELISA: 0.5)

Or manually in Python:

import pickle

# Load model
with open("experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl", "rb") as f:
    classifier = pickle.load(f)

# Get prediction probabilities
probs = classifier.predict_proba(test_embeddings)[:, 1]

# Apply PSR threshold
psr_predictions = (probs > 0.5495).astype(int)

3. Match Fragment Types

Train and test on same fragment type:

# If trained on VH, test on VH (not CDRs or FWRs)
uv run antibody-test \
  --model experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl \
  --data data/test/shehata/fragments/VH_only_shehata.csv  # VH only

4. Accept Lower Performance

Cross-assay accuracy typically drops 5-10%:

  • Same-assay (ELISA→ELISA): 66-71%
  • Cross-assay (ELISA→PSR): 60-65%

See Research Notes - Benchmark Results for expected performance ranges.


Test Takes Too Long (Large Datasets)

Symptoms:

Testing Harvey (141k sequences) takes >30 minutes

Solution:

Use GPU acceleration:

# Verify GPU available
python -c "import torch; print(torch.cuda.is_available())"

# Force GPU usage
export CUDA_VISIBLE_DEVICES=0
uv run antibody-test --model experiments/checkpoints/esm1v/logreg/my_model.pkl --dataset harvey

Expected Times:

Dataset Size CPU GPU (CUDA/MPS)
Jain 86 30s 10s
Shehata 398 2m 30s
Harvey 141k 20m 5-8m

Preprocessing Issues

ANARCI Annotation Fails

Symptoms:

Command 'anarci' not found

Solution:

Install ANARCI (requires Conda/Mamba):

# Create conda environment with ANARCI
conda create -n anarci python=3.12
conda activate anarci
conda install -c bioconda anarci

# Verify installation
anarci -h

Note: ANARCI is required for CDR/FWR extraction but not for training on pre-extracted VH sequences.


Excel File Won't Open

Symptoms:

ImportError: Missing optional dependency 'openpyxl'

Solution:

Install Excel reading libraries:

uv pip install openpyxl xlrd

Fragment Extraction Produces Empty Sequences

Symptoms:

Fragment CSVs have NaN or empty strings

Solution:

Check ANARCI annotation success:

import pandas as pd
df = pd.read_csv("canonical/my_dataset.csv")

# Check annotation rate
df["has_vh"] = df["VH"].notna() & (df["VH"] != "")
print(f"VH annotation rate: {df['has_vh'].mean():.1%}")

# Expected: >90% for high-quality data

If annotation rate is low (<80%):

  1. Check sequence quality (valid amino acids only)
  2. Verify ANARCI is installed correctly
  3. Inspect failed sequences manually

Configuration Issues

YAML Syntax Error

Symptoms:

yaml.scanner.ScannerError: while scanning a simple key

Solution:

Check YAML syntax:

# INCORRECT (missing space after colon)
data:
  train_file:"path/to/file.csv"

# CORRECT (space after colon)
data:
  train_file: "path/to/file.csv"

Validate YAML:

python -c "import yaml; yaml.safe_load(open('src/antibody_training_esm/conf/config.yaml'))"

Config File Not Found

Symptoms:

FileNotFoundError: configs/my_config.yaml not found

Solution:

Use absolute path or ensure working directory is correct:

# From repository root
uv run antibody-train

# Or use absolute path
uv run antibody-train --config /full/path/to/config.yaml

Config Group Override Not Applied

Symptoms:

# Using ESM2 config group, but still trains with ESM-1v
uv run antibody-train model=esm2_650m
# Output shows: Loading model facebook/esm1v_t33_650M_UR90S_1  ❌ WRONG!

Cause: This was a critical bug (2025-11-11) where ConfigStore registrations conflicted with YAML config groups, causing Hydra to ignore config group overrides.

Status:FIXED in production (ConfigStore registrations commented out)

Solution (if you encounter this):

  1. Verify config group override syntax:

    # CORRECT - config group override
    antibody-train model=esm2_650m
    
    # ALSO CORRECT - field override
    antibody-train model.name=facebook/esm2_t33_650M_UR50D
    

  2. Check Hydra sees the override:

    antibody-train --cfg job model=esm2_650m | grep "model:"
    # Should show: model.name: facebook/esm2_t33_650M_UR50D
    

  3. If issue persists, check ConfigStore registrations are commented out:

    grep -n "cs.store" src/antibody_training_esm/conf/config_schema.py
    # All lines should start with "#" (commented)
    

Historical Context: See docs/archive/investigations/2025-11-11-cli-override-bug.md for full root cause analysis.


Development / CI Issues

Pre-commit Hook Blocks Commit

Symptoms:

ruff....................................Failed
- hook id: ruff
- exit code: 1

Solution:

This is intentional - hooks enforce code quality. Fix the errors:

# Auto-fix formatting
make format

# Check remaining issues
make lint

# Run all quality checks
make all

# Try commit again
git commit -m "Your message"

To bypass hooks (NOT RECOMMENDED):

git commit --no-verify -m "Your message"

Type Checking Fails

Symptoms:

error: Function is missing a return type annotation

Solution:

Add type annotations:

# INCORRECT (no return type)
def my_function(x):
    return x * 2

# CORRECT (with return type)
def my_function(x: int) -> int:
    return x * 2

This repository enforces 100% type safety. See Developer Guide - Type Checking for details (pending Phase 4).


Tests Fail in CI but Pass Locally

Symptoms:

# Local
pytest  # All pass

# CI
pytest  # Some fail

Possible Causes:

  1. Environment differences - CI uses fresh environment
  2. Cached data - Local has cached embeddings, CI doesn't
  3. Random seeds - Non-deterministic test behavior

Solution:

# Test in fresh environment
rm -rf .venv experiments/cache/
uv venv
source .venv/bin/activate
uv sync
pytest

Common Error Messages

RuntimeError: Cannot re-initialize CUDA in forked subprocess

Solution:

Set multiprocessing start method:

import multiprocessing
multiprocessing.set_start_method('spawn', force=True)

pickle.UnpicklingError: invalid load key

Symptoms:

Model file is corrupted

Solution:

Retrain model:

rm experiments/checkpoints/esm1v/logreg/corrupted_model.pkl
uv run antibody-train

torch.cuda.OutOfMemoryError: CUDA out of memory

See CUDA Out of Memory section above.


Getting Help

If you encounter an issue not covered here:

  1. Check existing documentation:
  2. System Overview
  3. Installation Guide
  4. Training Guide
  5. Testing Guide
  6. Preprocessing Guide

  7. Check CI/CD logs:

  8. See Developer Guide - CI/CD (pending Phase 4)

  9. Review historical issues:

  10. See Archive for past debugging sessions

  11. File a GitHub issue:

  12. Include: OS, Python version, GPU type, error message, minimal reproducible example

Quick Diagnostic Commands

Run these commands to diagnose common issues:

# Check Python version
python --version  # Should be 3.12+

# Check uv installation
uv --version

# Check GPU availability
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
python -c "import torch; print(f'MPS: {torch.backends.mps.is_available()}')"

# Check installed packages
uv pip list

# Check repository structure
ls -lh experiments/ data/train/ data/test/

# Check embeddings cache
ls -lh experiments/cache/

# Run full quality pipeline
make all

Last Updated: 2025-11-18 Branch: dev