Security¶

Target Audience: Developers handling security concerns and dependency management

Purpose: Understand the security model, approved practices, and how to maintain security posture

When to Use This Guide¶

Use this guide if you're: - ✅ Adding dependencies (need to audit for CVEs) - ✅ Using pickle (understand approved use cases) - ✅ Fixing security findings (Bandit, CodeQL, pip-audit) - ✅ Running security scans (local and CI) - ✅ Understanding threat model (what's in scope, what's not)

Workflow: Development Workflow - Code quality commands
Architecture: Architecture - System design
Type Checking: Type Checking Guide - Type safety requirements

Security Model¶

Threat Model¶

Context: Research codebase for antibody classification (NOT production API deployment)

In Scope: - ✅ Code-level vulnerabilities (SQL injection, XSS, command injection) - ✅ Dependency vulnerabilities (CVEs in packages) - ✅ Scientific reproducibility (model version pinning) - ✅ Code quality (type safety, linting, testing)

Out of Scope: - ❌ Internet-facing attack surface (no web server) - ❌ Untrusted user input (local trusted data only) - ❌ Cryptographic operations (not a crypto application) - ❌ Multi-tenant isolation (single-user research environment)

Current Security Posture¶

✅ Code-level security: Bandit clean (0 issues, documented suppressions) ✅ Dependencies: pip-audit clean (0 CVEs in locked environment) ✅ Type safety: 100% type coverage on production code (mypy --strict) ✅ CI enforcement: Security gates block PR merges

Pickle Usage Policy¶

Why We Use Pickle¶

Approved use cases: 1. Trained models - BinaryClassifier model persistence (.pkl files) 2. Embedding caches - ESM-1v embeddings for performance 3. Preprocessed datasets - Cached data transformations

Security justification: - All pickle files are locally generated by our own code - NOT loading pickles from internet or untrusted sources - NOT exposed to external attackers (local research pipeline) - Cache files have integrity validation (hash checks)

Where Pickle is Used¶

Production code (src/): - cli/test.py - Loading trained models and cached embeddings - core/trainer.py - Loading cached embeddings (with hash validation) - data/loaders.py - Loading preprocessed datasets

All uses are marked with # nosec B301 comments with justification.

When NOT to Use Pickle¶

Avoid pickle for: - ❌ Receiving data from external sources - ❌ Storing configuration (use JSON/YAML instead) - ❌ Production API deployments (use JSON + NPZ format) - ❌ Cross-language data exchange (use standard formats)

Production Deployment (Implemented)¶

✅ Dual-format serialization is now live:

All trained models are automatically saved in both formats:

# Research path (pickle) - still works
with open("experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl", "rb") as f:
    model = pickle.load(f)

# Production path (NPZ+JSON) - NEW
from antibody_training_esm.core import load_model_from_npz

model = load_model_from_npz(
    npz_path="experiments/checkpoints/esm1v/logreg/model.npz",
    json_path="experiments/checkpoints/esm1v/logreg/model_config.json"
)

Implementation details: - save_model() in core/trainer.py writes all three files automatically - load_model_from_npz() reconstructs BinaryClassifier from NPZ+JSON - No breaking changes - pickle still supported for research workflows - See Training Guide for usage examples

Security improvements: - ✅ NPZ+JSON cannot execute code (unlike pickle) - ✅ Cross-platform compatible (any language can load) - ✅ HuggingFace deployment ready - ✅ Production API safe - ✅ Data validation: Input validation at all pipeline entry points (sequences, embeddings, configs) - ✅ Cache integrity: Embedding cache validation prevents training on corrupted data

Data Validation & Corruption Prevention¶

Overview¶

As of v0.3.0, the pipeline enforces strict validation at all data entry points to prevent silent data corruption. The principle is: fail fast with clear error messages instead of silent corruption.

Validation Layers¶

1. Sequence Validation (src/antibody_training_esm/core/embeddings.py) - Invalid sequences (non-amino-acid characters, gaps, empty strings) now raise ValueError - Error message shows specific invalid characters and sequence preview - Before v0.3.0: Invalid sequences replaced with single "M" (methionine) → silent corruption - After v0.3.0: Training halts immediately with actionable error message

2. Embedding Validation (src/antibody_training_esm/core/trainer.py) - Cached and computed embeddings validated for: - Correct shape (must match number of sequences) - NaN values (indicates failed computation) - All-zero rows (indicates failed batch processing) - Before v0.3.0: Corrupted embeddings loaded silently → model trains on garbage - After v0.3.0: Corruption detected immediately, cache deleted, user instructed to recompute

3. Config Validation (src/antibody_training_esm/core/trainer.py) - Required config keys validated before any expensive operations (GPU allocation, model downloads) - Missing sections or keys reported with full list of what's missing - Before v0.3.0: Cryptic KeyError after GPU already allocated - After v0.3.0: Clear error message showing exactly what's missing, fails before resource allocation

4. Dataset Validation (src/antibody_training_esm/datasets/*.py) - CSV/Excel files validated for: - Non-empty (at least 1 row) - Required columns present (with helpful error showing available columns) - Before v0.3.0: Empty datasets or missing columns cause mysterious crashes later - After v0.3.0: Immediate failure with clear guidance

5. Column Validation (src/antibody_training_esm/data/loaders.py) - CSV column existence checked before access - Error message shows available columns when expected column missing - Prevents cryptic KeyError exceptions

Validation Function Pattern¶

Example: Config Validation (v0.3.0+)

name="__codelineno-1-1" href="#__codelineno-1-1">def validate_config(config: dict[str, Any]) -> None: """Validate that config dictionary contains all required keys.""" required_keys = { "data": ["train_file", "test_file", "embeddings_cache_dir"], "model": ["name", "device"], "classifier": [], "training": ["log_level", "metrics", "n_splits"], "experiment": ["name"], } missing_sections = [] missing_keys = [] for section in required_keys: if section not in config: missing_sections.append(section) continue for key in required_keys[section]: if key not in config[section]: missing_keys.append(f"{section}.{key}") if missing_sections or missing_keys: error_parts = [] if missing_sections: error_parts.append(f"Missing config sections: {', '.join(missing_sections)}") if missing_keys: error_parts.append(f"Missing config keys: {', '.join(missing_keys)}") raise ValueError("Config validation failed:\n - " + "\n - ".join(error_parts))

Key Principles: 1. Validate early: Check inputs before expensive operations 2. Fail fast: Raise errors immediately, don't continue with invalid data 3. Clear messages: Show what was expected, what was found, how to fix 4. Actionable errors: Include sequence previews, available columns, missing keys

Impact¶

Research Integrity: - No more training on corrupted embeddings - No more silent replacement of invalid sequences - Invalid test sets rejected (prevents reporting wrong metrics)

Developer Experience: - Error messages show exactly what's wrong and where - No more cryptic KeyErrors or AttributeErrors - Validation failures include context (sequence preview, available columns)

Resource Efficiency: - Config validation before GPU allocation saves expensive failures - Cache validation prevents wasting hours training on garbage data

HuggingFace Model Pinning¶

Why We Pin Versions¶

Scientific reproducibility: Unpinned models can change, breaking reproducibility: - If Facebook updates ESM-1v → embeddings change → results change - Paper methods must specify exact model versions - Cached embeddings become invalid if model updates

How Models are Pinned¶

Configuration:

# src/antibody_training_esm/conf/config.yaml
model:
  name: "facebook/esm1v_t33_650M_UR90S_1"
  revision: "main"  # Pinned revision for reproducibility

Code:

# src/antibody_training_esm/core/embeddings.py
AutoModel.from_pretrained(
    model_name,
    revision=revision  # nosec B615 - Pinned for reproducibility
)

When to Update Revisions¶

Update model revisions when: - Starting a new research project (pin to latest, then freeze) - Publishing results (document exact revision in methods) - Security fix released for HuggingFace transformers library

DON'T update mid-project - breaks reproducibility of existing results

Dependency Management¶

Checking for Vulnerabilities¶

Local audit:

# Export uv lock to requirements format
uv export --format=requirements-txt --no-hashes --output-file pip-audit-reqs.txt

# Run pip-audit on exported requirements
uv run pip-audit -r pip-audit-reqs.txt

Expected output:

No known vulnerabilities found

Upgrading Dependencies¶

Low-risk upgrades (safe to do anytime):

# Update dev tools, utilities (no ML dependencies)
# Examples: ruff, mypy, pytest, bandit, pre-commit

uv add "ruff@latest"
uv lock
uv run pytest  # Verify tests still pass

High-risk upgrades (require testing):

# ML dependencies: torch, transformers, scikit-learn, numpy

# ⚠️ Requires:
# - Full test suite pass
# - Embedding cache regeneration
# - External benchmark validation (Jain, Harvey, Shehata)
# - MPS backend compatibility check (Apple Silicon)

# Only upgrade when:
# - CVE fix required
# - After research milestone (can freeze and compare)
# - Dedicated validation sprint scheduled

Upgrade checklist for ML dependencies: 1. Create feature branch: git checkout -b security/ml-deps-upgrade 2. Upgrade ONE package at a time (clean git blame) 3. Run full test suite: uv run pytest 4. Regenerate embeddings on sample data 5. Compare embeddings (hash + cosine similarity) 6. Re-run external benchmarks (should be within ±1% accuracy) 7. Verify MPS backend still works (Apple Silicon) 8. Merge only if all checks pass

Dependency Watchlist¶

Monitor for updates: - torch (>2.9.0) - MPS backend changes can break inference - transformers (>4.57.1) - Tokenizer/model changes can invalidate caches - scikit-learn (>1.5.0) - Model serialization format changes

Currently locked versions (all CVE-free):

torch==2.9.0
transformers==4.57.1
scikit-learn==1.7.2
numpy==2.3.4

Security Scanning¶

Bandit (Code Security)¶

What it checks: - Insecure function usage (pickle, eval, exec) - Weak cryptography (MD5, DES) - SQL injection patterns - Shell injection patterns - Hard-coded secrets

Run locally:

uv run bandit -r src/

Expected output:

No issues identified.

Documented suppressions: - Pickle imports/loads: # nosec B301, B403 (trusted local data) - HuggingFace downloads: # nosec B615 (pinned versions)

When to use # nosec: Only for false positives with clear justification:

import pickle  # nosec B403 - Used only for local trusted data

model = pickle.load(f)  # nosec B301 - Loading our own trained model

pip-audit (Dependency Vulnerabilities)¶

What it checks: - Known CVEs in installed packages - Security advisories from PyPI

Run locally:

# Export lock file
uv export --format=requirements-txt --no-hashes --output-file pip-audit-reqs.txt

# Audit dependencies
uv run pip-audit -r pip-audit-reqs.txt

Expected output:

No known vulnerabilities found

If vulnerabilities found: 1. Check severity (LOW vs MEDIUM vs HIGH) 2. Check if dependency is direct or transitive 3. For direct deps: uv add "package>=fixed-version" 4. For transitive deps: Wait for upstream fix or constraint version 5. Re-export and re-audit:

uv export --format=requirements-txt --no-hashes --output-file pip-audit-reqs.txt
uv run pip-audit -r pip-audit-reqs.txt

6. Verify tests pass: uv run pytest

CodeQL (Static Analysis)¶

What it checks: - Uninitialized variables - Null pointer dereferences - SQL injection - XSS vulnerabilities - Resource leaks

Runs automatically: - On every PR (GitHub Actions) - Weekly schedule on main branch

View results: - GitHub Security tab: https://github.com/{owner}/{repo}/security/code-scanning

If findings appear: 1. Read finding description and severity 2. Review suggested fix 3. Implement fix 4. Verify with tests 5. Push to PR (CodeQL re-scans automatically)

CI Security Enforcement¶

Security Gates¶

What's enforced in CI:

# .github/workflows/ci.yml

- name: Security scan with bandit
  run: uv run bandit -r src/
  continue-on-error: false  # ✅ BLOCKS MERGE

- name: Dependency audit with pip-audit
  run: |
    uv export --format=requirements-txt --no-hashes --output pip-audit-reqs.txt
    uv run pip-audit -r pip-audit-reqs.txt
  continue-on-error: false  # ✅ BLOCKS MERGE

What's NOT enforced: - CodeQL findings (informational only) - Safety scan (advisory only)

Bypassing Security Gates¶

NEVER bypass security gates except for documented false positives.

Legitimate bypass:

# Clear justification
result = pickle.load(f)  # nosec B301 - Loading local trusted cache with hash validation

Illegitimate bypass:

# No justification - REJECTED in PR review
result = pickle.load(f)  # nosec B301

Best Practices¶

1. Never Load Untrusted Pickle Files¶

# ❌ NEVER DO THIS
import requests
response = requests.get("https://example.com/model.pkl")
model = pickle.loads(response.content)  # RCE vulnerability!

# ✅ ONLY LOCAL FILES
with open("experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl", "rb") as f:
    model = pickle.load(f)  # Safe - we generated this file

2. Pin HuggingFace Model Versions¶

# ❌ AVOID - Results not reproducible
model = AutoModel.from_pretrained("facebook/esm1v_t33_650M_UR90S_1")

# ✅ ALWAYS PIN REVISION
model = AutoModel.from_pretrained(
    "facebook/esm1v_t33_650M_UR90S_1",
    revision="main"  # or specific commit SHA
)

3. Use Strong Hashes for Caching¶

# ❌ AVOID - MD5 triggers security scanners
cache_key = hashlib.md5(data.encode()).hexdigest()

# ✅ USE SHA-256
cache_key = hashlib.sha256(data.encode()).hexdigest()[:12]

4. Document All Security Suppressions¶

# ❌ NO CONTEXT
import pickle  # nosec B403

# ✅ CLEAR JUSTIFICATION
import pickle  # nosec B403 - Used only for local trusted data (models, caches)

5. Keep Dependencies Updated¶

# ❌ NEVER RUN
pip install package

# ✅ ALWAYS USE UV
uv add "package>=1.2.3"
uv lock  # Updates lock file

6. Run Security Scans Locally¶

# Before every commit
make all  # Includes bandit scan

# Before adding dependencies
uv export --format=requirements-txt --no-hashes --output pip-audit-reqs.txt
uv run pip-audit -r pip-audit-reqs.txt

Troubleshooting¶

"Bandit: [B301] pickle.load"¶

Solution: Add # nosec B301 with justification:

model = pickle.load(f)  # nosec B301 - Loading local trusted model

"pip-audit: Package X has known vulnerabilities"¶

Solution: 1. Check if direct dependency: uv tree | grep package 2. If direct: uv add "package>=fixed-version" 3. If transitive: Upgrade parent or constrain version 4. Re-audit: uv run pip-audit -r pip-audit-reqs.txt

"CodeQL: Potentially uninitialized variable"¶

Solution: Initialize variable before conditional:

# ❌ BAD
if condition:
    value = calculate()
print(value)  # ERROR: value may not exist

# ✅ GOOD
value = None  # or appropriate default
if condition:
    value = calculate()
if value is not None:
    print(value)

"Security gate failing in CI"¶

Checklist: 1. Run scan locally: uv run bandit -r src/ 2. Check if issue is new or pre-existing 3. Fix issue or add documented suppression 4. Verify locally: make all 5. Push fix: git push

Resources¶

Internal¶

CI workflow: .github/workflows/ci.yml (security job)
Bandit config: pyproject.toml (bandit section)
Dependencies: uv.lock (locked versions)

External¶

Bandit docs: https://bandit.readthedocs.io/
pip-audit docs: https://pypi.org/project/pip-audit/
CodeQL docs: https://codeql.github.com/docs/
HuggingFace security: https://huggingface.co/docs/hub/security

Error Handling Best Practices¶

Overview¶

As of v0.3.0, the codebase follows consistent error handling patterns to prevent silent failures and provide actionable error messages.

Pattern 1: Validate Config Before Expensive Operations¶

Why: GPU allocation and model downloads are expensive. Validate config structure before starting.

Example:

def train_model(config_path: str = "src/antibody_training_esm/conf/config.yaml") -> dict[str, Any]:
    config = load_config(config_path)
    validate_config(config)  # ← Validate BEFORE GPU allocation
    logger = setup_logging(config)

    # Now safe to do expensive operations
    X_train, y_train = load_data(config)
    classifier = BinaryClassifier(...)

Impact: Catches config errors immediately instead of after minutes of setup.

Pattern 2: Validate Data Structures After Loading¶

Why: Pickle and CSV can return corrupted or unexpected data.

Example:

# Loading pickle cache
with open(cache_file, "rb") as f:
    cached_data_raw = pickle.load(f)

# Validate type before accessing
if not isinstance(cached_data_raw, dict):
    logger.warning(f"Invalid cache format (got {type(cached_data_raw).__name__}), recomputing...")
    # Fall back to recomputation
elif "embeddings" not in cached_data_raw:
    logger.warning("Corrupt cache (missing keys), recomputing...")
else:
    # Safe to use
    cached_data = cached_data_raw
    validate_embeddings(cached_data["embeddings"], ...)

Impact: Graceful fallback instead of cryptic errors.

Pattern 3: Provide Context in Error Messages¶

Why: When processing thousands of sequences, knowing WHICH one failed is critical.

Example:

try:
    embedding = model(sequence)
except Exception as e:
    seq_preview = sequence[:50] + "..." if len(sequence) > 50 else sequence
    raise RuntimeError(
        f"Failed to extract embedding for sequence of length {len(sequence)}: {seq_preview}"
    ) from e

Impact: Developers can immediately identify and fix the problematic sequence.

Pattern 4: Use Type Validation on Untrusted Loads¶

Why: Pickle returns Any, CSV returns mixed types. Must validate before use.

Example:

# BAD: Trust the type hint
cached_data: dict[str, Any] = pickle.load(f)  # Could be anything!

# GOOD: Validate runtime type
cached_data_raw = pickle.load(f)
if not isinstance(cached_data_raw, dict):
    raise ValueError(f"Expected dict, got {type(cached_data_raw)}")
cached_data: dict[str, Any] = cached_data_raw  # Now safe

Impact: Prevents type confusion attacks and corrupt data propagation.

When to Add Validation¶

Add validation when: 1. Loading external data: CSV, pickle, user input 2. Before expensive operations: GPU allocation, model downloads, training loops 3. Accessing dict keys: Config dicts, cached data 4. Processing batches: Show which batch/sequence failed

Don't add validation for: 1. Internal function calls (trust your own typed code) 2. After validation already done (don't double-validate) 3. Performance-critical loops (validate before loop, not inside)

Testing Error Paths¶

Every validation function should have tests for:

def test_validate_config_missing_section():
    config = {"data": {}}  # Missing "model" section
    with pytest.raises(ValueError, match="Missing config sections: model"):
        validate_config(config)

def test_validate_config_missing_keys():
    config = {"data": {}, "model": {}}  # Missing "data.train_file"
    with pytest.raises(ValueError, match="Missing config keys: data.train_file"):
        validate_config(config)

Last Updated: 2025-11-28 Branch: main

Security¶

When to Use This Guide¶

Related Documentation¶

Security Model¶

Threat Model¶

Current Security Posture¶

Pickle Usage Policy¶

Why We Use Pickle¶

Where Pickle is Used¶

When NOT to Use Pickle¶

Production Deployment (Implemented)¶

Data Validation & Corruption Prevention¶

Overview¶

Validation Layers¶

Validation Function Pattern¶

Impact¶

HuggingFace Model Pinning¶

Why We Pin Versions¶

How Models are Pinned¶

When to Update Revisions¶

Dependency Management¶

Checking for Vulnerabilities¶

Upgrading Dependencies¶

Dependency Watchlist¶

Security Scanning¶

Bandit (Code Security)¶

pip-audit (Dependency Vulnerabilities)¶

CodeQL (Static Analysis)¶

CI Security Enforcement¶

Security Gates¶

Bypassing Security Gates¶

Best Practices¶

1. Never Load Untrusted Pickle Files¶

2. Pin HuggingFace Model Versions¶

3. Use Strong Hashes for Caching¶

4. Document All Security Suppressions¶

5. Keep Dependencies Updated¶

6. Run Security Scans Locally¶

Troubleshooting¶

"Bandit: [B301] pickle.load"¶

"pip-audit: Package X has known vulnerabilities"¶

"CodeQL: Potentially uninitialized variable"¶

"Security gate failing in CI"¶

Resources¶

Internal¶

External¶

Error Handling Best Practices¶

Overview¶

Pattern 1: Validate Config Before Expensive Operations¶

Pattern 2: Validate Data Structures After Loading¶

Pattern 3: Provide Context in Error Messages¶

Pattern 4: Use Type Validation on Untrusted Loads¶

When to Add Validation¶

Testing Error Paths¶