Skip to content

Security

Target Audience: Developers handling security concerns and dependency management

Purpose: Understand the security model, approved practices, and how to maintain security posture


When to Use This Guide

Use this guide if you're: - ✅ Adding dependencies (need to audit for CVEs) - ✅ Using pickle (understand approved use cases) - ✅ Fixing security findings (Bandit, CodeQL, pip-audit) - ✅ Running security scans (local and CI) - ✅ Understanding threat model (what's in scope, what's not)



Security Model

Threat Model

Context: Research codebase for antibody classification (NOT production API deployment)

In Scope: - ✅ Code-level vulnerabilities (SQL injection, XSS, command injection) - ✅ Dependency vulnerabilities (CVEs in packages) - ✅ Scientific reproducibility (model version pinning) - ✅ Code quality (type safety, linting, testing)

Out of Scope: - ❌ Internet-facing attack surface (no web server) - ❌ Untrusted user input (local trusted data only) - ❌ Cryptographic operations (not a crypto application) - ❌ Multi-tenant isolation (single-user research environment)

Current Security Posture

Code-level security: Bandit clean (0 issues, documented suppressions) ✅ Dependencies: pip-audit clean (0 CVEs in locked environment) ✅ Type safety: 100% type coverage on production code (mypy --strict) ✅ CI enforcement: Security gates block PR merges


Pickle Usage Policy

Why We Use Pickle

Approved use cases: 1. Trained models - BinaryClassifier model persistence (.pkl files) 2. Embedding caches - ESM-1v embeddings for performance 3. Preprocessed datasets - Cached data transformations

Security justification: - All pickle files are locally generated by our own code - NOT loading pickles from internet or untrusted sources - NOT exposed to external attackers (local research pipeline) - Cache files have integrity validation (hash checks)

Where Pickle is Used

Production code (src/): - cli/test.py - Loading trained models and cached embeddings - core/trainer.py - Loading cached embeddings (with hash validation) - data/loaders.py - Loading preprocessed datasets

All uses are marked with # nosec B301 comments with justification.

When NOT to Use Pickle

Avoid pickle for: - ❌ Receiving data from external sources - ❌ Storing configuration (use JSON/YAML instead) - ❌ Production API deployments (use JSON + NPZ format) - ❌ Cross-language data exchange (use standard formats)

Production Deployment (Implemented)

✅ Dual-format serialization is now live:

All trained models are automatically saved in both formats:

# Research path (pickle) - still works
with open("experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl", "rb") as f:
    model = pickle.load(f)

# Production path (NPZ+JSON) - NEW
from antibody_training_esm.core import load_model_from_npz

model = load_model_from_npz(
    npz_path="experiments/checkpoints/esm1v/logreg/model.npz",
    json_path="experiments/checkpoints/esm1v/logreg/model_config.json"
)

Implementation details: - save_model() in core/trainer.py writes all three files automatically - load_model_from_npz() reconstructs BinaryClassifier from NPZ+JSON - No breaking changes - pickle still supported for research workflows - See Training Guide for usage examples

Security improvements: - ✅ NPZ+JSON cannot execute code (unlike pickle) - ✅ Cross-platform compatible (any language can load) - ✅ HuggingFace deployment ready - ✅ Production API safe - ✅ Data validation: Input validation at all pipeline entry points (sequences, embeddings, configs) - ✅ Cache integrity: Embedding cache validation prevents training on corrupted data


Data Validation & Corruption Prevention

Overview

As of v0.3.0, the pipeline enforces strict validation at all data entry points to prevent silent data corruption. The principle is: fail fast with clear error messages instead of silent corruption.

Validation Layers

1. Sequence Validation (src/antibody_training_esm/core/embeddings.py) - Invalid sequences (non-amino-acid characters, gaps, empty strings) now raise ValueError - Error message shows specific invalid characters and sequence preview - Before v0.3.0: Invalid sequences replaced with single "M" (methionine) → silent corruption - After v0.3.0: Training halts immediately with actionable error message

2. Embedding Validation (src/antibody_training_esm/core/trainer.py) - Cached and computed embeddings validated for: - Correct shape (must match number of sequences) - NaN values (indicates failed computation) - All-zero rows (indicates failed batch processing) - Before v0.3.0: Corrupted embeddings loaded silently → model trains on garbage - After v0.3.0: Corruption detected immediately, cache deleted, user instructed to recompute

3. Config Validation (src/antibody_training_esm/core/trainer.py) - Required config keys validated before any expensive operations (GPU allocation, model downloads) - Missing sections or keys reported with full list of what's missing - Before v0.3.0: Cryptic KeyError after GPU already allocated - After v0.3.0: Clear error message showing exactly what's missing, fails before resource allocation

4. Dataset Validation (src/antibody_training_esm/datasets/*.py) - CSV/Excel files validated for: - Non-empty (at least 1 row) - Required columns present (with helpful error showing available columns) - Before v0.3.0: Empty datasets or missing columns cause mysterious crashes later - After v0.3.0: Immediate failure with clear guidance

5. Column Validation (src/antibody_training_esm/data/loaders.py) - CSV column existence checked before access - Error message shows available columns when expected column missing - Prevents cryptic KeyError exceptions

Validation Function Pattern

Example: Config Validation (v0.3.0+)

def validate_config(config: dict[str, Any]) -> None:
    """Validate that config dictionary contains all required keys."""
    required_keys = {
        "data": ["train_file", "test_file", "embeddings_cache_dir"],
        "model": ["name", "device"],
        "classifier": [],
        "training": ["log_level", "metrics", "n_splits"],
        "experiment": ["name"],
    }

    missing_sections = []
    missing_keys = []

    for section in required_keys:
        if section not in config:
            missing_sections.append(section)
            continue

        for key in required_keys[section]:
            if key not in config[section]:
                missing_keys.append(f"{section}.{key}")

    if missing_sections or missing_keys:
        error_parts = []
        if missing_sections:
            error_parts.append(f"Missing config sections: {', '.join(missing_sections)}")
        if missing_keys:
            error_parts.append(f"Missing config keys: {', '.join(missing_keys)}")
        raise ValueError("Config validation failed:\n  - " + "\n  - ".join(error_parts))

Key Principles: 1. Validate early: Check inputs before expensive operations 2. Fail fast: Raise errors immediately, don't continue with invalid data 3. Clear messages: Show what was expected, what was found, how to fix 4. Actionable errors: Include sequence previews, available columns, missing keys

Impact

Research Integrity: - No more training on corrupted embeddings - No more silent replacement of invalid sequences - Invalid test sets rejected (prevents reporting wrong metrics)

Developer Experience: - Error messages show exactly what's wrong and where - No more cryptic KeyErrors or AttributeErrors - Validation failures include context (sequence preview, available columns)

Resource Efficiency: - Config validation before GPU allocation saves expensive failures - Cache validation prevents wasting hours training on garbage data


HuggingFace Model Pinning

Why We Pin Versions

Scientific reproducibility: Unpinned models can change, breaking reproducibility: - If Facebook updates ESM-1v → embeddings change → results change - Paper methods must specify exact model versions - Cached embeddings become invalid if model updates

How Models are Pinned

Configuration:

# src/antibody_training_esm/conf/config.yaml
model:
  name: "facebook/esm1v_t33_650M_UR90S_1"
  revision: "main"  # Pinned revision for reproducibility

Code:

# src/antibody_training_esm/core/embeddings.py
AutoModel.from_pretrained(
    model_name,
    revision=revision  # nosec B615 - Pinned for reproducibility
)

When to Update Revisions

Update model revisions when: - Starting a new research project (pin to latest, then freeze) - Publishing results (document exact revision in methods) - Security fix released for HuggingFace transformers library

DON'T update mid-project - breaks reproducibility of existing results


Dependency Management

Checking for Vulnerabilities

Local audit:

# Export uv lock to requirements format
uv export --format=requirements-txt --no-hashes --output-file pip-audit-reqs.txt

# Run pip-audit on exported requirements
uv run pip-audit -r pip-audit-reqs.txt

Expected output:

No known vulnerabilities found

Upgrading Dependencies

Low-risk upgrades (safe to do anytime):

# Update dev tools, utilities (no ML dependencies)
# Examples: ruff, mypy, pytest, bandit, pre-commit

uv add "ruff@latest"
uv lock
uv run pytest  # Verify tests still pass

High-risk upgrades (require testing):

# ML dependencies: torch, transformers, scikit-learn, numpy

# ⚠️ Requires:
# - Full test suite pass
# - Embedding cache regeneration
# - External benchmark validation (Jain, Harvey, Shehata)
# - MPS backend compatibility check (Apple Silicon)

# Only upgrade when:
# - CVE fix required
# - After research milestone (can freeze and compare)
# - Dedicated validation sprint scheduled

Upgrade checklist for ML dependencies: 1. Create feature branch: git checkout -b security/ml-deps-upgrade 2. Upgrade ONE package at a time (clean git blame) 3. Run full test suite: uv run pytest 4. Regenerate embeddings on sample data 5. Compare embeddings (hash + cosine similarity) 6. Re-run external benchmarks (should be within ±1% accuracy) 7. Verify MPS backend still works (Apple Silicon) 8. Merge only if all checks pass

Dependency Watchlist

Monitor for updates: - torch (>2.9.0) - MPS backend changes can break inference - transformers (>4.57.1) - Tokenizer/model changes can invalidate caches - scikit-learn (>1.5.0) - Model serialization format changes

Currently locked versions (all CVE-free):

torch==2.9.0
transformers==4.57.1
scikit-learn==1.7.2
numpy==2.3.4


Security Scanning

Bandit (Code Security)

What it checks: - Insecure function usage (pickle, eval, exec) - Weak cryptography (MD5, DES) - SQL injection patterns - Shell injection patterns - Hard-coded secrets

Run locally:

uv run bandit -r src/

Expected output:

No issues identified.

Documented suppressions: - Pickle imports/loads: # nosec B301, B403 (trusted local data) - HuggingFace downloads: # nosec B615 (pinned versions)

When to use # nosec: Only for false positives with clear justification:

import pickle  # nosec B403 - Used only for local trusted data

model = pickle.load(f)  # nosec B301 - Loading our own trained model

pip-audit (Dependency Vulnerabilities)

What it checks: - Known CVEs in installed packages - Security advisories from PyPI

Run locally:

# Export lock file
uv export --format=requirements-txt --no-hashes --output-file pip-audit-reqs.txt

# Audit dependencies
uv run pip-audit -r pip-audit-reqs.txt

Expected output:

No known vulnerabilities found

If vulnerabilities found: 1. Check severity (LOW vs MEDIUM vs HIGH) 2. Check if dependency is direct or transitive 3. For direct deps: uv add "package>=fixed-version" 4. For transitive deps: Wait for upstream fix or constraint version 5. Re-export and re-audit:

uv export --format=requirements-txt --no-hashes --output-file pip-audit-reqs.txt
uv run pip-audit -r pip-audit-reqs.txt
6. Verify tests pass: uv run pytest

CodeQL (Static Analysis)

What it checks: - Uninitialized variables - Null pointer dereferences - SQL injection - XSS vulnerabilities - Resource leaks

Runs automatically: - On every PR (GitHub Actions) - Weekly schedule on main branch

View results: - GitHub Security tab: https://github.com/{owner}/{repo}/security/code-scanning

If findings appear: 1. Read finding description and severity 2. Review suggested fix 3. Implement fix 4. Verify with tests 5. Push to PR (CodeQL re-scans automatically)


CI Security Enforcement

Security Gates

What's enforced in CI:

# .github/workflows/ci.yml

- name: Security scan with bandit
  run: uv run bandit -r src/
  continue-on-error: false  # ✅ BLOCKS MERGE

- name: Dependency audit with pip-audit
  run: |
    uv export --format=requirements-txt --no-hashes --output pip-audit-reqs.txt
    uv run pip-audit -r pip-audit-reqs.txt
  continue-on-error: false  # ✅ BLOCKS MERGE

What's NOT enforced: - CodeQL findings (informational only) - Safety scan (advisory only)

Bypassing Security Gates

NEVER bypass security gates except for documented false positives.

Legitimate bypass:

# Clear justification
result = pickle.load(f)  # nosec B301 - Loading local trusted cache with hash validation

Illegitimate bypass:

# No justification - REJECTED in PR review
result = pickle.load(f)  # nosec B301


Best Practices

1. Never Load Untrusted Pickle Files

# ❌ NEVER DO THIS
import requests
response = requests.get("https://example.com/model.pkl")
model = pickle.loads(response.content)  # RCE vulnerability!

# ✅ ONLY LOCAL FILES
with open("experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl", "rb") as f:
    model = pickle.load(f)  # Safe - we generated this file

2. Pin HuggingFace Model Versions

# ❌ AVOID - Results not reproducible
model = AutoModel.from_pretrained("facebook/esm1v_t33_650M_UR90S_1")

# ✅ ALWAYS PIN REVISION
model = AutoModel.from_pretrained(
    "facebook/esm1v_t33_650M_UR90S_1",
    revision="main"  # or specific commit SHA
)

3. Use Strong Hashes for Caching

# ❌ AVOID - MD5 triggers security scanners
cache_key = hashlib.md5(data.encode()).hexdigest()

# ✅ USE SHA-256
cache_key = hashlib.sha256(data.encode()).hexdigest()[:12]

4. Document All Security Suppressions

# ❌ NO CONTEXT
import pickle  # nosec B403

# ✅ CLEAR JUSTIFICATION
import pickle  # nosec B403 - Used only for local trusted data (models, caches)

5. Keep Dependencies Updated

# ❌ NEVER RUN
pip install package

# ✅ ALWAYS USE UV
uv add "package>=1.2.3"
uv lock  # Updates lock file

6. Run Security Scans Locally

# Before every commit
make all  # Includes bandit scan

# Before adding dependencies
uv export --format=requirements-txt --no-hashes --output pip-audit-reqs.txt
uv run pip-audit -r pip-audit-reqs.txt

Troubleshooting

"Bandit: [B301] pickle.load"

Solution: Add # nosec B301 with justification:

model = pickle.load(f)  # nosec B301 - Loading local trusted model

"pip-audit: Package X has known vulnerabilities"

Solution: 1. Check if direct dependency: uv tree | grep package 2. If direct: uv add "package>=fixed-version" 3. If transitive: Upgrade parent or constrain version 4. Re-audit: uv run pip-audit -r pip-audit-reqs.txt

"CodeQL: Potentially uninitialized variable"

Solution: Initialize variable before conditional:

# ❌ BAD
if condition:
    value = calculate()
print(value)  # ERROR: value may not exist

# ✅ GOOD
value = None  # or appropriate default
if condition:
    value = calculate()
if value is not None:
    print(value)

"Security gate failing in CI"

Checklist: 1. Run scan locally: uv run bandit -r src/ 2. Check if issue is new or pre-existing 3. Fix issue or add documented suppression 4. Verify locally: make all 5. Push fix: git push


Resources

Internal

  • CI workflow: .github/workflows/ci.yml (security job)
  • Bandit config: pyproject.toml (bandit section)
  • Dependencies: uv.lock (locked versions)

External

  • Bandit docs: https://bandit.readthedocs.io/
  • pip-audit docs: https://pypi.org/project/pip-audit/
  • CodeQL docs: https://codeql.github.com/docs/
  • HuggingFace security: https://huggingface.co/docs/hub/security

Error Handling Best Practices

Overview

As of v0.3.0, the codebase follows consistent error handling patterns to prevent silent failures and provide actionable error messages.

Pattern 1: Validate Config Before Expensive Operations

Why: GPU allocation and model downloads are expensive. Validate config structure before starting.

Example:

def train_model(config_path: str = "src/antibody_training_esm/conf/config.yaml") -> dict[str, Any]:
    config = load_config(config_path)
    validate_config(config)  # ← Validate BEFORE GPU allocation
    logger = setup_logging(config)

    # Now safe to do expensive operations
    X_train, y_train = load_data(config)
    classifier = BinaryClassifier(...)

Impact: Catches config errors immediately instead of after minutes of setup.

Pattern 2: Validate Data Structures After Loading

Why: Pickle and CSV can return corrupted or unexpected data.

Example:

# Loading pickle cache
with open(cache_file, "rb") as f:
    cached_data_raw = pickle.load(f)

# Validate type before accessing
if not isinstance(cached_data_raw, dict):
    logger.warning(f"Invalid cache format (got {type(cached_data_raw).__name__}), recomputing...")
    # Fall back to recomputation
elif "embeddings" not in cached_data_raw:
    logger.warning("Corrupt cache (missing keys), recomputing...")
else:
    # Safe to use
    cached_data = cached_data_raw
    validate_embeddings(cached_data["embeddings"], ...)

Impact: Graceful fallback instead of cryptic errors.

Pattern 3: Provide Context in Error Messages

Why: When processing thousands of sequences, knowing WHICH one failed is critical.

Example:

try:
    embedding = model(sequence)
except Exception as e:
    seq_preview = sequence[:50] + "..." if len(sequence) > 50 else sequence
    raise RuntimeError(
        f"Failed to extract embedding for sequence of length {len(sequence)}: {seq_preview}"
    ) from e

Impact: Developers can immediately identify and fix the problematic sequence.

Pattern 4: Use Type Validation on Untrusted Loads

Why: Pickle returns Any, CSV returns mixed types. Must validate before use.

Example:

# BAD: Trust the type hint
cached_data: dict[str, Any] = pickle.load(f)  # Could be anything!

# GOOD: Validate runtime type
cached_data_raw = pickle.load(f)
if not isinstance(cached_data_raw, dict):
    raise ValueError(f"Expected dict, got {type(cached_data_raw)}")
cached_data: dict[str, Any] = cached_data_raw  # Now safe

Impact: Prevents type confusion attacks and corrupt data propagation.

When to Add Validation

Add validation when: 1. Loading external data: CSV, pickle, user input 2. Before expensive operations: GPU allocation, model downloads, training loops 3. Accessing dict keys: Config dicts, cached data 4. Processing batches: Show which batch/sequence failed

Don't add validation for: 1. Internal function calls (trust your own typed code) 2. After validation already done (don't double-validate) 3. Performance-critical loops (validate before loop, not inside)

Testing Error Paths

Every validation function should have tests for:

def test_validate_config_missing_section():
    config = {"data": {}}  # Missing "model" section
    with pytest.raises(ValueError, match="Missing config sections: model"):
        validate_config(config)

def test_validate_config_missing_keys():
    config = {"data": {}, "model": {}}  # Missing "data.train_file"
    with pytest.raises(ValueError, match="Missing config keys: data.train_file"):
        validate_config(config)

See also: Testing Strategy - Error Handling


Last Updated: 2025-11-28 Branch: main