Security¶
Target Audience: Developers handling security concerns and dependency management
Purpose: Understand the security model, approved practices, and how to maintain security posture
When to Use This Guide¶
Use this guide if you're: - ✅ Adding dependencies (need to audit for CVEs) - ✅ Using pickle (understand approved use cases) - ✅ Fixing security findings (Bandit, CodeQL, pip-audit) - ✅ Running security scans (local and CI) - ✅ Understanding threat model (what's in scope, what's not)
Related Documentation¶
- Workflow: Development Workflow - Code quality commands
- Architecture: Architecture - System design
- Type Checking: Type Checking Guide - Type safety requirements
Security Model¶
Threat Model¶
Context: Research codebase for antibody classification (NOT production API deployment)
In Scope: - ✅ Code-level vulnerabilities (SQL injection, XSS, command injection) - ✅ Dependency vulnerabilities (CVEs in packages) - ✅ Scientific reproducibility (model version pinning) - ✅ Code quality (type safety, linting, testing)
Out of Scope: - ❌ Internet-facing attack surface (no web server) - ❌ Untrusted user input (local trusted data only) - ❌ Cryptographic operations (not a crypto application) - ❌ Multi-tenant isolation (single-user research environment)
Current Security Posture¶
✅ Code-level security: Bandit clean (0 issues, documented suppressions) ✅ Dependencies: pip-audit clean (0 CVEs in locked environment) ✅ Type safety: 100% type coverage on production code (mypy --strict) ✅ CI enforcement: Security gates block PR merges
Pickle Usage Policy¶
Why We Use Pickle¶
Approved use cases:
1. Trained models - BinaryClassifier model persistence (.pkl files)
2. Embedding caches - ESM-1v embeddings for performance
3. Preprocessed datasets - Cached data transformations
Security justification: - All pickle files are locally generated by our own code - NOT loading pickles from internet or untrusted sources - NOT exposed to external attackers (local research pipeline) - Cache files have integrity validation (hash checks)
Where Pickle is Used¶
Production code (src/):
- cli/test.py - Loading trained models and cached embeddings
- core/trainer.py - Loading cached embeddings (with hash validation)
- data/loaders.py - Loading preprocessed datasets
All uses are marked with # nosec B301 comments with justification.
When NOT to Use Pickle¶
Avoid pickle for: - ❌ Receiving data from external sources - ❌ Storing configuration (use JSON/YAML instead) - ❌ Production API deployments (use JSON + NPZ format) - ❌ Cross-language data exchange (use standard formats)
Production Deployment (Implemented)¶
✅ Dual-format serialization is now live:
All trained models are automatically saved in both formats:
# Research path (pickle) - still works
with open("experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl", "rb") as f:
model = pickle.load(f)
# Production path (NPZ+JSON) - NEW
from antibody_training_esm.core import load_model_from_npz
model = load_model_from_npz(
npz_path="experiments/checkpoints/esm1v/logreg/model.npz",
json_path="experiments/checkpoints/esm1v/logreg/model_config.json"
)
Implementation details:
- save_model() in core/trainer.py writes all three files automatically
- load_model_from_npz() reconstructs BinaryClassifier from NPZ+JSON
- No breaking changes - pickle still supported for research workflows
- See Training Guide for usage examples
Security improvements: - ✅ NPZ+JSON cannot execute code (unlike pickle) - ✅ Cross-platform compatible (any language can load) - ✅ HuggingFace deployment ready - ✅ Production API safe - ✅ Data validation: Input validation at all pipeline entry points (sequences, embeddings, configs) - ✅ Cache integrity: Embedding cache validation prevents training on corrupted data
Data Validation & Corruption Prevention¶
Overview¶
As of v0.3.0, the pipeline enforces strict validation at all data entry points to prevent silent data corruption. The principle is: fail fast with clear error messages instead of silent corruption.
Validation Layers¶
1. Sequence Validation (src/antibody_training_esm/core/embeddings.py)
- Invalid sequences (non-amino-acid characters, gaps, empty strings) now raise ValueError
- Error message shows specific invalid characters and sequence preview
- Before v0.3.0: Invalid sequences replaced with single "M" (methionine) → silent corruption
- After v0.3.0: Training halts immediately with actionable error message
2. Embedding Validation (src/antibody_training_esm/core/trainer.py)
- Cached and computed embeddings validated for:
- Correct shape (must match number of sequences)
- NaN values (indicates failed computation)
- All-zero rows (indicates failed batch processing)
- Before v0.3.0: Corrupted embeddings loaded silently → model trains on garbage
- After v0.3.0: Corruption detected immediately, cache deleted, user instructed to recompute
3. Config Validation (src/antibody_training_esm/core/trainer.py)
- Required config keys validated before any expensive operations (GPU allocation, model downloads)
- Missing sections or keys reported with full list of what's missing
- Before v0.3.0: Cryptic KeyError after GPU already allocated
- After v0.3.0: Clear error message showing exactly what's missing, fails before resource allocation
4. Dataset Validation (src/antibody_training_esm/datasets/*.py)
- CSV/Excel files validated for:
- Non-empty (at least 1 row)
- Required columns present (with helpful error showing available columns)
- Before v0.3.0: Empty datasets or missing columns cause mysterious crashes later
- After v0.3.0: Immediate failure with clear guidance
5. Column Validation (src/antibody_training_esm/data/loaders.py)
- CSV column existence checked before access
- Error message shows available columns when expected column missing
- Prevents cryptic KeyError exceptions
Validation Function Pattern¶
Example: Config Validation (v0.3.0+)
def validate_config(config: dict[str, Any]) -> None:
"""Validate that config dictionary contains all required keys."""
required_keys = {
"data": ["train_file", "test_file", "embeddings_cache_dir"],
"model": ["name", "device"],
"classifier": [],
"training": ["log_level", "metrics", "n_splits"],
"experiment": ["name"],
}
missing_sections = []
missing_keys = []
for section in required_keys:
if section not in config:
missing_sections.append(section)
continue
for key in required_keys[section]:
if key not in config[section]:
missing_keys.append(f"{section}.{key}")
if missing_sections or missing_keys:
error_parts = []
if missing_sections:
error_parts.append(f"Missing config sections: {', '.join(missing_sections)}")
if missing_keys:
error_parts.append(f"Missing config keys: {', '.join(missing_keys)}")
raise ValueError("Config validation failed:\n - " + "\n - ".join(error_parts))
Key Principles: 1. Validate early: Check inputs before expensive operations 2. Fail fast: Raise errors immediately, don't continue with invalid data 3. Clear messages: Show what was expected, what was found, how to fix 4. Actionable errors: Include sequence previews, available columns, missing keys
Impact¶
Research Integrity: - No more training on corrupted embeddings - No more silent replacement of invalid sequences - Invalid test sets rejected (prevents reporting wrong metrics)
Developer Experience: - Error messages show exactly what's wrong and where - No more cryptic KeyErrors or AttributeErrors - Validation failures include context (sequence preview, available columns)
Resource Efficiency: - Config validation before GPU allocation saves expensive failures - Cache validation prevents wasting hours training on garbage data
HuggingFace Model Pinning¶
Why We Pin Versions¶
Scientific reproducibility: Unpinned models can change, breaking reproducibility: - If Facebook updates ESM-1v → embeddings change → results change - Paper methods must specify exact model versions - Cached embeddings become invalid if model updates
How Models are Pinned¶
Configuration:
# src/antibody_training_esm/conf/config.yaml
model:
name: "facebook/esm1v_t33_650M_UR90S_1"
revision: "main" # Pinned revision for reproducibility
Code:
# src/antibody_training_esm/core/embeddings.py
AutoModel.from_pretrained(
model_name,
revision=revision # nosec B615 - Pinned for reproducibility
)
When to Update Revisions¶
Update model revisions when: - Starting a new research project (pin to latest, then freeze) - Publishing results (document exact revision in methods) - Security fix released for HuggingFace transformers library
DON'T update mid-project - breaks reproducibility of existing results
Dependency Management¶
Checking for Vulnerabilities¶
Local audit:
# Export uv lock to requirements format
uv export --format=requirements-txt --no-hashes --output-file pip-audit-reqs.txt
# Run pip-audit on exported requirements
uv run pip-audit -r pip-audit-reqs.txt
Expected output:
Upgrading Dependencies¶
Low-risk upgrades (safe to do anytime):
# Update dev tools, utilities (no ML dependencies)
# Examples: ruff, mypy, pytest, bandit, pre-commit
uv add "ruff@latest"
uv lock
uv run pytest # Verify tests still pass
High-risk upgrades (require testing):
# ML dependencies: torch, transformers, scikit-learn, numpy
# ⚠️ Requires:
# - Full test suite pass
# - Embedding cache regeneration
# - External benchmark validation (Jain, Harvey, Shehata)
# - MPS backend compatibility check (Apple Silicon)
# Only upgrade when:
# - CVE fix required
# - After research milestone (can freeze and compare)
# - Dedicated validation sprint scheduled
Upgrade checklist for ML dependencies:
1. Create feature branch: git checkout -b security/ml-deps-upgrade
2. Upgrade ONE package at a time (clean git blame)
3. Run full test suite: uv run pytest
4. Regenerate embeddings on sample data
5. Compare embeddings (hash + cosine similarity)
6. Re-run external benchmarks (should be within ±1% accuracy)
7. Verify MPS backend still works (Apple Silicon)
8. Merge only if all checks pass
Dependency Watchlist¶
Monitor for updates:
- torch (>2.9.0) - MPS backend changes can break inference
- transformers (>4.57.1) - Tokenizer/model changes can invalidate caches
- scikit-learn (>1.5.0) - Model serialization format changes
Currently locked versions (all CVE-free):
Security Scanning¶
Bandit (Code Security)¶
What it checks: - Insecure function usage (pickle, eval, exec) - Weak cryptography (MD5, DES) - SQL injection patterns - Shell injection patterns - Hard-coded secrets
Run locally:
Expected output:
Documented suppressions:
- Pickle imports/loads: # nosec B301, B403 (trusted local data)
- HuggingFace downloads: # nosec B615 (pinned versions)
When to use # nosec:
Only for false positives with clear justification:
import pickle # nosec B403 - Used only for local trusted data
model = pickle.load(f) # nosec B301 - Loading our own trained model
pip-audit (Dependency Vulnerabilities)¶
What it checks: - Known CVEs in installed packages - Security advisories from PyPI
Run locally:
# Export lock file
uv export --format=requirements-txt --no-hashes --output-file pip-audit-reqs.txt
# Audit dependencies
uv run pip-audit -r pip-audit-reqs.txt
Expected output:
If vulnerabilities found:
1. Check severity (LOW vs MEDIUM vs HIGH)
2. Check if dependency is direct or transitive
3. For direct deps: uv add "package>=fixed-version"
4. For transitive deps: Wait for upstream fix or constraint version
5. Re-export and re-audit:
uv export --format=requirements-txt --no-hashes --output-file pip-audit-reqs.txt
uv run pip-audit -r pip-audit-reqs.txt
uv run pytest
CodeQL (Static Analysis)¶
What it checks: - Uninitialized variables - Null pointer dereferences - SQL injection - XSS vulnerabilities - Resource leaks
Runs automatically: - On every PR (GitHub Actions) - Weekly schedule on main branch
View results:
- GitHub Security tab: https://github.com/{owner}/{repo}/security/code-scanning
If findings appear: 1. Read finding description and severity 2. Review suggested fix 3. Implement fix 4. Verify with tests 5. Push to PR (CodeQL re-scans automatically)
CI Security Enforcement¶
Security Gates¶
What's enforced in CI:
# .github/workflows/ci.yml
- name: Security scan with bandit
run: uv run bandit -r src/
continue-on-error: false # ✅ BLOCKS MERGE
- name: Dependency audit with pip-audit
run: |
uv export --format=requirements-txt --no-hashes --output pip-audit-reqs.txt
uv run pip-audit -r pip-audit-reqs.txt
continue-on-error: false # ✅ BLOCKS MERGE
What's NOT enforced: - CodeQL findings (informational only) - Safety scan (advisory only)
Bypassing Security Gates¶
NEVER bypass security gates except for documented false positives.
Legitimate bypass:
# Clear justification
result = pickle.load(f) # nosec B301 - Loading local trusted cache with hash validation
Illegitimate bypass:
Best Practices¶
1. Never Load Untrusted Pickle Files¶
# ❌ NEVER DO THIS
import requests
response = requests.get("https://example.com/model.pkl")
model = pickle.loads(response.content) # RCE vulnerability!
# ✅ ONLY LOCAL FILES
with open("experiments/checkpoints/esm1v/logreg/boughter_vh_esm1v_logreg.pkl", "rb") as f:
model = pickle.load(f) # Safe - we generated this file
2. Pin HuggingFace Model Versions¶
# ❌ AVOID - Results not reproducible
model = AutoModel.from_pretrained("facebook/esm1v_t33_650M_UR90S_1")
# ✅ ALWAYS PIN REVISION
model = AutoModel.from_pretrained(
"facebook/esm1v_t33_650M_UR90S_1",
revision="main" # or specific commit SHA
)
3. Use Strong Hashes for Caching¶
# ❌ AVOID - MD5 triggers security scanners
cache_key = hashlib.md5(data.encode()).hexdigest()
# ✅ USE SHA-256
cache_key = hashlib.sha256(data.encode()).hexdigest()[:12]
4. Document All Security Suppressions¶
# ❌ NO CONTEXT
import pickle # nosec B403
# ✅ CLEAR JUSTIFICATION
import pickle # nosec B403 - Used only for local trusted data (models, caches)
5. Keep Dependencies Updated¶
# ❌ NEVER RUN
pip install package
# ✅ ALWAYS USE UV
uv add "package>=1.2.3"
uv lock # Updates lock file
6. Run Security Scans Locally¶
# Before every commit
make all # Includes bandit scan
# Before adding dependencies
uv export --format=requirements-txt --no-hashes --output pip-audit-reqs.txt
uv run pip-audit -r pip-audit-reqs.txt
Troubleshooting¶
"Bandit: [B301] pickle.load"¶
Solution: Add # nosec B301 with justification:
"pip-audit: Package X has known vulnerabilities"¶
Solution:
1. Check if direct dependency: uv tree | grep package
2. If direct: uv add "package>=fixed-version"
3. If transitive: Upgrade parent or constrain version
4. Re-audit: uv run pip-audit -r pip-audit-reqs.txt
"CodeQL: Potentially uninitialized variable"¶
Solution: Initialize variable before conditional:
# ❌ BAD
if condition:
value = calculate()
print(value) # ERROR: value may not exist
# ✅ GOOD
value = None # or appropriate default
if condition:
value = calculate()
if value is not None:
print(value)
"Security gate failing in CI"¶
Checklist:
1. Run scan locally: uv run bandit -r src/
2. Check if issue is new or pre-existing
3. Fix issue or add documented suppression
4. Verify locally: make all
5. Push fix: git push
Resources¶
Internal¶
- CI workflow:
.github/workflows/ci.yml(security job) - Bandit config:
pyproject.toml(bandit section) - Dependencies:
uv.lock(locked versions)
External¶
- Bandit docs: https://bandit.readthedocs.io/
- pip-audit docs: https://pypi.org/project/pip-audit/
- CodeQL docs: https://codeql.github.com/docs/
- HuggingFace security: https://huggingface.co/docs/hub/security
Error Handling Best Practices¶
Overview¶
As of v0.3.0, the codebase follows consistent error handling patterns to prevent silent failures and provide actionable error messages.
Pattern 1: Validate Config Before Expensive Operations¶
Why: GPU allocation and model downloads are expensive. Validate config structure before starting.
Example:
def train_model(config_path: str = "src/antibody_training_esm/conf/config.yaml") -> dict[str, Any]:
config = load_config(config_path)
validate_config(config) # ← Validate BEFORE GPU allocation
logger = setup_logging(config)
# Now safe to do expensive operations
X_train, y_train = load_data(config)
classifier = BinaryClassifier(...)
Impact: Catches config errors immediately instead of after minutes of setup.
Pattern 2: Validate Data Structures After Loading¶
Why: Pickle and CSV can return corrupted or unexpected data.
Example:
# Loading pickle cache
with open(cache_file, "rb") as f:
cached_data_raw = pickle.load(f)
# Validate type before accessing
if not isinstance(cached_data_raw, dict):
logger.warning(f"Invalid cache format (got {type(cached_data_raw).__name__}), recomputing...")
# Fall back to recomputation
elif "embeddings" not in cached_data_raw:
logger.warning("Corrupt cache (missing keys), recomputing...")
else:
# Safe to use
cached_data = cached_data_raw
validate_embeddings(cached_data["embeddings"], ...)
Impact: Graceful fallback instead of cryptic errors.
Pattern 3: Provide Context in Error Messages¶
Why: When processing thousands of sequences, knowing WHICH one failed is critical.
Example:
try:
embedding = model(sequence)
except Exception as e:
seq_preview = sequence[:50] + "..." if len(sequence) > 50 else sequence
raise RuntimeError(
f"Failed to extract embedding for sequence of length {len(sequence)}: {seq_preview}"
) from e
Impact: Developers can immediately identify and fix the problematic sequence.
Pattern 4: Use Type Validation on Untrusted Loads¶
Why: Pickle returns Any, CSV returns mixed types. Must validate before use.
Example:
# BAD: Trust the type hint
cached_data: dict[str, Any] = pickle.load(f) # Could be anything!
# GOOD: Validate runtime type
cached_data_raw = pickle.load(f)
if not isinstance(cached_data_raw, dict):
raise ValueError(f"Expected dict, got {type(cached_data_raw)}")
cached_data: dict[str, Any] = cached_data_raw # Now safe
Impact: Prevents type confusion attacks and corrupt data propagation.
When to Add Validation¶
Add validation when: 1. Loading external data: CSV, pickle, user input 2. Before expensive operations: GPU allocation, model downloads, training loops 3. Accessing dict keys: Config dicts, cached data 4. Processing batches: Show which batch/sequence failed
Don't add validation for: 1. Internal function calls (trust your own typed code) 2. After validation already done (don't double-validate) 3. Performance-critical loops (validate before loop, not inside)
Testing Error Paths¶
Every validation function should have tests for:
def test_validate_config_missing_section():
config = {"data": {}} # Missing "model" section
with pytest.raises(ValueError, match="Missing config sections: model"):
validate_config(config)
def test_validate_config_missing_keys():
config = {"data": {}, "model": {}} # Missing "data.train_file"
with pytest.raises(ValueError, match="Missing config keys: data.train_file"):
validate_config(config)
See also: Testing Strategy - Error Handling
Last Updated: 2025-11-28
Branch: main