Skip to content

Boughter

boughter

Boughter Dataset Loader

Loads preprocessed Boughter mouse antibody dataset.

IMPORTANT: This module is for LOADING preprocessed data, not for running the preprocessing pipeline. The preprocessing scripts that CREATE the data are in: preprocessing/boughter/stage2_stage3_annotation_qc.py

Dataset characteristics: - Full antibodies (VH + VL) - Mouse antibodies from 6 subsets (flu, hiv, gut, mouse IgA) - DNA sequences requiring translation to protein - Novo flagging strategy (0/1-¾+ flags) - 3-stage quality control pipeline - 16 fragment types (full antibody)

Processing Pipeline

Stage 1: DNA translation (FASTA → protein sequences) Stage 2: ANARCI annotation (riot_na) Stage 3: Post-annotation QC (filter X in CDRs, empty CDRs)

Source: - data/train/boughter/raw/ (multiple subsets) - Sequences in DNA format requiring translation

Reference: - Boughter et al., "Biochemical patterns of antibody polyreactivity revealed through a bioinformatics-based analysis of CDR loops"

Classes

BoughterDataset

Bases: AntibodyDataset

Loader for Boughter mouse antibody dataset.

This class provides an interface to LOAD preprocessed Boughter dataset files. It does NOT run the preprocessing pipeline - use preprocessing/boughter/stage2_stage3_annotation_qc.py for that.

The Boughter dataset originally requires DNA translation before standard preprocessing. Sequences are provided as DNA in FASTA format and must be translated to protein sequences using a hybrid translation strategy (done by preprocessing scripts).

Source code in src/antibody_training_esm/datasets/boughter.py
class BoughterDataset(AntibodyDataset):
    """
    Loader for Boughter mouse antibody dataset.

    This class provides an interface to LOAD preprocessed Boughter dataset files.
    It does NOT run the preprocessing pipeline - use preprocessing/boughter/stage2_stage3_annotation_qc.py for that.

    The Boughter dataset originally requires DNA translation before standard preprocessing.
    Sequences are provided as DNA in FASTA format and must be translated
    to protein sequences using a hybrid translation strategy (done by preprocessing scripts).
    """

    # Novo flagging strategy
    FLAG_SPECIFIC = 0  # 0 flags = specific (include in training)
    FLAG_MILD = [1, 2, 3]  # 1-3 flags = mild (EXCLUDE from training)
    FLAG_NONSPECIFIC = [4, 5, 6, 7]  # 4+ flags = non-specific (include in training)

    # Dataset subsets
    SUBSETS = ["flu", "hiv_nat", "hiv_cntrl", "hiv_plos", "gut_hiv", "mouse_iga"]

    @classmethod
    def get_schema(cls) -> pa.DataFrameSchema:
        return get_boughter_schema()

    def __init__(
        self, output_dir: Path | None = None, logger: logging.Logger | None = None
    ):
        """
        Initialize Boughter dataset loader.

        Args:
            output_dir: Directory containing preprocessed fragment files
            logger: Logger instance
        """
        super().__init__(
            dataset_name="boughter",
            output_dir=output_dir or BOUGHTER_ANNOTATED_DIR,
            logger=logger,
        )

    def get_fragment_types(self) -> list[str]:
        """
        Return full antibody fragment types.

        Boughter contains VH + VL sequences, so we generate all 16 fragment types.

        Returns:
            List of 16 full antibody fragment types
        """
        return self.FULL_ANTIBODY_FRAGMENTS

    def load_data(
        self,
        processed_csv: str | Path | None = None,
        subset: str | None = None,
        include_mild: bool = False,
        **_: Any,
    ) -> pd.DataFrame:
        """
        Load Boughter dataset from processed CSV.

        Note: This assumes DNA translation has already been performed.
        For DNA translation from FASTA files, use the preprocessing scripts
        in preprocessing/boughter/

        Args:
            processed_csv: Path to processed CSV with protein sequences
            subset: Specific subset to load (flu, hiv_nat, etc.) or None for all
            include_mild: If True, include mild (1-3 flags). Default False.

        Returns:
            DataFrame with columns: id, VH_sequence, VL_sequence, label, flags, include_in_training

        Raises:
            FileNotFoundError: If processed CSV not found
        """
        # Default path
        if processed_csv is None:
            processed_csv = BOUGHTER_PROCESSED_CSV

        csv_file = Path(processed_csv)
        if not csv_file.exists():
            raise FileNotFoundError(
                f"Boughter processed CSV not found: {csv_file}\n"
                f"Please run DNA translation preprocessing first:\n"
                f"  python preprocessing/boughter/stage1_dna_translation.py"
            )

        # Load data
        self.logger.info(f"Reading Boughter dataset from {csv_file}...")
        df = pd.read_csv(csv_file)
        self.logger.info(f"  Loaded {len(df)} sequences")

        # Filter by subset if specified
        if subset is not None:
            if subset not in self.SUBSETS:
                raise ValueError(f"Unknown subset: {subset}. Valid: {self.SUBSETS}")
            df = df[df["subset"] == subset].copy()
            self.logger.info(f"  Filtered to subset '{subset}': {len(df)} sequences")

        # Apply Novo flagging strategy (only if flags column exists)
        # Pre-filtered training files (e.g., *_training.csv) don't have flags column
        if not include_mild:
            # Check if flags column exists (may be 'num_flags' or 'flags')
            has_flags = "num_flags" in df.columns or "flags" in df.columns

            if has_flags:
                # Exclude mild (1-3 flags) per Novo Nordisk methodology
                flag_col = "num_flags" if "num_flags" in df.columns else "flags"
                df["include_in_training"] = ~df[flag_col].isin(self.FLAG_MILD)
                df_training = df[df["include_in_training"]].copy()

                excluded = len(df) - len(df_training)
                self.logger.info("\nNovo flagging strategy:")
                self.logger.info(
                    f"  Excluded {excluded} sequences with mild flags (1-3)"
                )
                self.logger.info(f"  Training set: {len(df_training)} sequences")

                df = df_training
            else:
                # File is pre-filtered (training subset) - no flags column
                self.logger.info(
                    "  No flags column found - assuming pre-filtered training data"
                )

        # Standardize column names
        column_mapping = {
            "heavy_seq": "VH_sequence",
            "light_seq": "VL_sequence",
        }
        if "heavy_seq" in df.columns:
            df = df.rename(columns=column_mapping)

        # Create binary labels from flags
        # 0 flags → specific (label=0)
        # 4+ flags → non-specific (label=1)
        flag_col = "num_flags" if "num_flags" in df.columns else "flags"
        if flag_col in df.columns:
            df["label"] = (df[flag_col] >= 4).astype(int)

        # Create 'sequence' column for schema validation (use VH)
        if "sequence" not in df.columns and "VH_sequence" in df.columns:
            df["sequence"] = df["VH_sequence"]

        # Filter out sequences with stop codons (*) which violate the schema (uppercase letters only)
        if "sequence" in df.columns:
            initial_len = len(df)
            # Use raw string for regex to match literal *
            df = df[~df["sequence"].str.contains(r"\*", regex=True, na=False)].copy()
            dropped = initial_len - len(df)
            if dropped > 0:
                self.logger.info(
                    f"  Removed {dropped} sequences containing stop codons (*)"
                )

        # Guard against empty dataset after filtering
        if len(df) == 0:
            raise ValueError(
                "No valid sequences remaining after filtering. "
                "All sequences were removed due to stop codons (*), empty sequences, "
                "or other quality issues. Check upstream preprocessing pipeline."
            )

        # Validate with Pandera
        df = self.validate_dataframe(df)

        self.logger.info("\nLabel distribution:")
        label_counts = df["label"].value_counts().sort_index()
        for label, count in label_counts.items():
            label_name = "Specific" if label == 0 else "Non-specific"
            percentage = (count / len(df)) * 100
            self.logger.info(
                f"  {label_name} (label={label}): {count} ({percentage:.1f}%)"
            )

        return df

    def translate_dna_to_protein(self, dna_sequence: str) -> NoReturn:  # noqa: ARG002
        """
        This method is NOT IMPLEMENTED and will always raise an error.

        DNA translation logic belongs in the preprocessing scripts, not in
        dataset loader classes. Loaders are for LOADING preprocessed data,
        not for creating it.

        For DNA translation, use:
            preprocessing/boughter/stage1_dna_translation.py

        Args:
            dna_sequence: DNA sequence string (unused - always raises)

        Raises:
            NotImplementedError: Always - this method intentionally does nothing
        """
        raise NotImplementedError(
            "DNA translation is not implemented in dataset loader classes.\n"
            "Dataset loaders are for LOADING preprocessed data, not creating it.\n"
            "\n"
            "For DNA translation, use the preprocessing script:\n"
            "  python preprocessing/boughter/stage1_dna_translation.py\n"
            "\n"
            "This script implements the full hybrid translation strategy:\n"
            "  1. Direct V-domain translation (pre-trimmed sequences)\n"
            "  2. ATG-based translation (full-length with signal peptide)\n"
            "  3. V-domain motif detection (EVQL, QVQL, etc.)"
        )

    def filter_quality_issues(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Stage 3 QC: Filter sequences with quality issues.

        Removes:
        - Sequences with X in CDRs (ambiguous amino acids)
        - Sequences with empty CDRs
        - Invalid annotations

        Args:
            df: Annotated DataFrame

        Returns:
            Filtered DataFrame
        """
        initial_count = len(df)

        # Filter X in CDRs
        cdr_cols = [
            col for col in df.columns if "CDR" in col and ("VH_" in col or "VL_" in col)
        ]

        if cdr_cols:
            for col in cdr_cols:
                if col in df.columns:
                    df = df[~df[col].str.contains("X", na=False)].copy()

        # Filter empty CDRs
        for col in cdr_cols:
            if col in df.columns:
                df = df[df[col].str.len() > 0].copy()

        filtered_count = initial_count - len(df)

        if filtered_count > 0:
            self.logger.info(f"\nStage 3 QC filtered {filtered_count} sequences:")
            self.logger.info(f"  Remaining: {len(df)} sequences")

        return df
Functions
get_fragment_types()

Return full antibody fragment types.

Boughter contains VH + VL sequences, so we generate all 16 fragment types.

Returns:

Type Description
list[str]

List of 16 full antibody fragment types

Source code in src/antibody_training_esm/datasets/boughter.py
def get_fragment_types(self) -> list[str]:
    """
    Return full antibody fragment types.

    Boughter contains VH + VL sequences, so we generate all 16 fragment types.

    Returns:
        List of 16 full antibody fragment types
    """
    return self.FULL_ANTIBODY_FRAGMENTS
load_data(processed_csv=None, subset=None, include_mild=False, **_)

Load Boughter dataset from processed CSV.

Note: This assumes DNA translation has already been performed. For DNA translation from FASTA files, use the preprocessing scripts in preprocessing/boughter/

Parameters:

Name Type Description Default
processed_csv str | Path | None

Path to processed CSV with protein sequences

None
subset str | None

Specific subset to load (flu, hiv_nat, etc.) or None for all

None
include_mild bool

If True, include mild (1-3 flags). Default False.

False

Returns:

Type Description
DataFrame

DataFrame with columns: id, VH_sequence, VL_sequence, label, flags, include_in_training

Raises:

Type Description
FileNotFoundError

If processed CSV not found

Source code in src/antibody_training_esm/datasets/boughter.py
def load_data(
    self,
    processed_csv: str | Path | None = None,
    subset: str | None = None,
    include_mild: bool = False,
    **_: Any,
) -> pd.DataFrame:
    """
    Load Boughter dataset from processed CSV.

    Note: This assumes DNA translation has already been performed.
    For DNA translation from FASTA files, use the preprocessing scripts
    in preprocessing/boughter/

    Args:
        processed_csv: Path to processed CSV with protein sequences
        subset: Specific subset to load (flu, hiv_nat, etc.) or None for all
        include_mild: If True, include mild (1-3 flags). Default False.

    Returns:
        DataFrame with columns: id, VH_sequence, VL_sequence, label, flags, include_in_training

    Raises:
        FileNotFoundError: If processed CSV not found
    """
    # Default path
    if processed_csv is None:
        processed_csv = BOUGHTER_PROCESSED_CSV

    csv_file = Path(processed_csv)
    if not csv_file.exists():
        raise FileNotFoundError(
            f"Boughter processed CSV not found: {csv_file}\n"
            f"Please run DNA translation preprocessing first:\n"
            f"  python preprocessing/boughter/stage1_dna_translation.py"
        )

    # Load data
    self.logger.info(f"Reading Boughter dataset from {csv_file}...")
    df = pd.read_csv(csv_file)
    self.logger.info(f"  Loaded {len(df)} sequences")

    # Filter by subset if specified
    if subset is not None:
        if subset not in self.SUBSETS:
            raise ValueError(f"Unknown subset: {subset}. Valid: {self.SUBSETS}")
        df = df[df["subset"] == subset].copy()
        self.logger.info(f"  Filtered to subset '{subset}': {len(df)} sequences")

    # Apply Novo flagging strategy (only if flags column exists)
    # Pre-filtered training files (e.g., *_training.csv) don't have flags column
    if not include_mild:
        # Check if flags column exists (may be 'num_flags' or 'flags')
        has_flags = "num_flags" in df.columns or "flags" in df.columns

        if has_flags:
            # Exclude mild (1-3 flags) per Novo Nordisk methodology
            flag_col = "num_flags" if "num_flags" in df.columns else "flags"
            df["include_in_training"] = ~df[flag_col].isin(self.FLAG_MILD)
            df_training = df[df["include_in_training"]].copy()

            excluded = len(df) - len(df_training)
            self.logger.info("\nNovo flagging strategy:")
            self.logger.info(
                f"  Excluded {excluded} sequences with mild flags (1-3)"
            )
            self.logger.info(f"  Training set: {len(df_training)} sequences")

            df = df_training
        else:
            # File is pre-filtered (training subset) - no flags column
            self.logger.info(
                "  No flags column found - assuming pre-filtered training data"
            )

    # Standardize column names
    column_mapping = {
        "heavy_seq": "VH_sequence",
        "light_seq": "VL_sequence",
    }
    if "heavy_seq" in df.columns:
        df = df.rename(columns=column_mapping)

    # Create binary labels from flags
    # 0 flags → specific (label=0)
    # 4+ flags → non-specific (label=1)
    flag_col = "num_flags" if "num_flags" in df.columns else "flags"
    if flag_col in df.columns:
        df["label"] = (df[flag_col] >= 4).astype(int)

    # Create 'sequence' column for schema validation (use VH)
    if "sequence" not in df.columns and "VH_sequence" in df.columns:
        df["sequence"] = df["VH_sequence"]

    # Filter out sequences with stop codons (*) which violate the schema (uppercase letters only)
    if "sequence" in df.columns:
        initial_len = len(df)
        # Use raw string for regex to match literal *
        df = df[~df["sequence"].str.contains(r"\*", regex=True, na=False)].copy()
        dropped = initial_len - len(df)
        if dropped > 0:
            self.logger.info(
                f"  Removed {dropped} sequences containing stop codons (*)"
            )

    # Guard against empty dataset after filtering
    if len(df) == 0:
        raise ValueError(
            "No valid sequences remaining after filtering. "
            "All sequences were removed due to stop codons (*), empty sequences, "
            "or other quality issues. Check upstream preprocessing pipeline."
        )

    # Validate with Pandera
    df = self.validate_dataframe(df)

    self.logger.info("\nLabel distribution:")
    label_counts = df["label"].value_counts().sort_index()
    for label, count in label_counts.items():
        label_name = "Specific" if label == 0 else "Non-specific"
        percentage = (count / len(df)) * 100
        self.logger.info(
            f"  {label_name} (label={label}): {count} ({percentage:.1f}%)"
        )

    return df
translate_dna_to_protein(dna_sequence)

This method is NOT IMPLEMENTED and will always raise an error.

DNA translation logic belongs in the preprocessing scripts, not in dataset loader classes. Loaders are for LOADING preprocessed data, not for creating it.

For DNA translation, use: preprocessing/boughter/stage1_dna_translation.py

Parameters:

Name Type Description Default
dna_sequence str

DNA sequence string (unused - always raises)

required

Raises:

Type Description
NotImplementedError

Always - this method intentionally does nothing

Source code in src/antibody_training_esm/datasets/boughter.py
def translate_dna_to_protein(self, dna_sequence: str) -> NoReturn:  # noqa: ARG002
    """
    This method is NOT IMPLEMENTED and will always raise an error.

    DNA translation logic belongs in the preprocessing scripts, not in
    dataset loader classes. Loaders are for LOADING preprocessed data,
    not for creating it.

    For DNA translation, use:
        preprocessing/boughter/stage1_dna_translation.py

    Args:
        dna_sequence: DNA sequence string (unused - always raises)

    Raises:
        NotImplementedError: Always - this method intentionally does nothing
    """
    raise NotImplementedError(
        "DNA translation is not implemented in dataset loader classes.\n"
        "Dataset loaders are for LOADING preprocessed data, not creating it.\n"
        "\n"
        "For DNA translation, use the preprocessing script:\n"
        "  python preprocessing/boughter/stage1_dna_translation.py\n"
        "\n"
        "This script implements the full hybrid translation strategy:\n"
        "  1. Direct V-domain translation (pre-trimmed sequences)\n"
        "  2. ATG-based translation (full-length with signal peptide)\n"
        "  3. V-domain motif detection (EVQL, QVQL, etc.)"
    )
filter_quality_issues(df)

Stage 3 QC: Filter sequences with quality issues.

Removes: - Sequences with X in CDRs (ambiguous amino acids) - Sequences with empty CDRs - Invalid annotations

Parameters:

Name Type Description Default
df DataFrame

Annotated DataFrame

required

Returns:

Type Description
DataFrame

Filtered DataFrame

Source code in src/antibody_training_esm/datasets/boughter.py
def filter_quality_issues(self, df: pd.DataFrame) -> pd.DataFrame:
    """
    Stage 3 QC: Filter sequences with quality issues.

    Removes:
    - Sequences with X in CDRs (ambiguous amino acids)
    - Sequences with empty CDRs
    - Invalid annotations

    Args:
        df: Annotated DataFrame

    Returns:
        Filtered DataFrame
    """
    initial_count = len(df)

    # Filter X in CDRs
    cdr_cols = [
        col for col in df.columns if "CDR" in col and ("VH_" in col or "VL_" in col)
    ]

    if cdr_cols:
        for col in cdr_cols:
            if col in df.columns:
                df = df[~df[col].str.contains("X", na=False)].copy()

    # Filter empty CDRs
    for col in cdr_cols:
        if col in df.columns:
            df = df[df[col].str.len() > 0].copy()

    filtered_count = initial_count - len(df)

    if filtered_count > 0:
        self.logger.info(f"\nStage 3 QC filtered {filtered_count} sequences:")
        self.logger.info(f"  Remaining: {len(df)} sequences")

    return df

Functions

load_boughter_data(processed_csv=None, subset=None, include_mild=False)

Convenience function to load preprocessed Boughter dataset.

IMPORTANT: This loads PREPROCESSED data. To preprocess raw data, use: preprocessing/boughter/stage2_stage3_annotation_qc.py

Parameters:

Name Type Description Default
processed_csv str | None

Path to processed CSV with protein sequences

None
subset str | None

Specific subset to load or None for all

None
include_mild bool

If True, include mild (1-3 flags)

False

Returns:

Type Description
DataFrame

DataFrame with preprocessed data

Example

from antibody_training_esm.datasets.boughter import load_boughter_data df = load_boughter_data(include_mild=False) # Novo flagging print(f"Loaded {len(df)} sequences")

Source code in src/antibody_training_esm/datasets/boughter.py
def load_boughter_data(
    processed_csv: str | None = None,
    subset: str | None = None,
    include_mild: bool = False,
) -> pd.DataFrame:
    """
    Convenience function to load preprocessed Boughter dataset.

    IMPORTANT: This loads PREPROCESSED data. To preprocess raw data, use:
    preprocessing/boughter/stage2_stage3_annotation_qc.py

    Args:
        processed_csv: Path to processed CSV with protein sequences
        subset: Specific subset to load or None for all
        include_mild: If True, include mild (1-3 flags)

    Returns:
        DataFrame with preprocessed data

    Example:
        >>> from antibody_training_esm.datasets.boughter import load_boughter_data
        >>> df = load_boughter_data(include_mild=False)  # Novo flagging
        >>> print(f"Loaded {len(df)} sequences")
    """
    dataset = BoughterDataset()
    return dataset.load_data(
        processed_csv=processed_csv, subset=subset, include_mild=include_mild
    )