Skip to content

FLaTNNBio/EntropyFusionDNA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fusion Detection Tool

A command-line toolkit for DNA/RNA fusion sequence analysis based on engineered sequence features and two complementary machine-learning pipelines.

This repository lets you:

  • extract numerical features from CSV or FASTA sequence data,
  • optionally balance labeled datasets by random undersampling,
  • train a binary fusion classifier (Pipeline A),
  • train a fusion partner + breakpoint predictor (Pipeline B within the A+B workflow),
  • run inference directly from FASTA files,
  • run inference from precomputed feature tables,
  • download pretrained artifacts from a manifest file.

Overview

The tool is organized around two prediction stages.

Pipeline A — fusion detection

Pipeline A is a KNN-based binary classifier trained on extracted sequence features. It predicts whether a sequence is likely to be a fusion event.

Typical output columns include:

  • pipeline_a_proba
  • pipeline_a_pred_default
  • pipeline_a_pred_high
  • pipeline_a_pred_low

Pipeline B — fusion characterization

Pipeline B is a PyTorch neural network used in the A+B workflow to predict:

  • the most likely gene 1 partner,
  • the most likely gene 2 partner,
  • the relative breakpoint,
  • the absolute breakpoint when sequence length is available.

Typical output columns include:

  • pipeline_b_gene1_topK
  • pipeline_b_gene2_topK
  • pipeline_b_rel_bp_pred
  • pipeline_b_abs_bp_pred

Combined A+B workflow

In combined mode, Pipeline A acts as a gate before Pipeline B. Only rows whose Pipeline A score passes a chosen threshold are forwarded to Pipeline B.

The output includes:

  • pipeline_ab_called

When pipeline_ab_called = 0, Pipeline B columns are typically left as NaN.


Repository structure

fusion_detection_tool/
├── .gitignore
├── balance_dataset.py
├── features_selection.py
├── main_download_pretrained.py
├── main_predict_fasta.py
├── main_predict_features.py
├── main_train.py
├── pipeline_fusion_classification.py
└── train_pipeline_classification_detection.py

Script summary

  • features_selection.py
    Extracts engineered sequence features from CSV or FASTA inputs.

  • balance_dataset.py
    Randomly undersamples labeled datasets.

  • pipeline_fusion_classification.py
    Lower-level training script for Pipeline A only.

  • train_pipeline_classification_detection.py
    Lower-level training script for the full A+B workflow from a features CSV.

  • main_train.py
    Main training orchestrator. It can start from a raw CSV, a FASTA file plus metadata CSV, or a precomputed features CSV. It can also optionally balance the dataset before training.

  • main_predict_fasta.py
    Extracts features from a FASTA file and runs prediction with Pipeline A, Pipeline B, or combined A+B.

  • main_predict_features.py
    Runs prediction starting from an already computed features CSV.

  • main_download_pretrained.py
    Downloads pretrained model bundles from a local or remote manifest JSON.


Requirements

Python

This codebase uses modern Python syntax such as str | Path, so Python 3.10+ is recommended.

Core dependencies

Install the main dependencies with:

pip install -r requirements.txt

Optional dependencies

For extra functionality or better performance:

pip install xxhash

Notes:

  • tqdm is used for progress bars.
  • joblib is used for serialization and some parallel workloads.
  • xxhash improves sketch/hash feature performance when sketch-based features are enabled.

Installation

Clone the repository and move into it:

git clone <your-repo-url>
cd <your-repo-folder>

Install the package in editable mode:

pip install -e .

Then check the available options:

python main_train.py --help
python main_predict_fasta.py --help
python main_predict_features.py --help

You can also use the installed CLI entry points:

fusion-train --help
fusion-predict-fasta --help
fusion-predict-features --help

To run the automatic test suite:

python -m unittest discover -s tests -v

Release validation

Before publishing, the repository was smoke-tested locally with:

  • python -m unittest discover -s tests -v
  • python main_train.py --help
  • python main_predict_fasta.py --help
  • python main_predict_features.py --help
  • python balance_dataset.py --help
  • python features_selection.py --help
  • python pipeline_fusion_classification.py --help
  • python train_pipeline_classification_detection.py --help
  • python main_download_pretrained.py --help
  • end-to-end smoke runs for feature extraction, balancing, Pipeline A training, A-only prediction, A+B prediction, and minimal lower-level A+B training
  • editable install verification with the packaged CLI entry points such as fusion-train and fusion-predict-fasta

Sample data included in the repository

To keep the GitHub repository lightweight, the repository includes only one sample dataset:

  • dataset/smoke_5k_stratified.csv

The larger raw datasets, derived feature tables, and generated artifacts are intentionally excluded from version control.


Data formats

1. Raw training CSV

When using main_train.py --input-type csv, the required columns depend on the selected pipeline.

For --pipeline a

Required columns:

  • sequence
  • label

For --pipeline ab

Required columns:

  • sequence
  • label
  • gene1
  • gene2
  • junction_point

Example for full A+B training:

sequence,label,gene1,gene2,junction_point
ACGTACGTACGTACGT,1,GENE_A,GENE_B,8
TGCATGCATGCATGCA,0,GENE_X,GENE_Y,7

2. FASTA + metadata for training

main_train.py also supports --input-type fasta.

In this mode:

  • the FASTA file provides the sequences,
  • a separate metadata CSV provides labels and any extra training columns,
  • FASTA seq_id values are matched against a metadata column specified by --metadata-id-col.

The metadata CSV must contain at least:

For --pipeline a

  • the ID column used for matching, for example seq_id
  • label

For --pipeline ab

  • the ID column used for matching, for example seq_id
  • label
  • gene1
  • gene2
  • junction_point

Example FASTA:

>sample_001
ACGTACGTACGTACGT
>sample_002
TGCATGCATGCATGCA

Example metadata CSV for A+B:

seq_id,label,gene1,gene2,junction_point
sample_001,1,GENE_A,GENE_B,8
sample_002,0,GENE_X,GENE_Y,7

3. Features CSV for training

A features CSV is a table that already contains engineered feature columns.

Required columns depend on the selected pipeline:

For --pipeline a

  • label

For --pipeline ab

  • label
  • gene1
  • gene2
  • junction_point

Extra metadata columns are allowed.

4. FASTA input for inference

main_predict_fasta.py takes a FASTA file, extracts the features internally, and runs prediction.

Typical FASTA-derived metadata columns preserved in the feature table may include:

  • seq_id
  • description
  • fasta_header

5. Features CSV for inference

main_predict_features.py takes a precomputed features CSV and skips feature extraction.

This is useful when:

  • features were already generated offline,
  • you want faster repeated inference,
  • you want strict control over the exact feature table being scored.

The input features CSV must contain the feature columns expected by the trained artifact. Extra metadata columns are usually preserved in the output.


Feature extraction

features_selection.py can be used as a standalone script to generate feature tables from CSV or FASTA inputs.

Supported feature families include:

  • Shannon and Rényi entropy
  • mutual information across different lags (tau)
  • resolved mutual information
  • GC content
  • canonical k-mer frequencies
  • optional compression-derived features
  • sequence length
  • optional sketch/hash-based features

Example: extract features from a CSV

python features_selection.py \
  --input data/train.csv \
  --output data/train_features.csv \
  --input-format csv \
  --sequence-col sequence \
  --label-col label \
  --tau-list 1,2,3,4,5 \
  --k-list 3,4,5

Example: extract features from a FASTA file

python features_selection.py \
  --input data/query.fasta \
  --output data/query_features.csv \
  --input-format fasta \
  --tau-list 1,2,3,4,5 \
  --k-list 3,4,5

Common extraction options

  • --keep-N keeps N bases during cleaning.
  • --include-compression adds compression-based features.
  • --include-sketch adds sketch/hash-based features.
  • --k-sketch, --n-hash, --sketch-stride, --sketch-window control sketch features.
  • --min-len discards sequences shorter than a threshold.
  • --n-jobs enables parallel processing where supported.

Dataset balancing

balance_dataset.py performs random undersampling while preserving the original columns.

Supported strategies:

  • minority: every class is reduced to the minority class size
  • fixed: every class is reduced to --target-count

Example: balance to the minority class

python balance_dataset.py \
  --input data/train_features.csv \
  --output data/train_features_balanced.csv \
  --label-col label \
  --strategy minority

Example: balance to a fixed class size

python balance_dataset.py \
  --input data/train_features.csv \
  --output data/train_features_balanced.csv \
  --label-col label \
  --strategy fixed \
  --target-count 5000

Training with main_train.py

This is the recommended training entry point for most users.

It can:

  1. read a raw CSV, a FASTA+metadata pair, or a features CSV,
  2. extract features when needed,
  3. optionally balance the dataset,
  4. train either Pipeline A or A+B,
  5. save artifacts under a single output directory.

Training examples

Train Pipeline A from a raw CSV

python main_train.py \
  --input data/train.csv \
  --input-type csv \
  --artifact-root artifacts/run_a \
  --pipeline a

Train A+B from a raw CSV

python main_train.py \
  --input data/train.csv \
  --input-type csv \
  --artifact-root artifacts/run_ab \
  --pipeline ab \
  --device auto

Train Pipeline A from FASTA + metadata

python main_train.py \
  --input data/train.fasta \
  --input-type fasta \
  --metadata-csv data/train_metadata.csv \
  --metadata-id-col seq_id \
  --artifact-root artifacts/run_a_fasta \
  --pipeline a

Train A+B from FASTA + metadata

python main_train.py \
  --input data/train.fasta \
  --input-type fasta \
  --metadata-csv data/train_metadata.csv \
  --metadata-id-col seq_id \
  --artifact-root artifacts/run_ab_fasta \
  --pipeline ab \
  --device auto

Train from an existing features table

python main_train.py \
  --input data/train_features.csv \
  --input-type features \
  --artifact-root artifacts/run_ab_features \
  --pipeline ab

Train with balancing enabled

python main_train.py \
  --input data/train.csv \
  --input-type csv \
  --artifact-root artifacts/run_ab_balanced \
  --pipeline ab \
  --balance \
  --balance-strategy minority

Important main_train.py options

Input handling

  • --input
  • --input-type {csv,features,fasta}
  • --artifact-root
  • --sequence-col
  • --label-col
  • --metadata-csv
  • --metadata-id-col

Feature extraction

  • --tau-list
  • --k-list
  • --keep-N
  • --include-compression
  • --include-sketch
  • --k-sketch
  • --n-hash
  • --sketch-stride
  • --sketch-window
  • --min-len
  • --n-jobs
  • --no-progress

Balancing

  • --balance
  • --balance-strategy {minority,fixed}
  • --balance-target-count

Common training parameters

  • --seed
  • --test-size
  • --val-size
  • --device {auto,cpu,cuda}

Pipeline A parameters

  • --a-n-neighbors
  • --a-p
  • --a-weights
  • --a-n-jobs
  • --target-recall-low
  • --a-n-boot
  • --a-boot-seed
  • --a-no-bootstrap

Pipeline B / A+B parameters

  • --epochs-b
  • --patience-b
  • --batch-size
  • --w-gene
  • --w-bp
  • --hidden-grid
  • --dropout-grid
  • --lr-grid
  • --topk-list
  • --b-n-boot
  • --b-boot-seed

What main_train.py writes

Typical outputs include:

  • intermediates/train_raw_from_fasta.csv when training starts from FASTA
  • intermediates/train_features.csv
  • intermediates/train_features_balanced.csv when balancing is enabled
  • preprocessing_summary.json
  • pipeline_a/ and optionally pipeline_b/ artifact directories

Advanced training entry points

If you want more control, you can run the lower-level training scripts directly.

Train Pipeline A only from a features CSV

python pipeline_fusion_classification.py \
  --features-csv data/train_features.csv \
  --artifact-dir artifacts/pipeline_a

Train the full A+B workflow from a features CSV

python train_pipeline_classification_detection.py \
  --features-csv data/train_features.csv \
  --artifact-root artifacts/pipeline_ab \
  --device auto

Prediction from FASTA with main_predict_fasta.py

Use this script when you have raw sequences in FASTA format and want the tool to extract features internally.

Supported modes:

  • --pipeline a
  • --pipeline b
  • --pipeline ab

Predict with Pipeline A only

python main_predict_fasta.py \
  --input-fasta data/query.fasta \
  --pipeline a \
  --pipeline-a-artifact-dir artifacts/run_a/pipeline_a \
  --output results/query_pipeline_a.csv

Predict with Pipeline B only

python main_predict_fasta.py \
  --input-fasta data/query.fasta \
  --pipeline b \
  --pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
  --topk 5 \
  --device auto \
  --output results/query_pipeline_b.csv

Predict with combined A+B

python main_predict_fasta.py \
  --input-fasta data/query.fasta \
  --pipeline ab \
  --pipeline-a-artifact-dir artifacts/run_ab/pipeline_a \
  --pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
  --a-threshold low \
  --topk 5 \
  --device auto \
  --output results/query_pipeline_ab.csv

Save extracted features and metadata

python main_predict_fasta.py \
  --input-fasta data/query.fasta \
  --pipeline ab \
  --pipeline-a-artifact-dir artifacts/run_ab/pipeline_a \
  --pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
  --features-out results/query_features.csv \
  --metadata-out results/query_metadata.json \
  --output results/query_predictions.csv

Important main_predict_fasta.py options

  • --input-fasta
  • --output
  • --features-out
  • --metadata-out
  • --pipeline {a,b,ab}
  • --pipeline-a-artifact-dir
  • --pipeline-b-artifact-dir
  • --a-threshold {default,high,low}
  • --a-threshold-value
  • --topk
  • --device {cpu,cuda,auto}
  • --feature-config-root to explicitly point to a training artifact root or config JSON
  • all feature extraction options such as --tau-list, --k-list, --include-sketch, and --include-compression

When available, main_predict_fasta.py now reuses the feature extraction configuration saved during main_train.py automatically by reading preprocessing_summary.json from the training artifacts. Explicit CLI flags still override the saved configuration.


Prediction from precomputed features with main_predict_features.py

Use this script when the feature table is already available and you want to skip feature extraction.

Supported modes:

  • --pipeline a
  • --pipeline b
  • --pipeline ab

Predict with Pipeline A only

python main_predict_features.py \
  --input-features data/query_features.csv \
  --pipeline a \
  --pipeline-a-artifact-dir artifacts/run_a/pipeline_a \
  --output results/query_features_pipeline_a.csv

Predict with Pipeline B only

python main_predict_features.py \
  --input-features data/query_features.csv \
  --pipeline b \
  --pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
  --topk 5 \
  --device auto \
  --output results/query_features_pipeline_b.csv

Predict with combined A+B

python main_predict_features.py \
  --input-features data/query_features.csv \
  --pipeline ab \
  --pipeline-a-artifact-dir artifacts/run_ab/pipeline_a \
  --pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
  --a-threshold low \
  --topk 5 \
  --device auto \
  --output results/query_features_pipeline_ab.csv \
  --metadata-out results/query_features_pipeline_ab_meta.json

Important main_predict_features.py options

  • --input-features
  • --output
  • --metadata-out
  • --pipeline {a,b,ab}
  • --pipeline-a-artifact-dir
  • --pipeline-b-artifact-dir
  • --a-threshold {default,high,low}
  • --a-threshold-value
  • --topk
  • --device {cpu,cuda,auto}

Output formats

Both prediction entry points support:

  • .csv
  • .tsv
  • .json

If --output is omitted, a default filename is inferred from the input name.


Threshold control in A+B mode

In combined mode, Pipeline A decides which rows are sent to Pipeline B.

You can choose one of the stored thresholds:

  • --a-threshold default
  • --a-threshold high
  • --a-threshold low

Or provide a custom numeric threshold:

--a-threshold-value 0.72

When --a-threshold-value is provided, it overrides --a-threshold.


Pretrained models

main_download_pretrained.py downloads pretrained bundles from a manifest JSON.

The manifest can be:

  • a local JSON file passed with --manifest,
  • a remote URL,
  • or a file named pretrained_models_manifest.json placed next to the script.

List available bundles

python main_download_pretrained.py \
  --manifest pretrained_models_manifest.json \
  --list

Download a bundle

python main_download_pretrained.py \
  --manifest pretrained_models_manifest.json \
  --bundle pipeline_ab \
  --output-dir artifacts/pretrained

Common options

  • --force
  • --no-verify-sha256
  • --metadata-out

Typical artifact layout

The exact contents depend on the workflow, but a typical A+B training run may look like this:

artifacts/run_ab/
├── intermediates/
│   ├── train_raw_from_fasta.csv
│   ├── train_features.csv
│   └── train_features_balanced.csv
├── pipeline_a/
│   ├── ...
├── pipeline_b/
│   ├── ...
└── preprocessing_summary.json

The internal contents of pipeline_a/ and pipeline_b/ depend on the lower-level training scripts and may include model weights, scalers, metadata JSON files, tuning summaries, and prediction tables.


Practical notes

  • main_train.py launches sibling scripts such as pipeline_fusion_classification.py and train_pipeline_classification_detection.py, so keep those files in the same repository folder.
  • For inference, the feature columns used at prediction time should match the feature set expected by the trained artifact.
  • For A+B training, labels plus partner and breakpoint annotations are required.
  • For FASTA-based training, make sure FASTA sequence identifiers match the metadata ID column exactly.

Quick start

1. Train A+B from a raw CSV

python main_train.py \
  --input data/train.csv \
  --input-type csv \
  --artifact-root artifacts/run_ab \
  --pipeline ab \
  --device auto

2. Predict from FASTA

python main_predict_fasta.py \
  --input-fasta data/query.fasta \
  --pipeline ab \
  --pipeline-a-artifact-dir artifacts/run_ab/pipeline_a \
  --pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
  --output results/query_predictions.csv

3. Predict from precomputed features

python main_predict_features.py \
  --input-features data/query_features.csv \
  --pipeline ab \
  --pipeline-a-artifact-dir artifacts/run_ab/pipeline_a \
  --pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
  --output results/query_predictions_from_features.csv

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages