Fusion Detection Tool

A command-line toolkit for DNA/RNA fusion sequence analysis based on engineered sequence features and two complementary machine-learning pipelines.

This repository lets you:

extract numerical features from CSV or FASTA sequence data,
optionally balance labeled datasets by random undersampling,
train a binary fusion classifier (Pipeline A),
train a fusion partner + breakpoint predictor (Pipeline B within the A+B workflow),
run inference directly from FASTA files,
run inference from precomputed feature tables,
download pretrained artifacts from a manifest file.

Overview

The tool is organized around two prediction stages.

Pipeline A — fusion detection

Pipeline A is a KNN-based binary classifier trained on extracted sequence features. It predicts whether a sequence is likely to be a fusion event.

Typical output columns include:

pipeline_a_proba
pipeline_a_pred_default
pipeline_a_pred_high
pipeline_a_pred_low

Pipeline B — fusion characterization

Pipeline B is a PyTorch neural network used in the A+B workflow to predict:

the most likely gene 1 partner,
the most likely gene 2 partner,
the relative breakpoint,
the absolute breakpoint when sequence length is available.

Typical output columns include:

pipeline_b_gene1_topK
pipeline_b_gene2_topK
pipeline_b_rel_bp_pred
pipeline_b_abs_bp_pred

Combined A+B workflow

In combined mode, Pipeline A acts as a gate before Pipeline B. Only rows whose Pipeline A score passes a chosen threshold are forwarded to Pipeline B.

The output includes:

pipeline_ab_called

When pipeline_ab_called = 0, Pipeline B columns are typically left as NaN.

Repository structure

fusion_detection_tool/
├── .gitignore
├── balance_dataset.py
├── features_selection.py
├── main_download_pretrained.py
├── main_predict_fasta.py
├── main_predict_features.py
├── main_train.py
├── pipeline_fusion_classification.py
└── train_pipeline_classification_detection.py

Script summary

features_selection.py
Extracts engineered sequence features from CSV or FASTA inputs.
balance_dataset.py
Randomly undersamples labeled datasets.
pipeline_fusion_classification.py
Lower-level training script for Pipeline A only.
train_pipeline_classification_detection.py
Lower-level training script for the full A+B workflow from a features CSV.
main_train.py
Main training orchestrator. It can start from a raw CSV, a FASTA file plus metadata CSV, or a precomputed features CSV. It can also optionally balance the dataset before training.
main_predict_fasta.py
Extracts features from a FASTA file and runs prediction with Pipeline A, Pipeline B, or combined A+B.
main_predict_features.py
Runs prediction starting from an already computed features CSV.
main_download_pretrained.py
Downloads pretrained model bundles from a local or remote manifest JSON.

Requirements

Python

This codebase uses modern Python syntax such as str | Path, so Python 3.10+ is recommended.

Core dependencies

Install the main dependencies with:

pip install -r requirements.txt

Optional dependencies

For extra functionality or better performance:

pip install xxhash

Notes:

tqdm is used for progress bars.
joblib is used for serialization and some parallel workloads.
xxhash improves sketch/hash feature performance when sketch-based features are enabled.

Installation

Clone the repository and move into it:

git clone <your-repo-url>
cd <your-repo-folder>

Install the package in editable mode:

pip install -e .

Then check the available options:

python main_train.py --help
python main_predict_fasta.py --help
python main_predict_features.py --help

You can also use the installed CLI entry points:

fusion-train --help
fusion-predict-fasta --help
fusion-predict-features --help

To run the automatic test suite:

python -m unittest discover -s tests -v

Release validation

Before publishing, the repository was smoke-tested locally with:

python -m unittest discover -s tests -v
python main_train.py --help
python main_predict_fasta.py --help
python main_predict_features.py --help
python balance_dataset.py --help
python features_selection.py --help
python pipeline_fusion_classification.py --help
python train_pipeline_classification_detection.py --help
python main_download_pretrained.py --help
end-to-end smoke runs for feature extraction, balancing, Pipeline A training, A-only prediction, A+B prediction, and minimal lower-level A+B training
editable install verification with the packaged CLI entry points such as fusion-train and fusion-predict-fasta

Sample data included in the repository

To keep the GitHub repository lightweight, the repository includes only one sample dataset:

dataset/smoke_5k_stratified.csv

The larger raw datasets, derived feature tables, and generated artifacts are intentionally excluded from version control.

Data formats

1. Raw training CSV

When using main_train.py --input-type csv, the required columns depend on the selected pipeline.

For `--pipeline a`

Required columns:

sequence
label

For `--pipeline ab`

Required columns:

sequence
label
gene1
gene2
junction_point

Example for full A+B training:

sequence,label,gene1,gene2,junction_point
ACGTACGTACGTACGT,1,GENE_A,GENE_B,8
TGCATGCATGCATGCA,0,GENE_X,GENE_Y,7

2. FASTA + metadata for training

main_train.py also supports --input-type fasta.

In this mode:

the FASTA file provides the sequences,
a separate metadata CSV provides labels and any extra training columns,
FASTA seq_id values are matched against a metadata column specified by --metadata-id-col.

The metadata CSV must contain at least:

For `--pipeline a`

the ID column used for matching, for example seq_id
label

For `--pipeline ab`

the ID column used for matching, for example seq_id
label
gene1
gene2
junction_point

Example FASTA:

>sample_001
ACGTACGTACGTACGT
>sample_002
TGCATGCATGCATGCA

Example metadata CSV for A+B:

seq_id,label,gene1,gene2,junction_point
sample_001,1,GENE_A,GENE_B,8
sample_002,0,GENE_X,GENE_Y,7

3. Features CSV for training

A features CSV is a table that already contains engineered feature columns.

Required columns depend on the selected pipeline:

For `--pipeline a`

label

For `--pipeline ab`

label
gene1
gene2
junction_point

Extra metadata columns are allowed.

4. FASTA input for inference

main_predict_fasta.py takes a FASTA file, extracts the features internally, and runs prediction.

Typical FASTA-derived metadata columns preserved in the feature table may include:

seq_id
description
fasta_header

5. Features CSV for inference

main_predict_features.py takes a precomputed features CSV and skips feature extraction.

This is useful when:

features were already generated offline,
you want faster repeated inference,
you want strict control over the exact feature table being scored.

The input features CSV must contain the feature columns expected by the trained artifact. Extra metadata columns are usually preserved in the output.

Feature extraction

features_selection.py can be used as a standalone script to generate feature tables from CSV or FASTA inputs.

Supported feature families include:

Shannon and Rényi entropy
mutual information across different lags (tau)
resolved mutual information
GC content
canonical k-mer frequencies
optional compression-derived features
sequence length
optional sketch/hash-based features

Example: extract features from a CSV

python features_selection.py \
  --input data/train.csv \
  --output data/train_features.csv \
  --input-format csv \
  --sequence-col sequence \
  --label-col label \
  --tau-list 1,2,3,4,5 \
  --k-list 3,4,5

Example: extract features from a FASTA file

python features_selection.py \
  --input data/query.fasta \
  --output data/query_features.csv \
  --input-format fasta \
  --tau-list 1,2,3,4,5 \
  --k-list 3,4,5

Common extraction options

--keep-N keeps N bases during cleaning.
--include-compression adds compression-based features.
--include-sketch adds sketch/hash-based features.
--k-sketch, --n-hash, --sketch-stride, --sketch-window control sketch features.
--min-len discards sequences shorter than a threshold.
--n-jobs enables parallel processing where supported.

Dataset balancing

balance_dataset.py performs random undersampling while preserving the original columns.

Supported strategies:

minority: every class is reduced to the minority class size
fixed: every class is reduced to --target-count

Example: balance to the minority class

python balance_dataset.py \
  --input data/train_features.csv \
  --output data/train_features_balanced.csv \
  --label-col label \
  --strategy minority

Example: balance to a fixed class size

python balance_dataset.py \
  --input data/train_features.csv \
  --output data/train_features_balanced.csv \
  --label-col label \
  --strategy fixed \
  --target-count 5000

Training with `main_train.py`

This is the recommended training entry point for most users.

It can:

read a raw CSV, a FASTA+metadata pair, or a features CSV,
extract features when needed,
optionally balance the dataset,
train either Pipeline A or A+B,
save artifacts under a single output directory.

Training examples

Train Pipeline A from a raw CSV

python main_train.py \
  --input data/train.csv \
  --input-type csv \
  --artifact-root artifacts/run_a \
  --pipeline a

Train A+B from a raw CSV

python main_train.py \
  --input data/train.csv \
  --input-type csv \
  --artifact-root artifacts/run_ab \
  --pipeline ab \
  --device auto

Train Pipeline A from FASTA + metadata

python main_train.py \
  --input data/train.fasta \
  --input-type fasta \
  --metadata-csv data/train_metadata.csv \
  --metadata-id-col seq_id \
  --artifact-root artifacts/run_a_fasta \
  --pipeline a

Train A+B from FASTA + metadata

python main_train.py \
  --input data/train.fasta \
  --input-type fasta \
  --metadata-csv data/train_metadata.csv \
  --metadata-id-col seq_id \
  --artifact-root artifacts/run_ab_fasta \
  --pipeline ab \
  --device auto

Train from an existing features table

python main_train.py \
  --input data/train_features.csv \
  --input-type features \
  --artifact-root artifacts/run_ab_features \
  --pipeline ab

Train with balancing enabled

python main_train.py \
  --input data/train.csv \
  --input-type csv \
  --artifact-root artifacts/run_ab_balanced \
  --pipeline ab \
  --balance \
  --balance-strategy minority

Important `main_train.py` options

Input handling

--input
--input-type {csv,features,fasta}
--artifact-root
--sequence-col
--label-col
--metadata-csv
--metadata-id-col

Feature extraction

--tau-list
--k-list
--keep-N
--include-compression
--include-sketch
--k-sketch
--n-hash
--sketch-stride
--sketch-window
--min-len
--n-jobs
--no-progress

Balancing

--balance
--balance-strategy {minority,fixed}
--balance-target-count

Common training parameters

--seed
--test-size
--val-size
--device {auto,cpu,cuda}

Pipeline A parameters

--a-n-neighbors
--a-p
--a-weights
--a-n-jobs
--target-recall-low
--a-n-boot
--a-boot-seed
--a-no-bootstrap

Pipeline B / A+B parameters

--epochs-b
--patience-b
--batch-size
--w-gene
--w-bp
--hidden-grid
--dropout-grid
--lr-grid
--topk-list
--b-n-boot
--b-boot-seed

What `main_train.py` writes

Typical outputs include:

intermediates/train_raw_from_fasta.csv when training starts from FASTA
intermediates/train_features.csv
intermediates/train_features_balanced.csv when balancing is enabled
preprocessing_summary.json
pipeline_a/ and optionally pipeline_b/ artifact directories

Advanced training entry points

If you want more control, you can run the lower-level training scripts directly.

Train Pipeline A only from a features CSV

python pipeline_fusion_classification.py \
  --features-csv data/train_features.csv \
  --artifact-dir artifacts/pipeline_a

Train the full A+B workflow from a features CSV

python train_pipeline_classification_detection.py \
  --features-csv data/train_features.csv \
  --artifact-root artifacts/pipeline_ab \
  --device auto

Prediction from FASTA with `main_predict_fasta.py`

Use this script when you have raw sequences in FASTA format and want the tool to extract features internally.

Supported modes:

--pipeline a
--pipeline b
--pipeline ab

Predict with Pipeline A only

python main_predict_fasta.py \
  --input-fasta data/query.fasta \
  --pipeline a \
  --pipeline-a-artifact-dir artifacts/run_a/pipeline_a \
  --output results/query_pipeline_a.csv

Predict with Pipeline B only

python main_predict_fasta.py \
  --input-fasta data/query.fasta \
  --pipeline b \
  --pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
  --topk 5 \
  --device auto \
  --output results/query_pipeline_b.csv

Predict with combined A+B

python main_predict_fasta.py \
  --input-fasta data/query.fasta \
  --pipeline ab \
  --pipeline-a-artifact-dir artifacts/run_ab/pipeline_a \
  --pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
  --a-threshold low \
  --topk 5 \
  --device auto \
  --output results/query_pipeline_ab.csv

Save extracted features and metadata

python main_predict_fasta.py \
  --input-fasta data/query.fasta \
  --pipeline ab \
  --pipeline-a-artifact-dir artifacts/run_ab/pipeline_a \
  --pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
  --features-out results/query_features.csv \
  --metadata-out results/query_metadata.json \
  --output results/query_predictions.csv

Important `main_predict_fasta.py` options

--input-fasta
--output
--features-out
--metadata-out
--pipeline {a,b,ab}
--pipeline-a-artifact-dir
--pipeline-b-artifact-dir
--a-threshold {default,high,low}
--a-threshold-value
--topk
--device {cpu,cuda,auto}
--feature-config-root to explicitly point to a training artifact root or config JSON
all feature extraction options such as --tau-list, --k-list, --include-sketch, and --include-compression

When available, main_predict_fasta.py now reuses the feature extraction configuration saved during main_train.py automatically by reading preprocessing_summary.json from the training artifacts. Explicit CLI flags still override the saved configuration.

Prediction from precomputed features with `main_predict_features.py`

Use this script when the feature table is already available and you want to skip feature extraction.

Supported modes:

--pipeline a
--pipeline b
--pipeline ab

Predict with Pipeline A only

python main_predict_features.py \
  --input-features data/query_features.csv \
  --pipeline a \
  --pipeline-a-artifact-dir artifacts/run_a/pipeline_a \
  --output results/query_features_pipeline_a.csv

Predict with Pipeline B only

python main_predict_features.py \
  --input-features data/query_features.csv \
  --pipeline b \
  --pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
  --topk 5 \
  --device auto \
  --output results/query_features_pipeline_b.csv

Predict with combined A+B

python main_predict_features.py \
  --input-features data/query_features.csv \
  --pipeline ab \
  --pipeline-a-artifact-dir artifacts/run_ab/pipeline_a \
  --pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
  --a-threshold low \
  --topk 5 \
  --device auto \
  --output results/query_features_pipeline_ab.csv \
  --metadata-out results/query_features_pipeline_ab_meta.json

Important `main_predict_features.py` options

--input-features
--output
--metadata-out
--pipeline {a,b,ab}
--pipeline-a-artifact-dir
--pipeline-b-artifact-dir
--a-threshold {default,high,low}
--a-threshold-value
--topk
--device {cpu,cuda,auto}

Output formats

Both prediction entry points support:

.csv
.tsv
.json

If --output is omitted, a default filename is inferred from the input name.

Threshold control in A+B mode

In combined mode, Pipeline A decides which rows are sent to Pipeline B.

You can choose one of the stored thresholds:

--a-threshold default
--a-threshold high
--a-threshold low

Or provide a custom numeric threshold:

--a-threshold-value 0.72

When --a-threshold-value is provided, it overrides --a-threshold.

Pretrained models

main_download_pretrained.py downloads pretrained bundles from a manifest JSON.

The manifest can be:

a local JSON file passed with --manifest,
a remote URL,
or a file named pretrained_models_manifest.json placed next to the script.

List available bundles

python main_download_pretrained.py \
  --manifest pretrained_models_manifest.json \
  --list

Download a bundle

python main_download_pretrained.py \
  --manifest pretrained_models_manifest.json \
  --bundle pipeline_ab \
  --output-dir artifacts/pretrained

Common options

--force
--no-verify-sha256
--metadata-out

Typical artifact layout

The exact contents depend on the workflow, but a typical A+B training run may look like this:

artifacts/run_ab/
├── intermediates/
│   ├── train_raw_from_fasta.csv
│   ├── train_features.csv
│   └── train_features_balanced.csv
├── pipeline_a/
│   ├── ...
├── pipeline_b/
│   ├── ...
└── preprocessing_summary.json

The internal contents of pipeline_a/ and pipeline_b/ depend on the lower-level training scripts and may include model weights, scalers, metadata JSON files, tuning summaries, and prediction tables.

Practical notes

main_train.py launches sibling scripts such as pipeline_fusion_classification.py and train_pipeline_classification_detection.py, so keep those files in the same repository folder.
For inference, the feature columns used at prediction time should match the feature set expected by the trained artifact.
For A+B training, labels plus partner and breakpoint annotations are required.
For FASTA-based training, make sure FASTA sequence identifiers match the metadata ID column exactly.

Quick start

1. Train A+B from a raw CSV

python main_train.py \
  --input data/train.csv \
  --input-type csv \
  --artifact-root artifacts/run_ab \
  --pipeline ab \
  --device auto

2. Predict from FASTA

python main_predict_fasta.py \
  --input-fasta data/query.fasta \
  --pipeline ab \
  --pipeline-a-artifact-dir artifacts/run_ab/pipeline_a \
  --pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
  --output results/query_predictions.csv

3. Predict from precomputed features

python main_predict_features.py \
  --input-features data/query_features.csv \
  --pipeline ab \
  --pipeline-a-artifact-dir artifacts/run_ab/pipeline_a \
  --pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
  --output results/query_predictions_from_features.csv

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_fusion_detection_tool.md		README_fusion_detection_tool.md
balance_dataset.py		balance_dataset.py
feature_config.py		feature_config.py
features_selection.py		features_selection.py
main_download_pretrained.py		main_download_pretrained.py
main_predict_fasta.py		main_predict_fasta.py
main_predict_features.py		main_predict_features.py
main_train.py		main_train.py
pipeline_fusion_classification.py		pipeline_fusion_classification.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train_pipeline_classification_detection.py		train_pipeline_classification_detection.py

Folders and files

Latest commit

History

Repository files navigation

Fusion Detection Tool

Overview

Pipeline A — fusion detection

Pipeline B — fusion characterization

Combined A+B workflow

Repository structure

Script summary

Requirements

Python

Core dependencies

Optional dependencies

Installation

Release validation

Sample data included in the repository

Data formats

1. Raw training CSV

For --pipeline a

For --pipeline ab

2. FASTA + metadata for training

For --pipeline a

For --pipeline ab

3. Features CSV for training

For --pipeline a

For --pipeline ab

4. FASTA input for inference

5. Features CSV for inference

Feature extraction

Example: extract features from a CSV

Example: extract features from a FASTA file

Common extraction options

Dataset balancing

Example: balance to the minority class

Example: balance to a fixed class size

Training with main_train.py

Training examples

Train Pipeline A from a raw CSV

Train A+B from a raw CSV

Train Pipeline A from FASTA + metadata

Train A+B from FASTA + metadata

Train from an existing features table

Train with balancing enabled

Important main_train.py options

Input handling

Feature extraction

Balancing

Common training parameters

Pipeline A parameters

Pipeline B / A+B parameters

What main_train.py writes

Advanced training entry points

Train Pipeline A only from a features CSV

Train the full A+B workflow from a features CSV

Prediction from FASTA with main_predict_fasta.py

Predict with Pipeline A only

Predict with Pipeline B only

Predict with combined A+B

Save extracted features and metadata

Important main_predict_fasta.py options

Prediction from precomputed features with main_predict_features.py

Predict with Pipeline A only

Predict with Pipeline B only

Predict with combined A+B

Important main_predict_features.py options

Output formats

Threshold control in A+B mode

Pretrained models

List available bundles

Download a bundle

Common options

Typical artifact layout

Practical notes

Quick start

1. Train A+B from a raw CSV

2. Predict from FASTA

3. Predict from precomputed features

About

For `--pipeline a`

For `--pipeline ab`

For `--pipeline a`

For `--pipeline ab`

For `--pipeline a`

For `--pipeline ab`

Training with `main_train.py`

Important `main_train.py` options

What `main_train.py` writes

Prediction from FASTA with `main_predict_fasta.py`

Important `main_predict_fasta.py` options

Prediction from precomputed features with `main_predict_features.py`

Important `main_predict_features.py` options

Packages