A command-line toolkit for DNA/RNA fusion sequence analysis based on engineered sequence features and two complementary machine-learning pipelines.
This repository lets you:
- extract numerical features from CSV or FASTA sequence data,
- optionally balance labeled datasets by random undersampling,
- train a binary fusion classifier (Pipeline A),
- train a fusion partner + breakpoint predictor (Pipeline B within the A+B workflow),
- run inference directly from FASTA files,
- run inference from precomputed feature tables,
- download pretrained artifacts from a manifest file.
The tool is organized around two prediction stages.
Pipeline A is a KNN-based binary classifier trained on extracted sequence features. It predicts whether a sequence is likely to be a fusion event.
Typical output columns include:
pipeline_a_probapipeline_a_pred_defaultpipeline_a_pred_highpipeline_a_pred_low
Pipeline B is a PyTorch neural network used in the A+B workflow to predict:
- the most likely gene 1 partner,
- the most likely gene 2 partner,
- the relative breakpoint,
- the absolute breakpoint when sequence length is available.
Typical output columns include:
pipeline_b_gene1_topKpipeline_b_gene2_topKpipeline_b_rel_bp_predpipeline_b_abs_bp_pred
In combined mode, Pipeline A acts as a gate before Pipeline B. Only rows whose Pipeline A score passes a chosen threshold are forwarded to Pipeline B.
The output includes:
pipeline_ab_called
When pipeline_ab_called = 0, Pipeline B columns are typically left as NaN.
fusion_detection_tool/
├── .gitignore
├── balance_dataset.py
├── features_selection.py
├── main_download_pretrained.py
├── main_predict_fasta.py
├── main_predict_features.py
├── main_train.py
├── pipeline_fusion_classification.py
└── train_pipeline_classification_detection.py
-
features_selection.py
Extracts engineered sequence features from CSV or FASTA inputs. -
balance_dataset.py
Randomly undersamples labeled datasets. -
pipeline_fusion_classification.py
Lower-level training script for Pipeline A only. -
train_pipeline_classification_detection.py
Lower-level training script for the full A+B workflow from a features CSV. -
main_train.py
Main training orchestrator. It can start from a raw CSV, a FASTA file plus metadata CSV, or a precomputed features CSV. It can also optionally balance the dataset before training. -
main_predict_fasta.py
Extracts features from a FASTA file and runs prediction with Pipeline A, Pipeline B, or combined A+B. -
main_predict_features.py
Runs prediction starting from an already computed features CSV. -
main_download_pretrained.py
Downloads pretrained model bundles from a local or remote manifest JSON.
This codebase uses modern Python syntax such as str | Path, so Python 3.10+ is recommended.
Install the main dependencies with:
pip install -r requirements.txtFor extra functionality or better performance:
pip install xxhashNotes:
tqdmis used for progress bars.joblibis used for serialization and some parallel workloads.xxhashimproves sketch/hash feature performance when sketch-based features are enabled.
Clone the repository and move into it:
git clone <your-repo-url>
cd <your-repo-folder>Install the package in editable mode:
pip install -e .Then check the available options:
python main_train.py --help
python main_predict_fasta.py --help
python main_predict_features.py --helpYou can also use the installed CLI entry points:
fusion-train --help
fusion-predict-fasta --help
fusion-predict-features --helpTo run the automatic test suite:
python -m unittest discover -s tests -vBefore publishing, the repository was smoke-tested locally with:
python -m unittest discover -s tests -vpython main_train.py --helppython main_predict_fasta.py --helppython main_predict_features.py --helppython balance_dataset.py --helppython features_selection.py --helppython pipeline_fusion_classification.py --helppython train_pipeline_classification_detection.py --helppython main_download_pretrained.py --help- end-to-end smoke runs for feature extraction, balancing, Pipeline A training, A-only prediction, A+B prediction, and minimal lower-level A+B training
- editable install verification with the packaged CLI entry points such as
fusion-trainandfusion-predict-fasta
To keep the GitHub repository lightweight, the repository includes only one sample dataset:
dataset/smoke_5k_stratified.csv
The larger raw datasets, derived feature tables, and generated artifacts are intentionally excluded from version control.
When using main_train.py --input-type csv, the required columns depend on the selected pipeline.
Required columns:
sequencelabel
Required columns:
sequencelabelgene1gene2junction_point
Example for full A+B training:
sequence,label,gene1,gene2,junction_point
ACGTACGTACGTACGT,1,GENE_A,GENE_B,8
TGCATGCATGCATGCA,0,GENE_X,GENE_Y,7main_train.py also supports --input-type fasta.
In this mode:
- the FASTA file provides the sequences,
- a separate metadata CSV provides labels and any extra training columns,
- FASTA
seq_idvalues are matched against a metadata column specified by--metadata-id-col.
The metadata CSV must contain at least:
- the ID column used for matching, for example
seq_id label
- the ID column used for matching, for example
seq_id labelgene1gene2junction_point
Example FASTA:
>sample_001
ACGTACGTACGTACGT
>sample_002
TGCATGCATGCATGCA
Example metadata CSV for A+B:
seq_id,label,gene1,gene2,junction_point
sample_001,1,GENE_A,GENE_B,8
sample_002,0,GENE_X,GENE_Y,7A features CSV is a table that already contains engineered feature columns.
Required columns depend on the selected pipeline:
label
labelgene1gene2junction_point
Extra metadata columns are allowed.
main_predict_fasta.py takes a FASTA file, extracts the features internally, and runs prediction.
Typical FASTA-derived metadata columns preserved in the feature table may include:
seq_iddescriptionfasta_header
main_predict_features.py takes a precomputed features CSV and skips feature extraction.
This is useful when:
- features were already generated offline,
- you want faster repeated inference,
- you want strict control over the exact feature table being scored.
The input features CSV must contain the feature columns expected by the trained artifact. Extra metadata columns are usually preserved in the output.
features_selection.py can be used as a standalone script to generate feature tables from CSV or FASTA inputs.
Supported feature families include:
- Shannon and Rényi entropy
- mutual information across different lags (
tau) - resolved mutual information
- GC content
- canonical k-mer frequencies
- optional compression-derived features
- sequence length
- optional sketch/hash-based features
python features_selection.py \
--input data/train.csv \
--output data/train_features.csv \
--input-format csv \
--sequence-col sequence \
--label-col label \
--tau-list 1,2,3,4,5 \
--k-list 3,4,5python features_selection.py \
--input data/query.fasta \
--output data/query_features.csv \
--input-format fasta \
--tau-list 1,2,3,4,5 \
--k-list 3,4,5--keep-NkeepsNbases during cleaning.--include-compressionadds compression-based features.--include-sketchadds sketch/hash-based features.--k-sketch,--n-hash,--sketch-stride,--sketch-windowcontrol sketch features.--min-lendiscards sequences shorter than a threshold.--n-jobsenables parallel processing where supported.
balance_dataset.py performs random undersampling while preserving the original columns.
Supported strategies:
minority: every class is reduced to the minority class sizefixed: every class is reduced to--target-count
python balance_dataset.py \
--input data/train_features.csv \
--output data/train_features_balanced.csv \
--label-col label \
--strategy minoritypython balance_dataset.py \
--input data/train_features.csv \
--output data/train_features_balanced.csv \
--label-col label \
--strategy fixed \
--target-count 5000This is the recommended training entry point for most users.
It can:
- read a raw CSV, a FASTA+metadata pair, or a features CSV,
- extract features when needed,
- optionally balance the dataset,
- train either Pipeline A or A+B,
- save artifacts under a single output directory.
python main_train.py \
--input data/train.csv \
--input-type csv \
--artifact-root artifacts/run_a \
--pipeline apython main_train.py \
--input data/train.csv \
--input-type csv \
--artifact-root artifacts/run_ab \
--pipeline ab \
--device autopython main_train.py \
--input data/train.fasta \
--input-type fasta \
--metadata-csv data/train_metadata.csv \
--metadata-id-col seq_id \
--artifact-root artifacts/run_a_fasta \
--pipeline apython main_train.py \
--input data/train.fasta \
--input-type fasta \
--metadata-csv data/train_metadata.csv \
--metadata-id-col seq_id \
--artifact-root artifacts/run_ab_fasta \
--pipeline ab \
--device autopython main_train.py \
--input data/train_features.csv \
--input-type features \
--artifact-root artifacts/run_ab_features \
--pipeline abpython main_train.py \
--input data/train.csv \
--input-type csv \
--artifact-root artifacts/run_ab_balanced \
--pipeline ab \
--balance \
--balance-strategy minority--input--input-type {csv,features,fasta}--artifact-root--sequence-col--label-col--metadata-csv--metadata-id-col
--tau-list--k-list--keep-N--include-compression--include-sketch--k-sketch--n-hash--sketch-stride--sketch-window--min-len--n-jobs--no-progress
--balance--balance-strategy {minority,fixed}--balance-target-count
--seed--test-size--val-size--device {auto,cpu,cuda}
--a-n-neighbors--a-p--a-weights--a-n-jobs--target-recall-low--a-n-boot--a-boot-seed--a-no-bootstrap
--epochs-b--patience-b--batch-size--w-gene--w-bp--hidden-grid--dropout-grid--lr-grid--topk-list--b-n-boot--b-boot-seed
Typical outputs include:
intermediates/train_raw_from_fasta.csvwhen training starts from FASTAintermediates/train_features.csvintermediates/train_features_balanced.csvwhen balancing is enabledpreprocessing_summary.jsonpipeline_a/and optionallypipeline_b/artifact directories
If you want more control, you can run the lower-level training scripts directly.
python pipeline_fusion_classification.py \
--features-csv data/train_features.csv \
--artifact-dir artifacts/pipeline_apython train_pipeline_classification_detection.py \
--features-csv data/train_features.csv \
--artifact-root artifacts/pipeline_ab \
--device autoUse this script when you have raw sequences in FASTA format and want the tool to extract features internally.
Supported modes:
--pipeline a--pipeline b--pipeline ab
python main_predict_fasta.py \
--input-fasta data/query.fasta \
--pipeline a \
--pipeline-a-artifact-dir artifacts/run_a/pipeline_a \
--output results/query_pipeline_a.csvpython main_predict_fasta.py \
--input-fasta data/query.fasta \
--pipeline b \
--pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
--topk 5 \
--device auto \
--output results/query_pipeline_b.csvpython main_predict_fasta.py \
--input-fasta data/query.fasta \
--pipeline ab \
--pipeline-a-artifact-dir artifacts/run_ab/pipeline_a \
--pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
--a-threshold low \
--topk 5 \
--device auto \
--output results/query_pipeline_ab.csvpython main_predict_fasta.py \
--input-fasta data/query.fasta \
--pipeline ab \
--pipeline-a-artifact-dir artifacts/run_ab/pipeline_a \
--pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
--features-out results/query_features.csv \
--metadata-out results/query_metadata.json \
--output results/query_predictions.csv--input-fasta--output--features-out--metadata-out--pipeline {a,b,ab}--pipeline-a-artifact-dir--pipeline-b-artifact-dir--a-threshold {default,high,low}--a-threshold-value--topk--device {cpu,cuda,auto}--feature-config-rootto explicitly point to a training artifact root or config JSON- all feature extraction options such as
--tau-list,--k-list,--include-sketch, and--include-compression
When available, main_predict_fasta.py now reuses the feature extraction configuration saved during main_train.py automatically by reading preprocessing_summary.json from the training artifacts. Explicit CLI flags still override the saved configuration.
Use this script when the feature table is already available and you want to skip feature extraction.
Supported modes:
--pipeline a--pipeline b--pipeline ab
python main_predict_features.py \
--input-features data/query_features.csv \
--pipeline a \
--pipeline-a-artifact-dir artifacts/run_a/pipeline_a \
--output results/query_features_pipeline_a.csvpython main_predict_features.py \
--input-features data/query_features.csv \
--pipeline b \
--pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
--topk 5 \
--device auto \
--output results/query_features_pipeline_b.csvpython main_predict_features.py \
--input-features data/query_features.csv \
--pipeline ab \
--pipeline-a-artifact-dir artifacts/run_ab/pipeline_a \
--pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
--a-threshold low \
--topk 5 \
--device auto \
--output results/query_features_pipeline_ab.csv \
--metadata-out results/query_features_pipeline_ab_meta.json--input-features--output--metadata-out--pipeline {a,b,ab}--pipeline-a-artifact-dir--pipeline-b-artifact-dir--a-threshold {default,high,low}--a-threshold-value--topk--device {cpu,cuda,auto}
Both prediction entry points support:
.csv.tsv.json
If --output is omitted, a default filename is inferred from the input name.
In combined mode, Pipeline A decides which rows are sent to Pipeline B.
You can choose one of the stored thresholds:
--a-threshold default--a-threshold high--a-threshold low
Or provide a custom numeric threshold:
--a-threshold-value 0.72When --a-threshold-value is provided, it overrides --a-threshold.
main_download_pretrained.py downloads pretrained bundles from a manifest JSON.
The manifest can be:
- a local JSON file passed with
--manifest, - a remote URL,
- or a file named
pretrained_models_manifest.jsonplaced next to the script.
python main_download_pretrained.py \
--manifest pretrained_models_manifest.json \
--listpython main_download_pretrained.py \
--manifest pretrained_models_manifest.json \
--bundle pipeline_ab \
--output-dir artifacts/pretrained--force--no-verify-sha256--metadata-out
The exact contents depend on the workflow, but a typical A+B training run may look like this:
artifacts/run_ab/
├── intermediates/
│ ├── train_raw_from_fasta.csv
│ ├── train_features.csv
│ └── train_features_balanced.csv
├── pipeline_a/
│ ├── ...
├── pipeline_b/
│ ├── ...
└── preprocessing_summary.json
The internal contents of pipeline_a/ and pipeline_b/ depend on the lower-level training scripts and may include model weights, scalers, metadata JSON files, tuning summaries, and prediction tables.
main_train.pylaunches sibling scripts such aspipeline_fusion_classification.pyandtrain_pipeline_classification_detection.py, so keep those files in the same repository folder.- For inference, the feature columns used at prediction time should match the feature set expected by the trained artifact.
- For A+B training, labels plus partner and breakpoint annotations are required.
- For FASTA-based training, make sure FASTA sequence identifiers match the metadata ID column exactly.
python main_train.py \
--input data/train.csv \
--input-type csv \
--artifact-root artifacts/run_ab \
--pipeline ab \
--device autopython main_predict_fasta.py \
--input-fasta data/query.fasta \
--pipeline ab \
--pipeline-a-artifact-dir artifacts/run_ab/pipeline_a \
--pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
--output results/query_predictions.csvpython main_predict_features.py \
--input-features data/query_features.csv \
--pipeline ab \
--pipeline-a-artifact-dir artifacts/run_ab/pipeline_a \
--pipeline-b-artifact-dir artifacts/run_ab/pipeline_b \
--output results/query_predictions_from_features.csv