otk (ecDNA Analysis Toolkit) is a machine learning toolkit for predicting extrachromosomal DNA (ecDNA) cargo genes. It classifies genes at the gene level (ecDNA cargo vs. non-ecDNA) and identifies focal amplification types at the sample level (nofocal, noncircular, circular/ecDNA).
Based on the paper: Wang, S., et al. (2024). Machine learning-based extrachromosomal DNA identification in large-scale cohorts reveals its clinical implications in cancer. Nature Communications.
- Deep learning-based ecDNA cargo gene prediction at gene level
- Sample-level focal amplification type classification (nofocal/noncircular/circular)
- Multiple pre-trained models (XGBoost, Neural Networks, TabPFN)
- Efficient command-line interface for training and prediction
- GPU acceleration support
- Pre-trained models ready to use after pip install
- RESTful API for web service deployment
- Chinese mirror support for large model downloads
pip install otk-ecdnaThis installs the otk CLI command and all pre-trained models (except TabPFN which is ~275MB and needs separate download).
The TabPFN model (~275MB) is hosted on GitHub Release:
# List available large models
otk download --list
# Download TabPFN model
otk download --model tabpfngit clone https://github.com/WangLabCSU/otk.git
cd otk/otk
pip install -e .# Check installation
otk --version
# List available models
otk models
# Run prediction (example)
otk predict --input data.csv --output predictions.csv --model xgb_new
# Start API server
otk api --port 8000# List all available models with performance metrics
otk models
# Analyze a specific model
otk analyze --model xgb_new
# Generate model configuration
otk config generate --model xgb_new# Train single model
otk train --model xgb_new --gpu 0
# Train neural network model
otk train --model transformer --gpu 0
# Train all models sequentially
otk train --all --gpu 0
# Train all models in parallel on multiple GPUs
otk train --all --parallel --gpus 0,1,2,3
# CPU-only training
otk train --model xgb_new --gpu -1# Basic prediction
otk predict --input data.csv --output predictions.csv --model xgb_new
# With GPU acceleration
otk predict -i data.csv -o results/ -m transformer --gpu 0
# With custom threshold
otk predict -i data.csv -o predictions.csv -m xgb_new --threshold 0.5# Start API with default settings (base path /otk)
otk api
# Custom port
otk api --port 8080
# Serve at root (no base path)
otk api --base-path ""
# Development mode with auto-reload
otk api --reload
# Multiple workers
otk api --workers 4# List large models requiring download
otk download --list
# Download TabPFN model
otk download --model tabpfn
# Force re-download
otk download --model tabpfn --forceInput data should be in CSV format.
Minimal required columns:
| Column | Description |
|---|---|
sample |
Tumor sample ID |
gene_id |
Gene ID (e.g., ENSG00000284662) |
segVal |
Total gene copy number |
Auto-filled columns (defaults applied if missing):
| Column | Default | Description |
|---|---|---|
minor_cn |
0 | Minor copy number |
intersect_ratio |
1.0 | Segment-gene overlap ratio |
purity |
0.8 | Tumor purity |
ploidy |
2.0 | Genome ploidy |
AScore |
10.0 | Aneuploidy score |
pLOH |
0.1 | LOH proportion |
cna_burden |
0.2 | CNA burden |
CN1-CN19 |
0.05 each | Copy number signatures |
type |
- | Cancer type → auto-converts to type_* columns |
Automatically generated features (from gene_id matching):
| Column | Description |
|---|---|
freq_Linear |
Prior frequency in linear amplifications |
freq_BFB |
Prior frequency in BFB events |
freq_Circular |
Prior frequency in ecDNA |
freq_HR |
Prior frequency in HR events |
Training data requires:
| Column | Description |
|---|---|
y |
Binary label (1=ecDNA cargo gene, 0=not) |
Supported cancer types (24): BLCA, BRCA, CESC, COAD, DLBC, ESCA, GBM, HNSC, KICH, KIRC, KIRP, LGG, LIHC, LUAD, LUSC, OV, PRAD, READ, SARC, SKCM, STAD, THCA, UCEC, UVM
| Column | Description |
|---|---|
sample |
Sample ID |
gene_id |
Gene ID |
prediction_prob |
Probability of ecDNA (0-1) |
prediction |
Binary classification (0/1) |
sample_level_prediction_label |
Sample type: nofocal/noncircular/circular |
sample_level_prediction |
Sample type code (0/1/2) |
Sample classification rules:
circular(2): Any gene predicted as ecDNA cargononcircular(1): No ecDNA but segVal > ploidy + 2nofocal(0): Otherwise
| Model | Type | Test auPRC | Description |
|---|---|---|---|
| xgb_new | XGBoost | 0.8339 | Optimized with feature engineering |
| tabpfn | TabPFN | 0.8323 | TabPFN ensemble (~275MB, needs download) |
| deep_residual | Neural | 0.8132 | Deep residual network |
| xgb_tuned | XGBoost | 0.8065 | Hyperparameter tuned |
| optimized_residual | Neural | 0.7906 | Optimized residual network |
| baseline_mlp | Neural | 0.7663 | Simple MLP baseline |
| dgit_super | Neural | 0.7662 | Deep gated interaction transformer |
| xgb_paper | XGBoost | 0.7138 | Paper reproduction (11 features) |
| transformer | Neural | 0.6875 | Transformer architecture |
All models use unified 80/10/10 data split with seed=2026 for reproducibility.
Start a RESTful API for web-based prediction:
# Start API (default base path /otk)
otk api
# Access points:
# - API docs: http://localhost:8000/otk/docs
# - Health: http://localhost:8000/otk/health
# - Web UI: http://localhost:8000/otk/See otk_api/README.md for full API documentation.
otk/
├── src/otk/ # Core library
│ ├── data/ # Data handling
│ ├── models/ # Model implementations
│ ├── predict/ # Prediction utilities
│ └── cli.py # Command-line interface
├── otk_api/ # FastAPI web service
│ ├── api/ # API implementation
│ ├── models/ # Pre-trained models
│ └── static/ # Performance charts
├── configs/ # Model configurations
└── tests/ # Unit tests
If you use otk in your research, please cite:
Wang, S., et al. (2024). Machine learning-based extrachromosomal DNA
identification in large-scale cohorts reveals its clinical implications
in cancer. Nature Communications.MIT License. See LICENSE file for details.
- Homepage: https://github.com/WangLabCSU/otk
- PyPI: https://pypi.org/project/otk-ecdna/
- Email: wangshx@csu.edu.cn