pip install -i https://test.pypi.org/simple/ SCellBOW0328SCellBOW_pretrain(adata_source, save_dir, vec_size=300, n_worker=1, iter=20)Create the pre-trained model from the source dataset.
The arguments are as follows:
- adata_source: the preprocessed scanpy.anndata for source dataset
- save_dir: name of directory to save the source model
- vec_size: dimensionality of the embedding vectors. Defaults to 300 for SCellBOW.
- n_worker: number of worker threads to train the model. For a fully deterministically-reproducible run, limit the model to one worker thread. Defaults to 1 for SCellBOW.
- iter: Number of iterations (epochs) over the corpus. Defaults to 20 for SCellBOW.
SCellBOW_cluster(adata_target,save_dir,resolution=1.0,neighbors=15, iter=20,).run()Transfer learning the weights of pre-trained to obtain single-cell embeddings for the target dataset.
The arguments are as follows:
- adata_target: the preprocessed scanpy.anndata for target dataset
- save_dir: name of directory where the source model is saved
- resolution: granularity of the leiden clustering. Defaults to 1.0 for SCellBOW.
- neighbors: number of neighboring data points. Defaults to 15 for SCellBOW.
- iter: Number of iterations (epochs) over the corpus. Defaults to 20 for SCellBOW.
SCellBOW_algebra(adata_target, adata_surv, save_dir, type, bootstrap_samples=50, split=0.2, unit="UMI", n_top_features=1000, iter=20).run()Rank the single cell clusters or subtypes based on their relative aggressiveness.
The arguments are as follows:
- adata_test: the scanpy.anndata for single-cell data with the annotation(subtype,cluster) in adata_test.obs
- adata_train: the anndata for bulk RNAseq gene expression matrix with survival data in adata_train.obs
- save_dir: name of directory where the source model is saved
- type: column from adata_test.obs on which we want to classify (subtype/clusters).
- algebra: values from column Type from adata_test.obs which we want to combine (optional).
- bootstrap_samples: number of bootstrap iterations. Defaults to 50 for SCellBOW.
- split: split on single cell dataset. Defaults to 80:20 split for SCellBOW.
- unit: type of dataset UMI, TPM, FPKM, etc. Default to UMI for SCellBOW.
- n_top_features: number of top common highly variables genes in bulk RNAseq and single cell RNAseq datasets. Defaults to 1000 for SCellBOW.
- iter: Number of iterations (epochs) over the corpus. Defaults to 20 for SCellBOW.
For step-by-step guide on how SCellBOW perform clustering and phenotype algebra on full single-cell dataset.
import SCellBOW as sb
import scanpy as sc
import matplotlib.pyplot as pltadata_source = sc.read("/path/to/directory/adata_source.h5ad")adata_source.var_names_make_unique()
sc.pp.filter_cells(adata_source, min_genes=200)
sc.pp.filter_genes(adata_source, min_cells=20)
sc.pp.normalize_total(adata_source, target_sum=1e4)
sc.pp.log1p(adata_source)
sc.pp.highly_variable_genes(adata_source, n_top_genes = 1000)
adata_source = adata_source[:, adata_source.var.highly_variable]
sc.pp.scale(adata_source, max_value=10)sb.SCellBOW_pretrain(adata_source, 'dummy', vec_size=300, n_worker=1, iter=20)adata_target = sc.read("/path/to/directory/adata_target.h5ad")adata_target.var_names_make_unique()
sc.pp.filter_cells(adata_target, min_genes=200)
sc.pp.filter_genes(adata_target, min_cells=20)
sc.pp.normalize_total(adata_target, target_sum=1e4)
sc.pp.log1p(adata_target)
sc.pp.highly_variable_genes(adata_target, n_top_genes = 1000)
adata_target = adata_target[:, adata_target.var.highly_variable]
sc.pp.scale(adata_target, max_value=10)adata = sb.SCellBOW_cluster(adata_target,'dummy').run()resolution = 1.0
with plt.rc_context({'figure.figsize': (5, 5)}):
sc.pl.umap(adata,
color='clusters_'+str(resolution),
add_outline=True,
legend_fontsize=14,
legend_fontoutline=2,
title='UMAP visualisation',
size = 50,
palette=plt.rcParams["axes.prop_cycle"],
)import SCellBOW as sb
import scanpy as sc
import matplotlib.pyplot as pltadata_source = sc.read("/path/to/directory/adata_source.h5ad")adata_test = sc.read("/path/to/directory/adata_target.h5ad")adata_surv = sc.read("/path/to/directory/adata_survival.h5ad")Note that for phenotype algebra the bulk expression dataset has to be in the AnnData object format. Ensure that the column names in the .obs attribute follow the specific naming convention as below-
| time | status | |
|---|---|---|
| 0_bulk | 8.442838 | 1 |
| 1_bulk | 29.829172 | 1 |
| 2_bulk | 4.533509 | 1 |
| 3_bulk | 22.996058 | 1 |
| 4_bulk | 28.318003 | 0 |
Follow the code snippet to generate the survival data:
import scanpy as sc
import pandas as pd
# Load bulk expression matrix into a Pandas dataframe
df = pd.read_csv('data.csv')
# Load the corresponding survival information into a Pandas dataframe
label = pd.read_csv('label.csv')
# Create an AnnData object from the dataframe
adata = sc.AnnData(df, obs_names=df.index, var_names=df.columns)
# Add sample annotations to the AnnData object
adata.obs = label
adata.obs = adata.obs.rename(columns={'old_colname_time': 'time', 'old_colname_status': 'status'})median_score, scores = sb.SCellBOW_algebra(adata_target,
adata_surv,
save_dir ='dummy',
type='subtype',
algebra = [],
bootstrap_samples=50,
split=0.2,
unit="UMI",
n_top_features=1000,
iter=20).run()median_score.sort_values(ascending=True, inplace=True)
scores = scores[median_score.index]
plt.figure(figsize=(4,4))
scores.boxplot(patch_artist=True, notch=True)
plt.xticks(rotation=90, size=10)
plt.show()Please see our jupyter notebook examples.
| Notebooks | Description |
|---|---|
| Clusters | Run SCellBOW cluster on pancreas dataset |
| Phenotype algebra | Run phenotype algebra on molecular subtypes of prostate cancer on a simulated prostate cancer dataset |
| Phenotype algebra on clusters | Run phenotype algebra and SCellBOW clusters on a simulated prostate cancer dataset |
Note: Create a folder with the same name as the "save_dir" before running the script.
Souce code: Github
For further information contact debarka@iiitd.ac.in or namrata.bhattacharya@hdr.qut.eu.au.