This folder contains the following tools and workflows. A tool perform a single task and a workflow runs multiple tools. Some scripts are seperately implemented as bulk (.bk.) and single-cell (.sc.) mode, while others are common (.cm.) for both.
- scafe.workflow.sc.subsample ---> workflow, single-cell mode, subsample ctss
- scafe.workflow.sc.solo ---> workflow, single-cell mode, process a single sample
- scafe.workflow.sc.pool ---> workflow, single-cell mode, pool ctss of multiple samples
- scafe.workflow.bk.subsample ---> workflow, bulk mode, subsample ctss
- scafe.workflow.bk.solo ---> workflow, bulk mode, process a single sample
- scafe.workflow.bk.pool ---> workflow, bulk mode, process a single sample
- scafe.tool.sc.subsample_ctss ---> tool, single-cell mode, subsample ctss
- scafe.tool.sc.pool ---> tool, single-cell mode, pool ctss of multiple samples
- scafe.tool.sc.link ---> tool, single-cell mode, linking tCRE by coactivity
- scafe.tool.sc.count ---> tool, single-cell mode, count of UMI within tCRE
- scafe.tool.sc.bam_to_ctss ---> tool, single-cell mode, convert bam to ctss
- scafe.tool.cm.remove_strand_invader ---> tool, common mode, remove strand invader artefact
- scafe.tool.cm.prep_genome ---> tool, common mode, prepare custom reference genome
- scafe.tool.cm.filter ---> tool, common mode, filter for genuine TSS clusters
- scafe.tool.cm.ctss_to_bigwig ---> tool, common mode, convert ctss to bigwig
- scafe.tool.cm.cluster ---> tool, common mode, cluster ctss
- scafe.tool.cm.annotate ---> tool, common mode, define and annotate tCRE
- scafe.tool.bk.subsample_ctss ---> tool, bulk mode, subsample ctss
- scafe.tool.bk.pool ---> tool, bulk mode, pool ctss of multiple samples
- scafe.tool.bk.count ---> tool, bulk mode, count ctss within tCREs
- scafe.tool.bk.bam_to_ctss ---> tool, bulk mode, convert bam to ctss bed
- scafe.download.resources.genome ---> download, reference genome to resources dir
- scafe.download.demo.input ---> download, demo input data for testing
- scafe.demo.test.run ---> demo, run demo data for testing
- scafe.check.dependencies ---> check dependencies
scafe.workflow.sc.subsample [top]
This workflow subsamples a ctss file, defines tCRE and generate a tCRE UMI/cellbarcode count matrix Subsampling is useful to investigate the effect of sequencing depth to tCRE definition
Usage:
scafe.workflow.sc.subsample [options] --UMI_CB_ctss_bed_path --run_cellbarcode_path --subsample_num --genome --run_tag --run_outDir
--UMI_CB_ctss_bed_path <required> [string] ctss file for subsampling, one line one cellbarcode-UMI combination,
*UMI_CB.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl,
4th column cellbarcode-UMI and 5th column is number of unencoded-G
--run_cellbarcode_path <required> [string] tsv file contains a list of cell barcodes,
barcodes.tsv.gz from cellranger
--subsample_num <required> [integer] number of UMI to be subsampled
--genome <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
--run_tag <required> [string] prefix for the output files
--run_outDir <required> [string] directory for the output files
--training_signal_path (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic
regions (e.g. annotated CRE, in bed format) used for training of logical
regression model If null, $usr_glm_model_path must be supplied for
pre-built logical regression model. It overrides usr_glm_model_path
(default=null)
--testing_signal_path (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic
regions (e.g. annotated CRE, in bed format) used for testing the performance
of the logical regression model. If null, annotated TSS from $genome will be
used as binary genomic regions. (default=null)
--max_thread (optional) [integer] maximum number of parallel threads, capped at 10 to
avoid memory overflow (default=5)
--overwrite (optional) [yes/no] erase run_outDir before running (default=no)
Dependencies:
R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
bigWigAverageOverBed
bedGraphToBigWig
bedtools
samtools
paraclu
paraclu-cut.sh
To demo run, cd to SCAFE dir and run:
scafe.workflow.sc.subsample \
--overwrite=yes \
--UMI_CB_ctss_bed_path=./demo/input/sc.subsample/demo.UMI_CB.ctss.bed.gz \
--run_cellbarcode_path=./demo/input/sc.subsample/demo.barcodes.tsv.gz \
--subsample_num=100000 \
--genome=hg19.gencode_v32lift37 \
--run_tag=demo \
--run_outDir=./demo/output/sc.subsample/
scafe.workflow.sc.solo [top]
This workflow process a single sample, from a cellranger bam file to tCRE UMI/cellbarcode count matrix
Usage:
scafe.workflow.sc.solo [options] --run_bam_path --run_cellbarcode_path --genome --run_tag --run_outDir
--run_bam_path <required> [string] bam file from cellranger, can be read 1 only or pair-end
--run_cellbarcode_path <required> [string] tsv file contains a list of cell barcodes,
barcodes.tsv.gz from cellranger
--genome <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
--run_tag <required> [string] prefix for the output files
--run_outDir <required> [string] directory for the output files
--training_signal_path (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic
regions (e.g. annotated CRE, in bed format) used for training of logical
regression model If null, $usr_glm_model_path must be supplied for
pre-built logical regression model. It overrides usr_glm_model_path
(default=null)
--testing_signal_path (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic
regions (e.g. annotated CRE, in bed format) used for testing the performance
of the logical regression model. If null, annotated TSS from $genome will be
used as binary genomic regions. (default=null)
--usr_glm_model_path (optional) [string] pre-built logical regression model from the Caret package in R. Used only if
training_signal_path is not supplied. Models were pre-built for each genome
and used as default.
--max_thread (optional) [integer] maximum number of parallel threads, capped at 10 to
avoid memory overflow (default=5)
--overwrite (optional) [yes/no] erase run_outDir before running (default=no)
Dependencies:
R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
bigWigAverageOverBed
bedGraphToBigWig
bedtools
samtools
paraclu
paraclu-cut.sh
To demo run, cd to SCAFE dir and run:
scafe.workflow.sc.solo \
--overwrite=yes \
--run_bam_path=./demo/input/sc.solo/demo.cellranger.bam \
--run_cellbarcode_path=./demo/input/sc.solo/demo.barcodes.tsv.gz \
--genome=hg19.gencode_v32lift37 \
--run_tag=demo \
--run_outDir=./demo/output/sc.solo/
scafe.workflow.sc.pool [top]
This workflow pool multiple samples for defining tCRE, starting from ctss files to tCRE UMI/cellbarcode count matrix
Usage:
scafe.workflow.sc.pool [options] --lib_list_path --genome --run_tag --run_outDir
--lib_list_path <required> [string] a list of libraries, in formation of
<lib_ID><\t><suffix><\t><UMI_CB_ctss_bed><\t><cellbarcode><\t><CB_ctss_bed>
lib_ID = Unique ID of the cellbarcode
suffix = an unique integer to be used as for suffix of cellbarcode
UMI_CB_ctss_bed = *UMI_CB.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl,
CB_ctss_bed = *CB.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl,
--genome <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
--run_tag <required> [string] prefix for the output files
--run_outDir <required> [string] directory for the output files
--training_signal_path (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic
regions (e.g. annotated CRE, in bed format) used for training of logical
regression model If null, $usr_glm_model_path must be supplied for
pre-built logical regression model. It overrides usr_glm_model_path
(default=null)
--testing_signal_path (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic
regions (e.g. annotated CRE, in bed format) used for testing the performance
of the logical regression model. If null, annotated TSS from $genome will be
used as binary genomic regions. (default=null)
--usr_glm_model_path (optional) [string] pre-built logical regression model from the Caret package in R. Used only if
training_signal_path is not supplied. Models were pre-built for each genome
and used as default.
--max_thread (optional) [integer] maximum number of parallel threads, capped at 10 to
avoid memory overflow (default=5)
--overwrite (optional) [yes/no] erase run_outDir before running (default=no)
Dependencies:
R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
bigWigAverageOverBed
bedGraphToBigWig
bedtools
samtools
paraclu
paraclu-cut.sh
To demo run, cd to SCAFE dir and run:
scafe.workflow.sc.pool \
--overwrite=yes \
--lib_list_path=./demo/input/sc.pool/lib_list_path.txt \
--genome=hg19.gencode_v32lift37 \
--run_tag=demo \
--run_outDir=./demo/output/sc.pool/
scafe.workflow.bk.subsample [top]
This workflow subsamples a ctss file, defines tCRE and generate tCRE read count Subsampling is useful to investigate the effect of sequencing depth to tCRE definition
Usage:
scafe.workflow.bk.subsample [options] --long_ctss_bed_path --subsample_num --genome --run_tag --run_outDir
--long_ctss_bed_path <required> [string] ctss file for subsampling, one line one read
*long.ctss.bed.gz from scafe.tool.bk.bam_to_ctss.pl,
--subsample_num <required> [integer] number of UMI to be subsampled
--genome <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
--run_tag <required> [string] prefix for the output files
--run_outDir <required> [string] directory for the output files
--training_signal_path (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic
regions (e.g. annotated CRE, in bed format) used for training of logical
regression model If null, $usr_glm_model_path must be supplied for
pre-built logical regression model. It overrides usr_glm_model_path
(default=null)
--testing_signal_path (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic
regions (e.g. annotated CRE, in bed format) used for testing the performance
of the logical regression model. If null, annotated TSS from $genome will be
used as binary genomic regions. (default=null)
--max_thread (optional) [integer] maximum number of parallel threads, capped at 10 to
avoid memory overflow (default=5)
--overwrite (optional) [yes/no] erase run_outDir before running (default=no)
Dependencies:
R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
bigWigAverageOverBed
bedGraphToBigWig
bedtools
samtools
paraclu
paraclu-cut.sh
To demo run, cd to SCAFE dir and run:
scafe.workflow.bk.subsample \
--overwrite=yes \
--long_ctss_bed_path=./demo/input/bk.subsample/demo.long.ctss.bed.gz \
--subsample_num=100000 \
--genome=hg19.gencode_v32lift37 \
--run_tag=demo \
--run_outDir=./demo/output/bk.subsample/
scafe.workflow.bk.solo [top]
This workflow process a single sample, from a bulk CAGE bam file to read count per tCRE
Usage:
scafe.workflow.bk.solo [options] --run_bam_path --genome --run_tag --run_outDir
--run_bam_path <required> [string] bam file (of CAGE reads), can be read 1 only or pair-end
--genome <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
--run_tag <required> [string] prefix for the output files
--run_outDir <required> [string] directory for the output files
--training_signal_path (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic
regions (e.g. annotated CRE, in bed format) used for training of logical
regression model If null, $usr_glm_model_path must be supplied for
pre-built logical regression model. It overrides usr_glm_model_path
(default=null)
--testing_signal_path (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic
regions (e.g. annotated CRE, in bed format) used for testing the performance
of the logical regression model. If null, annotated TSS from $genome will be
used as binary genomic regions. (default=null)
--max_thread (optional) [integer] maximum number of parallel threads, capped at 10 to
avoid memory overflow (default=5)
--overwrite (optional) [yes/no] erase run_outDir before running (default=no)
Dependencies:
R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
bigWigAverageOverBed
bedGraphToBigWig
bedtools
samtools
paraclu
paraclu-cut.sh
To demo run, cd to SCAFE dir and run:
scafe.workflow.bk.solo \
--overwrite=yes \
--run_bam_path=./demo/input/bk.solo/demo.CAGE.bam \
--genome=hg19.gencode_v32lift37 \
--run_tag=demo \
--run_outDir=./demo/output/bk.solo/
scafe.workflow.bk.pool [top]
This workflow pool multiple samples for defining tCRE, starting from ctss files to read count per tCRE per sample
Usage:
scafe.workflow.bk.pool [options] --lib_list_path --genome --run_tag --run_outDir
--lib_list_path <required> [string] a list of libraries, in formation of
<lib_ID><\t><long_ctss_bed><\t><collapse_ctss_bed>
lib_ID = Unique ID of the cellbarcode
long_ctss_bed = *long.ctss.bed.gz from scafe.tool.bk.bam_to_ctss.pl,
collapse_ctss_bed = *collapse.ctss.bed.gz from scafe.tool.bk.bam_to_ctss.pl,
--genome <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
--run_tag <required> [string] prefix for the output files
--run_outDir <required> [string] directory for the output files
--training_signal_path (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic
regions (e.g. annotated CRE, in bed format) used for training of logical
regression model If null, $usr_glm_model_path must be supplied for
pre-built logical regression model. It overrides usr_glm_model_path
(default=null)
--testing_signal_path (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic
regions (e.g. annotated CRE, in bed format) used for testing the performance
of the logical regression model. If null, annotated TSS from $genome will be
used as binary genomic regions. (default=null)
--max_thread (optional) [integer] maximum number of parallel threads, capped at 10 to
avoid memory overflow (default=5)
--overwrite (optional) [yes/no] erase run_outDir before running (default=no)
Dependencies:
R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
bigWigAverageOverBed
bedGraphToBigWig
bedtools
samtools
paraclu
paraclu-cut.sh
To demo run, cd to SCAFE dir and run:
scafe.workflow.bk.pool \
--overwrite=yes \
--lib_list_path=./demo/input/bk.pool/lib_list_path.txt \
--genome=hg19.gencode_v32lift37 \
--run_tag=demo \
--run_outDir=./demo/output/bk.pool/
scafe.tool.sc.subsample_ctss [top]
This tool subsample a ctss bed file and maintains the cellbarcode and UMI information
Usage:
scafe.tool.sc.subsample_ctss [options] --UMI_CB_ctss_bed_path --subsample_num --outputPrefix --outDir
--UMI_CB_ctss_bed_path <required> [string] ctss file for subsampling, one line one cellbarcode-UMI combination,
*UMI_CB.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl,
4th column cellbarcode-UMI and 5th column is number of unencoded-G
--subsample_num <required> [integer] number of UMI to be subsampled
--outputPrefix <required> [string] prefix for the output files
--outDir <required> [string] directory for the output files
--overwrite (optional) [yes/no] erase outDir/outputPrefix before running (default=no)
Dependencies:
bedtools
To demo run, cd to SCAFE dir and run:
scafe.tool.sc.subsample_ctss \
--overwrite=yes \
--UMI_CB_ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.UMI_CB.ctss.bed.gz \
--subsample_num=100000 \
--outputPrefix=demo \
--outDir=./demo/output/sc.subsample/subsample_ctss/
scafe.tool.sc.pool [top]
This tool pool multiple ctss bed file and maintains the unique (suffixed) cellbarcode and UMI information
Usage:
scafe.tool.sc.pool [options] --lib_list_path --genome --outputPrefix --outDir
--lib_list_path <required> [string] a list of libraries, in formation of
<lib_ID><\t><suffix><\t><UMI_CB_ctss_bed><\t><cellbarcode><\t><CB_ctss_bed>
lib_ID = Unique ID of the cellbarcode
suffix = an unique integer to be used as for suffix of cellbarcode
UMI_CB_ctss_bed = *UMI_CB.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl,
CB_ctss_bed = *CB.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl,
--genome <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
--outputPrefix <required> [string] prefix for the output files
--outDir <required> [string] directory for the output files
--max_thread (optional) [integer] maximum number of parallel threads, capped at 10 to
avoid memory overflow (default=5)
--overwrite (optional) [yes/no] erase outDir/outputPrefix before running (default=no)
Dependencies:
bedtools
To demo run, cd to SCAFE dir and run:
scafe.tool.sc.pool \
--overwrite=yes \
--lib_list_path=./demo/input/sc.pool/lib_list_path.txt \
--genome=hg19.gencode_v32lift37 \
--outputPrefix=demo \
--outDir=./demo/output/sc.pool/pool/
scafe.tool.sc.link [top]
This tool links tCREs by their coactivity among single cells using cicero
Usage:
scafe.tool.sc.link [options] --count --run_chr --genome --CRE_bed_path --CRE_info_path --outputPrefix --outDir
--CRE_bed_path <required> [string] bed file contains the regions of CRE,
*.CRE.coord.bed.gz from scafe.tool.cm.annotate.pl
--CRE_info_path <required> [string] tsv file contains the annoations of CREs,
*..CRE.info.tsv.gz from scafe.tool.cm.annotate.pl
--count_dir <required> [string] a dir contains the UMI count of the CRE
--genome <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
--outputPrefix <required> [string] prefix for the output files
--outDir <required> [string] directory for the output files
--network_cutoff (optional) [0-1] minimum coactivity to define cis-coactivity network (default = 0.05)
--link_cutoff (optional) [integer] minimum coactivity to output as link(default = 0.2)
--binarize_CRE_exp (optional) [yes/no] binarize_CRE_exp CRE expression signal or not (default = no)
--min_cell (optional) [integer] minimum number of cells the CRE to be expressed (default = 5)
--Rscript_bin (optional) [string] path to the Rscript bin, aim to allow users to supply an R version other the
system wide R version. Package Caret must be installed. (default = Rscript)
--max_thread (optional) [integer] maximum number of parallel threads, capped at 10 to
avoid memory overflow (default=5)
--run_chr (optional) [string] comma delimited list of chromosome name to run,
use 'all' to run all chromosome (default=all)
--overwrite (optional) [yes/no] erase outDir/outputPrefix before running (default=no)
Dependencies:
R packages: 'docopt','monocle3', 'cicero', 'Matrix', 'data.table', 'scales'
To demo run, cd to SCAFE dir and run:
scafe.tool.sc.link \
--overwrite=yes \
--max_thread=10 \
--CRE_bed_path=./demo/input/sc.link/demo.CRE.coord.bed.gz \
--CRE_info_path=./demo/input/sc.link/demo.CRE.info.tsv.gz \
--count_dir=./demo/input/sc.link/matrix/ \
--genome=hg19.gencode_v32lift37 \
--outputPrefix=demo \
--outDir=./demo/output/sc.link/
scafe.tool.sc.count [top]
This tool counts the UMI within a set of user-defined regions, e.g. tCRE, and returns a UMI/cellbarcode matrix
Usage:
scafe.tool.sc.count [options] --countRegion_bed_path --cellBarcode_list_path --ctss_bed_path --outputPrefix --outDir
--countRegion_bed_path <required> [string] bed file contains the regions for counting CTSS, e.g. tCRE ranges,
*.CRE.coord.bed.gz from scafe.tool.cm.annotate.pl
--cellBarcode_list_path <required> [string] tsv file contains a list of cell barcodes,
barcodes.tsv.gz from cellranger
--ctss_bed_path <required> [string] ctss file for counting,
*CB.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl,
4th column cellbarcode and 5th column is number UMI
--outputPrefix <required> [string] prefix for the output files
--outDir <required> [string] directory for the output files
--overwrite (optional) [yes/no] erase outDir/outputPrefix before running (default=no)
Dependencies:
bedtools
To demo run, cd to SCAFE dir and run:
scafe.tool.sc.count \
--overwrite=yes \
--countRegion_bed_path=./demo/output/sc.solo/annotate/demo/bed/demo.CRE.annot.bed.gz \
--cellBarcode_list_path=./demo/input/sc.solo/demo.barcodes.tsv.gz \
--ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.CB.ctss.bed.gz \
--outputPrefix=demo \
--outDir=./demo/output/sc.solo/count/
scafe.tool.sc.bam_to_ctss [top]
This tool converts a bam file to a ctss bed file, identifies read 5'end (capped TSS, i.e. ctss), extracts the unencoded G information, pileup ctss, and deduplicate the UMI
Usage:
scafe.tool.sc.bam_to_ctss [options] --bamPath --genome --outputPrefix --outDir
--bamPath <required> [string] bam file from cellranger, can be read 1 only or pair-end
--genome <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
--outputPrefix <required> [string] prefix for the output files
--outDir <required> [string] directory for the output files
--include_flag (optional) [string] samflag to be included, comma delimited
e.g. '64' to include read1, (default=null)
--exclude_flag (optional) [string] samflag to be excluded, comma delimited,
e.g. '128,256,4' to exclude read2, secondary alignment
and unaligned reads (default=128,256,4)
--min_MAPQ (optional) [integer] minimum MAPQ to include (default=0)
--max_thread (optional) [integer] maximum number of parallel threads, capped at 10 to
avoid memory overflow (default=5)
--TS_oligo_seq (optional) [string] Template switching oligo sequence for identification of
5'end (default=TTTCTTATATGGG)
--overwrite (optional) [yes/no] erase outDir/outputPrefix before running (default=no)
Dependencies:
bedtools
samtools
To demo run, cd to SCAFE dir and run:
scafe.tool.sc.bam_to_ctss \
--overwrite=yes \
--bamPath=./demo/input/sc.solo/demo.cellranger.bam \
--genome=hg19.gencode_v32lift37 \
--outputPrefix=demo \
--outDir=./demo/output/sc.solo/bam_to_ctss/
scafe.tool.cm.remove_strand_invader [top]
This tool identify and remove strand invader artefact from a ctss bed file, by aligning the sequence immediate upstream of a ctss to TS oligo sequence
Usage:
scafe.tool.cm.remove_strand_invader [options] --ctss_bed_path --genome --outputPrefix --outDir
--ctss_bed_path <required> [string] "collapse" ctss file from scafe.tool.sc.bam_to_ctss.pl,
4th column is number of cells and 5th column is number UMI
--genome <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
--outputPrefix <required> [string] prefix for the output files
--outDir <required> [string] directory for the output files
--min_edit_distance (optional) [integer] edit distance threshold to define strand invader
the smaller value, the more stringent defintion of strand invader
(default=5)
--min_end_non_G_num (optional) [integer] immediate upstream non-G number threshold to define strand invader
the smaller value, the more stringent defintion of strand invader
(default=2)
--max_thread (optional) [integer] maximum number of parallel threads, capped at
10 to avoid memory overflow (default=5)
--TS_oligo_seq (optional) [string] Template switching oligo sequence for identification
of 5'end (default=TTTCTTATATGGG)
--overwrite (optional) [yes/no] [yes/no] erase outDir/outputPrefix before running (default=no)
Dependencies:
bedtools
To demo run, cd to SCAFE dir and run:
scafe.tool.cm.remove_strand_invader \
--overwrite=yes \
--ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.collapse.ctss.bed.gz \
--genome=hg19.gencode_v32lift37 \
--outputPrefix=demo \
--outDir=./demo/output/sc.solo/remove_strand_invader/
scafe.tool.cm.prep_genome [top]
This tool prepares a reference genome assembly and its gene models for others tools in scafe.
Usage:
scafe.tool.cm.prep_genome [options] --gtf_path --fasta_path --chrom_list_path --mask_bed_path --outputPrefix --outDir
--gtf_path <required> [string] gtf of the gene models
--fasta_path <required> [string] fasta of the genome assembly
--chrom_list_path <required> [string] list of <chromosome name><\t><alternative chromosome name>
e.g. <chr1><\t><1>
chromosome name and alternative chromosome name could be the same
alternative chromosome name is necessary if the cellranger bam
file uses alternative chromosome name that is different from those
in $fasta_path
--mask_bed_path <required> [string] a bed file specific the CRE regions. For human or mouse, consider
using ENCODE CREs. for other species, consider using merged ATAC-seq
from multiple tissues. If ATAC is not available, use the +/- 500nt of
gene model 5'end.
--outputPrefix <required> [string] prefix for the output files (should be name of the genome reference)
--outDir <required> [string] directory for the output files (should be resource dir in scafe dir)
--overwrite (optional) [yes/no] erase outDir/outputPrefix before running (default=no)
Dependencies:
bedtools
samtools
To demo run, cd to SCAFE dir and run:
scafe.tool.cm.prep_genome \
--overwrite=yes \
--gtf_path=./demo/input/genome/TAIR10.AtRTDv2.gtf.gz \
--fasta_path=./demo/input/genome/TAIR10.genome.fa.gz \
--chrom_list_path=./demo/input/genome/TAIR10.chrom_list.txt \
--mask_bed_path=./demo/input/genome/TAIR10.ATAC.bed.gz \
--outputPrefix=TAIR10.AtRTDv2 \
--outDir=./demo/output/genome/
scafe.tool.cm.filter [top]
Usage:
scafe.tool.cm.filter [options] --ctss_bed_path --ung_ctss_bed_path --tssCluster_bed_path --genome --outputPrefix --outDir
--ctss_bed_path <required> [string] ctss file contains all ctss,
*.collapse.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl,
5th column is number reads/UMI
--ung_ctss_bed_path <required> [string] ctss file contains only ctss with unencoded G,
*.unencoded_G.collapse.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl,
5th column is number reads/UMI
--tssCluster_bed_path <required> [string] bed file contains all TSS clusters,
*.tssCluster.bed.gz from scafe.tool.cm.cluster.pl
--genome <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
--outputPrefix <required> [string] prefix for the output files
--outDir <required> [string] directory for the output files
--tssCluster_flank_size (optional) [integer] size of regions (each side) flanking a TSS cluster summit for
counting UMI/reads for expression levels calculation (default = 75)
--local_bkgd_extend_size (optional) [integer] size of regions (each side) flanking a TSS cluster summit for
defining the scope for calculating local background (default = 500)
--min_gold_num (optional) [integer] minimum number of gold standard regions for training and testing the
logical regression model (default = 100)
--training_pct (optional) [float] top and bottom percentage of the TSS clusters, ranked by signal in
$training_signal_path, used for training of logical regression model
(default = 5)
--training_signal_path (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic
regions (e.g. annotated CRE, in bed format) used for training of logical
regression model If null, $usr_glm_model_path must be supplied for
pre-built logical regression model. It overrides usr_glm_model_path
(default=null)
--testing_signal_path (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic
regions (e.g. annotated CRE, in bed format) used for testing the performance
of the logical regression model. If null, annotated TSS from $genome will be
used as binary genomic regions. (default=null)
--usr_glm_model_path (optional) [string] pre-built logical regression model from the Caret package in R. Used only if
training_signal_path is not supplied. Models were pre-built for each genome
and used as default.
--Rscript_bin (optional) [string] path to the Rscript bin, aim to allow users to supply an R version other the
system wide R version. Package Caret must be installed. (Defaul = Rscript)
--default_cutoff (optional) [integer] logistic probablity cutoffs for the "default" stringency (Default = 0.5)
--exclude_chrom_list (optional) [string] a list of comma delimited chromosome to be excluded in the training and
testing of the logical regression model (Default = chrM)
--overwrite (optional) [yes/no] erase outDir/outputPrefix before running (default=no)
Dependencies:
R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
bedtools
bigWigAverageOverBed
To demo run, cd to SCAFE dir and run:
scafe.tool.cm.filter \
--overwrite=yes \
--ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.collapse.ctss.bed.gz \
--ung_ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.unencoded_G.collapse.ctss.bed.gz \
--tssCluster_bed_path=./demo/output/sc.solo/cluster/demo/bed/demo.tssCluster.bed.gz \
--training_signal_path=./demo/input/atac/demo.atac.bw \
--testing_signal_path=./demo/input/atac/demo.atac.bw \
--genome=hg19.gencode_v32lift37 \
--outputPrefix=demo \
--outDir=./demo/output/sc.solo/filter/
scafe.tool.cm.ctss_to_bigwig [top]
This tool converts a ctss bed file into two bigwig file, one for each strand, for visualization purpose
Usage:
scafe.tool.cm.ctss_to_bigwig [options] --ctss_bed_path --genome --outputPrefix --outDir
--ctss_bed_path <required> [string] "collapse" ctss file from scafe.tool.sc.bam_to_ctss.pl
--genome <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
--outputPrefix <required> [string] prefix for the output files
--outDir <required> [string] directory for the output files
--overwrite (optional) [yes/no] erase outDir/outputPrefix before running (default=no)
Dependencies:
bedGraphToBigWig
To demo run, cd to SCAFE dir and run:
scafe.tool.cm.ctss_to_bigwig \
--ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.collapse.ctss.bed.gz \
--genome=hg19.gencode_v32lift37 \
--outputPrefix=demo \
--outDir=./demo/output/sc.solo/ctss_to_bigwig/
scafe.tool.cm.cluster [top]
This tool generate TSS cluster from a ctss bed file, using an external tool paraclu with user-defined cutoffs
Usage:
scafe.tool.cm.cluster [options] --cluster_ctss_bed_path --outputPrefix --outDir
--cluster_ctss_bed_path <required> [string] ctss file used for clustering,
"collapse" ctss file from scafe.tool.sc.bam_to_ctss.pl,
4th column is number of cells and 5th column is number UMI
--outputPrefix <required> [string] prefix for the output files
--outDir <required> [string] directory for the output files
--count_ctss_bed_path_list (optional) [string] comma delimited list of ctss bed file,
using for filtering of clusters based signal
(default=$cluster_ctss_bed_path)
--count_scope_bed_path (optional) [string] a bed file specify the scope for counting in $count_ctss_bed_path_list,
using for filtering of clusters based signal
(default=$cluster_ctss_bed_path)
--min_pos_count (optional) [integer] minimum counts per position, used for filtering the raw signal
in $cluster_ctss_bed_path before clustering (default = 1)
--min_cluster_cpm (optional) [float] minimum counts per million (cpm) for a cluster (default = 1e-5)
--min_summit_count (optional) [integer] minimum counts at the summit of a cluster (default = 3)
--min_cluster_count (optional) [integer] minimum counts within a cluster (default = 5)
--min_num_sample_expr_cluster (optional) [integer] minimum number of samples (or cells) detected at the
summit of a cluster (default = 3)
--min_num_sample_expr_summit (optional) [integer] minimum number of samples (or cells) detected within
of a cluster (default = 5)
--merge_dist (optional) [integer] maximum distance for merging closely located clusters,
-1 to turn off merging (default = -1)
--overwrite (optional) [yes/no] erase outDir/outputPrefix before running (default=no)
Dependencies:
paraclu
paraclu-cut.sh
bedtools
To demo run, cd to SCAFE dir and run:
scafe.tool.cm.cluster \
--overwrite=yes \
--cluster_ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.collapse.ctss.bed.gz \
--outputPrefix=demo \
--outDir=./demo/output/sc.solo/cluster/
scafe.tool.cm.annotate [top]
This tool defines tCRE from TSS clusters and annotates them based their overlap with gene models.
Usage:
scafe.tool.cm.annotate [options] --tssCluster_bed_path --tssCluster_info_path --genome --outputPrefix --outDir
--tssCluster_bed_path <required> [string] bed file contains the ranges of filtered TSS clusters,
*.tssCluster.*.filtered.bed.gz from scafe.tool.cm.filter.pl
--tssCluster_info_path <required> [string] tsv file contains the information of all TSS clusters,
*.tssCluster.log.tsv from scafe.tool.cm.filter.pl
--genome <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
--outputPrefix <required> [string] prefix for the output files
--outDir <required> [string] directory for the output files
--up_end5Rng (optional) [integer] TSS clusters will be classified as gene TSS, exonic, intron
and intergenic. $up_end5Rng determines the range upstream of
annotated gene TSS to be used for gene TSS assignment
(default = 500)
--dn_end5Rng (optional) [integer] TSS clusters will be classified as gene TSS, exonic, intron
and intergenic. $dn_end5Rng determines the range downstream of
annotated gene TSS to be used for gene TSS assignment
(default = 500)
--exon_slop_rng (optional) [integer] TSS clusters will be classified as gene TSS, exonic, intron
and intergenic. $exon_slop_rng determines the range to be extended
(i.e. slopped) from exon for assignment of exonic class.
Used -1 to NOT to extend (default = -1)
--merge_dist (optional) [integer] TSS clusters outside annotated gene promoters are grouped
as "dummy genes" (for operational uniformity) by merging closely
located TSS clusters. $merge_dist determines the maximum distances
between TSS clusters to be merged (default = 500)
--addon_length (optional) [integer] see $merge_dist. add-on "dummy transcrips" will assigned to TSS cluster of
"dummy genes" (for operational uniformity).$addon_length determines
the length of these add-on "dummy transcrips" (default = 500).
--proximity_slop_rng (optional) [integer] TSS clusters will be assigned to annotated gene TSS are "proximal"
TSS clusters. $proximity_slop_rng determines the range to be extended
(i.e. slopped) from gene TSS for assignment of proximal TSS clusters.
(default = 500)
--merge_strandness (optional) [string] see $merge_dist. $merge_strandness decides the merge to be
strand-aware ("stranded") or strand-agnostic "strandless".
(default = strandless)
--proximal_strandness (optional) [string] closely located proximal TSS clusters are merged
tCREs. $proximal_strandness decides the merge to be
strand-aware ("stranded") or strand-agnostic "strandless".
(default = stranded)
--CRE_extend_size (optional) [integer] tCREs were defined by merging the extended ranges of TSS clusters.
$CRE_extend_size determine the size of this range (both sides of
summit) (default = 500)
--CRE_extend_upstrm_ratio (optional) [float] see $CRE_extend_size. $CRE_extend_upstrm_ratio determines the ratio
(X:1) of flanking sizes on the upstream and downstream of summit.
e.g. $CRE_extend_upstrm_ratio=4, upstream and downstream size will be
taken as 4:1 ratio. $CRE_extend_size=500 and $CRE_extend_upstrm_ratio=4,
upstream and downstream will be 400 and 100 respectively
(default = 4)
--overwrite (optional) [yes/no] erase outDir/outputPrefix before running (default=no)
Dependencies:
bedtools
To demo run, cd to SCAFE dir and run:
scafe.tool.cm.annotate \
--overwrite=yes \
--tssCluster_bed_path=./demo/output/sc.solo/filter/demo/bed/demo.tssCluster.default.filtered.bed.gz \
--tssCluster_info_path=./demo/output/sc.solo/filter/demo/log/demo.tssCluster.log.tsv \
--genome=hg19.gencode_v32lift37 \
--outputPrefix=demo \
--outDir=./demo/output/sc.solo/annotate/
scafe.tool.bk.subsample_ctss [top]
This tool subsample a ctss bed file from bulk CAGE ctss
Usage:
scafe.tool.bk.subsample_ctss [options] --UMI_CB_ctss_bed_path --subsample_num --outputPrefix --outDir
--long_ctss_bed_path <required> [string] ctss file for subsampling, one line read in "long" format,
*long.ctss.bed.gz from scafe.tool.bk.bam_to_ctss.pl,
--subsample_num <required> [integer] number of UMI to be subsampled
--outputPrefix <required> [string] prefix for the output files
--outDir <required> [string] directory for the output files
--overwrite (optional) [yes/no] erase outDir/outputPrefix before running (default=no)
Dependencies:
bedtools
To demo run, cd to SCAFE dir and run:
scafe.tool.bk.subsample_ctss \
--overwrite=yes \
--long_ctss_bed_path=./demo/output/bk.solo/bam_to_ctss/demo/bed/demo.long.ctss.bed.gz \
--subsample_num=100000 \
--outputPrefix=demo \
--outDir=./demo/output/bk.subsample/subsample_ctss/
scafe.tool.bk.pool [top]
This tool pools multiple bulk CAGE ctss bed file
Usage:
scafe.tool.bk.pool [options] --lib_list_path --genome --outputPrefix --outDir
--lib_list_path <required> [string] a list of libraries, in formation of
<lib_ID><\t><long_ctss_bed><\t><collapse_ctss_bed>
lib_ID = Unique ID of the cellbarcode
long_ctss_bed = *long.ctss.bed.gz from scafe.tool.bk.bam_to_ctss.pl,
collapse_ctss_bed = *collapse.ctss.bed.gz from scafe.tool.bk.bam_to_ctss.pl,
--genome <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
--outputPrefix <required> [string] prefix for the output files
--outDir <required> [string] directory for the output files
--max_thread (optional) [integer] maximum number of parallel threads, capped at 10 to
avoid memory overflow (default=5)
--overwrite (optional) [yes/no] erase outDir/outputPrefix before running (default=no)
Dependencies:
bedtools
To demo run, cd to SCAFE dir and run:
scafe.tool.bk.pool \
--overwrite=yes \
--lib_list_path=./demo/input/bk.pool/lib_list_path.txt \
--genome=hg19.gencode_v32lift37 \
--outputPrefix=demo \
--outDir=./demo/output/bk.pool/pool/
scafe.tool.bk.count [top]
This tool counts the CAGE reads within a set of user-defined regions, e.g. tCRE, and returns the reads per regions
Usage:
scafe.tool.bk.count [options] --countRegion_bed_path --ctss_bed_path --outputPrefix --outDir
--countRegion_bed_path <required> [string] bed file contains the regions for counting CTSS, e.g. tCRE ranges,
*.CRE.coord.bed.gz from scafe.tool.cm.annotate.pl
--ctss_bed_path <required> [string] ctss file for counting,
*CB.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl,
4th column cellbarcode and 5th column is number UMI
--outputPrefix <required> [string] prefix for the output files
--outDir <required> [string] directory for the output files
--overwrite (optional) [yes/no] erase outDir/outputPrefix before running (default=no)
Dependencies:
bedtools
To demo run, cd to SCAFE dir and run:
scafe.tool.bk.count \
--overwrite=yes \
--countRegion_bed_path=./demo/output/bk.solo/annotate/demo/bed/demo.CRE.annot.bed.gz \
--ctss_bed_path=./demo/output/bk.solo/bam_to_ctss/demo/bed/demo.collapse.ctss.bed.gz \
--outputPrefix=demo \
--outDir=./demo/output/bk.solo/count/
scafe.tool.bk.bam_to_ctss [top]
This tool converts a bulk CAGE bam file to a ctss bed file, identifies read 5'end (capped TSS, i.e. ctss), extracts the unencoded G information, pileup ctss, and deduplicate the UMI
Usage:
scafe.tool.bk.bam_to_ctss [options] --bamPath --genome --outputPrefix --outDir
--bamPath <required> [string] bam file (of CAGE reads), can be read 1 only or pair-end
--genome <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
--outputPrefix <required> [string] prefix for the output files
--outDir <required> [string] directory for the output files
--include_flag (optional) [string] samflag to be included, comma delimited
e.g. '64' to include read1, (default=null)
--exclude_flag (optional) [string] samflag to be excluded, comma delimited,
e.g. '128,256,4' to exclude read2, secondary alignment
and unaligned reads (default=128,256,4)
--min_MAPQ (optional) [integer] minimum MAPQ to include (default=0)
--max_thread (optional) [integer] maximum number of parallel threads, capped at 10 to
avoid memory overflow (default=5)
--overwrite (optional) [yes/no] erase outDir/outputPrefix before running (default=no)
Dependencies:
bedtools
samtools
To demo run, cd to SCAFE dir and run:
scafe.tool.bk.bam_to_ctss \
--overwrite=yes \
--bamPath=./demo/input/bk.solo/demo.CAGE.bam \
--genome=hg19.gencode_v32lift37 \
--outputPrefix=demo \
--outDir=./demo/output/bk.solo/bam_to_ctss/
scafe.download.resources.genome [top]
This script download reference genome data and save in ./resources/genome.
Usage:
download.resources.genome --genome
--genome <required> [string] name of genome reference, currently available genomes:
hg19.gencode_v32lift37
hg38.gencode_v32
mm10.gencode_vM25
TAIR10.AtRTDv2
Dependencies:
wget
tar
To demo run, cd to SCAFE dir and run:
scafe.download.resources.genome \
--genome=hg19.gencode_v32lift37
scafe.download.demo.input [top]
This scripts download demo data and save in ./demo/input dir.
Usage:
download.demo.input
Dependencies:
wget
tar
To demo run, cd to SCAFE dir and run:
scafe.download.demo.input
scafe.demo.test.run [top]
This scripts test run for demo data in the ./demo/input dir. It runs user-selected workflows. Demo input data must be downloaded from using ./script/download.demo.input Genome reference hg19.gencode_v32lift37 must be downloaded using ./scripts/download.resources.genome
Usage:
demo.test.run [options] --run_outDir
--run_outDir <required> [string] directory for the output test runs
--workflow (optional) [string] comma delimited list of workflows,
or use 'all' to run all workflows.
Available workflows includes,
scafe.workflow.sc.subsample ---> workflow, single-cell mode, subsample ctss
scafe.workflow.sc.solo ---> workflow, single-cell mode, process a single sample
scafe.workflow.sc.pool ---> workflow, single-cell mode, pool ctss of multiple samples
scafe.workflow.bk.subsample ---> workflow, bulk mode, subsample ctss
scafe.workflow.bk.solo ---> workflow, bulk mode, process a single sample
scafe.workflow.bk.pool ---> workflow, bulk mode, process a single sample
(default=all)
--overwrite (optional) [yes/no] erase run_outDir before running (default=no)
Dependencies:
R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
bigWigAverageOverBed
bedGraphToBigWig
bedtools
samtools
paraclu
paraclu-cut.sh
To demo run, cd to SCAFE dir and run:
scafe.demo.test.run \
--overwrite=yes \
--run_outDir=./demo/output/
scafe.check.dependencies [top]
This scripts check the integrity of tools and workflow scripts, 3rd executable dependencies and R packages.
Usage:
check.dependencies
Dependencies:
wget
tar
Rscript
To demo run, cd to SCAFE dir and run:
scafe.check.dependencies