Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

SCAFE Tools and Workflows

This folder contains the following tools and workflows. A tool perform a single task and a workflow runs multiple tools. Some scripts are seperately implemented as bulk (.bk.) and single-cell (.sc.) mode, while others are common (.cm.) for both.

scafe.workflow.sc.subsample [top]

This workflow subsamples a ctss file, defines tCRE and generate a tCRE UMI/cellbarcode count matrix Subsampling is useful to investigate the effect of sequencing depth to tCRE definition

 Usage:
   scafe.workflow.sc.subsample [options] --UMI_CB_ctss_bed_path --run_cellbarcode_path --subsample_num --genome --run_tag --run_outDir
   
   --UMI_CB_ctss_bed_path <required> [string]  ctss file for subsampling, one line one cellbarcode-UMI combination,
                                               *UMI_CB.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl, 
                                               4th column cellbarcode-UMI and 5th column is number of unencoded-G
   --run_cellbarcode_path <required> [string]  tsv file contains a list of cell barcodes,
                                               barcodes.tsv.gz from cellranger
   --subsample_num        <required> [integer] number of UMI to be subsampled
   --genome               <required> [string]  name of genome reference, e.g. hg19.gencode_v32lift37
   --run_tag              <required> [string]  prefix for the output files
   --run_outDir           <required> [string]  directory for the output files
   --training_signal_path (optional) [string]  quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic 
                                               regions (e.g. annotated CRE, in bed format) used for training of logical 
                                               regression model If null, $usr_glm_model_path must be supplied for 
                                               pre-built logical regression model. It overrides usr_glm_model_path 
                                               (default=null)
   --testing_signal_path  (optional) [string]  quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic 
                                               regions (e.g. annotated CRE, in bed format) used for testing the performance 
                                               of the logical regression model. If null, annotated TSS from $genome will be 
                                               used as binary genomic regions. (default=null)
   --max_thread           (optional) [integer] maximum number of parallel threads, capped at 10 to 
                                               avoid memory overflow (default=5)
   --overwrite            (optional) [yes/no]  erase run_outDir before running (default=no)

 Dependencies:
   R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
   bigWigAverageOverBed
   bedGraphToBigWig
   bedtools
   samtools
   paraclu
   paraclu-cut.sh

 To demo run, cd to SCAFE dir and run:
   scafe.workflow.sc.subsample \
   --overwrite=yes \
   --UMI_CB_ctss_bed_path=./demo/input/sc.subsample/demo.UMI_CB.ctss.bed.gz \
   --run_cellbarcode_path=./demo/input/sc.subsample/demo.barcodes.tsv.gz \
   --subsample_num=100000 \
   --genome=hg19.gencode_v32lift37 \
   --run_tag=demo \
   --run_outDir=./demo/output/sc.subsample/

scafe.workflow.sc.solo [top]

This workflow process a single sample, from a cellranger bam file to tCRE UMI/cellbarcode count matrix

 Usage:
   scafe.workflow.sc.solo [options] --run_bam_path --run_cellbarcode_path --genome --run_tag --run_outDir
   
   --run_bam_path         <required> [string]  bam file from cellranger, can be read 1 only or pair-end
   --run_cellbarcode_path <required> [string]  tsv file contains a list of cell barcodes,
                                               barcodes.tsv.gz from cellranger
   --genome               <required> [string]  name of genome reference, e.g. hg19.gencode_v32lift37
   --run_tag              <required> [string]  prefix for the output files
   --run_outDir           <required> [string]  directory for the output files
   --training_signal_path (optional) [string]  quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic 
                                               regions (e.g. annotated CRE, in bed format) used for training of logical 
                                               regression model If null, $usr_glm_model_path must be supplied for 
                                               pre-built logical regression model. It overrides usr_glm_model_path 
                                               (default=null)
   --testing_signal_path  (optional) [string]  quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic 
                                               regions (e.g. annotated CRE, in bed format) used for testing the performance 
                                               of the logical regression model. If null, annotated TSS from $genome will be 
                                               used as binary genomic regions. (default=null)
   --usr_glm_model_path   (optional) [string]  pre-built logical regression model from the Caret package in R. Used only if 
                                               training_signal_path is not supplied. Models were pre-built for each genome
                                               and used as default.
   --max_thread           (optional) [integer] maximum number of parallel threads, capped at 10 to 
                                               avoid memory overflow (default=5)
   --overwrite            (optional) [yes/no]  erase run_outDir before running (default=no)

 Dependencies:
   R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
   bigWigAverageOverBed
   bedGraphToBigWig
   bedtools
   samtools
   paraclu
   paraclu-cut.sh

 To demo run, cd to SCAFE dir and run:
   scafe.workflow.sc.solo \
   --overwrite=yes \
   --run_bam_path=./demo/input/sc.solo/demo.cellranger.bam \
   --run_cellbarcode_path=./demo/input/sc.solo/demo.barcodes.tsv.gz \
   --genome=hg19.gencode_v32lift37 \
   --run_tag=demo \
   --run_outDir=./demo/output/sc.solo/

scafe.workflow.sc.pool [top]

This workflow pool multiple samples for defining tCRE, starting from ctss files to tCRE UMI/cellbarcode count matrix

 Usage:
   scafe.workflow.sc.pool [options] --lib_list_path --genome --run_tag --run_outDir
   
   --lib_list_path        <required> [string] a list of libraries, in formation of 
                                              <lib_ID><\t><suffix><\t><UMI_CB_ctss_bed><\t><cellbarcode><\t><CB_ctss_bed>
                                              lib_ID = Unique ID of the cellbarcode
                                              suffix = an unique integer to be used as for suffix of cellbarcode
                                              UMI_CB_ctss_bed = *UMI_CB.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl, 
                                              CB_ctss_bed = *CB.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl, 
   --genome               <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
   --run_tag              <required> [string] prefix for the output files
   --run_outDir           <required> [string] directory for the output files
   --training_signal_path (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic 
                                              regions (e.g. annotated CRE, in bed format) used for training of logical 
                                              regression model If null, $usr_glm_model_path must be supplied for 
                                              pre-built logical regression model. It overrides usr_glm_model_path 
                                              (default=null)
   --testing_signal_path (optional) [string]  quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic 
                                              regions (e.g. annotated CRE, in bed format) used for testing the performance 
                                              of the logical regression model. If null, annotated TSS from $genome will be 
                                              used as binary genomic regions. (default=null)
   --usr_glm_model_path  (optional) [string]  pre-built logical regression model from the Caret package in R. Used only if 
                                              training_signal_path is not supplied. Models were pre-built for each genome
                                              and used as default.
   --max_thread          (optional) [integer] maximum number of parallel threads, capped at 10 to 
                                              avoid memory overflow (default=5)
   --overwrite           (optional) [yes/no]  erase run_outDir before running (default=no)

 Dependencies:
   R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
   bigWigAverageOverBed
   bedGraphToBigWig
   bedtools
   samtools
   paraclu
   paraclu-cut.sh

 To demo run, cd to SCAFE dir and run:
   scafe.workflow.sc.pool \
   --overwrite=yes \
   --lib_list_path=./demo/input/sc.pool/lib_list_path.txt \
   --genome=hg19.gencode_v32lift37 \
   --run_tag=demo \
   --run_outDir=./demo/output/sc.pool/

scafe.workflow.bk.subsample [top]

This workflow subsamples a ctss file, defines tCRE and generate tCRE read count Subsampling is useful to investigate the effect of sequencing depth to tCRE definition

 Usage:
   scafe.workflow.bk.subsample [options] --long_ctss_bed_path --subsample_num --genome --run_tag --run_outDir
   
   --long_ctss_bed_path    <required> [string] ctss file for subsampling, one line one read
                                               *long.ctss.bed.gz from scafe.tool.bk.bam_to_ctss.pl, 
   --subsample_num        <required> [integer] number of UMI to be subsampled
   --genome               <required> [string]  name of genome reference, e.g. hg19.gencode_v32lift37
   --run_tag              <required> [string]  prefix for the output files
   --run_outDir           <required> [string]  directory for the output files
   --training_signal_path (optional) [string]  quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic 
                                               regions (e.g. annotated CRE, in bed format) used for training of logical 
                                               regression model If null, $usr_glm_model_path must be supplied for 
                                               pre-built logical regression model. It overrides usr_glm_model_path 
                                               (default=null)
   --testing_signal_path  (optional) [string]  quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic 
                                               regions (e.g. annotated CRE, in bed format) used for testing the performance 
                                               of the logical regression model. If null, annotated TSS from $genome will be 
                                               used as binary genomic regions. (default=null)
   --max_thread           (optional) [integer] maximum number of parallel threads, capped at 10 to 
                                               avoid memory overflow (default=5)
   --overwrite            (optional) [yes/no]  erase run_outDir before running (default=no)

 Dependencies:
   R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
   bigWigAverageOverBed
   bedGraphToBigWig
   bedtools
   samtools
   paraclu
   paraclu-cut.sh

 To demo run, cd to SCAFE dir and run:
   scafe.workflow.bk.subsample \
   --overwrite=yes \
   --long_ctss_bed_path=./demo/input/bk.subsample/demo.long.ctss.bed.gz \
   --subsample_num=100000 \
   --genome=hg19.gencode_v32lift37 \
   --run_tag=demo \
   --run_outDir=./demo/output/bk.subsample/

scafe.workflow.bk.solo [top]

This workflow process a single sample, from a bulk CAGE bam file to read count per tCRE

 Usage:
   scafe.workflow.bk.solo [options] --run_bam_path --genome --run_tag --run_outDir
   
   --run_bam_path         <required> [string]  bam file (of CAGE reads), can be read 1 only or pair-end
   --genome               <required> [string]  name of genome reference, e.g. hg19.gencode_v32lift37
   --run_tag              <required> [string]  prefix for the output files
   --run_outDir           <required> [string]  directory for the output files
   --training_signal_path (optional) [string]  quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic 
                                               regions (e.g. annotated CRE, in bed format) used for training of logical 
                                               regression model If null, $usr_glm_model_path must be supplied for 
                                               pre-built logical regression model. It overrides usr_glm_model_path 
                                               (default=null)
   --testing_signal_path  (optional) [string]  quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic 
                                               regions (e.g. annotated CRE, in bed format) used for testing the performance 
                                               of the logical regression model. If null, annotated TSS from $genome will be 
                                               used as binary genomic regions. (default=null)
   --max_thread           (optional) [integer] maximum number of parallel threads, capped at 10 to 
                                               avoid memory overflow (default=5)
   --overwrite            (optional) [yes/no]  erase run_outDir before running (default=no)

 Dependencies:
   R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
   bigWigAverageOverBed
   bedGraphToBigWig
   bedtools
   samtools
   paraclu
   paraclu-cut.sh

 To demo run, cd to SCAFE dir and run:
   scafe.workflow.bk.solo \
   --overwrite=yes \
   --run_bam_path=./demo/input/bk.solo/demo.CAGE.bam \
   --genome=hg19.gencode_v32lift37 \
   --run_tag=demo \
   --run_outDir=./demo/output/bk.solo/

scafe.workflow.bk.pool [top]

This workflow pool multiple samples for defining tCRE, starting from ctss files to read count per tCRE per sample

 Usage:
   scafe.workflow.bk.pool [options] --lib_list_path --genome --run_tag --run_outDir
   
   --lib_list_path         <required> [string] a list of libraries, in formation of 
                                               <lib_ID><\t><long_ctss_bed><\t><collapse_ctss_bed>
                                               lib_ID = Unique ID of the cellbarcode
                                               long_ctss_bed = *long.ctss.bed.gz from scafe.tool.bk.bam_to_ctss.pl, 
                                               collapse_ctss_bed = *collapse.ctss.bed.gz from scafe.tool.bk.bam_to_ctss.pl, 
   --genome                <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
   --run_tag               <required> [string] prefix for the output files
   --run_outDir            <required> [string] directory for the output files
   --training_signal_path  (optional) [string] quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic 
                                               regions (e.g. annotated CRE, in bed format) used for training of logical 
                                               regression model If null, $usr_glm_model_path must be supplied for 
                                               pre-built logical regression model. It overrides usr_glm_model_path 
                                               (default=null)
   --testing_signal_path  (optional) [string]  quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic 
                                               regions (e.g. annotated CRE, in bed format) used for testing the performance 
                                               of the logical regression model. If null, annotated TSS from $genome will be 
                                               used as binary genomic regions. (default=null)
   --max_thread           (optional) [integer] maximum number of parallel threads, capped at 10 to 
                                               avoid memory overflow (default=5)
   --overwrite            (optional) [yes/no]  erase run_outDir before running (default=no)

 Dependencies:
   R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
   bigWigAverageOverBed
   bedGraphToBigWig
   bedtools
   samtools
   paraclu
   paraclu-cut.sh

 To demo run, cd to SCAFE dir and run:
   scafe.workflow.bk.pool \
   --overwrite=yes \
   --lib_list_path=./demo/input/bk.pool/lib_list_path.txt \
   --genome=hg19.gencode_v32lift37 \
   --run_tag=demo \
   --run_outDir=./demo/output/bk.pool/

scafe.tool.sc.subsample_ctss [top]

This tool subsample a ctss bed file and maintains the cellbarcode and UMI information

 Usage:
   scafe.tool.sc.subsample_ctss [options] --UMI_CB_ctss_bed_path --subsample_num --outputPrefix --outDir
   
   --UMI_CB_ctss_bed_path <required> [string]  ctss file for subsampling, one line one cellbarcode-UMI combination,
                                               *UMI_CB.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl, 
                                               4th column cellbarcode-UMI and 5th column is number of unencoded-G
   --subsample_num        <required> [integer] number of UMI to be subsampled
   --outputPrefix         <required> [string]  prefix for the output files
   --outDir               <required> [string]  directory for the output files
   --overwrite            (optional) [yes/no]  erase outDir/outputPrefix before running (default=no)

 Dependencies:
   bedtools

 To demo run, cd to SCAFE dir and run:
   scafe.tool.sc.subsample_ctss \
   --overwrite=yes \
   --UMI_CB_ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.UMI_CB.ctss.bed.gz \
   --subsample_num=100000 \
   --outputPrefix=demo \
   --outDir=./demo/output/sc.subsample/subsample_ctss/

scafe.tool.sc.pool [top]

This tool pool multiple ctss bed file and maintains the unique (suffixed) cellbarcode and UMI information

 Usage:
   scafe.tool.sc.pool [options] --lib_list_path --genome --outputPrefix --outDir
   
   --lib_list_path <required> [string]  a list of libraries, in formation of 
                                        <lib_ID><\t><suffix><\t><UMI_CB_ctss_bed><\t><cellbarcode><\t><CB_ctss_bed>
                                        lib_ID = Unique ID of the cellbarcode
                                        suffix = an unique integer to be used as for suffix of cellbarcode
                                        UMI_CB_ctss_bed = *UMI_CB.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl, 
                                        CB_ctss_bed = *CB.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl, 
   --genome        <required> [string]  name of genome reference, e.g. hg19.gencode_v32lift37
   --outputPrefix  <required> [string]  prefix for the output files
   --outDir        <required> [string]  directory for the output files
   --max_thread    (optional) [integer] maximum number of parallel threads, capped at 10 to 
                                        avoid memory overflow (default=5)
   --overwrite     (optional) [yes/no]  erase outDir/outputPrefix before running (default=no)

 Dependencies:
   bedtools

 To demo run, cd to SCAFE dir and run:
   scafe.tool.sc.pool \
   --overwrite=yes \
   --lib_list_path=./demo/input/sc.pool/lib_list_path.txt \
   --genome=hg19.gencode_v32lift37 \
   --outputPrefix=demo \
   --outDir=./demo/output/sc.pool/pool/

scafe.tool.sc.link [top]

This tool links tCREs by their coactivity among single cells using cicero

 Usage:
   scafe.tool.sc.link [options] --count --run_chr --genome --CRE_bed_path --CRE_info_path --outputPrefix --outDir
   
   --CRE_bed_path     <required> [string]  bed file contains the regions of CRE,
                                      *.CRE.coord.bed.gz from scafe.tool.cm.annotate.pl
   --CRE_info_path    <required> [string]  tsv file contains the annoations of CREs, 
                                      *..CRE.info.tsv.gz from scafe.tool.cm.annotate.pl
   --count_dir        <required> [string]  a dir contains the UMI count of the CRE
   --genome           <required> [string]  name of genome reference, e.g. hg19.gencode_v32lift37
   --outputPrefix     <required> [string]  prefix for the output files
   --outDir           <required> [string]  directory for the output files
   --network_cutoff   (optional) [0-1]     minimum coactivity to define cis-coactivity network (default = 0.05)
   --link_cutoff   (optional) [integer] minimum coactivity to output as link(default = 0.2)
   --binarize_CRE_exp (optional) [yes/no]  binarize_CRE_exp CRE expression signal or not (default = no)
   --min_cell         (optional) [integer] minimum number of cells the CRE to be expressed (default = 5)
   --Rscript_bin      (optional) [string]  path to the Rscript bin, aim to allow users to supply an R version other the 
                                           system wide R version. Package Caret must be installed. (default = Rscript)
   --max_thread       (optional) [integer] maximum number of parallel threads, capped at 10 to 
                                           avoid memory overflow (default=5)
   --run_chr          (optional) [string]  comma delimited list of chromosome name to run,
                                           use 'all' to run all chromosome (default=all)
   --overwrite        (optional) [yes/no]  erase outDir/outputPrefix before running (default=no)

 Dependencies:
   R packages: 'docopt','monocle3', 'cicero', 'Matrix', 'data.table', 'scales'

 To demo run, cd to SCAFE dir and run:
   scafe.tool.sc.link \
   --overwrite=yes \
   --max_thread=10 \
   --CRE_bed_path=./demo/input/sc.link/demo.CRE.coord.bed.gz \
   --CRE_info_path=./demo/input/sc.link/demo.CRE.info.tsv.gz \
   --count_dir=./demo/input/sc.link/matrix/ \
   --genome=hg19.gencode_v32lift37 \
   --outputPrefix=demo \
   --outDir=./demo/output/sc.link/

scafe.tool.sc.count [top]

This tool counts the UMI within a set of user-defined regions, e.g. tCRE, and returns a UMI/cellbarcode matrix

 Usage:
   scafe.tool.sc.count [options] --countRegion_bed_path --cellBarcode_list_path --ctss_bed_path --outputPrefix --outDir
   
   --countRegion_bed_path   <required> [string] bed file contains the regions for counting CTSS, e.g. tCRE ranges, 
                                                *.CRE.coord.bed.gz from scafe.tool.cm.annotate.pl
   --cellBarcode_list_path  <required> [string] tsv file contains a list of cell barcodes,
                                                barcodes.tsv.gz from cellranger
   --ctss_bed_path          <required> [string] ctss file for counting,
                                                *CB.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl, 
                                                4th column cellbarcode and 5th column is number UMI
   --outputPrefix           <required> [string] prefix for the output files
   --outDir                 <required> [string] directory for the output files
   --overwrite              (optional) [yes/no]  erase outDir/outputPrefix before running (default=no)

 Dependencies:
   bedtools

 To demo run, cd to SCAFE dir and run:
   scafe.tool.sc.count \
   --overwrite=yes \
   --countRegion_bed_path=./demo/output/sc.solo/annotate/demo/bed/demo.CRE.annot.bed.gz \
   --cellBarcode_list_path=./demo/input/sc.solo/demo.barcodes.tsv.gz \
   --ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.CB.ctss.bed.gz \
   --outputPrefix=demo \
   --outDir=./demo/output/sc.solo/count/

scafe.tool.sc.bam_to_ctss [top]

This tool converts a bam file to a ctss bed file, identifies read 5'end (capped TSS, i.e. ctss), extracts the unencoded G information, pileup ctss, and deduplicate the UMI

 Usage:
   scafe.tool.sc.bam_to_ctss [options] --bamPath --genome --outputPrefix --outDir
   
   --bamPath      <required> [string]  bam file from cellranger, can be read 1 only or pair-end
   --genome       <required> [string]  name of genome reference, e.g. hg19.gencode_v32lift37
   --outputPrefix <required> [string]  prefix for the output files
   --outDir       <required> [string]  directory for the output files
   --include_flag (optional) [string]  samflag to be included, comma delimited 
                                       e.g. '64' to include read1, (default=null)
   --exclude_flag (optional) [string]  samflag to be excluded, comma delimited, 
                                       e.g. '128,256,4' to exclude read2, secondary alignment 
                                       and unaligned reads (default=128,256,4)
   --min_MAPQ     (optional) [integer] minimum MAPQ to include (default=0)
   --max_thread   (optional) [integer] maximum number of parallel threads, capped at 10 to 
                                       avoid memory overflow (default=5)
   --TS_oligo_seq (optional) [string]  Template switching oligo sequence for identification of 
                                       5'end (default=TTTCTTATATGGG) 
   --overwrite    (optional) [yes/no]  erase outDir/outputPrefix before running (default=no)

 Dependencies:
   bedtools
   samtools

 To demo run, cd to SCAFE dir and run:
   scafe.tool.sc.bam_to_ctss \
   --overwrite=yes \
   --bamPath=./demo/input/sc.solo/demo.cellranger.bam \
   --genome=hg19.gencode_v32lift37 \
   --outputPrefix=demo \
   --outDir=./demo/output/sc.solo/bam_to_ctss/

scafe.tool.cm.remove_strand_invader [top]

This tool identify and remove strand invader artefact from a ctss bed file, by aligning the sequence immediate upstream of a ctss to TS oligo sequence

 Usage:
   scafe.tool.cm.remove_strand_invader [options] --ctss_bed_path --genome --outputPrefix --outDir
   
   --ctss_bed_path      <required> [string]  "collapse" ctss file from scafe.tool.sc.bam_to_ctss.pl, 
                                             4th column is number of cells and 5th column is number UMI
   --genome             <required> [string]  name of genome reference, e.g. hg19.gencode_v32lift37
   --outputPrefix       <required> [string]  prefix for the output files
   --outDir             <required> [string]  directory for the output files
   --min_edit_distance  (optional) [integer] edit distance threshold to define strand invader 
                                             the smaller value, the more stringent defintion of strand invader
                                             (default=5)
   --min_end_non_G_num  (optional) [integer] immediate upstream non-G number threshold to define strand invader
                                             the smaller value, the more stringent defintion of strand invader
                                             (default=2)
   --max_thread         (optional) [integer] maximum number of parallel threads, capped at 
                                             10 to avoid memory overflow (default=5)
   --TS_oligo_seq       (optional) [string]  Template switching oligo sequence for identification 
                                             of 5'end (default=TTTCTTATATGGG) 
   --overwrite          (optional) [yes/no]  [yes/no]  erase outDir/outputPrefix before running (default=no)

 Dependencies:
   bedtools

 To demo run, cd to SCAFE dir and run:
   scafe.tool.cm.remove_strand_invader \
   --overwrite=yes \
   --ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.collapse.ctss.bed.gz \
   --genome=hg19.gencode_v32lift37 \
   --outputPrefix=demo \
   --outDir=./demo/output/sc.solo/remove_strand_invader/

scafe.tool.cm.prep_genome [top]

This tool prepares a reference genome assembly and its gene models for others tools in scafe.

 Usage:
   scafe.tool.cm.prep_genome [options] --gtf_path --fasta_path --chrom_list_path --mask_bed_path --outputPrefix --outDir
   
   --gtf_path         <required> [string] gtf of the gene models
   --fasta_path       <required> [string] fasta of the genome assembly
   --chrom_list_path  <required> [string] list of <chromosome name><\t><alternative chromosome name> 
                                          e.g. <chr1><\t><1>
                                          chromosome name and alternative chromosome name could be the same
                                          alternative chromosome name is necessary if the cellranger bam
                                          file uses alternative chromosome name that is different from those
                                          in $fasta_path
   --mask_bed_path    <required> [string] a bed file specific the CRE regions. For human or mouse, consider 
                                          using ENCODE CREs. for other species, consider using merged ATAC-seq
                                          from multiple tissues. If ATAC is not available, use the +/- 500nt of 
                                          gene model 5'end.
   --outputPrefix     <required> [string] prefix for the output files (should be name of the genome reference)
   --outDir           <required> [string] directory for the output files (should be resource dir in scafe dir)
   --overwrite        (optional) [yes/no] erase outDir/outputPrefix before running (default=no)

 Dependencies:
   bedtools
   samtools

 To demo run, cd to SCAFE dir and run:
   scafe.tool.cm.prep_genome \
   --overwrite=yes \
   --gtf_path=./demo/input/genome/TAIR10.AtRTDv2.gtf.gz \
   --fasta_path=./demo/input/genome/TAIR10.genome.fa.gz \
   --chrom_list_path=./demo/input/genome/TAIR10.chrom_list.txt \
   --mask_bed_path=./demo/input/genome/TAIR10.ATAC.bed.gz \
   --outputPrefix=TAIR10.AtRTDv2 \
   --outDir=./demo/output/genome/

scafe.tool.cm.filter [top]

 Usage:
   scafe.tool.cm.filter [options] --ctss_bed_path --ung_ctss_bed_path --tssCluster_bed_path --genome --outputPrefix --outDir
   
   --ctss_bed_path           <required> [string]  ctss file contains all ctss,
                                                  *.collapse.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl, 
                                                  5th column is number reads/UMI
   --ung_ctss_bed_path       <required> [string]  ctss file contains only ctss with unencoded G,
                                                  *.unencoded_G.collapse.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl, 
                                                  5th column is number reads/UMI
   --tssCluster_bed_path     <required> [string]  bed file contains all TSS clusters,
                                                  *.tssCluster.bed.gz from scafe.tool.cm.cluster.pl
   --genome                  <required> [string]  name of genome reference, e.g. hg19.gencode_v32lift37
   --outputPrefix            <required> [string]  prefix for the output files
   --outDir                  <required> [string]  directory for the output files
   --tssCluster_flank_size   (optional) [integer] size of regions (each side) flanking a TSS cluster summit for
                                                  counting UMI/reads for expression levels calculation (default = 75)
   --local_bkgd_extend_size  (optional) [integer] size of regions (each side) flanking a TSS cluster summit for 
                                                  defining the scope for calculating local background (default = 500)
   --min_gold_num            (optional) [integer] minimum number of gold standard regions for training and testing the
                                                  logical regression model (default = 100)
   --training_pct            (optional) [float]   top and bottom percentage of the TSS clusters, ranked by signal in 
                                                  $training_signal_path, used for training of logical regression model
                                                  (default = 5)
   --training_signal_path    (optional) [string]  quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic 
                                                  regions (e.g. annotated CRE, in bed format) used for training of logical 
                                                  regression model If null, $usr_glm_model_path must be supplied for 
                                                  pre-built logical regression model. It overrides usr_glm_model_path 
                                                  (default=null)
   --testing_signal_path     (optional) [string]  quantitative signal (e.g. ATAC -logP, in bigwig format), or binary genomic 
                                                  regions (e.g. annotated CRE, in bed format) used for testing the performance 
                                                  of the logical regression model. If null, annotated TSS from $genome will be 
                                                  used as binary genomic regions. (default=null)
   --usr_glm_model_path      (optional) [string]  pre-built logical regression model from the Caret package in R. Used only if 
                                                  training_signal_path is not supplied. Models were pre-built for each genome
                                                  and used as default.
   --Rscript_bin             (optional) [string]  path to the Rscript bin, aim to allow users to supply an R version other the 
                                                  system wide R version. Package Caret must be installed. (Defaul = Rscript)
   --default_cutoff          (optional) [integer] logistic probablity cutoffs for the "default" stringency (Default = 0.5)
   --exclude_chrom_list      (optional) [string]  a list of comma delimited chromosome to be excluded in the training and 
                                                  testing of the logical regression model (Default = chrM)
   --overwrite               (optional) [yes/no]  erase outDir/outputPrefix before running (default=no)

 Dependencies:
   R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
   bedtools
   bigWigAverageOverBed

 To demo run, cd to SCAFE dir and run:
   scafe.tool.cm.filter \
   --overwrite=yes \
   --ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.collapse.ctss.bed.gz \
   --ung_ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.unencoded_G.collapse.ctss.bed.gz \
   --tssCluster_bed_path=./demo/output/sc.solo/cluster/demo/bed/demo.tssCluster.bed.gz \
   --training_signal_path=./demo/input/atac/demo.atac.bw \
   --testing_signal_path=./demo/input/atac/demo.atac.bw \
   --genome=hg19.gencode_v32lift37 \
   --outputPrefix=demo \
   --outDir=./demo/output/sc.solo/filter/

scafe.tool.cm.ctss_to_bigwig [top]

This tool converts a ctss bed file into two bigwig file, one for each strand, for visualization purpose

 Usage:
   scafe.tool.cm.ctss_to_bigwig [options] --ctss_bed_path --genome --outputPrefix --outDir
   
   --ctss_bed_path  <required> [string] "collapse" ctss file from scafe.tool.sc.bam_to_ctss.pl
   --genome         <required> [string] name of genome reference, e.g. hg19.gencode_v32lift37
   --outputPrefix   <required> [string] prefix for the output files
   --outDir         <required> [string] directory for the output files
   --overwrite      (optional) [yes/no] erase outDir/outputPrefix before running (default=no)

 Dependencies:
   bedGraphToBigWig

 To demo run, cd to SCAFE dir and run:
   scafe.tool.cm.ctss_to_bigwig \
   --ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.collapse.ctss.bed.gz \
   --genome=hg19.gencode_v32lift37 \
   --outputPrefix=demo \
   --outDir=./demo/output/sc.solo/ctss_to_bigwig/

scafe.tool.cm.cluster [top]

This tool generate TSS cluster from a ctss bed file, using an external tool paraclu with user-defined cutoffs

 Usage:
   scafe.tool.cm.cluster [options] --cluster_ctss_bed_path --outputPrefix --outDir
   
   --cluster_ctss_bed_path       <required> [string]  ctss file used for clustering,
                                                      "collapse" ctss file from scafe.tool.sc.bam_to_ctss.pl, 
                                                      4th column is number of cells and 5th column is number UMI
   --outputPrefix                <required> [string]  prefix for the output files
   --outDir                      <required> [string]  directory for the output files
   --count_ctss_bed_path_list    (optional) [string]  comma delimited list of ctss bed file, 
                                                      using for filtering of clusters based signal 
                                                      (default=$cluster_ctss_bed_path) 
   --count_scope_bed_path        (optional) [string]  a bed file specify the scope for counting in $count_ctss_bed_path_list, 
                                                      using for filtering of clusters based signal
                                                      (default=$cluster_ctss_bed_path) 
   --min_pos_count               (optional) [integer] minimum counts per position, used for filtering the raw signal 
                                                      in $cluster_ctss_bed_path before clustering (default = 1)
   --min_cluster_cpm             (optional) [float]   minimum counts per million (cpm) for a cluster (default = 1e-5)
   --min_summit_count            (optional) [integer] minimum counts at the summit of a cluster (default = 3)
   --min_cluster_count           (optional) [integer] minimum counts within a cluster (default = 5)
   --min_num_sample_expr_cluster (optional) [integer] minimum number of samples (or cells) detected at the 
                                                      summit of a cluster (default = 3)
   --min_num_sample_expr_summit  (optional) [integer] minimum number of samples (or cells) detected within 
                                                      of a cluster (default = 5)
   --merge_dist                  (optional) [integer] maximum distance for merging closely located clusters, 
                                                      -1 to turn off merging (default = -1)
   --overwrite                   (optional) [yes/no]  erase outDir/outputPrefix before running (default=no)

 Dependencies:
   paraclu
   paraclu-cut.sh
   bedtools

 To demo run, cd to SCAFE dir and run:
   scafe.tool.cm.cluster \
   --overwrite=yes \
   --cluster_ctss_bed_path=./demo/output/sc.solo/bam_to_ctss/demo/bed/demo.collapse.ctss.bed.gz \
   --outputPrefix=demo \
   --outDir=./demo/output/sc.solo/cluster/

scafe.tool.cm.annotate [top]

This tool defines tCRE from TSS clusters and annotates them based their overlap with gene models.

 Usage:
   scafe.tool.cm.annotate [options] --tssCluster_bed_path --tssCluster_info_path --genome --outputPrefix --outDir
   
   --tssCluster_bed_path     <required> [string]  bed file contains the ranges of filtered TSS clusters,
                                                  *.tssCluster.*.filtered.bed.gz from scafe.tool.cm.filter.pl
   --tssCluster_info_path    <required> [string]  tsv file contains the information of all TSS clusters,
                                                  *.tssCluster.log.tsv from scafe.tool.cm.filter.pl
   --genome                  <required> [string]  name of genome reference, e.g. hg19.gencode_v32lift37
   --outputPrefix            <required> [string]  prefix for the output files
   --outDir                  <required> [string]  directory for the output files
   --up_end5Rng              (optional) [integer] TSS clusters will be classified as gene TSS, exonic, intron 
                                                  and intergenic. $up_end5Rng determines the range upstream of 
                                                  annotated gene TSS to be used for gene TSS assignment 
                                                  (default = 500)
   --dn_end5Rng              (optional) [integer] TSS clusters will be classified as gene TSS, exonic, intron 
                                                  and intergenic. $dn_end5Rng determines the range downstream of 
                                                  annotated gene TSS to be used for gene TSS assignment 
                                                  (default = 500)
   --exon_slop_rng           (optional) [integer] TSS clusters will be classified as gene TSS, exonic, intron 
                                                  and intergenic. $exon_slop_rng determines the range to be extended
                                                  (i.e. slopped) from exon for assignment of exonic class. 
                                                  Used -1 to NOT to extend (default = -1)
   --merge_dist              (optional) [integer] TSS clusters outside annotated gene promoters are grouped
                                                  as "dummy genes" (for operational uniformity) by merging closely 
                                                  located TSS clusters.  $merge_dist determines the maximum distances 
                                                  between TSS clusters to be merged (default = 500)
   --addon_length            (optional) [integer] see $merge_dist. add-on "dummy transcrips" will assigned to TSS cluster of 
                                                  "dummy genes" (for operational uniformity).$addon_length determines 
                                                  the length of these add-on "dummy transcrips" (default = 500).
   --proximity_slop_rng      (optional) [integer] TSS clusters will be assigned to annotated gene TSS are "proximal"
                                                  TSS clusters. $proximity_slop_rng determines the range to be extended
                                                  (i.e. slopped) from gene TSS for assignment of proximal TSS clusters. 
                                                  (default = 500)
   --merge_strandness        (optional) [string]  see $merge_dist. $merge_strandness decides the merge to be 
                                                  strand-aware ("stranded") or strand-agnostic "strandless".
                                                  (default = strandless)
   --proximal_strandness     (optional) [string]  closely located proximal TSS clusters are merged  
                                                  tCREs. $proximal_strandness decides the merge to be 
                                                  strand-aware ("stranded") or strand-agnostic "strandless".
                                                  (default = stranded)
   --CRE_extend_size         (optional) [integer] tCREs were defined by merging the extended ranges of TSS clusters.
                                                  $CRE_extend_size determine the size of this range (both sides of 
                                                  summit) (default = 500)
   --CRE_extend_upstrm_ratio (optional) [float]   see $CRE_extend_size. $CRE_extend_upstrm_ratio determines the ratio 
                                                  (X:1) of flanking sizes on the upstream and downstream of summit. 
                                                  e.g. $CRE_extend_upstrm_ratio=4, upstream and downstream size will be 
                                                  taken as 4:1 ratio. $CRE_extend_size=500 and $CRE_extend_upstrm_ratio=4,
                                                  upstream and downstream will be 400 and 100 respectively 
                                                  (default = 4)
   --overwrite               (optional) [yes/no]  erase outDir/outputPrefix before running (default=no)

 Dependencies:
   bedtools

 To demo run, cd to SCAFE dir and run:
   scafe.tool.cm.annotate \
   --overwrite=yes \
   --tssCluster_bed_path=./demo/output/sc.solo/filter/demo/bed/demo.tssCluster.default.filtered.bed.gz \
   --tssCluster_info_path=./demo/output/sc.solo/filter/demo/log/demo.tssCluster.log.tsv \
   --genome=hg19.gencode_v32lift37 \
   --outputPrefix=demo \
   --outDir=./demo/output/sc.solo/annotate/

scafe.tool.bk.subsample_ctss [top]

This tool subsample a ctss bed file from bulk CAGE ctss

 Usage:
   scafe.tool.bk.subsample_ctss [options] --UMI_CB_ctss_bed_path --subsample_num --outputPrefix --outDir
   
   --long_ctss_bed_path <required> [string]  ctss file for subsampling, one line read in "long" format,
                                             *long.ctss.bed.gz from scafe.tool.bk.bam_to_ctss.pl, 
   --subsample_num      <required> [integer] number of UMI to be subsampled
   --outputPrefix       <required> [string]  prefix for the output files
   --outDir             <required> [string]  directory for the output files
   --overwrite          (optional) [yes/no]  erase outDir/outputPrefix before running (default=no)

 Dependencies:
   bedtools

 To demo run, cd to SCAFE dir and run:
   scafe.tool.bk.subsample_ctss \
   --overwrite=yes \
   --long_ctss_bed_path=./demo/output/bk.solo/bam_to_ctss/demo/bed/demo.long.ctss.bed.gz \
   --subsample_num=100000 \
   --outputPrefix=demo \
   --outDir=./demo/output/bk.subsample/subsample_ctss/

scafe.tool.bk.pool [top]

This tool pools multiple bulk CAGE ctss bed file

 Usage:
   scafe.tool.bk.pool [options] --lib_list_path --genome --outputPrefix --outDir
   
   --lib_list_path <required> [string]  a list of libraries, in formation of 
                                        <lib_ID><\t><long_ctss_bed><\t><collapse_ctss_bed>
                                        lib_ID = Unique ID of the cellbarcode
                                        long_ctss_bed = *long.ctss.bed.gz from scafe.tool.bk.bam_to_ctss.pl, 
                                        collapse_ctss_bed = *collapse.ctss.bed.gz from scafe.tool.bk.bam_to_ctss.pl, 
   --genome        <required> [string]  name of genome reference, e.g. hg19.gencode_v32lift37
   --outputPrefix  <required> [string]  prefix for the output files
   --outDir        <required> [string]  directory for the output files
   --max_thread    (optional) [integer] maximum number of parallel threads, capped at 10 to 
                                        avoid memory overflow (default=5)
   --overwrite     (optional) [yes/no]  erase outDir/outputPrefix before running (default=no)

 Dependencies:
   bedtools

 To demo run, cd to SCAFE dir and run:
   scafe.tool.bk.pool \
   --overwrite=yes \
   --lib_list_path=./demo/input/bk.pool/lib_list_path.txt \
   --genome=hg19.gencode_v32lift37 \
   --outputPrefix=demo \
   --outDir=./demo/output/bk.pool/pool/

scafe.tool.bk.count [top]

This tool counts the CAGE reads within a set of user-defined regions, e.g. tCRE, and returns the reads per regions

 Usage:
   scafe.tool.bk.count [options] --countRegion_bed_path --ctss_bed_path --outputPrefix --outDir
   
   --countRegion_bed_path   <required> [string] bed file contains the regions for counting CTSS, e.g. tCRE ranges, 
                                                *.CRE.coord.bed.gz from scafe.tool.cm.annotate.pl
   --ctss_bed_path          <required> [string] ctss file for counting,
                                                *CB.ctss.bed.gz from scafe.tool.sc.bam_to_ctss.pl, 
                                                4th column cellbarcode and 5th column is number UMI
   --outputPrefix           <required> [string] prefix for the output files
   --outDir                 <required> [string] directory for the output files
   --overwrite              (optional) [yes/no]  erase outDir/outputPrefix before running (default=no)

 Dependencies:
   bedtools

 To demo run, cd to SCAFE dir and run:
   scafe.tool.bk.count \
   --overwrite=yes \
   --countRegion_bed_path=./demo/output/bk.solo/annotate/demo/bed/demo.CRE.annot.bed.gz \
   --ctss_bed_path=./demo/output/bk.solo/bam_to_ctss/demo/bed/demo.collapse.ctss.bed.gz \
   --outputPrefix=demo \
   --outDir=./demo/output/bk.solo/count/

scafe.tool.bk.bam_to_ctss [top]

This tool converts a bulk CAGE bam file to a ctss bed file, identifies read 5'end (capped TSS, i.e. ctss), extracts the unencoded G information, pileup ctss, and deduplicate the UMI

 Usage:
   scafe.tool.bk.bam_to_ctss [options] --bamPath --genome --outputPrefix --outDir
   
   --bamPath      <required> [string]  bam file (of CAGE reads), can be read 1 only or pair-end
   --genome       <required> [string]  name of genome reference, e.g. hg19.gencode_v32lift37
   --outputPrefix <required> [string]  prefix for the output files
   --outDir       <required> [string]  directory for the output files
   --include_flag (optional) [string]  samflag to be included, comma delimited 
                                       e.g. '64' to include read1, (default=null)
   --exclude_flag (optional) [string]  samflag to be excluded, comma delimited, 
                                       e.g. '128,256,4' to exclude read2, secondary alignment 
                                       and unaligned reads (default=128,256,4)
   --min_MAPQ     (optional) [integer] minimum MAPQ to include (default=0)
   --max_thread   (optional) [integer] maximum number of parallel threads, capped at 10 to 
                                       avoid memory overflow (default=5)
   --overwrite    (optional) [yes/no]  erase outDir/outputPrefix before running (default=no)

 Dependencies:
   bedtools
   samtools

 To demo run, cd to SCAFE dir and run:
   scafe.tool.bk.bam_to_ctss \
   --overwrite=yes \
   --bamPath=./demo/input/bk.solo/demo.CAGE.bam \
   --genome=hg19.gencode_v32lift37 \
   --outputPrefix=demo \
   --outDir=./demo/output/bk.solo/bam_to_ctss/

scafe.download.resources.genome [top]

This script download reference genome data and save in ./resources/genome.

 Usage:
   download.resources.genome --genome
   
   --genome <required> [string] name of genome reference, currently available genomes:
                                hg19.gencode_v32lift37
                                hg38.gencode_v32
                                mm10.gencode_vM25
                                TAIR10.AtRTDv2

 Dependencies:
   wget
   tar

 To demo run, cd to SCAFE dir and run:
   scafe.download.resources.genome \
   --genome=hg19.gencode_v32lift37

scafe.download.demo.input [top]

This scripts download demo data and save in ./demo/input dir.

 Usage:
   download.demo.input

 Dependencies:
   wget
   tar

 To demo run, cd to SCAFE dir and run:
   scafe.download.demo.input

scafe.demo.test.run [top]

This scripts test run for demo data in the ./demo/input dir. It runs user-selected workflows. Demo input data must be downloaded from using ./script/download.demo.input Genome reference hg19.gencode_v32lift37 must be downloaded using ./scripts/download.resources.genome

 Usage:
   demo.test.run [options] --run_outDir
   
   --run_outDir           <required> [string]  directory for the output test runs
   --workflow             (optional) [string]  comma delimited list of workflows, 
                                               or use 'all' to run all workflows.
                                               Available workflows includes,
                                               scafe.workflow.sc.subsample ---> workflow, single-cell mode, subsample ctss
                                               scafe.workflow.sc.solo ---> workflow, single-cell mode, process a single sample
                                               scafe.workflow.sc.pool ---> workflow, single-cell mode, pool ctss of multiple samples
                                               scafe.workflow.bk.subsample ---> workflow, bulk mode, subsample ctss
                                               scafe.workflow.bk.solo ---> workflow, bulk mode, process a single sample
                                               scafe.workflow.bk.pool ---> workflow, bulk mode, process a single sample
                                               (default=all)
   --overwrite            (optional) [yes/no]  erase run_outDir before running (default=no)

 Dependencies:
   R packages: 'ROCR','PRROC', 'caret', 'e1071', 'ggplot2', 'scales', 'reshape2'
   bigWigAverageOverBed
   bedGraphToBigWig
   bedtools
   samtools
   paraclu
   paraclu-cut.sh

 To demo run, cd to SCAFE dir and run:
   scafe.demo.test.run \
   --overwrite=yes \
   --run_outDir=./demo/output/

scafe.check.dependencies [top]

This scripts check the integrity of tools and workflow scripts, 3rd executable dependencies and R packages.

 Usage:
   check.dependencies

 Dependencies:
   wget
   tar
   Rscript

 To demo run, cd to SCAFE dir and run:
   scafe.check.dependencies