Pipeline for preprocessing, alignment and report of SILVER-seq datasets. This pipeline consists of three modules:
- a preprocessing module where a python script is applied to perform reads deduplication on fastq files
- an alignment module where the exceRpt software (Rozowsky et al., 2019) is applied to map the reads to different RNA features
- a report module where another python script is applied to parse the alignment results from exceRpt to generate a table of mapping statistics and a table of gene counts. The script is enabled to report results from one or multiple SILVER-seq datasets in the form of one mapping statisitcs table and one gene counts table.
Three files are needed for this module:
- SILVER-seq read fastq file
sample1.fastqwheresample1is the sample ID for this silver-seq dataset - SILVER-seq read index file
sample1.index.fastq - deduplication python script
runDedup.py
Create a working directory for example /path/silverSeq/ to store related input and output files for this pipeline.
To perform deduplication on SILVER-seq read fastq file, open a terminal and run
python /path/silverSeq/runDedup.py /path/silverSeq/sample1.fastq /path/silverSeq/sample1.index.fastq /path/silverSeq/sample1.fastq.dedup
where sample1.fastq.dedup contains the deduplicated silver-seq reads that will be used for the downstream pipeline
This module follows the exceRpt small RNA-seq pipeline. Here is the installation instructions for exceRpt. The pre-compiled endogenenous genome and transcriptome indices of Human hg38 for is also needed for exceRpt to use, which can be downloaded Here
Create a sub-directory under the working directory for example /path/silverSeq/excerptOutput/ to hold the output from exceRpt. The directory holding the downloaded genome and transcriptome indices is reffered as /path/silverSeq/excerptHg38/
To map the SILVER-seq reads, open a terminal and run
docker run -v /path/silverSeq/:/exceRptInput \
-v /path/silverSeq/excerptOutput/sample1/:/exceRptOutput \
-v /path/silverSeq/excerptHg38:/exceRpt_DB/hg38 \
-t rkitchen/excerpt \
TRIM_N_BASES_3p=25 \
INPUT_FILE_PATH=/exceRptInput/sample1.fastq.dedup \
ADAPTER_SEQ=none \
N_THREADS=10 \
STAR_outFilterMatchNminOverLread=0.66 \
STAR_outFilterMismatchNmax=10
While the value of N_THREADS parameter in the above command can be adjusted accorodingly, we do not recrommand to change the value of other parameters. After the mapping process finishes, the output is held under the directory of /path/silverSeq/excerptOutput/sample1/
reportSilverSeq.py is needed for this module.
If you have mapping results of multiple silver-seq libraries, put all exceRpt output subdirectories under a same directory for example /path/silverSeq/excerptOutput/ so that the directory architerture will look like
/path/silverSeq/excerptOutput/
--sample1/
--sample2/
--sample3/
...
...
To get gene counts of these SILVER-seq libraries, run the following command
python /path/silverSeq/reportSilverSeq.py /path/silverSeq/excerptHg38 /path/silverSeq/excerptOutput prefix
where prefix is the prefix you want to add to your output files.
A file named as mappingStats.csv will be output as the mapping statistics table derived from the SILVER-seq libraries. A file named as geneCounts.csv will be output as the gene counts table derived from the SILVER-seq libraries.
In this repository, we provide example outputs of this pipeline using the SILVER-seq data published in the paper of Zhou et al,2019.
The mapping statisitcs table is available as silverSeq_pnas_mappingStats.csv
The compressed gene counts table is available as silverSeq_pnas_geneCounts.csv.xz