In this tutorial you're gonna learn how to map the RNAseq data to a drosophila reference genome and quantify each gene's expression level.
http://chagall.med.cornell.edu/RNASEQcourse/Slides_July2019_Day2.pdf
More info here: Alignment-free sequence comparison: benefits, applications, and tools
We will use RNAseq data from FlyAtlas2 database, which collects hundreds of RNAseq data of drosophila melanogaster. You can search by gene, category or tissue.
The raw RNAseq data can be found on the resource page for the homework tutorials.
The next step is to trim and clean the raw data using the techniques you learned from Tutorial 2 or Discussion session 3 (See Canvas Files and Media Gallery for more details).
You could download reference genome / transcriptome / gtf files of your familiar species from ENSEMBLE. If you are analyzing Human or Mouse, you could also try Genecode.
# Download drosophila genome reference sequence
wget ftp://ftp.ensembl.org/pub/release-97/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.22.dna.toplevel.fa.gz
gzip -d Drosophila_melanogaster.BDGP6.22.dna.toplevel.fa.gz # decompress .gz file
# Download drosophila transctiptome (cDNA) sequence
wget ftp://ftp.ensembl.org/pub/release-97/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.22.cdna.all.fa.gz
gzip -d Drosophila_melanogaster.BDGP6.22.cdna.all.fa.gz
# Download drosophila annotation gtf file
wget ftp://ftp.ensembl.org/pub/release-97/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.22.97.chr.gtf.gz
gzip -d Drosophila_melanogaster.BDGP6.22.97.chr.gtf.gzWe will use STAR to align reads to the whole genome.
conda install star
# check if STAR has been successfully installed.
STAR -h
# build index
STAR --runThreadN 16 --runMode genomeGenerate --genomeDir genome_STARidx/ --genomeFastaFiles Drosophila_melanogaster.BDGP6.22.dna.toplevel.fa --sjdbGTFfile Drosophila_melanogaster.BDGP6.22.97.chr.gtf --sjdbOverhang 100(Change the file paths to reflect where the respective files are located on your computer.)
--runThreadN Number of threads you use to run on your computer.
--genomeDir The place you want to put your reference. (Make sure the location already exists; the command does not make a new folder.)
--genomeFastaFiles The genome fasta file we've just downloaded and uncompressed.
--sjdbGTFfile Gtf file.
--sjdbOverhang Usually equals read length minus 1.
It might take some time to finish the alignment, and the total Memory usage peak at 10g during this process. If this memory requirement is beyond your computer, you could download the pre-computed index from our resource page.
After we build the index, we're gonna map our reads towards the genome.
STAR --runThreadN 16 --genomeDir genome_STARidx --readFilesIn female_midgut1_R1_clean.fastq female_midgut1_R2_clean.fastq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ./female_midgut1_R1_STAR_genomeThis will give you two important results:
- A sorted bam file based on the coordinates.
- A final *Log.final.out file that includes mapping results information: total mapping ratio / unique mapping ratio / number of mapped reads etc.
Once you get the bam file (which records each reads align to which specific locations of the genome), you may want to summarize reads abundance for each gene.
featureCounts -p -a Drosophila_melanogaster.BDGP6.22.97.chr.gtf -T 16 -o female_midgut1_count.txt bam/female_midgut1_R1_STAR_genomeAligned.sortedByCoord.out.bam
-p Check validity of paired-end distance.
-a Name of an annotation file. GTF/GFF format by default.