Skip to content

Read_Mapping

Skylar Wyant edited this page Jun 2, 2016 · 21 revisions

Basic Usage

The read_mapping_start.sh script's main function is to start a series of QSub job submissions to the Portable Batch System job scheduler for read mapping using the Burrows Wheeler Aligner (BWA). Despite being classified as a shell script, read_mapping_start.sh requires a list of samples to run. It can also index a Fasta file using BWA. To run read_mapping_start.sh, you would type:

./read_mapping_start.sh

This will display a usage message describing the arguments for read_mapping_start.sh

Arguments

There are two subroutines for read_mapping_start.sh: map and index. As such, the argument lists have been broke up into these two subroutines.

Mapping Argument Function
map Start the read mapping process using BWA
Scratch A directory to put the finished SAM files from BWA
Reference Genome The genome to base the read mapping off of. The genome must be indexed before read mapping can happen
Sample Info A list of samples for read_mapping_start.sh to work with
Project The name of the project or capture facility, used for the Read Group header
Platform The platform used for sequencing, used for the Read Group header
Email An email address for the QSub scheduler to notify you of starts, ends, and abortions for each read mapping
Indexing Argument Function
index Start the indexing process using BWA
Reference Genome The genome to be indexed, must be in Fasta format
Email An email address for the QSub scheduler to notify you of starts, ends, and abortions for indexing

All arguments must be passed in the correct order (top to bottom for each list) for read_mapping_start.sh to work. For example, say this script is in the directory ~/sequence_handling; to index a genome called 'reference_genome.fasta' in the directory ~/genomes and have the QSub scheduler notify user@github.com, we would type:

./read_mapping_start.sh index ~/genomes/reference_genome.fasta user@github.com

To map a list of Illumina-sequenced samples in the file 'trimmed_samples.txt' for our 'Genetics' project, stored in the directory ~/trimmed_samples, have the SAM files go to the directory ~/mapped_SAM, use the reference genome 'reference_genome.fasta' stored in the directory ~/genomes, and email user@github.com any notifications for the QSub scheduler, we would type:

./read_mapping_start.sh map ~/mapped_SAM ~/genome/refernce_genome.fasta ~/trimmed_samples/trimmed_samples.txt Genetics Illumina user@github.com

Please note: the script is set up to read forward reads as having the extension '_R1_trimmed.fq.gz' an reverse reads as having the extension '_R2_trimmed.fq.gz', if your files do NOT have this extension, please edit the script on lines 82 and 83 for forward and reverse naming extensions using your favorite text editor

Output

The index subroutine for read_mapping_start.sh generates an index file for a reference genome in the same directory as the reference genome. Please make sure you have write permissions for said directory.

The map subroutine generates aligned SAM files for each sample. These SAM files have the '@SQ', '@RG', and '@PG' headers included in them. The '@HD' header is not generated from this process.

A list of files is not generated from read_mapping_start.sh. To do create one, please use sample_list_generator.sh. A list of SAM files is required for the SAM_Processing scripts

Dependencies

read_mapping_start.sh depends on BWA and the Portable Batch System to run. If you want to use a differnt job scheduler or read mapper, you will need to modify this script extensively.

Clone this wiki locally