This repository contains a number of scripts and programs useful for finding mobile element insertion (MEI) sites from single-end datasets. In short, these programs align soft-clipped regions of alignments against repeat sequences and processes the results to find their locations.
- The pysam python module
- bowtie1 or another short read aligner.
- A local aligner, like STAR.
- htslib 1.1, which is a submodule
The C files in this repository can be compiled with the supplied Makefile by simply typeing make in the source directory.
-
Map with
STARor another local aligner against the genome. -
Use
extractSoftclippedon the resulting BAM file to create fastq files of the soft-clipped regions. -
Download and filter the output from RepeatMasker from UCSC or another source. Filtering could be done as follows (piping to
sedis only needed if you aligned to an Ensembl reference):head -n 2 1/chr1.fa.out for chrom in ls */ do awk 'BEGIN{OFS="\t"}{if(NR>3) print $5,$6,$7,$11}' $chrom/*.out | sed 's/chr//g' >> rmask.bed done -
Extract the repeat sequences with
bed2fasta.pyusing the just BED file from step 3. Note that this is a VERY slow program and should probably just be rewritten. One could alternatively just usebedtools getfasta -name, which is probably faster. -
Align those results with bowtie1 or another short-read aligner to the repeat sequences.
-
Run
compactRepeatson the resulting BAM file to create a quality controlled list in BED format of putative insertion sites. -
Use
bedtools clusterto group insertion sites into clusters. Ensure that the input is sorted (bedtools can be used for this). -
Use
filter.pyto produce a BED file of only sites with at least two supporting alignments. A repeat type must be specified, since the output only contains a single repeat type.
Below are the exact commands used to process a single example sample. The fastq file isn't provided as this is meant to simply be an illustrative example.
-
Align with
STAR(note that multiple samples would be processed in this manner, so the index, which we assume has already been created, isn't unloaded until completion):STAR --genomeDir indexes --genomeLoad LoadAndKeep --outSAMstrandField intronMotif --outFilterMultimapNmax 2 --outSAMattributes Standard --readFilesCommand zcat --readFilesIn foo.fq.gz --outFileNamePrefix foo --outStd SAM | samtools view -Sbo foo.bam - -
Extract the soft-clipped sequences in fastq format:
extractSoftclipped foo.bam > foo.fq.gz -
Download, filter and merge RepeatMasker sequences from UCSC (this example uses a mouse dataset and extract only LINE/L1 and SINE/Alu regions):
wget --quiet http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromOut.tar.gz tar xf chromOut.tar.gz for d in `ls -d */` do awk '{if(NR>3 && ($11=="LINE/L1" || $11=="SINE/Alu")) printf("%s\t%s\t%s\t%s\n",$5,$6,$7,$11)}' $d/*.out | sed -e 's/chr//g' >> rmask.bed #Convert UCSC to Ensembl chromosome names done -
Extract the genomic sequence of the repeat regions:
bed2fasta.py rmask.bed reference.fa > rmask.faAlternatively:
bedtools getfasta -name -fi reference.fa -bed rmask.bed -fo /dev/stdout | fold -w 60 > rmask.fa #This is faster than `bed2fasta.py` -
Index and align to the repeat regions (with bowtie1 in this example):
bowtie-build rmask.fa rmask bowtie -a --best --strata -p 6 rmask <(zcat foo.fq.gz) -S | samtools view -Sbo foo.bam - #You can ignore the "duplicated sequence" warnings -
Perform a bit of QC and sort the sites:
compactRepeats foo.bam | bedtools sort > foo.bed #This could be merged with step 5 by specifying "-" as the input file. -
Group reads into clusters (with multiple samples, concatenate the output from step 6 of each sample and then use
bedtools sortbefore clustering):bedtools cluster -d 50 foo.bed > foo.clustered.bed -
Produce unique insertions of a single type
filter.py foo.clustered.bed "LINE/L1" foo.LINE.bed
A common goal is to find insertion sites unique to a given group of samples. This can be done as follows:
compareGroups Sample1.bed,Sample2.bed Sample3.bed,Sample4.bed All.bed > unique.bed
Note that compareGroups requires that the BED files are sorted in lexicographic order. Comparisons between samples can be done in a similar way, by specifying only a single file in one or both groups.