Skip to content

mtzingarella/VariantCalling-DenovoAssembly-StudyReplication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VariantCalling-DenovoAssembly-StudyReplication

Workflow flowchart

Replication of the analysis from:

Paterson JR, Wadsworth JM, Hu P, Sharples GJ. A critical role for iron and zinc homeostatic systems in the evolutionary adaptation of Escherichia coli to metal restriction. Microbial Genomics 9:001153 (2023).

Final project for UConn ISG5312 — Genomic Data Analysis in Practice II. Raw WGS reads for 8 E. coli BW25113 isolates (ancestor, 6 evolved, 1 control; BioProject PRJNA989548) were reprocessed through a GATK hard-filtering pipeline and two independent de novo assembly variant calling methods (minimap2 and MUMmer), with results compared against the paper's reported mutations.


Directory structure

VariantCalling-DenovoAssembly-StudyReplication/
├── data/
│   ├── accessionlist.txt
│   ├── genome
│   ├── raw
│   ├── trimmed
├── scripts/
│   ├── 01_getdata/
│   ├── 02_qc_trimming/
│   ├── 03_variantcalling/
│   │   └── gatk_snp_indel/
│   └── 04_denovoassemblies/
├── notebooks/
├── results/          # HPC-side outputs (large files not committed)
│   ├── 02_qc_trimming/
│   ├── 03_variantcalling/
│   └── 04_denovoassemblies/
└── local_results/    # Transferred subsets for notebook use
    ├── 03_post_align_QC/
    ├── 03_variantcalling/
    └── 04_denovoassemblies/

NOTE:

local_results is a manually created directory used for running the jupyter notebooks locally, it is not pushed to the repo

data directory is made when the data download scripts run

scripts/

SLURM batch scripts run on the UConn HPC cluster, numbered in execution order within each phase.

Directory Contents
01_getdata/ Download SRA reads (SRA Toolkit v3.0.5 fasterq-dump) and BW25113 reference genome (GCF_000750555.1)
02_qc_trimming/ FastQC v0.12.1 + MultiQC v1.15 on raw reads; Trimmomatic v0.39 adapter/quality trimming (SLIDINGWINDOW:4:15, MINLEN:50); post-trim QC
03_variantcalling/ bwa-mem2 v2.2.1 alignment with samblaster v0.1.24 inline duplicate marking; post-alignment QC (samtools stats v1.20, bedtools v2.29.0, bamtools v2.5.1). gatk_snp_indel/: GATK v4.3.0.0 MarkDuplicates → HaplotypeCaller (haploid, GVCF mode) → GenomicsDBImport → GenotypeGVCFs → hard filtering → SnpEff v4.3q annotation → TSV export
04_denovoassemblies/ SPAdes v3.15.2 (--isolate) per-sample assembly; QUAST v5.2.0 QC; Prokka v1.13 annotation; minimap2 v2.28 (-cx asm5) + paftools.js variant calling → SnpEff v4.3q annotation → TSV export; MUMmer v4.0.2 (nucmer, delta-filter -r -q -i 95, show-snps -C) variant calling → SnpEff v4.3q annotation → TSV export

notebooks/

R notebooks for QC and downstream analysis. Output files from HPC are transferred to local_results/ before running.

Notebook Contents
Post_Alignment_QC.ipynb samtools stats per-sample summaries; per-window coverage (100 bp and 1 kb) relative to origin of replication
Assembly_QC.ipynb QUAST assembly statistics: N50, genome fraction, misassemblies, mismatches/indels per 100 kb
GATK_VCF_QC.ipynb Pre/post-filter annotation distributions (QD, QUAL, FS, SOR, MQ, ReadPosRankSum); depth distributions; filter yield; haploid genotype check
GATK_Variant_Analysis.ipynb Ancestor background subtraction; GATK PASS variants in paper genes; recovery comparison vs Paterson et al. 2023
Minimap_Variant_Analysis.ipynb Minimap2 variant parsing (QUAL ≥ 20, ALT ≤ 50 bp); ancestor background subtraction; paper mutation recovery
MUMmer_Variant_Analysis.ipynb MUMmer SNP/coords parsing; reference coverage; ancestor background subtraction; paper mutation recovery
Paper_Comparison.ipynb Cross-method overlap heatmap; 14-gene × 8-sample detection matrix across GATK, minimap2, and MUMmer; sample-swap candidate aggregation

Workflow summary

  1. Data acquisition — SRA reads (BioProject PRJNA989548) + BW25113 reference genome (GCF_000750555.1)
  2. QC and trimming — FastQC/MultiQC on raw and trimmed reads; Trimmomatic adapter/quality trimming
  3. Reference-based variant calling — bwa-mem2 → GATK HaplotypeCaller haploid GVCF → joint genotyping → hard filtering → SnpEff annotation
  4. De novo assembly — SPAdes --isolate → QUAST QC → Prokka annotation → two independent variant calling methods:
    • minimap2 (-cx asm5) + paftools.js assembly-to-reference variant calling → SnpEff annotation
    • MUMmer (nucmer, bidirectional best-hit filter, unique-region SNPs) → SnpEff annotation
  5. Downstream analysis — Ancestor background subtraction per method; compare all 14 paper genes across all 8 samples; cross-method detection heatmap; potential SRA sample swap

About

Replication of published analysis for Uconn's ISG5312 Genomic Data Analysis in Practice course. Paterson, J. R., Wadsworth, J. M., Hu, P., & Sharples, G. J. (2023). A critical role for iron and zinc homeostatic systems in the evolutionary adaptation of Escherichia coli to metal restriction. Microbial genomics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors