Replication of the analysis from:
Paterson JR, Wadsworth JM, Hu P, Sharples GJ. A critical role for iron and zinc homeostatic systems in the evolutionary adaptation of Escherichia coli to metal restriction. Microbial Genomics 9:001153 (2023).
Final project for UConn ISG5312 — Genomic Data Analysis in Practice II. Raw WGS reads for 8 E. coli BW25113 isolates (ancestor, 6 evolved, 1 control; BioProject PRJNA989548) were reprocessed through a GATK hard-filtering pipeline and two independent de novo assembly variant calling methods (minimap2 and MUMmer), with results compared against the paper's reported mutations.
VariantCalling-DenovoAssembly-StudyReplication/
├── data/
│ ├── accessionlist.txt
│ ├── genome
│ ├── raw
│ ├── trimmed
├── scripts/
│ ├── 01_getdata/
│ ├── 02_qc_trimming/
│ ├── 03_variantcalling/
│ │ └── gatk_snp_indel/
│ └── 04_denovoassemblies/
├── notebooks/
├── results/ # HPC-side outputs (large files not committed)
│ ├── 02_qc_trimming/
│ ├── 03_variantcalling/
│ └── 04_denovoassemblies/
└── local_results/ # Transferred subsets for notebook use
├── 03_post_align_QC/
├── 03_variantcalling/
└── 04_denovoassemblies/
NOTE:
local_results is a manually created directory used for running the jupyter notebooks locally, it is not pushed to the repo
data directory is made when the data download scripts run
SLURM batch scripts run on the UConn HPC cluster, numbered in execution order within each phase.
| Directory | Contents |
|---|---|
01_getdata/ |
Download SRA reads (SRA Toolkit v3.0.5 fasterq-dump) and BW25113 reference genome (GCF_000750555.1) |
02_qc_trimming/ |
FastQC v0.12.1 + MultiQC v1.15 on raw reads; Trimmomatic v0.39 adapter/quality trimming (SLIDINGWINDOW:4:15, MINLEN:50); post-trim QC |
03_variantcalling/ |
bwa-mem2 v2.2.1 alignment with samblaster v0.1.24 inline duplicate marking; post-alignment QC (samtools stats v1.20, bedtools v2.29.0, bamtools v2.5.1). gatk_snp_indel/: GATK v4.3.0.0 MarkDuplicates → HaplotypeCaller (haploid, GVCF mode) → GenomicsDBImport → GenotypeGVCFs → hard filtering → SnpEff v4.3q annotation → TSV export |
04_denovoassemblies/ |
SPAdes v3.15.2 (--isolate) per-sample assembly; QUAST v5.2.0 QC; Prokka v1.13 annotation; minimap2 v2.28 (-cx asm5) + paftools.js variant calling → SnpEff v4.3q annotation → TSV export; MUMmer v4.0.2 (nucmer, delta-filter -r -q -i 95, show-snps -C) variant calling → SnpEff v4.3q annotation → TSV export |
R notebooks for QC and downstream analysis. Output files from HPC are transferred to local_results/ before running.
| Notebook | Contents |
|---|---|
Post_Alignment_QC.ipynb |
samtools stats per-sample summaries; per-window coverage (100 bp and 1 kb) relative to origin of replication |
Assembly_QC.ipynb |
QUAST assembly statistics: N50, genome fraction, misassemblies, mismatches/indels per 100 kb |
GATK_VCF_QC.ipynb |
Pre/post-filter annotation distributions (QD, QUAL, FS, SOR, MQ, ReadPosRankSum); depth distributions; filter yield; haploid genotype check |
GATK_Variant_Analysis.ipynb |
Ancestor background subtraction; GATK PASS variants in paper genes; recovery comparison vs Paterson et al. 2023 |
Minimap_Variant_Analysis.ipynb |
Minimap2 variant parsing (QUAL ≥ 20, ALT ≤ 50 bp); ancestor background subtraction; paper mutation recovery |
MUMmer_Variant_Analysis.ipynb |
MUMmer SNP/coords parsing; reference coverage; ancestor background subtraction; paper mutation recovery |
Paper_Comparison.ipynb |
Cross-method overlap heatmap; 14-gene × 8-sample detection matrix across GATK, minimap2, and MUMmer; sample-swap candidate aggregation |
- Data acquisition — SRA reads (BioProject PRJNA989548) + BW25113 reference genome (GCF_000750555.1)
- QC and trimming — FastQC/MultiQC on raw and trimmed reads; Trimmomatic adapter/quality trimming
- Reference-based variant calling — bwa-mem2 → GATK HaplotypeCaller haploid GVCF → joint genotyping → hard filtering → SnpEff annotation
- De novo assembly — SPAdes
--isolate→ QUAST QC → Prokka annotation → two independent variant calling methods:- minimap2 (
-cx asm5) + paftools.js assembly-to-reference variant calling → SnpEff annotation - MUMmer (
nucmer, bidirectional best-hit filter, unique-region SNPs) → SnpEff annotation
- minimap2 (
- Downstream analysis — Ancestor background subtraction per method; compare all 14 paper genes across all 8 samples; cross-method detection heatmap; potential SRA sample swap
