The following repository contains code and analysis scripts for INSERT DOI AND LINK(S) HERE.
Many of the alignments generated throughout the course of this project are huge and cannot be feasibly hosted on GitHub; Please reach out to us if you'd like to obtain them (see Contact).
- fasterq-dump 3.2.1
- fastp 0.23.2
- minimap2 2.29-r1283
- samtools 1.22.1
- htslib 1.22.1
- python v3.10+
- spades v.4.2.0
- GNU parallel 20250322
- pigz 2.8
- gzcat 448.80.1
- kraken2 2.17.1 (for a particular database used in this analysis, please reach out to us; we can't host any here due to file size restrictions.)
- BBTools 39.26
- access to CZID metagenomics pipelines
- at least 500GB of disk space
There are three main directories:
code used to automate the download and preprocessing of data from NCBI's Sequencing Read Archive and the European Nucleotide Archive.
These directories include dedicated collections of scripts for the following datasets (separated by directory):
- ChIP-Seq (
fastqs/chipseq) - RNA-Seq (
fastqs/rnaseq)
code used to map short reads (not pacbio) from the fastqs folder to a variety of reference genomes for analysis. Also contains analysis scripts. Follows the same one-directory-per-dataset scheme as fastqs.
Has two additional directories:
denovo_assemblycontains just the 4 RNA-Seq SRAs used to generate assemblies of the anellovirus genome.extracontains alignments/assemblies of SRAs from other cell lines
code used to explore our long-read dataset derived from long-read sequencing of the SKNO-1 cell line from our cultures.
reference genomes used in analysis that are small enough to be uploaded, plus instructions on how to generate the human genome references.
Scripts are generally written to process all .bam or all .fastq.gz files in a directory. Common scripts include assemble.sh (denovo assembly and/or kraken2 classification), map_to_<REF_GENOME>.sh, and dl.sh, which downloads datasets from the SRA or ENA given a text file containing accessions. Different datasets should have their own versions of some or all of these scripts modified for their features (e.g. Chip-SEQ vs RNA-Seq).
Short- and long-read data will first have to be downloaded using a script usually called dl.sh. These reads may then need to be pre-processed depending on the experiment. Expect to use 200GB of space for the short read data and 100GB of data for the pacbio data. Depending on the dataset, the reads may then be mapped to a given reference or de-novo assembled.
Preprocessed .fastq.gz files, BAM alignments, depth/coverage tables, de-novo assemblies in fasta format, and figures in either .SVG or .PDF format generated from the .Rmd scripts.
All figures are generated by .Rmd files, which we generally run in Rstudio. You will need to modify the output path of these scripts to save the figures in locations that actually exist on your computer.
If you have questions: contact Eli Piliper: epil02 ...[(@)]... uw.edu