Skip to content

Latest commit

 

History

History
60 lines (45 loc) · 3.19 KB

File metadata and controls

60 lines (45 loc) · 3.19 KB

Analysis and code for "An Anellovirus integrated into SKNO-1 cell line"

The following repository contains code and analysis scripts for INSERT DOI AND LINK(S) HERE.

Note:

Many of the alignments generated throughout the course of this project are huge and cannot be feasibly hosted on GitHub; Please reach out to us if you'd like to obtain them (see Contact).

Requirements

  • fasterq-dump 3.2.1
  • fastp 0.23.2
  • minimap2 2.29-r1283
  • samtools 1.22.1
  • htslib 1.22.1
  • python v3.10+
  • spades v.4.2.0
  • GNU parallel 20250322
  • pigz 2.8
  • gzcat 448.80.1
  • kraken2 2.17.1 (for a particular database used in this analysis, please reach out to us; we can't host any here due to file size restrictions.)
  • BBTools 39.26
  • access to CZID metagenomics pipelines
  • at least 500GB of disk space

Contents

There are three main directories:

fastqs

code used to automate the download and preprocessing of data from NCBI's Sequencing Read Archive and the European Nucleotide Archive.

These directories include dedicated collections of scripts for the following datasets (separated by directory):

  • ChIP-Seq (fastqs/chipseq)
  • RNA-Seq (fastqs/rnaseq)

mapping

code used to map short reads (not pacbio) from the fastqs folder to a variety of reference genomes for analysis. Also contains analysis scripts. Follows the same one-directory-per-dataset scheme as fastqs.

Has two additional directories:

  • denovo_assembly contains just the 4 RNA-Seq SRAs used to generate assemblies of the anellovirus genome.
  • extra contains alignments/assemblies of SRAs from other cell lines

pacbio

code used to explore our long-read dataset derived from long-read sequencing of the SKNO-1 cell line from our cultures.

ref_genomes

reference genomes used in analysis that are small enough to be uploaded, plus instructions on how to generate the human genome references.

Scripts

Scripts are generally written to process all .bam or all .fastq.gz files in a directory. Common scripts include assemble.sh (denovo assembly and/or kraken2 classification), map_to_<REF_GENOME>.sh, and dl.sh, which downloads datasets from the SRA or ENA given a text file containing accessions. Different datasets should have their own versions of some or all of these scripts modified for their features (e.g. Chip-SEQ vs RNA-Seq).

How to run

Short- and long-read data will first have to be downloaded using a script usually called dl.sh. These reads may then need to be pre-processed depending on the experiment. Expect to use 200GB of space for the short read data and 100GB of data for the pacbio data. Depending on the dataset, the reads may then be mapped to a given reference or de-novo assembled.

Outputs to expect

Preprocessed .fastq.gz files, BAM alignments, depth/coverage tables, de-novo assemblies in fasta format, and figures in either .SVG or .PDF format generated from the .Rmd scripts.

Figures

All figures are generated by .Rmd files, which we generally run in Rstudio. You will need to modify the output path of these scripts to save the figures in locations that actually exist on your computer.

Contact:

If you have questions: contact Eli Piliper: epil02 ...[(@)]... uw.edu