Analysis and code for "An Anellovirus integrated into SKNO-1 cell line"

The following repository contains code and analysis scripts for INSERT DOI AND LINK(S) HERE.

Note:

Many of the alignments generated throughout the course of this project are huge and cannot be feasibly hosted on GitHub; Please reach out to us if you'd like to obtain them (see Contact).

Requirements

fasterq-dump 3.2.1
fastp 0.23.2
minimap2 2.29-r1283
samtools 1.22.1
htslib 1.22.1
python v3.10+
spades v.4.2.0
GNU parallel 20250322
pigz 2.8
gzcat 448.80.1
kraken2 2.17.1 (for a particular database used in this analysis, please reach out to us; we can't host any here due to file size restrictions.)
BBTools 39.26
access to CZID metagenomics pipelines
at least 500GB of disk space

There are three main directories:

`fastqs`

code used to automate the download and preprocessing of data from NCBI's Sequencing Read Archive and the European Nucleotide Archive.

These directories include dedicated collections of scripts for the following datasets (separated by directory):

ChIP-Seq (fastqs/chipseq)
RNA-Seq (fastqs/rnaseq)

`mapping`

code used to map short reads (not pacbio) from the fastqs folder to a variety of reference genomes for analysis. Also contains analysis scripts. Follows the same one-directory-per-dataset scheme as fastqs.

Has two additional directories:

denovo_assembly contains just the 4 RNA-Seq SRAs used to generate assemblies of the anellovirus genome.
extra contains alignments/assemblies of SRAs from other cell lines

`pacbio`

code used to explore our long-read dataset derived from long-read sequencing of the SKNO-1 cell line from our cultures.

`ref_genomes`

reference genomes used in analysis that are small enough to be uploaded, plus instructions on how to generate the human genome references.

Scripts

Scripts are generally written to process all .bam or all .fastq.gz files in a directory. Common scripts include assemble.sh (denovo assembly and/or kraken2 classification), map_to_<REF_GENOME>.sh, and dl.sh, which downloads datasets from the SRA or ENA given a text file containing accessions. Different datasets should have their own versions of some or all of these scripts modified for their features (e.g. Chip-SEQ vs RNA-Seq).

How to run

Short- and long-read data will first have to be downloaded using a script usually called dl.sh. These reads may then need to be pre-processed depending on the experiment. Expect to use 200GB of space for the short read data and 100GB of data for the pacbio data. Depending on the dataset, the reads may then be mapped to a given reference or de-novo assembled.

Outputs to expect

Preprocessed .fastq.gz files, BAM alignments, depth/coverage tables, de-novo assemblies in fasta format, and figures in either .SVG or .PDF format generated from the .Rmd scripts.

Figures

All figures are generated by .Rmd files, which we generally run in Rstudio. You will need to modify the output path of these scripts to save the figures in locations that actually exist on your computer.

Contact:

If you have questions: contact Eli Piliper: epil02 ...[(@)]... uw.edu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis and code for "An Anellovirus integrated into SKNO-1 cell line"

Note:

Requirements

Contents

`fastqs`

`mapping`

`pacbio`

`ref_genomes`

Scripts

How to run

Outputs to expect

Figures

Contact:

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Analysis and code for "An Anellovirus integrated into SKNO-1 cell line"

Note:

Requirements

Contents

fastqs

mapping

pacbio

ref_genomes

Scripts

How to run

Outputs to expect

Figures

Contact:

`fastqs`

`mapping`

`pacbio`

`ref_genomes`