A Nextflow pipeline for running VADR v-scan to annotate nucleotide sequences using NCBI developed VADR model libraries for the following viral species:
- Influenza virus
- Sars-CoV-2 virus
- Respiratory syncytial virus
Nextflow - installation instructions
Docker - installation instructions
The tool compares a FASTA formatted nucleotide sequence to curated reference models to automatically annotate genomic features (.vadr.tbl) and report potential sequence anomalies (error_alert.tsv). If a Submission Template (.sbt) file and a Source Modifiers Table (.src) file are included, ASN.1 Format (.sqn) files will be generated for NCBI GenBank submission of your sequences by email to gb-sub@ncbi.nlm.nih.gov
This pipeline is a containerized application that can automatically scale to utilize the computing resources available (desktop, cloud, or cluster). To run efficiently, it needs a samplesheet in CSV format. This file helps the pipeline process multiple sequences in parallel, depending on the available resources. The samplesheet must indicate the FASTA header and the file path for each query sequence.
Note: the current version of the pipeline requires each query sequence to be stored in a separate FASTA file. Multi-FASTA files are not supported.
You are welcome to split multifasta and create the samplesheet manually, but we also provide scripts to generate it automatically for your convenience.
Download utils.zip and unzip.
Copy and paste the following command into your terminal, replacing <multi_fasta_file.fasta> with the name of your multi-FASTA file, and <output_directory> with the path to the folder where you want to save your individual FASTA files:
python3 utils/split_multi_fasta.py <multi_fasta_file.fasta> <output_directory>
Copy and paste the following command into your terminal, replacing <input_fasta_directory> with the path to the folder containing your FASTA files, and <samplesheet_output.csv> with your desired samplesheet filename:
python3 utils/generate_sample_fasta_csv.py <input_fasta_directory> <samplesheet_output.csv>
sample,fasta
SAMPLE1,/PATH/TO/SAMPLE1.fasta
SAMPLE2,/PATH/TO/SAMPLE2.fasta
The input sequences must be nucleotide sequences from one of the currently supported virus species listed above. In the default version vadr-vscan-ncbi-resp will generate a 5-column annotation table and an annotated sequence in .gb format (with "test" metadata). Copy and paste the following command into your terminal, replacing <samplesheet_output.csv> with actual name of your samplesheet:
nextflow run greninger-lab/vadr-vscan-ncbi-resp -r main -latest --input <samplesheet_output.csv> --outdir ./out -profile docker
However, if you also want to create the genbank submission .sqn files (and .gb files with complete metadata), you should indicate a Submission Template (.sbt) file and a Source Modifiers Table (.src) file. Copy and paste the following command into your terminal, replacing <samplesheet_output.csv>, <submission_template.sbt> and <source_modifiers.src> with your actual filenames:
nextflow run greninger-lab/vadr-vscan-ncbi-resp -r main -latest --input <samplesheet_output.csv> --sbt <submission_template.sbt> --src <source_modifiers.src> --outdir ./out -profile docker
| option | description |
|---|---|
--input /path/to/your/sample_fastas.csv |
(required) path to a csv sample,fasta input file |
--outdir /path/to/output |
(required) output directory |
--vadr_keep |
(optional) keeps all VADR output in the output/vadr directory (SAMPLE_out) |
--sbt <file> |
(optional) path to a GenBank Submission Template (.sbt) file |
--src <file> |
(optional) path to a Source Modifiers Table (.src) file |
-profile docker |
(required) |
--vadr_mem XXGB |
(optional) Override the memory requested for the VADR container (default is 36GB) |
--vadr_cpus N |
(optional) Override the number of CPUs (default is 6) available for VADR |
-c /path/to/your/custom.config |
(optional) used specify a custom configuration file (see Nextflow docs |
Download example.zip
unzip example.zip
cd example
nextflow run greninger-lab/vadr-vscan-ncbi-resp -r main -latest --input example.csv --outdir ./out -profile docker
out
├── pipeline_info
├── summary
| ├── batch_classify_pass_fail.tsv ← ⚠️ Check this file for VADR pass/fail reports on each sequence
| └── batch_error_alert.tsv ← ⚠️ Check here for error alerts for any VADR failed sequences
└── vadr
├── MZ054879.fsa
├── MZ054879.gbf
├── MZ054879_out.vadr.tbl
out
├── pipeline_info
├── summary
| ├── batch_classify_pass_fail.tsv ← ⚠️ Check this file for VADR pass/fail reports on each sequence
| └── batch_error_alert.tsv ← ⚠️ Check here for error alerts for any VADR failed sequences
└── vadr
├── MZ054879.fsa
├── MZ054879.gbf
├── MZ054879.sqn ← Created when --sbt <file> and --src <file> are provided
├── MZ054879_out ← Created when --vadr_keep is provided
│ ├── MZ054879_out.muv.vadr.alc
│ ├── MZ054879_out.muv.vadr.alt ← VADR alert file (see below)
│ ├── MZ054879_out.muv.vadr.alt.list ← VADR alert file (used for generating batch_error_alert.tsv)
│ ├── <additional VADR output files>
├── MZ054879_out.vadr.tbl
VADR v-annotate.pl detects and reports alerts for more than 70 types of unexpected sequence characteristics. Documentation for v-annotate.pl can be found here, and extensive documentation for v-annotate.pl alerts is available here.