DIANA: An integrated pipeline for analysis of long-read whole-genome sequencing data for molecular neuropathology

Overview

Diagnostic Integrated Analytics for Neoplastic Alterations (DIANA) is a comprehensive bioinformatics pipeline for analyzing Neoplastic alteration. It integrates multiple analyses including CNV detection, methylation profiling, structural variant calling, and MGMT promoter status determination.

Pipeline Schematic

DIANA pipeline follows a modular architecture with four main Nextflow modules that can be run independently or sequentially:

Pipeline workflow showing the flow from BAM files through Mergebam, Epi2me, and Annotation modules to final PDF reports.

Quick Start

Prerequisites

Docker (Desktop/Local) or Singularity/Apptainer (HPC)
Java 11–21 (auto-installed by setup script if missing; required by Nextflow)
Nextflow (auto-installed by setup script)
Internet connection for downloading reference files from Zenodo

Automated Setup & Run

The pipeline now features a unified setup script that automatically downloads all reference files from Zenodo:

For Docker (Desktop/Local):

git clone https://github.com/VilhelmMagnusLab/Diana.git
cd Diana
./setup_pipeline.sh docker
./run_pipeline_docker.sh --run_mode_order --sample_id YOUR_SAMPLE_ID

For Singularity/Apptainer (HPC):

git clone https://github.com/VilhelmMagnusLab/Diana.git
cd Diana
./setup_pipeline.sh singularity
./run_pipeline_singularity.sh --run_mode_order --sample_id YOUR_SAMPLE_ID

What the setup script does:

Checks for compatible Java (11–21) and installs it if missing
Installs Nextflow and adds it to PATH via .diana_env
Downloads all reference files from Zenodo (DOI: 10.5281/zenodo.19232427)
Extracts and organizes files into the correct directory structure
Downloads and sets up Docker containers or Singularity images

Note: First-time setup downloads ~14 GB of reference data and may take 10-30 minutes depending on your internet connection.

Test with Demo Data

A minimal test dataset (diana_dummy) is automatically downloaded and extracted into data/diana_dummy/ by setup_pipeline.sh. It contains a single sample (diana-001) with a small BAM file and the required final_summary trigger file, mirroring the expected input structure.

After setup completes, run the pipeline on the demo data:

# Docker
bash smart_sample_monitor_v2.sh -d data/diana_dummy

# Singularity/Apptainer
bash smart_sample_monitor_v2.sh --singularity -d data/diana_dummy

The monitor will detect the final_summary file in diana_dummy/diana-001/, trigger the full pipeline, and write results to ~/routine_diana/routine_results/diana-001/.

The sample ID files created by setup_pipeline.sh already contain diana-001 / PBE00000 — no manual configuration needed.

Pipeline Modules

The pipeline consists of three main modules that can be run independently or sequentially:

1. Mergebam Pipeline (`--run_mode_mergebam`)

Merges multiple BAM files per sample
Extracts protein-coding regions of interest using roi.protein_coding.bed

2. Epi2me Pipeline (`--run_mode_epi2me`)

Three independent analysis types:

Analysis	Tool	Purpose	Output
Modified Base Calling	Modkit	DNA modifications (5mC, 5hmC)	`*_wf_mods.bedmethyl.gz`
Structural Variants	Sniffles2	Structural variant detection	`*.sniffles.vcf.gz`
Copy Number Variation	QDNAseq	CNV detection	`_segs.bed`, `_bins.bed`, `*_segs.vcf`

PacBio HiFi Data: BAM Alignment Pre-processing

PacBio HiFi BAM files from the sequencer need to be unaligned. Before running the pipeline (specifically before modkit modified base calling), the BAM must be aligned to the reference genome. Use the following command:

samtools fastq -T MM,ML /path/to/input.hifi_reads.bam \
    | minimap2 -y -ax map-hifi -t 4 /path/to/GRCh38_reference.fa - \
    | samtools sort -@ 4 -o /path/to/output.hifi_reads.aligned.bam

# Then index the aligned BAM
samtools index /path/to/output.hifi_reads.aligned.bam

Important: The -T MM,ML flag in samtools fastq is required to preserve the base modification tags (MM and ML) that encode 5mC methylation information. Without these tags, modkit will not be able to extract methylation calls.

3. Annotation Pipeline (`--run_mode_annotation`)

MGMT methylation analysis using EPIC array sites
NanoDx neural network classification with dual classifier support:
- Capper et al. classifier (default) - Optimized for brain tumors
- Pan-cancer classifier v5i - Broader tumor type coverage (use --pancan flag)
Structural variant annotation with Svanna
SNV annotation with Clair3 (germline) and ClairS-TO (somatic), filtered by configurable Depth and GQ thresholds
CNV analysis with ACE tumor content determination
Comprehensive reporting (HTML, IGV snapshots, Circos plots, Markdown)

Methylation Classifier Selection

The pipeline supports two NanoDx methylation classifiers:

Classifier	Flag	Recommended For	Description
Capper et al.	(default)	Brain tumors	Default classifier optimized for CNS tumor classification
Pan-cancer v5i	`--pancan`	Broader tumor types	Extended classifier covering wider range of tumor types

Example usage:

# Default - Capper et al. classifier
./run_pipeline_singularity.sh --run_mode_order --sample_id SAMPLE_001

# Pan-cancer classifier
./run_pipeline_singularity.sh --run_mode_order --sample_id SAMPLE_001 --pancan

The --pancan flag works with all run modes and can be combined with any pipeline configuration.

Pipeline Run Modes

The pipeline can be executed in different modes:

Mode	Flag	Description	Use Case
Complete Pipeline	`--run_mode_order`	Runs all three modules sequentially (Mergebam → Epi2me → Annotation)	Starting from raw BAM files
Epi2me + Annotation	`--run_mode_epiannotation`	Runs Epi2me and Annotation sequentially (assumes merged BAM files exist)	When BAM files are already merged
Mergebam Only	`--run_mode_mergebam`	Merges BAM files and extracts regions of interest	BAM preparation only
Epi2me Only	`--run_mode_epi2me [all\|modkit\|cnv\|sv\|snv]`	Runs specific Epi2me analyses	Methylation, CNV, SV, or SNV calling
Annotation Only	`--run_mode_annotation [all\|mgmt\|cnv\|svannasv\|terp\|snv\|rmd]`	Runs specific downstream analyses	Report generation or specific analyses

Container Systems

Feature	Docker	Singularity/Apptainer
Best for	Desktop/Local	HPC/Shared systems
Setup Script	`setup_docker.sh`	`setup_singularity.sh`
Run Script	`run_pipeline_docker.sh`	`run_pipeline_singularity.sh`

All containers are automatically downloaded from vilhelmmagnuslab Docker Hub.

Usage Examples

Complete Pipeline (Recommended)

# Docker - Full pipeline starting from raw BAM files
./run_pipeline_docker.sh --run_mode_order --sample_id T001

# Singularity/Apptainer - Full pipeline starting from raw BAM files
./run_pipeline_singularity.sh --run_mode_order --sample_id T001

Epi2me + Annotation (When BAM files are already merged)

# Docker - Skip mergebam, run Epi2me and Annotation
./run_pipeline_docker.sh --run_mode_epiannotation --sample_id T001

# Singularity/Apptainer - Skip mergebam, run Epi2me and Annotation
./run_pipeline_singularity.sh --run_mode_epiannotation --sample_id T001

Individual Modules

Docker Commands:

# Mergebam only
./run_pipeline_docker.sh --run_mode_mergebam

# Epi2me analyses
./run_pipeline_docker.sh --run_mode_epi2me all          # All Epi2me analyses
./run_pipeline_docker.sh --run_mode_epi2me stat         # QC statistics (cramino) only
./run_pipeline_docker.sh --run_mode_epi2me modkit       # Modified base calling only
./run_pipeline_docker.sh --run_mode_epi2me cnv          # CNV analysis only
./run_pipeline_docker.sh --run_mode_epi2me sv           # Structural variants only
./run_pipeline_docker.sh --run_mode_epi2me snv          # SNV calling (Clair3 + ClairS-TO) only

# Annotation modules
./run_pipeline_docker.sh --run_mode_annotation all        # All analyses
./run_pipeline_docker.sh --run_mode_annotation mgmt       # MGMT analysis only
./run_pipeline_docker.sh --run_mode_annotation cnv        # CNV analysis only
./run_pipeline_docker.sh --run_mode_annotation svannasv   # Svanna SV annotation only
./run_pipeline_docker.sh --run_mode_annotation terp       # TERTp promoter analysis only
./run_pipeline_docker.sh --run_mode_annotation snv        # SNV annotation (Clair3 + ClairS-TO) only
./run_pipeline_docker.sh --run_mode_annotation rmd        # Markdown report only

Singularity/Apptainer Commands:

# Mergebam only
./run_pipeline_singularity.sh --run_mode_mergebam

# Epi2me analyses
./run_pipeline_singularity.sh --run_mode_epi2me all          # All Epi2me analyses
./run_pipeline_singularity.sh --run_mode_epi2me stat         # QC statistics (cramino) only
./run_pipeline_singularity.sh --run_mode_epi2me modkit       # Modified base calling only
./run_pipeline_singularity.sh --run_mode_epi2me cnv          # CNV analysis only
./run_pipeline_singularity.sh --run_mode_epi2me sv           # Structural variants only
./run_pipeline_singularity.sh --run_mode_epi2me snv          # SNV calling (Clair3 + ClairS-TO) only

# Annotation modules
./run_pipeline_singularity.sh --run_mode_annotation all        # All analyses
./run_pipeline_singularity.sh --run_mode_annotation mgmt       # MGMT analysis only
./run_pipeline_singularity.sh --run_mode_annotation cnv        # CNV analysis only
./run_pipeline_singularity.sh --run_mode_annotation svannasv   # Svanna SV annotation only
./run_pipeline_singularity.sh --run_mode_annotation terp       # TERT promoter analysis only
./run_pipeline_singularity.sh --run_mode_annotation snv        # SNV annotation (Clair3 + ClairS-TO) only
./run_pipeline_singularity.sh --run_mode_annotation rmd        # Markdown report only

Input Requirements

Sample ID File Format

# For annotation pipeline (with tumor content)
sample_id1   0.75    # 75% tumor content
sample_id2          # Auto-calculate with ACE

# For mergebam pipeline (with flowcell)
sample_id1   flowcell_id1
sample_id2   flowcell_id2

Directory Structure

The pipeline uses a standardized directory structure with separate input and output paths:

Pipeline directory:
/data/routine_diana/Diana/
├── conf/                         # Configuration files
│   ├── mergebam.config          # Mergebam module config
│   ├── epi2me.config            # Epi2me module config
│   └── annotation.config        # Annotation module config
├── modules/                      # Nextflow modules
├── containers/                   # Singularity container images
├── bin/                         # Helper scripts
├── docs/                        # Documentation
└── smart_sample_monitor_v2.sh  # Automated monitoring script

Pipeline data directory (configured via params.path):
/data/
├── reference/                    # Reference files (GRCh38, BED files, etc.)
└── humandb/                      # Annotation databases

Input data directory (configured via params.input_dir in mergebam.config):
/data/WGS_[DATE]/                # Oxford Nanopore sequencing output
├── SAMPLE_01/                    # Sample directory
│   └── [subdirectory]/          # Any subdirectory structure
│       ├── *.bam                # BAM files from ONT sequencing
│       ├── *.bam.bai            # BAM index files
│       └── final_summary_*_*_*.txt  # Completion marker file
├── SAMPLE_02/
│   └── [subdirectory]/
│       ├── *.bam
│       ├── *.bam.bai
│       └── final_summary_*_*_*.txt
└── ...

Output directory (configured via params.path_output):
routine_diana/
├── sample_ids_bam.txt           # Sample IDs for BAM merging
│
├── routine_bams/                # Processed BAM files (Mergebam module)
│   ├── merge_bams/              # Merged BAM files per sample
│   └── roi_bams/                # Region of interest extracted BAMs
│
├── routine_epi2me/              # Epi2me module results
│   └── [sample_id]/
│       ├── *.wf_mods.bedmethyl.gz     # Methylation calls (modkit)
│       ├── *.sniffles.vcf.gz          # Structural variants (Sniffles2)
│       ├── *_segs.bed                 # CNV segments (QDNAseq)
│       ├── *_bins.bed                 # CNV bins
│       ├── *_copyNumbersCalled.rds    # CNV RDS file for ACE
│       ├── clair3/                    # Germline SNV calling (Clair3)
│       │   └── *.vcf.gz
│       └── clairs-to/                 # Somatic SNV calling (ClairS-TO)
│           └── *.vcf.gz
│
├── routine_annotation/            # Analysis module results (detailed outputs)
│   └── [sample_id]/
│       ├── classifier/          # Tumor classification
│       │   ├── nanodx/         # NanoDx neural network results
│       │   └── sturgeon/       # Sturgeon methylation classifier
│       ├── cnv/                 # CNV analysis
│       │   ├── ace/            # ACE tumor content estimation
│       │   ├── annotatedcnv/   # Annotated CNV calls
│       │   └── *.pdf           # CNV plots (chr7, chr9, full genome)
│       ├── coverage/            # IGV coverage snapshots
│       │   ├── *_egfr_coverage.pdf
│       │   ├── *_idh1_coverage.pdf
│       │   ├── *_idh2_coverage.pdf
│       │   └── *_tertp_coverage.pdf
│       ├── cramino/             # BAM statistics
│       │   └── *_cramino_statistics.txt
│       ├── merge_annot_clair3andclairsto/  # Variant annotation
│       │   └── *_merge_annotation_filter_snvs_allcall.csv
│       ├── methylation/         # MGMT methylation analysis
│       │   └── *_MGMT_results.csv
│       └── structure_variant/   # SV annotation
│           ├── *_circos.pdf    # Circos plot
│           ├── *_fusion_events.tsv  # Fusion events
│           └── *_svanna_annotation.html  # Svanna SV annotation
│
└── routine_results/             # Final published reports (per sample)
    └── [sample_id]/
        ├── [sample_id]_bedmethyl_sturgeon_general.pdf  # Sturgeon classification
        ├── [sample_id]_markdown_pipeline_report.pdf    # Main comprehensive report
        ├── [sample_id]_mnpflex_input.bed               # MNP-Flex input format
        ├── [sample_id]_occ_svanna_annotation.html      # SV annotation HTML
        └── [sample_id]_tsne_plot.html                  # t-SNE visualization

Required Reference Data

Automated Download (Recommended)

The setup_pipeline.sh script automatically downloads and sets up all required reference files from Zenodo.

Simply run:

./setup_pipeline.sh docker    # For Docker users
# or
./setup_pipeline.sh singularity    # For Singularity users

The script will:

Download reference data from Zenodo (DOI: 10.5281/zenodo.19232427)
Extract and organize all files into the correct directory structure
Set up NanoDx classifier models
Configure all required paths

Manual Setup (Advanced Users Only)

If you prefer manual setup or need to customize the reference files:

Core reference files (automatically placed in data/reference/):

reference_core.tar.gz - Contains GRCh38 reference genome, BED files, and annotations including:
- GRCh38.fa and GRCh38.fa.fai - Human reference genome
- EPIC_sites_NEW.bed - Methylation sites
- MGMT_CpG_Island.hg38.bed - MGMT CpG islands
- roi.protein_coding.bed - Region of interest BED file (protein-coding genes for SNV screening and BAM extraction)
- TERTp_variants.bed - TERT promoter variants
- human_GRCh38_trf.bed - Tandem repeat regions
- CNV_genes_tuned.csv - CNV gene annotations
- occ_fusions_genes.txt - User-defined region of interest gene list for SV/fusion filtering and SNV annotation (one gene per line; can be replaced with any custom gene list)
- nanoDx/ - NanoDx neural network classifier (with models from Zenodo)

Annotation databases (automatically placed in data/humandb/):

humandb.tar.gz - Contains ANNOVAR annotation databases:
- hg38_refGene.txt - RefGene annotation
- hg38_refGeneMrna.fa - RefGene mRNA sequences
- hg38_clinvar_20240611.txt - ClinVar annotations
- hg38_cosmic100coding2024.txt - Cosmic annotations

Additional reference files (automatically extracted to data/reference/):

general.zip - Sturgeon classifier model (kept as zip, not extracted)
Assembly.zip - Assembly folder for vcfcircos visualization (automatically extracted)
r1041_e82_400bps_sup_v420.zip - ONT basecalling model for ClairS-TO (automatically extracted)
svanna-data.zip - Svanna structural variant annotation database (optional, automatically extracted)

Note on roi.protein_coding.bed: This ROI BED file uses OCC (Onco-Comprehensive-Coverage) genes but can be substituted with any custom ROI BED file. It's used for:

Extracting regions of interest during BAM merging (mergebam module)
SNV screening regions for variant calling (ClairS-TO analysis)
Ensure proper BED format with exactly 10 tab-separated fields per line

Note on occ_fusions_genes.txt: Plain-text gene list (one gene symbol per line) used for SV/fusion event filtering and SNV annotation. This file can be replaced with any custom gene list of interest — for example, a laboratory-specific panel of oncology-relevant genes. The default list contains 204 genes covering common fusion partners and oncogenes.

Manual download: If needed, all reference files are available at Zenodo (DOI: 10.5281/zenodo.19232427)

Directory Structure Setup

After downloading the reference files, your directory structure should look like this:

data/
├── reference/                    # Reference files
│   ├── GRCh38.fa
│   ├── GRCh38.fa.fai
│   ├── gencode.v48.annotation.gff3
│   ├── Assembly/                # Assembly folder for vcfcircos (from Zenodo)
│   ├── EPIC_sites_NEW.bed
│   ├── MGMT_CpG_Island.hg38.bed
│   ├── roi.protein_coding.bed
│   ├── TERTp_variants.bed
│   ├── human_GRCh38_trf.bed
│   ├── CNV_genes_tuned.csv
│   ├── occ_fusions_genes.txt
│   └── etc
│
└── humandb/                     # Annotation databases
    ├── hg38_refGene.txt
    ├── hg38_refGeneMrna.fa
    ├── hg38_clinvar_20240611.txt
    └── hg38_cosmic100coding2024.txt

ACE Tumor Content Calculation

The pipeline intelligently handles tumor content:

Provided value: Use directly if specified in sample ID file
Auto-calculation: ACE analyzes copy number profiles to estimate tumor cellularity
Multiple estimates: ACE provides several estimates and selects the best fit
Results: Saved in ${sample_id}_ace_results/threshold_value.txt

Report Generation

Standard Report Generation

PDF reports are automatically generated when running the pipeline with the following modes:

--run_mode_annotation rmd - Generate reports only
--run_mode_order - Run complete pipeline sequentially and generate reports
--run_mode_epiannotation - Run Epi2me and annotation modules and generate reports

The reports are automatically created in the routine_results/{sample_id}/ directory with the name {sample_id}_markdown_pipeline_report.pdf.

Additional Report Generation

The generate_report.sh script is provided for additional report generation in cases where:

You want to regenerate reports after re-running specific processes
You need to create reports for samples that were processed separately
You need to generate reports after the pipeline has already completed

Configuration

Path Configuration

The pipeline uses three main path parameters that must be configured:

1. Pipeline Data Path (params.path) - Reference files and databases

// conf/annotation.config, conf/epi2me.config, conf/mergebam.config
params {
    path = "/data/routine_diana/Diana/data"
    // Contains: reference/, humandb/ directories
}

2. Input Data Path (params.input_dir) - ONT sequencing output

// conf/mergebam.config
params {
    input_dir = "/data/WGS_27102025"
    // Contains: Sample directories with BAM files
    // Can be overridden via CLI: --input_dir or smart_sample_monitor -d
}

3. Output Path (params.path_output) - Pipeline results

// conf/mergebam.config, conf/epi2me.config, conf/annotation.config
params {
    path_output = "/data/routine_diana"
    // Contains: sample_ids_bam.txt, routine_bams/, routine_epi2me/, routine_results/
}

Key Points:

params.path: Reference data (rarely changes)
params.input_dir: ONT sequencing input (changes per run)
params.path_output: Where all results are stored (consistent location)
The input_dir can be overridden using --input_dir flag or smart_sample_monitor_v2.sh -d

SNV Filtering Configuration

The pipeline includes configurable quality thresholds for SNV filtering in the final reports:

// conf/annotation.config
params {
    snv_depth_threshold = 10    // Minimum sequencing depth (default: 10)
    snv_gq_threshold = 10       // Minimum Genotype Quality (default: 10)
}

How Filtering Works:

Depth threshold: Filters out variants with sequencing depth below the threshold
GQ threshold: For variants with multiple GQ values from different callers (e.g., "20,26,41"), keeps the variant if ANY value meets the threshold
Both filters must pass for a variant to appear in the final report

Examples:

# Stricter filtering (higher quality variants only)
snv_depth_threshold = 15
snv_gq_threshold = 20

# More permissive filtering (include more variants)
snv_depth_threshold = 5
snv_gq_threshold = 5

Note: These thresholds only affect the variants shown in the Markdown PDF reports. The raw VCF files contain all called variants regardless of these filters.

Container Configuration

Choose your preferred container engine and run the unified setup script:

# For Docker
./setup_pipeline.sh docker

# For Singularity/Apptainer
./setup_pipeline.sh singularity

The setup script handles Java, Nextflow, reference files, and container images in one step.

Work Directory Customization

You can specify a custom temporary work directory using the -w flag. This is useful for:

Managing disk space on different storage locations
Avoiding permission issues
Organizing temporary files

Example:

# Docker
./run_pipeline_docker.sh --run_mode_annotation tertp -w /path/to/your/work/dir

# Singularity/Apptainer  
./run_pipeline_singularity.sh --run_mode_annotation tertp -w /home/chbope/extension/trash/tmp

Note: The -w flag sets Nextflow's work directory where temporary files and intermediate results are stored during pipeline execution. By default nextflow create a folder work in the working directory.

Automated Sample Monitoring

The pipeline includes smart_sample_monitor_v2.sh for automated monitoring and processing of Oxford Nanopore sequencing runs. This intelligent script continuously monitors sample directories and automatically triggers the pipeline when sequencing completes.

Key Features:

Monitoring & Execution:

Real-time Monitoring: Watches for final_summary_*_*_*.txt files indicating completed sequencing
Automatic Pipeline Triggering: Starts processing immediately when samples are ready
Sequential Processing: Processes one sample at a time, queuing others
Markdown Report Validation: Verifies successful completion before marking as done

Version 2 Enhancements:

CLI Data Directory Override: --data-dir takes precedence over mergebam.config
Resume Control: Disabled by default for fresh runs; use -r to enable caching
Symlink Resolution: Works correctly when installed as global command
Portable Execution: Automatically finds pipeline directory from any location
Sample IDs File: Hardcoded to /data/routine_diana/sample_ids_bam.txt

Basic Usage:

# Run from pipeline directory with default config (auto-detects Singularity or Docker)
./smart_sample_monitor_v2.sh

# Monitor specific data directory (overrides config)
./smart_sample_monitor_v2.sh -d /data/WGS_27102025

# Enable resume for cached results
./smart_sample_monitor_v2.sh -d /data/WGS_27102025 -r

# Verbose logging
./smart_sample_monitor_v2.sh -d /data/WGS_27102025 -v

# Combination: resume + verbose
./smart_sample_monitor_v2.sh -d /data/WGS_27102025 -r -v

# Force Docker (useful when both Docker and Singularity are available)
./smart_sample_monitor_v2.sh --docker -d /data/WGS_27102025

# Force Singularity/Apptainer
./smart_sample_monitor_v2.sh --singularity -d /data/WGS_27102025

# Explicit engine flag (equivalent to --docker / --singularity)
./smart_sample_monitor_v2.sh -e docker -d /data/WGS_27102025 -r -v
./smart_sample_monitor_v2.sh -e singularity -d /data/WGS_27102025 -r -v

Global Command Installation:

Install the monitor as a global command accessible from any directory:

User-level installation (Recommended - No sudo required):

# Create user bin directory and symbolic link
mkdir -p ~/bin
ln -sf /data/routine_diana/Diana/smart_sample_monitor_v2.sh ~/bin/smart_sample_monitor

# Add ~/bin to PATH (run once)
cat >> ~/.bashrc << 'EOF'

# Add user's bin directory to PATH
if [ -d "$HOME/bin" ]; then
    export PATH="$HOME/bin:$PATH"
fi
EOF

# Activate changes
source ~/.bashrc

# Verify installation
which smart_sample_monitor

System-wide installation (Requires sudo):

sudo ln -sf /data/routine_diana/Diana/smart_sample_monitor_v2.sh /usr/local/bin/smart_sample_monitor

Then use from anywhere:

# Run from any directory
cd /tmp
smart_sample_monitor -d /data/WGS_27102025 -v

# Monitor with custom work directory
smart_sample_monitor -d /data/WGS_27102025 -w /data/trash -r

# Force Docker from anywhere
smart_sample_monitor --docker -d /data/WGS_27102025 -v

Command-Line Options:

Option	Long Form	Description	Default
`-d`	`--data-dir`	Base data directory (overrides config)	Auto-detect from config
`-p`	`--pipeline`	Pipeline base directory	Auto-detected
`-w`	`--workdir`	Nextflow work directory	`/data/trash`
`-c`	`--config`	Config file to parse	`conf/mergebam.config`
`-i`	`--interval`	Check interval in seconds	300 (5 min)
`-t`	`--timeout`	Maximum wait time in seconds	432000 (5 days)
`-e`	`--engine`	Container engine: `singularity`, `apptainer`, or `docker`	Auto-detect
	`--docker`	Shorthand for `--engine docker`	-
	`--singularity`	Shorthand for `--engine singularity`	-
`-r`	`--resume`	Enable Nextflow resume	Disabled
`-v`	`--verbose`	Enable verbose logging	Disabled
`-h`	`--help`	Show help message	-

Workflow:

Initialize: Load sample IDs from /data/routine_diana/sample_ids_bam.txt
Monitor: Check each sample directory for final_summary_*_*_*.txt
Queue: Mark ready samples for processing
Execute: Run --run_mode_order for each sample sequentially
Validate: Check for markdown report generation
Report: Display final status summary

Use Case:

This script is essential for routine ONT sequencing workflows where:

Multiple samples complete sequencing at different times
Immediate processing is desired upon completion
Manual monitoring would be time-consuming and error-prone
Consistent processing workflow is required

Instead of manually checking and starting the pipeline for each sample, the monitor automatically detects completion and starts processing immediately, maximizing throughput and reducing manual intervention.

Important: Ensure all paths are correctly configured in conf/mergebam.config:

params.path: Reference data directory
params.input_dir: Default input directory (can be overridden with -d)
params.path_output: Output results directory

See docs/GLOBAL_COMMAND_SETUP.md for detailed installation, troubleshooting, and advanced usage.

Troubleshooting

Common Issues

Container engine conflict: Ensure only one container system is enabled
Missing reference files: Download required external files
Permission issues: Check container and file permissions

Verification Commands

# Check containers
docker images | grep vilhelmmagnuslab          # Docker
ls -la containers/*.sif                        # Singularity

# Test pipeline
./test_pipeline_docker.sh                            # Docker
./test_pipeline_singularity.sh                # Singularity

Support

Documentation:
- DOCKER_SETUP.md - Docker installation and setup
- SINGULARITY_SETUP.md - Singularity/Apptainer setup
- docs/GLOBAL_COMMAND_SETUP.md - Global command installation
Issues: GitHub Issues
Contact:
- Christian Domilongo Bope (chbope@ous-hf.no / christianbope@gmail.com)
- Skarphedinn Halldorsson (skahal@ous-hf.no / skabbi@gmail.com)
- Richard Nagymihaly (ricnag@ous-hf.no)

Citation

If you use this pipeline in your research, please cite:

Bope CD, Nagymihaly R, Halldorsson S, et al. DIANA: Diagnostic Integrated Analytics for Neoplastic Alterations a long-read whole genome sequencing pipeline for molecular neuropathology. 2026. https://doi.org/10.64898/2026.03.25.714119

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

Diagnostic Integrated Analytics for Neoplastic Alterations pipeline (DIANA) is an investigational research tool that has not undergone full clinical validation. Any clinical use or interpretation of its results is entirely at the discretion and responsibility of the treating physician

Name		Name	Last commit message	Last commit date
Latest commit History 188 Commits
.empty_r_overlay		.empty_r_overlay
assets		assets
bin		bin
conf		conf
data/reference/nanoDx/workflow/scripts		data/reference/nanoDx/workflow/scripts
dockerfiles		dockerfiles
docs		docs
modules		modules
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
CONTAINERS.md		CONTAINERS.md
DOCKER_SETUP.md		DOCKER_SETUP.md
LICENSE		LICENSE
NANODX_SETUP.md		NANODX_SETUP.md
OPTIONAL_FILES_SETUP.md		OPTIONAL_FILES_SETUP.md
README.md		README.md
SINGULARITY_SETUP.md		SINGULARITY_SETUP.md
SOFTWARE_VERSIONS.md		SOFTWARE_VERSIONS.md
TESTING_GUIDE.md		TESTING_GUIDE.md
debug_sample_loading.sh		debug_sample_loading.sh
environment.yml		environment.yml
long_monitor.sh		long_monitor.sh
main.nf		main.nf
modules.json		modules.json
nWGS.png		nWGS.png
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config
package_for_zenodo.sh		package_for_zenodo.sh
pipeline_template.yml		pipeline_template.yml
run_pipeline_conda.sh		run_pipeline_conda.sh
run_pipeline_docker.sh		run_pipeline_docker.sh
run_pipeline_singularity.sh		run_pipeline_singularity.sh
setup_docker.sh		setup_docker.sh
setup_pipeline.sh		setup_pipeline.sh
setup_singularity.sh		setup_singularity.sh
smart_sample_monitor_v2.sh		smart_sample_monitor_v2.sh
styles.css		styles.css
test_pipeline_docker.sh		test_pipeline_docker.sh
test_pipeline_singularity.sh		test_pipeline_singularity.sh
tower.yml		tower.yml
update		update
update.sh		update.sh
upload_to_zenodo.sh		upload_to_zenodo.sh
validate_setup.sh		validate_setup.sh
versions.yml		versions.yml

Folders and files

Latest commit

History

Repository files navigation

DIANA: An integrated pipeline for analysis of long-read whole-genome sequencing data for molecular neuropathology

Overview

Pipeline Schematic

Quick Start

Prerequisites

Automated Setup & Run

Test with Demo Data

Pipeline Modules

1. Mergebam Pipeline (--run_mode_mergebam)

2. Epi2me Pipeline (--run_mode_epi2me)

PacBio HiFi Data: BAM Alignment Pre-processing

3. Annotation Pipeline (--run_mode_annotation)

Methylation Classifier Selection

Pipeline Run Modes

Container Systems

Usage Examples

Complete Pipeline (Recommended)

Epi2me + Annotation (When BAM files are already merged)

Individual Modules

Input Requirements

Sample ID File Format

Directory Structure

Required Reference Data

Automated Download (Recommended)

Manual Setup (Advanced Users Only)

Directory Structure Setup

ACE Tumor Content Calculation

Report Generation

Standard Report Generation

Additional Report Generation

Configuration

Path Configuration

SNV Filtering Configuration

Container Configuration

Work Directory Customization

Automated Sample Monitoring

Key Features:

Basic Usage:

Global Command Installation:

Command-Line Options:

Workflow:

Use Case:

Troubleshooting

Common Issues

Verification Commands

Support

Citation

License

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. Mergebam Pipeline (`--run_mode_mergebam`)

2. Epi2me Pipeline (`--run_mode_epi2me`)

3. Annotation Pipeline (`--run_mode_annotation`)

Packages