Adding initial version of ww-fastp WILDS WDL module (#278)

tefirman · emjbishop · web-flow · commit 4c0538c8ee3b · 2026-03-23T11:45:11.000-07:00
* Adding inital version of ww-fastp WILDS WDL module

* Removing unused linting rules from Sprocket config

* Adding fastp_single to ww-fastp test run WDL

* Updating README accordingly

* Apply suggestions from code review

Co-authored-by: Emma Bishop &lt;46635347+emjbishop@users.noreply.github.com&gt;

* Switching imports to URL's

* Switching imports to URL's

---------

Co-authored-by: Emma Bishop &lt;46635347+emjbishop@users.noreply.github.com&gt;
diff --git a/modules/ww-fastp/README.md b/modules/ww-fastp/README.md
@@ -0,0 +1,183 @@
+# ww-fastp Module
+
+[![Project Status: Prototype – Useable, some support, open to feedback, unstable API.](https://getwilds.org/badges/badges/prototype.svg)](https://getwilds.org/badges/#prototype)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+
+A WILDS WDL module for [fastp](https://github.com/OpenGene/fastp), an ultra-fast all-in-one FASTQ preprocessor. fastp performs quality filtering, adapter trimming, and comprehensive QC reporting in a single pass over the data.
+
+## Overview
+
+This module provides WDL tasks for running fastp on both paired-end and single-end sequencing data. fastp is significantly faster than similar tools (Trimmomatic, Cutadapt) while performing multiple preprocessing operations simultaneously, including adapter auto-detection, quality filtering, length filtering, and HTML/JSON report generation.
+
+## Module Structure
+
+This module is part of the [WILDS WDL Library](https://github.com/getwilds/wilds-wdl-library) and follows the standard WILDS module structure:
+
+- **Main WDL file**: `ww-fastp.wdl` - Contains task definitions for the module
+- **Test workflow**: `testrun.wdl` - Demonstration workflow for testing and examples
+- **Documentation**: This README with usage examples and parameter descriptions
+
+## Available Tasks
+
+### `fastp_paired`
+
+Run fastp on paired-end FASTQ files for quality filtering, adapter trimming, and QC reporting.
+
+**Inputs:**
+- `sample_name` (String): Name identifier for the sample
+- `r1_fastq` (File): Read 1 input FASTQ file (gzipped or uncompressed)
+- `r2_fastq` (File): Read 2 input FASTQ file (gzipped or uncompressed)
+- `qualified_quality_phred` (Int, default=15): Minimum base quality score for a base to be qualified
+- `length_required` (Int, default=15): Minimum read length after trimming
+- `detect_adapter_for_pe` (Boolean, default=true): Enable auto-detection of adapters for paired-end data
+- `adapter_fasta` (File?, optional): FASTA file with custom adapter sequences
+- `cpu_cores` (Int, default=4): Number of CPU cores allocated for the task
+- `memory_gb` (Int, default=8): Memory allocated for the task in GB
+
+**Outputs:**
+- `r1_trimmed` (File): Trimmed and filtered R1 FASTQ file
+- `r2_trimmed` (File): Trimmed and filtered R2 FASTQ file
+- `html_report` (File): HTML quality control report
+- `json_report` (File): JSON quality control report
+
+### `fastp_single`
+
+Run fastp on single-end FASTQ files for quality filtering, adapter trimming, and QC reporting.
+
+**Inputs:**
+- `sample_name` (String): Name identifier for the sample
+- `fastq` (File): Input FASTQ file (gzipped or uncompressed)
+- `qualified_quality_phred` (Int, default=15): Minimum base quality score for a base to be qualified
+- `length_required` (Int, default=15): Minimum read length after trimming
+- `adapter_fasta` (File?, optional): FASTA file with custom adapter sequences
+- `cpu_cores` (Int, default=4): Number of CPU cores allocated for the task
+- `memory_gb` (Int, default=8): Memory allocated for the task in GB
+
+**Outputs:**
+- `trimmed_fastq` (File): Trimmed and filtered FASTQ file
+- `html_report` (File): HTML quality control report
+- `json_report` (File): JSON quality control report
+
+## Usage as a Module
+
+### Importing into Your Workflow
+
+```wdl
+import "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-fastp/ww-fastp.wdl" as fastp_tasks
+
+struct FastpSample {
+    String name
+    File r1_fastq
+    File r2_fastq
+}
+
+workflow my_preprocessing_pipeline {
+  input {
+    Array[FastpSample] samples
+  }
+
+  scatter (sample in samples) {
+    call fastp_tasks.fastp_paired {
+      input:
+        sample_name = sample.name,
+        r1_fastq = sample.r1_fastq,
+        r2_fastq = sample.r2_fastq
+    }
+  }
+
+  output {
+    Array[File] trimmed_r1 = fastp_paired.r1_trimmed
+    Array[File] trimmed_r2 = fastp_paired.r2_trimmed
+    Array[File] reports = fastp_paired.html_report
+  }
+}
+```
+
+### Advanced Usage Examples
+
+**Custom quality filtering:**
+```wdl
+call fastp_tasks.fastp_paired {
+  input:
+    sample_name = "stringent_sample",
+    r1_fastq = r1_file,
+    r2_fastq = r2_file,
+    qualified_quality_phred = 20,
+    length_required = 50,
+    cpu_cores = 8,
+    memory_gb = 16
+}
+```
+
+### Integration Examples
+
+This module integrates seamlessly with other WILDS components:
+- **ww-fastqc**: Can be used alongside fastp for additional QC checks
+- **ww-bwa / ww-star**: Trimmed reads can be passed directly to alignment modules
+
+## Testing the Module
+
+The module includes a test workflow (`testrun.wdl`) that can be run independently:
+
+```bash
+# Using miniWDL
+miniwdl run testrun.wdl
+
+# Using Sprocket
+sprocket run testrun.wdl --entrypoint fastp_example
+
+# Using Cromwell
+java -jar cromwell.jar run testrun.wdl
+```
+
+### Automatic Demo Mode
+
+The test workflow automatically:
+1. Downloads paired-end test FASTQ data using `ww-testdata`
+2. Runs fastp paired-end trimming and quality filtering
+3. Runs fastp single-end trimming and quality filtering (using R1 as input)
+4. Produces trimmed FASTQ files and HTML/JSON QC reports for both modes
+
+## Docker Container
+
+This module uses the `getwilds/fastp:1.1.0` container image, which includes:
+- fastp v1.1.0
+- All necessary system dependencies for FASTQ preprocessing
+
+## Citation
+
+> Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor.
+> Bioinformatics. 2018;34(17):i884-i890.
+> DOI: https://doi.org/10.1093/bioinformatics/bty560
+
+## Parameters and Resource Requirements
+
+### Default Resources
+- **CPU**: 4 cores
+- **Memory**: 8 GB
+- **Runtime**: Typically under 5 minutes for standard WGS samples
+
+### Resource Scaling
+- `cpu_cores`: fastp supports multi-threading; 4 cores is a good default
+- `memory_gb`: 8 GB is sufficient for most use cases; increase for very large files
+
+## Contributing
+
+To improve this module or report issues:
+1. Fork the [WILDS WDL Library repository](https://github.com/getwilds/wilds-wdl-library)
+2. Make your changes following WILDS conventions
+3. Test thoroughly with the demonstration workflow
+4. Submit a pull request with detailed documentation
+
+## Support and Feedback
+
+For questions about this module or to report issues:
+- Open an issue in the [WILDS WDL Library repository](https://github.com/getwilds/wilds-wdl-library/issues)
+- Contact us via the Fred Hutch Office of the Chief Data Officer (OCDO) at wilds@fredhutch.org
+- See the library's [Contributor Guide](https://github.com/getwilds/wilds-wdl-library/blob/main/.github/CONTRIBUTING.md) for detailed guidelines
+
+## Related Resources
+
+- **[WILDS Docker Library](https://github.com/getwilds/wilds-docker-library)**: Container images used by WDL workflows
+- **[WILDS Documentation](https://getwilds.org/)**: Comprehensive guides and best practices
+- **[WDL Specification](https://openwdl.org/)**: Official WDL language documentation
diff --git a/modules/ww-fastp/testrun.wdl b/modules/ww-fastp/testrun.wdl
@@ -0,0 +1,56 @@
+version 1.0
+
+# Import module in question as well as the testdata module for automatic demo functionality
+import "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/add-fastp-module/modules/ww-fastp/ww-fastp.wdl" as ww_fastp
+import "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/add-fastp-module/modules/ww-testdata/ww-testdata.wdl" as ww_testdata
+
+# Define data structure for paired-end sample inputs
+struct FastpSample {
+    String name
+    File r1_fastq
+    File r2_fastq
+}
+
+#### TEST WORKFLOW DEFINITION ####
+# Define test workflow to demonstrate fastp module functionality
+
+workflow fastp_example {
+  # Auto-download test data for testing purposes
+  call ww_testdata.download_fastq_data as download_demo_data { }
+
+  # Create samples array using test data
+  Array[FastpSample] final_samples = [
+    {
+      "name": "demo_sample",
+      "r1_fastq": download_demo_data.r1_fastq,
+      "r2_fastq": download_demo_data.r2_fastq
+    }
+  ]
+
+  # Process each sample with fastp paired-end and single-end trimming
+  scatter (sample in final_samples) {
+    call ww_fastp.fastp_paired { input:
+        sample_name = sample.name,
+        r1_fastq = sample.r1_fastq,
+        r2_fastq = sample.r2_fastq,
+        cpu_cores = 1,
+        memory_gb = 4
+    }
+    call ww_fastp.fastp_single { input:
+        sample_name = sample.name,
+        fastq = sample.r1_fastq,
+        cpu_cores = 1,
+        memory_gb = 4
+    }
+  }
+
+  output {
+    Array[File] r1_trimmed = fastp_paired.r1_trimmed
+    Array[File] r2_trimmed = fastp_paired.r2_trimmed
+    Array[File] paired_html_reports = fastp_paired.html_report
+    Array[File] paired_json_reports = fastp_paired.json_report
+    Array[File] single_trimmed = fastp_single.trimmed_fastq
+    Array[File] single_html_reports = fastp_single.html_report
+    Array[File] single_json_reports = fastp_single.json_report
+  }
+}
diff --git a/modules/ww-fastp/ww-fastp.wdl b/modules/ww-fastp/ww-fastp.wdl
@@ -0,0 +1,152 @@
+## WILDS WDL fastp Module
+## Ultra-fast all-in-one FASTQ preprocessor for quality control and adapter trimming
+## fastp performs quality filtering, adapter trimming, and generates QC reports
+## in a single pass over the data
+
+version 1.0
+
+#### TASK DEFINITIONS ####
+
+task fastp_paired {
+  meta {
+    author: "Taylor Firman"
+    email: "tfirman@fredhutch.org"
+    description: "Run fastp on paired-end FASTQ files for quality filtering, adapter trimming, and QC reporting"
+    url: "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-fastp/ww-fastp.wdl"
+    outputs: {
+        r1_trimmed: "Trimmed and filtered R1 FASTQ file",
+        r2_trimmed: "Trimmed and filtered R2 FASTQ file",
+        html_report: "HTML quality control report",
+        json_report: "JSON quality control report"
+    }
+  }
+
+  parameter_meta {
+    sample_name: "Name identifier for the sample"
+    r1_fastq: "Read 1 input FASTQ file (gzipped or uncompressed)"
+    r2_fastq: "Read 2 input FASTQ file (gzipped or uncompressed)"
+    qualified_quality_phred: "Minimum base quality score for a base to be qualified (default: 15)"
+    length_required: "Minimum read length after trimming (default: 15)"
+    detect_adapter_for_pe: "Enable auto-detection of adapters for paired-end data (default: true)"
+    adapter_fasta: "Optional FASTA file with custom adapter sequences"
+    cpu_cores: "Number of CPU cores allocated for the task (default: 4)"
+    memory_gb: "Memory allocated for the task in GB (default: 8)"
+  }
+
+  input {
+    String sample_name
+    File r1_fastq
+    File r2_fastq
+    Int qualified_quality_phred = 15
+    Int length_required = 15
+    Boolean detect_adapter_for_pe = true
+    File? adapter_fasta
+    Int cpu_cores = 4
+    Int memory_gb = 8
+  }
+
+  command <<<
+    set -eo pipefail
+
+    FASTP_CMD="fastp \
+      --in1 ~{r1_fastq} \
+      --in2 ~{r2_fastq} \
+      --out1 ~{sample_name}_R1_trimmed.fastq.gz \
+      --out2 ~{sample_name}_R2_trimmed.fastq.gz \
+      --html ~{sample_name}_fastp.html \
+      --json ~{sample_name}_fastp.json \
+      --qualified_quality_phred ~{qualified_quality_phred} \
+      --length_required ~{length_required} \
+      --thread ~{cpu_cores}"
+
+    if [ "~{detect_adapter_for_pe}" == "true" ]; then
+      FASTP_CMD="${FASTP_CMD} --detect_adapter_for_pe"
+    fi
+
+    if [ ! -z "~{adapter_fasta}" ]; then
+      FASTP_CMD="${FASTP_CMD} --adapter_fasta ~{adapter_fasta}"
+    fi
+
+    echo "Running: ${FASTP_CMD}"
+    ${FASTP_CMD}
+  >>>
+
+  output {
+    File r1_trimmed = "~{sample_name}_R1_trimmed.fastq.gz"
+    File r2_trimmed = "~{sample_name}_R2_trimmed.fastq.gz"
+    File html_report = "~{sample_name}_fastp.html"
+    File json_report = "~{sample_name}_fastp.json"
+  }
+
+  runtime {
+    docker: "getwilds/fastp:1.1.0"
+    cpu: cpu_cores
+    memory: "~{memory_gb} GB"
+  }
+}
+
+task fastp_single {
+  meta {
+    author: "Taylor Firman"
+    email: "tfirman@fredhutch.org"
+    description: "Run fastp on single-end FASTQ files for quality filtering, adapter trimming, and QC reporting"
+    url: "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-fastp/ww-fastp.wdl"
+    outputs: {
+        trimmed_fastq: "Trimmed and filtered FASTQ file",
+        html_report: "HTML quality control report",
+        json_report: "JSON quality control report"
+    }
+  }
+
+  parameter_meta {
+    sample_name: "Name identifier for the sample"
+    fastq: "Input FASTQ file (gzipped or uncompressed)"
+    qualified_quality_phred: "Minimum base quality score for a base to be qualified (default: 15)"
+    length_required: "Minimum read length after trimming (default: 15)"
+    adapter_fasta: "Optional FASTA file with custom adapter sequences"
+    cpu_cores: "Number of CPU cores allocated for the task"
+    memory_gb: "Memory allocated for the task in GB"
+  }
+
+  input {
+    String sample_name
+    File fastq
+    Int qualified_quality_phred = 15
+    Int length_required = 15
+    File? adapter_fasta
+    Int cpu_cores = 4
+    Int memory_gb = 8
+  }
+
+  command <<<
+    set -eo pipefail
+
+    FASTP_CMD="fastp \
+      --in1 ~{fastq} \
+      --out1 ~{sample_name}_trimmed.fastq.gz \
+      --html ~{sample_name}_fastp.html \
+      --json ~{sample_name}_fastp.json \
+      --qualified_quality_phred ~{qualified_quality_phred} \
+      --length_required ~{length_required} \
+      --thread ~{cpu_cores}"
+
+    if [ ! -z "~{adapter_fasta}" ]; then
+      FASTP_CMD="${FASTP_CMD} --adapter_fasta ~{adapter_fasta}"
+    fi
+
+    echo "Running: ${FASTP_CMD}"
+    ${FASTP_CMD}
+  >>>
+
+  output {
+    File trimmed_fastq = "~{sample_name}_trimmed.fastq.gz"
+    File html_report = "~{sample_name}_fastp.html"
+    File json_report = "~{sample_name}_fastp.json"
+  }
+
+  runtime {
+    docker: "getwilds/fastp:1.1.0"
+    cpu: cpu_cores
+    memory: "~{memory_gb} GB"
+  }
+}
diff --git a/sprocket.toml b/sprocket.toml
@@ -1,4 +1,4 @@
 [check]
 deny_warnings = true
 hide_notes = true
-except = ['TodoComment', 'ContainerUri', 'TrailingComma', 'CommentWhitespace', 'UnusedInput']
+except = ['TodoComment', 'ContainerUri', 'UnusedInput']