Skip to content

Commit 4c0538c

Browse files
tefirmanemjbishop
andauthored
Adding initial version of ww-fastp WILDS WDL module (#278)
* Adding inital version of ww-fastp WILDS WDL module * Removing unused linting rules from Sprocket config * Adding fastp_single to ww-fastp test run WDL * Updating README accordingly * Apply suggestions from code review Co-authored-by: Emma Bishop <46635347+emjbishop@users.noreply.github.com> * Switching imports to URL's * Switching imports to URL's --------- Co-authored-by: Emma Bishop <46635347+emjbishop@users.noreply.github.com>
1 parent e796614 commit 4c0538c

File tree

4 files changed

+392
-1
lines changed

4 files changed

+392
-1
lines changed

modules/ww-fastp/README.md

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
# ww-fastp Module
2+
3+
[![Project Status: Prototype – Useable, some support, open to feedback, unstable API.](https://getwilds.org/badges/badges/prototype.svg)](https://getwilds.org/badges/#prototype)
4+
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5+
6+
A WILDS WDL module for [fastp](https://github.com/OpenGene/fastp), an ultra-fast all-in-one FASTQ preprocessor. fastp performs quality filtering, adapter trimming, and comprehensive QC reporting in a single pass over the data.
7+
8+
## Overview
9+
10+
This module provides WDL tasks for running fastp on both paired-end and single-end sequencing data. fastp is significantly faster than similar tools (Trimmomatic, Cutadapt) while performing multiple preprocessing operations simultaneously, including adapter auto-detection, quality filtering, length filtering, and HTML/JSON report generation.
11+
12+
## Module Structure
13+
14+
This module is part of the [WILDS WDL Library](https://github.com/getwilds/wilds-wdl-library) and follows the standard WILDS module structure:
15+
16+
- **Main WDL file**: `ww-fastp.wdl` - Contains task definitions for the module
17+
- **Test workflow**: `testrun.wdl` - Demonstration workflow for testing and examples
18+
- **Documentation**: This README with usage examples and parameter descriptions
19+
20+
## Available Tasks
21+
22+
### `fastp_paired`
23+
24+
Run fastp on paired-end FASTQ files for quality filtering, adapter trimming, and QC reporting.
25+
26+
**Inputs:**
27+
- `sample_name` (String): Name identifier for the sample
28+
- `r1_fastq` (File): Read 1 input FASTQ file (gzipped or uncompressed)
29+
- `r2_fastq` (File): Read 2 input FASTQ file (gzipped or uncompressed)
30+
- `qualified_quality_phred` (Int, default=15): Minimum base quality score for a base to be qualified
31+
- `length_required` (Int, default=15): Minimum read length after trimming
32+
- `detect_adapter_for_pe` (Boolean, default=true): Enable auto-detection of adapters for paired-end data
33+
- `adapter_fasta` (File?, optional): FASTA file with custom adapter sequences
34+
- `cpu_cores` (Int, default=4): Number of CPU cores allocated for the task
35+
- `memory_gb` (Int, default=8): Memory allocated for the task in GB
36+
37+
**Outputs:**
38+
- `r1_trimmed` (File): Trimmed and filtered R1 FASTQ file
39+
- `r2_trimmed` (File): Trimmed and filtered R2 FASTQ file
40+
- `html_report` (File): HTML quality control report
41+
- `json_report` (File): JSON quality control report
42+
43+
### `fastp_single`
44+
45+
Run fastp on single-end FASTQ files for quality filtering, adapter trimming, and QC reporting.
46+
47+
**Inputs:**
48+
- `sample_name` (String): Name identifier for the sample
49+
- `fastq` (File): Input FASTQ file (gzipped or uncompressed)
50+
- `qualified_quality_phred` (Int, default=15): Minimum base quality score for a base to be qualified
51+
- `length_required` (Int, default=15): Minimum read length after trimming
52+
- `adapter_fasta` (File?, optional): FASTA file with custom adapter sequences
53+
- `cpu_cores` (Int, default=4): Number of CPU cores allocated for the task
54+
- `memory_gb` (Int, default=8): Memory allocated for the task in GB
55+
56+
**Outputs:**
57+
- `trimmed_fastq` (File): Trimmed and filtered FASTQ file
58+
- `html_report` (File): HTML quality control report
59+
- `json_report` (File): JSON quality control report
60+
61+
## Usage as a Module
62+
63+
### Importing into Your Workflow
64+
65+
```wdl
66+
import "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-fastp/ww-fastp.wdl" as fastp_tasks
67+
68+
struct FastpSample {
69+
String name
70+
File r1_fastq
71+
File r2_fastq
72+
}
73+
74+
workflow my_preprocessing_pipeline {
75+
input {
76+
Array[FastpSample] samples
77+
}
78+
79+
scatter (sample in samples) {
80+
call fastp_tasks.fastp_paired {
81+
input:
82+
sample_name = sample.name,
83+
r1_fastq = sample.r1_fastq,
84+
r2_fastq = sample.r2_fastq
85+
}
86+
}
87+
88+
output {
89+
Array[File] trimmed_r1 = fastp_paired.r1_trimmed
90+
Array[File] trimmed_r2 = fastp_paired.r2_trimmed
91+
Array[File] reports = fastp_paired.html_report
92+
}
93+
}
94+
```
95+
96+
### Advanced Usage Examples
97+
98+
**Custom quality filtering:**
99+
```wdl
100+
call fastp_tasks.fastp_paired {
101+
input:
102+
sample_name = "stringent_sample",
103+
r1_fastq = r1_file,
104+
r2_fastq = r2_file,
105+
qualified_quality_phred = 20,
106+
length_required = 50,
107+
cpu_cores = 8,
108+
memory_gb = 16
109+
}
110+
```
111+
112+
### Integration Examples
113+
114+
This module integrates seamlessly with other WILDS components:
115+
- **ww-fastqc**: Can be used alongside fastp for additional QC checks
116+
- **ww-bwa / ww-star**: Trimmed reads can be passed directly to alignment modules
117+
118+
## Testing the Module
119+
120+
The module includes a test workflow (`testrun.wdl`) that can be run independently:
121+
122+
```bash
123+
# Using miniWDL
124+
miniwdl run testrun.wdl
125+
126+
# Using Sprocket
127+
sprocket run testrun.wdl --entrypoint fastp_example
128+
129+
# Using Cromwell
130+
java -jar cromwell.jar run testrun.wdl
131+
```
132+
133+
### Automatic Demo Mode
134+
135+
The test workflow automatically:
136+
1. Downloads paired-end test FASTQ data using `ww-testdata`
137+
2. Runs fastp paired-end trimming and quality filtering
138+
3. Runs fastp single-end trimming and quality filtering (using R1 as input)
139+
4. Produces trimmed FASTQ files and HTML/JSON QC reports for both modes
140+
141+
## Docker Container
142+
143+
This module uses the `getwilds/fastp:1.1.0` container image, which includes:
144+
- fastp v1.1.0
145+
- All necessary system dependencies for FASTQ preprocessing
146+
147+
## Citation
148+
149+
> Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor.
150+
> Bioinformatics. 2018;34(17):i884-i890.
151+
> DOI: https://doi.org/10.1093/bioinformatics/bty560
152+
153+
## Parameters and Resource Requirements
154+
155+
### Default Resources
156+
- **CPU**: 4 cores
157+
- **Memory**: 8 GB
158+
- **Runtime**: Typically under 5 minutes for standard WGS samples
159+
160+
### Resource Scaling
161+
- `cpu_cores`: fastp supports multi-threading; 4 cores is a good default
162+
- `memory_gb`: 8 GB is sufficient for most use cases; increase for very large files
163+
164+
## Contributing
165+
166+
To improve this module or report issues:
167+
1. Fork the [WILDS WDL Library repository](https://github.com/getwilds/wilds-wdl-library)
168+
2. Make your changes following WILDS conventions
169+
3. Test thoroughly with the demonstration workflow
170+
4. Submit a pull request with detailed documentation
171+
172+
## Support and Feedback
173+
174+
For questions about this module or to report issues:
175+
- Open an issue in the [WILDS WDL Library repository](https://github.com/getwilds/wilds-wdl-library/issues)
176+
- Contact us via the Fred Hutch Office of the Chief Data Officer (OCDO) at wilds@fredhutch.org
177+
- See the library's [Contributor Guide](https://github.com/getwilds/wilds-wdl-library/blob/main/.github/CONTRIBUTING.md) for detailed guidelines
178+
179+
## Related Resources
180+
181+
- **[WILDS Docker Library](https://github.com/getwilds/wilds-docker-library)**: Container images used by WDL workflows
182+
- **[WILDS Documentation](https://getwilds.org/)**: Comprehensive guides and best practices
183+
- **[WDL Specification](https://openwdl.org/)**: Official WDL language documentation

modules/ww-fastp/testrun.wdl

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
version 1.0
2+
3+
# Import module in question as well as the testdata module for automatic demo functionality
4+
import "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/add-fastp-module/modules/ww-fastp/ww-fastp.wdl" as ww_fastp
5+
import "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/add-fastp-module/modules/ww-testdata/ww-testdata.wdl" as ww_testdata
6+
7+
# Define data structure for paired-end sample inputs
8+
struct FastpSample {
9+
String name
10+
File r1_fastq
11+
File r2_fastq
12+
}
13+
14+
#### TEST WORKFLOW DEFINITION ####
15+
# Define test workflow to demonstrate fastp module functionality
16+
17+
workflow fastp_example {
18+
# Auto-download test data for testing purposes
19+
call ww_testdata.download_fastq_data as download_demo_data { }
20+
21+
# Create samples array using test data
22+
Array[FastpSample] final_samples = [
23+
{
24+
"name": "demo_sample",
25+
"r1_fastq": download_demo_data.r1_fastq,
26+
"r2_fastq": download_demo_data.r2_fastq
27+
}
28+
]
29+
30+
# Process each sample with fastp paired-end and single-end trimming
31+
scatter (sample in final_samples) {
32+
call ww_fastp.fastp_paired { input:
33+
sample_name = sample.name,
34+
r1_fastq = sample.r1_fastq,
35+
r2_fastq = sample.r2_fastq,
36+
cpu_cores = 1,
37+
memory_gb = 4
38+
}
39+
call ww_fastp.fastp_single { input:
40+
sample_name = sample.name,
41+
fastq = sample.r1_fastq,
42+
cpu_cores = 1,
43+
memory_gb = 4
44+
}
45+
}
46+
47+
output {
48+
Array[File] r1_trimmed = fastp_paired.r1_trimmed
49+
Array[File] r2_trimmed = fastp_paired.r2_trimmed
50+
Array[File] paired_html_reports = fastp_paired.html_report
51+
Array[File] paired_json_reports = fastp_paired.json_report
52+
Array[File] single_trimmed = fastp_single.trimmed_fastq
53+
Array[File] single_html_reports = fastp_single.html_report
54+
Array[File] single_json_reports = fastp_single.json_report
55+
}
56+
}

modules/ww-fastp/ww-fastp.wdl

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
## WILDS WDL fastp Module
2+
## Ultra-fast all-in-one FASTQ preprocessor for quality control and adapter trimming
3+
## fastp performs quality filtering, adapter trimming, and generates QC reports
4+
## in a single pass over the data
5+
6+
version 1.0
7+
8+
#### TASK DEFINITIONS ####
9+
10+
task fastp_paired {
11+
meta {
12+
author: "Taylor Firman"
13+
email: "tfirman@fredhutch.org"
14+
description: "Run fastp on paired-end FASTQ files for quality filtering, adapter trimming, and QC reporting"
15+
url: "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-fastp/ww-fastp.wdl"
16+
outputs: {
17+
r1_trimmed: "Trimmed and filtered R1 FASTQ file",
18+
r2_trimmed: "Trimmed and filtered R2 FASTQ file",
19+
html_report: "HTML quality control report",
20+
json_report: "JSON quality control report"
21+
}
22+
}
23+
24+
parameter_meta {
25+
sample_name: "Name identifier for the sample"
26+
r1_fastq: "Read 1 input FASTQ file (gzipped or uncompressed)"
27+
r2_fastq: "Read 2 input FASTQ file (gzipped or uncompressed)"
28+
qualified_quality_phred: "Minimum base quality score for a base to be qualified (default: 15)"
29+
length_required: "Minimum read length after trimming (default: 15)"
30+
detect_adapter_for_pe: "Enable auto-detection of adapters for paired-end data (default: true)"
31+
adapter_fasta: "Optional FASTA file with custom adapter sequences"
32+
cpu_cores: "Number of CPU cores allocated for the task (default: 4)"
33+
memory_gb: "Memory allocated for the task in GB (default: 8)"
34+
}
35+
36+
input {
37+
String sample_name
38+
File r1_fastq
39+
File r2_fastq
40+
Int qualified_quality_phred = 15
41+
Int length_required = 15
42+
Boolean detect_adapter_for_pe = true
43+
File? adapter_fasta
44+
Int cpu_cores = 4
45+
Int memory_gb = 8
46+
}
47+
48+
command <<<
49+
set -eo pipefail
50+
51+
FASTP_CMD="fastp \
52+
--in1 ~{r1_fastq} \
53+
--in2 ~{r2_fastq} \
54+
--out1 ~{sample_name}_R1_trimmed.fastq.gz \
55+
--out2 ~{sample_name}_R2_trimmed.fastq.gz \
56+
--html ~{sample_name}_fastp.html \
57+
--json ~{sample_name}_fastp.json \
58+
--qualified_quality_phred ~{qualified_quality_phred} \
59+
--length_required ~{length_required} \
60+
--thread ~{cpu_cores}"
61+
62+
if [ "~{detect_adapter_for_pe}" == "true" ]; then
63+
FASTP_CMD="${FASTP_CMD} --detect_adapter_for_pe"
64+
fi
65+
66+
if [ ! -z "~{adapter_fasta}" ]; then
67+
FASTP_CMD="${FASTP_CMD} --adapter_fasta ~{adapter_fasta}"
68+
fi
69+
70+
echo "Running: ${FASTP_CMD}"
71+
${FASTP_CMD}
72+
>>>
73+
74+
output {
75+
File r1_trimmed = "~{sample_name}_R1_trimmed.fastq.gz"
76+
File r2_trimmed = "~{sample_name}_R2_trimmed.fastq.gz"
77+
File html_report = "~{sample_name}_fastp.html"
78+
File json_report = "~{sample_name}_fastp.json"
79+
}
80+
81+
runtime {
82+
docker: "getwilds/fastp:1.1.0"
83+
cpu: cpu_cores
84+
memory: "~{memory_gb} GB"
85+
}
86+
}
87+
88+
task fastp_single {
89+
meta {
90+
author: "Taylor Firman"
91+
email: "tfirman@fredhutch.org"
92+
description: "Run fastp on single-end FASTQ files for quality filtering, adapter trimming, and QC reporting"
93+
url: "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-fastp/ww-fastp.wdl"
94+
outputs: {
95+
trimmed_fastq: "Trimmed and filtered FASTQ file",
96+
html_report: "HTML quality control report",
97+
json_report: "JSON quality control report"
98+
}
99+
}
100+
101+
parameter_meta {
102+
sample_name: "Name identifier for the sample"
103+
fastq: "Input FASTQ file (gzipped or uncompressed)"
104+
qualified_quality_phred: "Minimum base quality score for a base to be qualified (default: 15)"
105+
length_required: "Minimum read length after trimming (default: 15)"
106+
adapter_fasta: "Optional FASTA file with custom adapter sequences"
107+
cpu_cores: "Number of CPU cores allocated for the task"
108+
memory_gb: "Memory allocated for the task in GB"
109+
}
110+
111+
input {
112+
String sample_name
113+
File fastq
114+
Int qualified_quality_phred = 15
115+
Int length_required = 15
116+
File? adapter_fasta
117+
Int cpu_cores = 4
118+
Int memory_gb = 8
119+
}
120+
121+
command <<<
122+
set -eo pipefail
123+
124+
FASTP_CMD="fastp \
125+
--in1 ~{fastq} \
126+
--out1 ~{sample_name}_trimmed.fastq.gz \
127+
--html ~{sample_name}_fastp.html \
128+
--json ~{sample_name}_fastp.json \
129+
--qualified_quality_phred ~{qualified_quality_phred} \
130+
--length_required ~{length_required} \
131+
--thread ~{cpu_cores}"
132+
133+
if [ ! -z "~{adapter_fasta}" ]; then
134+
FASTP_CMD="${FASTP_CMD} --adapter_fasta ~{adapter_fasta}"
135+
fi
136+
137+
echo "Running: ${FASTP_CMD}"
138+
${FASTP_CMD}
139+
>>>
140+
141+
output {
142+
File trimmed_fastq = "~{sample_name}_trimmed.fastq.gz"
143+
File html_report = "~{sample_name}_fastp.html"
144+
File json_report = "~{sample_name}_fastp.json"
145+
}
146+
147+
runtime {
148+
docker: "getwilds/fastp:1.1.0"
149+
cpu: cpu_cores
150+
memory: "~{memory_gb} GB"
151+
}
152+
}

sprocket.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
[check]
22
deny_warnings = true
33
hide_notes = true
4-
except = ['TodoComment', 'ContainerUri', 'TrailingComma', 'CommentWhitespace', 'UnusedInput']
4+
except = ['TodoComment', 'ContainerUri', 'UnusedInput']

0 commit comments

Comments
 (0)