Skip to content

Commit e5fb56e

Browse files
tefirmanjayoungahberger
authored
Enabling single-end alignment in ww-star, ww-salmon, and related pipelines (#290)
* Allowing for single-ended alignment in ww-star and ww-sra-star * Adjusting conditional logic in ww-sra-star * Enabling single-end alignment functionality to ww-salmon, ww-sra-salmon, and ww-ena-star * Adding single-end sequencing sample to ww-sra-star and ww-sra-salmon test runs * Updating ww-sra-salmon test run import URL * Updating ww-ena-star test run import URL * Fixing single-end alignment functionality * Adding acknowledgment in README's for Fred Hutch contributors Co-Authored-By: Alice Berger <ahberger@fredhutch.org> Co-Authored-By: Janet Young <jayoung@fredhutch.org> * Updating ww-ena README based on EB suggestion * Adjusting acknowledgment sections for ww-sra-star and ww-sra salmon Co-Authored-By: Alice Berger <43681151+ahberger@users.noreply.github.com> --------- Co-authored-by: Janet Young <jayoung@fredhutch.org> Co-authored-by: Alice Berger <43681151+ahberger@users.noreply.github.com>
1 parent 75b337a commit e5fb56e

File tree

17 files changed

+334
-160
lines changed

17 files changed

+334
-160
lines changed

modules/ww-ena/README.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -66,17 +66,18 @@ Downloads sequencing data files from ENA using a search query. Allows filtering
6666

6767
### `extract_fastq_pairs`
6868

69-
Extracts R1 and R2 FASTQ files from downloaded ENA files for downstream paired-end processing. This task identifies all paired-end FASTQ files by common naming patterns, creates standardized outputs, and automatically extracts the accession ID from each filename. Supports multiple accessions in a single download.
69+
Extracts FASTQ files from downloaded ENA files for downstream processing. Supports both paired-end and single-end data. This task identifies FASTQ files by common naming patterns, creates standardized outputs, automatically extracts the accession ID from each filename, and detects whether each sample is paired-end or single-end. Supports multiple accessions in a single download.
7070

7171
**Inputs:**
7272
- `downloaded_files` (Array[File]): Array of files downloaded from ENA (typically from `download_files` task)
7373

7474
**Outputs:**
75-
- `r1_files` (Array[File]): Array of Read 1 FASTQ files, parallel with `r2_files` and `accessions`
76-
- `r2_files` (Array[File]): Array of Read 2 FASTQ files, parallel with `r1_files` and `accessions`
77-
- `accessions` (Array[String]): Array of ENA accession IDs extracted from filenames, parallel with `r1_files` and `r2_files`
75+
- `r1_files` (Array[File]): Array of Read 1 FASTQ files, parallel with `r2_files`, `accessions`, and `is_paired_end_list`
76+
- `r2_files` (Array[File]): Array of Read 2 FASTQ files (empty placeholder for single-end samples), parallel with `r1_files`, `accessions`, and `is_paired_end_list`
77+
- `accessions` (Array[String]): Array of ENA accession IDs extracted from filenames, parallel with `r1_files`, `r2_files`, and `is_paired_end_list`
78+
- `is_paired_end_list` (Array[String]): Array of strings ("true"/"false") indicating whether each sample is paired-end, parallel with `r1_files`, `r2_files`, and `accessions`
7879

79-
**Usage Note:** This task is designed for FASTQ workflows requiring separate R1/R2 files. It searches for common paired-end naming patterns including `_1.fastq.gz`/`_2.fastq.gz`, `_R1.fastq.gz`/`_R2.fastq.gz`, and their uncompressed equivalents. The accession ID is automatically extracted from each filename (e.g., `ERR000001_1.fastq.gz``ERR000001`). The output arrays are parallel, meaning `r1_files[i]`, `r2_files[i]`, and `accessions[i]` all correspond to the same sample. If you're downloading other file formats (BAM, analysis files), you don't need this task.
80+
**Usage Note:** This task is designed for FASTQ workflows requiring separate R1/R2 files. It searches for common naming patterns including `_1.fastq.gz`/`_2.fastq.gz`, `_R1.fastq.gz`/`_R2.fastq.gz`, and their uncompressed equivalents. The accession ID is automatically extracted from each filename (e.g., `ERR000001_1.fastq.gz``ERR000001`). The output arrays are parallel, meaning `r1_files[i]`, `r2_files[i]`, `accessions[i]`, and `is_paired_end_list[i]` all correspond to the same sample. If you're downloading other file formats (BAM, analysis files), you don't need this task.
8081

8182
## Usage as a Module
8283

modules/ww-ena/ww-ena.wdl

Lines changed: 55 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -177,12 +177,13 @@ task extract_fastq_pairs {
177177
meta {
178178
author: "Taylor Firman"
179179
email: "tfirman@fredhutch.org"
180-
description: "Extract R1 and R2 FASTQ file pairs from ENA downloads for downstream processing."
180+
description: "Extract FASTQ files from ENA downloads for downstream processing. Supports both paired-end and single-end data."
181181
url: "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-ena/ww-ena.wdl"
182182
outputs: {
183-
r1_files: "Array of Read 1 FASTQ files, parallel with r2_files and accessions",
184-
r2_files: "Array of Read 2 FASTQ files, parallel with r1_files and accessions",
185-
accessions: "Array of ENA accession IDs extracted from filenames, parallel with r1_files and r2_files"
183+
r1_files: "Array of Read 1 FASTQ files, parallel with accessions",
184+
r2_files: "Array of Read 2 FASTQ files (empty file for single-end samples), parallel with r1_files and accessions",
185+
accessions: "Array of ENA accession IDs extracted from filenames, parallel with r1_files and r2_files",
186+
is_paired_end_list: "Array of booleans indicating whether each sample is paired-end"
186187
}
187188
}
188189

@@ -208,55 +209,67 @@ task extract_fastq_pairs {
208209
# Look for patterns: *_1.fastq.gz, *_R1.fastq.gz, *_1.fq.gz, *_R1.fq.gz, etc.
209210
R1_FILES=$(ls ~{sep=' ' downloaded_files} | grep -E "(_1\.fastq|_R1\.fastq|_1\.fq|_R1\.fq)" | sort || echo "")
210211
211-
if [ -z "$R1_FILES" ]; then
212-
echo "ERROR: Could not identify any R1 FASTQ files"
213-
echo "Looking for files matching pattern *_1.fastq.gz or *_R1.fastq.gz"
214-
echo "Available files:"
215-
ls -lh ~{sep=' ' downloaded_files}
216-
exit 1
217-
fi
218-
219212
# Initialize output files
220213
> r1_files.txt
221214
> r2_files.txt
222215
> accessions.txt
216+
> is_paired_end.txt
217+
218+
if [ -n "$R1_FILES" ]; then
219+
# Paired-end: process each R1 file and find its matching R2
220+
echo "$R1_FILES" | while read R1_FILE; do
221+
echo "Processing R1: $R1_FILE"
222+
223+
# Extract accession ID from filename (everything before _1 or _R1)
224+
BASENAME=$(basename "$R1_FILE")
225+
ACCESSION=$(echo "$BASENAME" | sed -E 's/(_1\.fastq|_R1\.fastq|_1\.fq|_R1\.fq).*//')
226+
echo "Extracted accession: $ACCESSION"
227+
228+
# Find matching R2 file
229+
R2_FILE=$(ls ~{sep=' ' downloaded_files} | grep -E "^.*${ACCESSION}(_2\.fastq|_R2\.fastq|_2\.fq|_R2\.fq)" | head -1 || echo "")
230+
231+
if [ -z "$R2_FILE" ]; then
232+
echo "WARNING: No matching R2 file for accession: $ACCESSION, treating as single-end"
233+
cp "$R1_FILE" "fastq_pairs/${ACCESSION}_r1.fastq.gz"
234+
touch "fastq_pairs/${ACCESSION}_r2.fastq.gz"
235+
echo "false" >> is_paired_end.txt
236+
else
237+
echo "Matched R2: $R2_FILE"
238+
cp "$R1_FILE" "fastq_pairs/${ACCESSION}_r1.fastq.gz"
239+
cp "$R2_FILE" "fastq_pairs/${ACCESSION}_r2.fastq.gz"
240+
echo "true" >> is_paired_end.txt
241+
fi
242+
243+
echo "fastq_pairs/${ACCESSION}_r1.fastq.gz" >> r1_files.txt
244+
echo "fastq_pairs/${ACCESSION}_r2.fastq.gz" >> r2_files.txt
245+
echo "$ACCESSION" >> accessions.txt
246+
done
247+
else
248+
# Single-end: no R1 pattern found, treat all FASTQ files as single-end reads
249+
echo "No paired-end naming pattern found, treating files as single-end"
250+
for FASTQ_FILE in $(ls ~{sep=' ' downloaded_files} | grep -E "\.(fastq|fq)" | sort); do
251+
BASENAME=$(basename "$FASTQ_FILE")
252+
ACCESSION=$(echo "$BASENAME" | sed -E 's/\.(fastq|fq).*//')
253+
echo "Processing single-end: $ACCESSION"
254+
255+
cp "$FASTQ_FILE" "fastq_pairs/${ACCESSION}_r1.fastq.gz"
256+
touch "fastq_pairs/${ACCESSION}_r2.fastq.gz"
257+
258+
echo "fastq_pairs/${ACCESSION}_r1.fastq.gz" >> r1_files.txt
259+
echo "fastq_pairs/${ACCESSION}_r2.fastq.gz" >> r2_files.txt
260+
echo "$ACCESSION" >> accessions.txt
261+
echo "false" >> is_paired_end.txt
262+
done
263+
fi
223264
224-
# Process each R1 file and find its matching R2
225-
echo "$R1_FILES" | while read R1_FILE; do
226-
echo "Processing R1: $R1_FILE"
227-
228-
# Extract accession ID from filename (everything before _1 or _R1)
229-
BASENAME=$(basename "$R1_FILE")
230-
ACCESSION=$(echo "$BASENAME" | sed -E 's/(_1\.fastq|_R1\.fastq|_1\.fq|_R1\.fq).*//')
231-
echo "Extracted accession: $ACCESSION"
232-
233-
# Find matching R2 file
234-
R2_FILE=$(ls ~{sep=' ' downloaded_files} | grep -E "^.*${ACCESSION}(_2\.fastq|_R2\.fastq|_2\.fq|_R2\.fq)" | head -1 || echo "")
235-
236-
if [ -z "$R2_FILE" ]; then
237-
echo "ERROR: Could not find matching R2 file for accession: $ACCESSION"
238-
exit 1
239-
fi
240-
241-
echo "Matched R2: $R2_FILE"
242-
243-
# Copy files with accession-prefixed names to maintain uniqueness
244-
cp "$R1_FILE" "fastq_pairs/${ACCESSION}_r1.fastq.gz"
245-
cp "$R2_FILE" "fastq_pairs/${ACCESSION}_r2.fastq.gz"
246-
247-
# Append to output lists
248-
echo "fastq_pairs/${ACCESSION}_r1.fastq.gz" >> r1_files.txt
249-
echo "fastq_pairs/${ACCESSION}_r2.fastq.gz" >> r2_files.txt
250-
echo "$ACCESSION" >> accessions.txt
251-
done
252-
253-
echo "Successfully processed $(wc -l < accessions.txt) FASTQ pairs"
265+
echo "Successfully processed $(wc -l < accessions.txt) FASTQ sample(s)"
254266
>>>
255267

256268
output {
257269
Array[File] r1_files = read_lines("r1_files.txt")
258270
Array[File] r2_files = read_lines("r2_files.txt")
259271
Array[String] accessions = read_lines("accessions.txt")
272+
Array[String] is_paired_end_list = read_lines("is_paired_end.txt")
260273
}
261274

262275
runtime {

modules/ww-salmon/README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,13 +32,13 @@ Builds Salmon index from a reference transcriptome FASTA file.
3232
- `salmon_index` (File): Compressed tarball containing the Salmon index for quantification
3333

3434
### `quantify`
35-
Quantifies transcript expression from paired-end RNA-seq reads using Salmon.
35+
Quantifies transcript expression from RNA-seq reads using Salmon. Supports both paired-end and single-end data.
3636

3737
**Inputs:**
3838
- `salmon_index_dir` (File): Compressed tarball containing Salmon index from `build_index`
3939
- `sample_name` (String): Sample name identifier for output files
4040
- `fastq_r1` (File): FASTQ file for read 1
41-
- `fastq_r2` (File): FASTQ file for read 2
41+
- `fastq_r2` (File?, optional): FASTQ file for read 2 (omit for single-end data)
4242
- `cpu_cores` (Int): Number of CPU cores (default: 8)
4343
- `memory_gb` (Int): Memory allocation in GB (default: 16)
4444

@@ -182,6 +182,7 @@ The module supports flexible resource configuration:
182182

183183
## Features
184184

185+
- **Single-end and paired-end support**: Works with both single-end and paired-end sequencing data
185186
- **Multi-sample support**: Process multiple samples in parallel using scatter-gather patterns
186187
- **Result aggregation**: Automatically merge quantification results across samples into TPM and count matrices
187188
- **Module integration**: Seamlessly combines with ww-sra, ww-testdata, and ww-deseq2 modules

modules/ww-salmon/testrun.wdl

Lines changed: 27 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,11 @@
11
version 1.0
22

33
# Import module in question as well as the testdata module for automatic demo functionality
4-
import "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-salmon/ww-salmon.wdl" as ww_salmon
5-
import "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-testdata/ww-testdata.wdl" as ww_testdata
6-
7-
# Define data structure for sample inputs
8-
struct SalmonSample {
9-
String name
10-
File r1_fastq
11-
File r2_fastq
12-
}
4+
# import "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/fix-sra-star-jyoung/modules/ww-salmon/ww-salmon.wdl" as ww_salmon
5+
# import "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-testdata/ww-testdata.wdl" as ww_testdata
6+
import "ww-salmon.wdl" as ww_salmon
7+
import "../ww-testdata/ww-testdata.wdl" as ww_testdata
8+
139

1410
#### TEST WORKFLOW DEFINITION ####
1511
# Define test workflow to demonstrate module functionality
@@ -26,45 +22,44 @@ workflow salmon_example {
2622
memory_gb = 8
2723
}
2824
29-
# Create samples array using test data
30-
Array[SalmonSample] final_samples = [
31-
{
32-
"name": "demo_sample",
33-
"r1_fastq": download_demo_data.r1_fastq,
34-
"r2_fastq": download_demo_data.r2_fastq
35-
}
36-
]
37-
38-
# Quantify each sample
39-
scatter (sample in final_samples) {
40-
call ww_salmon.quantify { input:
41-
salmon_index_dir = build_index.salmon_index,
42-
sample_name = sample.name,
43-
fastq_r1 = sample.r1_fastq,
44-
fastq_r2 = sample.r2_fastq,
45-
cpu_cores = 2,
46-
memory_gb = 8
47-
}
25+
# Paired-end quantification
26+
call ww_salmon.quantify as quantify_paired { input:
27+
salmon_index_dir = build_index.salmon_index,
28+
sample_name = "demo_paired",
29+
fastq_r1 = download_demo_data.r1_fastq,
30+
fastq_r2 = download_demo_data.r2_fastq,
31+
cpu_cores = 2,
32+
memory_gb = 8
33+
}
34+
35+
# Single-end quantification (using R1 only)
36+
call ww_salmon.quantify as quantify_single { input:
37+
salmon_index_dir = build_index.salmon_index,
38+
sample_name = "demo_single",
39+
fastq_r1 = download_demo_data.r1_fastq,
40+
cpu_cores = 2,
41+
memory_gb = 8
4842
}
4943
5044
# Merge results
5145
call ww_salmon.merge_results { input:
52-
salmon_quant_dirs = quantify.salmon_quant_dir,
53-
sample_names = quantify.output_sample_name,
46+
salmon_quant_dirs = [quantify_paired.salmon_quant_dir, quantify_single.salmon_quant_dir],
47+
sample_names = [quantify_paired.output_sample_name, quantify_single.output_sample_name],
5448
cpu_cores = 1,
5549
memory_gb = 4
5650
}
5751
5852
# Validate outputs
5953
call validate_outputs { input:
60-
salmon_quant_dirs = quantify.salmon_quant_dir,
54+
salmon_quant_dirs = [quantify_paired.salmon_quant_dir, quantify_single.salmon_quant_dir],
6155
tpm_matrix = merge_results.tpm_matrix,
6256
counts_matrix = merge_results.counts_matrix
6357
}
6458
6559
output {
6660
File salmon_index_tar = build_index.salmon_index
67-
Array[File] salmon_quant_dirs = quantify.salmon_quant_dir
61+
File salmon_quant_paired = quantify_paired.salmon_quant_dir
62+
File salmon_quant_single = quantify_single.salmon_quant_dir
6863
File merged_tpm_matrix = merge_results.tpm_matrix
6964
File merged_counts_matrix = merge_results.counts_matrix
7065
File sample_list = merge_results.sample_list

modules/ww-salmon/ww-salmon.wdl

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ task quantify {
6666
meta {
6767
author: "WILDS Team"
6868
email: "wilds@fredhutch.org"
69-
description: "Quantify transcript expression from paired-end RNA-seq reads using Salmon"
69+
description: "Quantify transcript expression from RNA-seq reads using Salmon. Supports both paired-end and single-end data."
7070
url: "https://raw.githubusercontent.com/getwilds/wilds-wdl-library/refs/heads/main/modules/ww-salmon/ww-salmon.wdl"
7171
outputs: {
7272
salmon_quant_dir: "Compressed tarball containing Salmon quantification results including abundance estimates",
@@ -78,7 +78,7 @@ task quantify {
7878
salmon_index_dir: "Compressed tarball containing Salmon genome index"
7979
sample_name: "Name identifier for the sample"
8080
fastq_r1: "FASTQ file for read 1"
81-
fastq_r2: "FASTQ file for read 2"
81+
fastq_r2: "Optional FASTQ file for read 2 (omit for single-end data)"
8282
cpu_cores: "Number of CPU cores allocated for the task"
8383
memory_gb: "Memory allocated for the task in GB"
8484
}
@@ -87,7 +87,7 @@ task quantify {
8787
File salmon_index_dir
8888
String sample_name
8989
File fastq_r1
90-
File fastq_r2
90+
File? fastq_r2
9191
Int cpu_cores = 8
9292
Int memory_gb = 16
9393
}
@@ -99,12 +99,19 @@ task quantify {
9999
mkdir -p salmon_index
100100
tar -xzf ~{salmon_index_dir} -C ./
101101
102-
# Paired-end quantification with best practice parameters
102+
# Build read arguments based on paired-end or single-end data
103+
R2_FILE="~{if defined(fastq_r2) then select_first([fastq_r2]) else ""}"
104+
if [ -n "$R2_FILE" ]; then
105+
READ_ARGS="-1 ~{fastq_r1} -2 $R2_FILE"
106+
else
107+
READ_ARGS="-r ~{fastq_r1}"
108+
fi
109+
110+
# Quantification with best practice parameters
103111
salmon quant \
104112
-i salmon_index \
105113
--libType A \
106-
-1 ~{fastq_r1} \
107-
-2 ~{fastq_r2} \
114+
$READ_ARGS \
108115
-o ~{sample_name}_quant \
109116
-p ~{cpu_cores} \
110117
--validateMappings \

modules/ww-star/README.md

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ Performs RNA-seq alignment using STAR's two-pass methodology.
3838
**Inputs:**
3939
- `star_genome_tar` (File): STAR genome index from `build_index`
4040
- `r1` (File): R1 FASTQ file
41-
- `r2` (File): R2 FASTQ file
41+
- `r2` (File?, optional): R2 FASTQ file (omit for single-end data)
4242
- `name` (String): Sample name for output files
4343
- `sjdb_overhang` (Int): Length of genomic sequence around junctions (default: 100)
4444
- `memory_gb` (Int): Memory allocation in GB (default: 62)
@@ -71,13 +71,13 @@ workflow my_rna_seq_pipeline {
7171
File reference_fasta
7272
File reference_gtf
7373
}
74-
74+
7575
call star_tasks.build_index {
7676
input:
7777
reference_fasta = reference_fasta,
7878
reference_gtf = reference_gtf
7979
}
80-
80+
8181
scatter (sample in samples) {
8282
call star_tasks.align_two_pass {
8383
input:
@@ -87,14 +87,25 @@ workflow my_rna_seq_pipeline {
8787
name = sample.name
8888
}
8989
}
90-
90+
9191
output {
9292
Array[File] aligned_bams = align_two_pass.bam
9393
Array[File] gene_counts = align_two_pass.gene_counts
9494
}
9595
}
9696
```
9797

98+
**Single-end data:**
99+
```wdl
100+
# Simply omit the r2 parameter for single-end samples
101+
call star_tasks.align_two_pass {
102+
input:
103+
star_genome_tar = build_index.star_index_tar,
104+
r1 = my_single_end_fastq,
105+
name = "my_sample"
106+
}
107+
```
108+
98109
### Advanced Usage Examples
99110

100111
**Custom splice junction parameters:**
@@ -143,8 +154,9 @@ The test workflow (`star_example`) automatically:
143154
1. Downloads reference genome data using `ww-testdata`
144155
2. Downloads demonstration FASTQ data using `ww-testdata`
145156
3. Builds STAR genome index
146-
4. Performs RNA-seq alignment using STAR two-pass methodology
147-
5. Validates all outputs
157+
4. Performs paired-end RNA-seq alignment using STAR two-pass methodology
158+
5. Performs single-end RNA-seq alignment using STAR two-pass methodology
159+
6. Validates all outputs (both paired-end and single-end)
148160

149161
## Configuration Guidelines
150162

@@ -177,6 +189,7 @@ The module supports flexible resource configuration:
177189
## Features
178190

179191
- **Two-pass methodology**: Optimal splice junction detection using STAR's two-pass approach
192+
- **Single-end and paired-end support**: Works with both single-end and paired-end sequencing data
180193
- **Comprehensive outputs**: BAM files, gene counts, splice junctions, and detailed logs
181194
- **Multi-sample support**: Process multiple samples in parallel
182195
- **Module integration**: Seamlessly combines with ww-sra and ww-testdata

0 commit comments

Comments
 (0)