Skip to content

improve storage usage, minimize duplication of FASTQ data #109

@gpertea

Description

@gpertea

Just had a run of SPEAQeasy (locally) on a 30-sample dataset where the compressed raw data (fastq.gz) are about 175 GB total.
Running that on a fast SSD with about 1.8 TB available storage, the SSD got filled quickly and the pipeline aborted running out of space on that storage. This seems unreasonable.

It seems the main space hog is using uncompressed FASTQ files internally, in the working directories. This should and could be avoided, as most (all?) programs in the pipeline can use fastq.gz as input, or alternatively, the decompression of FASTQ can be performed on the fly if needed.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions