Skip to content

[TheiaProk] Add Sequins read extraction/decontamination#1021

Draft
xonq wants to merge 34 commits intomainfrom
kzm-sequins-dev
Draft

[TheiaProk] Add Sequins read extraction/decontamination#1021
xonq wants to merge 34 commits intomainfrom
kzm-sequins-dev

Conversation

@xonq
Copy link
Member

@xonq xonq commented Mar 13, 2026

🗑️ This dev branch should be deleted after merging to main.

🧠 Summary

Partners would like a module to extract and identify a FASTA of mirrored DNA sequences, or Sequins, as a means of qualitative sample inference. To this end, this PR implements the host_decontaminate workflow within read_QC_trim by only exposing the FASTA input of the workflow. More broadly, this enables read decontamination across workflows importing read_QC_trim.

As part of this PR, a sequence-specific read mapping statistics task was created and propagated to workflows that may have multiple sequence headers: host_decontaminate (multi-contig hosts/contaminants), theiaviral* (segmented viruses)

⚡ Impacted Workflows/Tasks

Workflows

+ read_decontaminate

Δ read_QC_trim_pe

Δ theiacov_illumina_pe

Δ theiaeuk_illumina_pe

Δ theiameta_illumina_pe

Δ theiaprok_illumina_pe

Δ theiaviral_illumina_pe

Δ theiaviral_ont

Δ theiaviral_panel

- host_decontaminate

Tasks

+ contaminant_check

+ mapping_stats

This PR may lead to different results in pre-existing outputs: Yes/No

This PR uses an element that could cause duplicate runs to have different results: Yes/No

🛠️ Changes

  • add host decontaminate to read_QC_trim_pe
  • create sequence-specific stats tasks to calculate the breadth and depth of coverage of the inputted FASTA
  • expose outputs (host_decontaminate -> read_QC_trim -> theiaprok)
  • expose mapping stats to other theia*illumina_pe wfs
  • test and compare to expected data
  • add documentation modules for theia* wfs
  • propagate segment mapping stats module to theiaviral
  • update documentation

⚙️ Algorithm

➡️ Inputs

read_QC_trim_pe +5 -0
+ expected_contaminants
+ min_contaminant_coverage
+ min_contaminant_depth
+ read_decontaminate_fasta
+ read_decontaminate_memory
theiaviral_illumina_pe +3 -0
+ host_decontaminate.expected_sequences
+ host_decontaminate.min_expected_coverage
+ host_decontaminate.min_expected_depth
freyja_fastq +5 -0
+ read_QC_trim_pe.expected_contaminants
+ read_QC_trim_pe.min_contaminant_coverage
+ read_QC_trim_pe.min_contaminant_depth
+ read_QC_trim_pe.read_decontaminate_fasta
+ read_QC_trim_pe.read_decontaminate_memory
theiaviral_ont +3 -0
+ host_decontaminate.expected_sequences
+ host_decontaminate.min_expected_coverage
+ host_decontaminate.min_expected_depth
theiacov_illumina_pe +5 -0
+ read_QC_trim.expected_contaminants
+ read_QC_trim.min_contaminant_coverage
+ read_QC_trim.min_contaminant_depth
+ read_QC_trim.read_decontaminate_fasta
+ read_QC_trim.read_decontaminate_memory
theiaeuk_illumina_pe +5 -0
+ read_QC_trim.expected_contaminants
+ read_QC_trim.min_contaminant_coverage
+ read_QC_trim.min_contaminant_depth
+ read_QC_trim.read_decontaminate_fasta
+ read_QC_trim.read_decontaminate_memory
theiaprok_illumina_pe +5 -0
+ read_QC_trim.expected_contaminants
+ read_QC_trim.min_contaminant_coverage
+ read_QC_trim.min_contaminant_depth
+ read_QC_trim.read_decontaminate_fasta
+ read_QC_trim.read_decontaminate_memory
theiaviral_panel +3 -0
+ host_decontaminate.expected_sequences
+ host_decontaminate.min_expected_coverage
+ host_decontaminate.min_expected_depth
read_decontaminate +46 -0
+ bam_to_unaligned_fastq.cpu
+ bam_to_unaligned_fastq.disk_size
+ bam_to_unaligned_fastq.docker
+ bam_to_unaligned_fastq.memory
+ complete_only
+ contaminant
+ contaminant_check.cpu
+ contaminant_check.disk_size
+ contaminant_check.docker
+ contaminant_check.memory
+ download_accession.cpu
+ download_accession.disk_size
+ download_accession.docker
+ download_accession.include_gbff
+ download_accession.include_gff3
+ download_accession.memory
+ expected_sequences
+ is_accession
+ is_genome
+ min_expected_coverage
+ min_expected_depth
+ minimap2_memory
+ minimap2_ont.cpu
+ minimap2_ont.disk_size
+ minimap2_ont.docker
+ minimap2_ont.query2
+ minimap2_pe.cpu
+ minimap2_pe.disk_size
+ minimap2_pe.docker
+ ncbi_identify.cpu
+ ncbi_identify.disk_size
+ ncbi_identify.docker
+ ncbi_identify.memory
+ parse_mapping.cpu
+ parse_mapping.disk_size
+ parse_mapping.docker
+ parse_mapping.memory
+ parse_mapping.min_qual
+ read1
+ read2
+ read_mapping_stats.cpu
+ read_mapping_stats.disk_size
+ read_mapping_stats.docker
+ read_mapping_stats.memory
+ refseq
+ samplename
theiameta_illumina_pe +5 -0
+ read_QC_trim.expected_contaminants
+ read_QC_trim.min_contaminant_coverage
+ read_QC_trim.min_contaminant_depth
+ read_QC_trim.read_decontaminate_fasta
+ read_QC_trim.read_decontaminate_memory

⬅️ Outputs

read_QC_trim_pe +11 -0
+ contaminant_bai
+ contaminant_bam
+ contaminant_cov_hist
+ contaminant_coverage
+ contaminant_mapping_flagstat
+ contaminant_mapping_stats
+ contaminant_mean_depth
+ contaminant_percent_mapped_reads
+ contaminant_sequence_coverage
+ contaminant_sequence_depth
+ contaminant_status
theiaviral_illumina_pe +5 -2
+ dehost_wf_host_coverage_by_sequence
+ dehost_wf_host_depth_by_sequence
+ dehost_wf_host_sequence_check
+ read_mapping_coverage_by_sequence
+ read_mapping_depth_by_sequence
- dehost_wf_host_mapping_metrics
- read_mapping_report
theiaviral_ont +5 -2
+ dehost_wf_host_coverage_by_sequence
+ dehost_wf_host_depth_by_sequence
+ dehost_wf_host_sequence_check
+ read_mapping_coverage_by_sequence
+ read_mapping_depth_by_sequence
- dehost_wf_host_mapping_metrics
- read_mapping_report
theiacov_illumina_pe +11 -0
+ contaminant_bai
+ contaminant_bam
+ contaminant_cov_hist
+ contaminant_coverage
+ contaminant_coverage_by_sequence
+ contaminant_depth_by_sequence
+ contaminant_mapping_flagstat
+ contaminant_mapping_stats
+ contaminant_mean_depth
+ contaminant_percent_mapped_reads
+ contaminant_status
theiaeuk_illumina_pe +11 -0
+ contaminant_bai
+ contaminant_bam
+ contaminant_cov_hist
+ contaminant_coverage
+ contaminant_coverage_by_sequence
+ contaminant_depth_by_sequence
+ contaminant_mapping_flagstat
+ contaminant_mapping_stats
+ contaminant_mean_depth
+ contaminant_percent_mapped_reads
+ contaminant_status
theiaprok_illumina_pe +11 -0
+ contaminant_bai
+ contaminant_bam
+ contaminant_cov_hist
+ contaminant_coverage
+ contaminant_coverage_by_sequence
+ contaminant_depth_by_sequence
+ contaminant_mapping_flagstat
+ contaminant_mapping_stats
+ contaminant_mean_depth
+ contaminant_percent_mapped_reads
+ contaminant_status
theiaviral_panel +3 -1
+ dehost_wf_host_coverage_by_sequence
+ dehost_wf_host_depth_by_sequence
+ dehost_wf_host_sequence_check
- dehost_wf_host_mapping_metrics
read_decontaminate +19 -0
+ contaminant_bam
+ contaminant_check_status
+ contaminant_coverage_by_sequence
+ contaminant_depth_by_sequence
+ contaminant_flagstat
+ contaminant_genome_accession
+ contaminant_genome_data_report_json
+ contaminant_genome_fasta
+ contaminant_mapped_sorted_bai
+ contaminant_mapped_sorted_bam
+ contaminant_mapping_cov_hist
+ contaminant_mapping_coverage
+ contaminant_mapping_mean_depth
+ contaminant_mapping_stats
+ contaminant_percent_mapped_reads
+ decontaminate_read1
+ decontaminate_read2
+ ncbi_datasets_version
+ samtools_version
theiameta_illumina_pe +11 -0
+ contaminant_bai
+ contaminant_bam
+ contaminant_cov_hist
+ contaminant_coverage
+ contaminant_coverage_by_sequence
+ contaminant_depth_by_sequence
+ contaminant_mapping_flagstat
+ contaminant_mapping_stats
+ contaminant_mean_depth
+ contaminant_percent_mapped_reads
+ contaminant_status

🧪 Testing

  • theiaviral_illumina_pehost_decontaminate, mapping_stats, theiaviral_illumina_pe

  • theiaviral_onthost_decontaminate, mapping_stats, theiaviral_ont

  • theiaviral_panelhost_decontaminate, mapping_stats, theiaviral_illumina_pe, theiaviral_panel

  • freyja_fastqread_QC_trim_pe

  • theiacov_illumina_peread_QC_trim_pe

  • theiaeuk_illumina_peread_QC_trim_pe

  • theiameta_illumina_peread_QC_trim_pe

  • theiaprok_illumina_peread_QC_trim_pe, theiaprok_illumina_pe

Suggested Scenarios for Reviewer to Test

🔬 Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable and follow the documentation style guide
    • You have updated the "Last Known Changes" field for any affected workflows in the respective workflow documentation page and for the entry in the docs/assets/tables/all_workflows.tsv table to be the tag for the next upcoming release. If you do not know the tag, please put "vX.X.X"

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated and adheres to the documentation style guide

Copy link
Member Author

@xonq xonq Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note this file was moved from "host_decontaminate"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant