v1.7.0 Birds of a feather
SqueezeMeta
Compatibility note
SqueezeMeta will now expect the CheckM2 database to be present in its database directory. If you had downloaded the SqueezeMeta database before, you can just download that extra file from here (make sure to uncompress it too!)
New features
- We have revamped all the documentation and moved it to Read The Docs! We will no longer provide a PDF version of the documentation
- SqueezeMeta can now be used to annotate a set of pre-existing genomes/bins and quantify their abundance in different samples. A directory containing genomes/bins can be provided through the
-extbinsparameter, tho will run the pipeline on a pre-existing set of bins/genomes. This is similar to what-extassemblywould do with a single FASTA file, but will treat each FASTA file in the input directory as a different bin - SqueezeMeta can now be used to quickly obtain bins from metagenomes, skipping the taxonomic/functional annotation of contigs and ORFs. We have added the
--onlybinsflag to SqueezeMeta.pl, in order to quickly perform assembly, binning and bin QC/annotation - SqueezeMeta can now optionally run GTDB-Tk for the taxonomic classification of bins, if the
--gtdbtkflag is provided when calling the pipeline. Note that we do not redistribute the GTDB-Tk databases and they must be obtained separately. By default we expect them to be in a directory namedgtdbinside the SqueezeMeta database directory, but a custom location can be provided via the-gtdbtk_data_pathargument - Switched to using CheckM2 for the calculation of bin completeness/contamination. This gets rid of several bugs related to CheckM1 not having updated its taxonomy to the current standard (e.g. "Pseudomonadota" instead of "Proteobacteria"). As a consequence, a strain heterogenity is no longer available in the bin results (though we've left an empty column there for backwards compatibility reasons)
-taxbinmodehas been deprecated, as GTDB-Tk can provide better bin-level taxonomies- Added the
--fastnrflag, which in turn will pass the--fastflag to DIAMOND when running classification against the nr database in Step 4 of the pipeline. This is significantly faster at the expense of some accuracy, but didn't seem to change the results significantly in our test. - We have simplified the way we calculate disparity for contig and bins, see details here
sqm2tables.pyis now called at the end of SqueezeMeta runs- We're moving towards using conda packages rather than vendoring SqueezeMeta's dependencies, see details here
Minor changes / bugfixes
- Contig names and bin names now start with the project name, to make it easy to distinguish contigs/bins coming from different SqueezeMeta runs
- Added read group tags identifying the sample from which the reads come from to the BAM files produced in step 10
- Removed the
make_databases_alt.plandconfigure_nodb_alt.plscripts, as the standardmake_databases.pl,download_databases.plandconfigure_nodb.plscripts now are able to switching to a mirror if our server is unavailable - Added the
-gparameter which will control the value of the-g|--global-rankingparameter in DIAMOND when running it against the nr database - Use forking instead of threads in scripts 06 and 10 to reduce memory usage when multithreading
- Fixed a bug that prevented
sqm_hmm_reads.plfrom working since it was trying to download legacy PFAM databases that are no longer reachable - Fixed a bug in which some ORFs would be duplicated if the pipeline went through step 13 on restart
- Fixed the calculation of present pathways in step 20
- Fixed a bug preventing SqueezeMeta to work with newer versions of MetaBAT2
- Several bugfixes to SqueezeMeta's behaviour when restarting a run
- We now use the
scaffolds.fastaresult instead of thecontigs.fastaone when running SPAdes with the-a spadesor-a spades-base(we still use thetranscripts.fastaresult if running it with the-a rnaspadesmode - Fixed a bug in which
sqm_annot.plwasn't passing the right number of threads to subprocesses - Fixed a bug in step 10 when the total number of contigs was smaller than the available threads
SQMtools
New features
- We have revamped all the documentation and moved it to Read The Docs! A PDF version will still be present as part of the CRAN release
- SQMtools now supports loading more than one project into the same object.
loadSQMcan now be used to load the output of different SqueezeMeta runs into a single object that can be subsetted and plotted as a standard SQM object (see details here. This facilitates the analysis of e.g. sequential runs in which each sample was processed independently - We now provide basic functions for defining/modifying/curating bins within SQMtools, and the possibility of recalculating bin completeness/contamination after adding/removing contigs to the bin (either manually or through a subset function). See details here and here
- Added
exportContigs,exportORFsandexportBinsto export the sequences present in aSQMorSQMbunchobject - We changed the default way of calculating copy numbers from using RecA as a reference to using the median coverage of 10 Universal Single Copy Genes. This behaviour can be controlled via the
single_copy_genesparameter inloadSQM
Minor changes / bugfixes
- Added the
load_sequencesargument toloadSQMto control whether contig/ORF sequences should be loaded. Setting it toFALSEwill decrease memory usage - Added an
output_dirparameter toexportPathway - Start and end positions of ORFs are now tracked explicitly in
SQM$orfs$table copy_numberis now the default quantification method used byplotFunctionsandexportPathway, when available.- Fixed some IDs missing from SQM names and paths vectors after running
combineSQMlite - Fixed a bug in which the
data.tablepackage wasn't attached when loadingSQMtools - Fixed a bug when subsetting was attempted with only one ORF/contig