Comparing various languages for building bioinformatics applications
Using 00python3-start-here as a template for other languages, write the
following solutions:
- FASTA iterator in a shared library
- dust filter: reads FASTA, outputs masked sequence
- kmer counter: reads FASTA, reports kmer frequencies
- genotyping simulator: reports genotype probabilities given nt counts
- get exon sequences from a FASTA/GFF3 sqlite database
- read a parameter file in JSON
Which languages?
- Classic systems-level: C
- Modern systems-level: Go, Rust, Zig
- Less common but maybe interested: Crystal, D, Mojo, Nim, V
- Probably not interested
- Compiled: C++, Free Pascal
- JIT-based: C#, F#, Java, Javascript, Julia, LuaJIT, Scala
- Interpreted: Lua, PHP, Raku, Ruby
README.mdthis documentdatafiles used for testingce1pct.fa.gz1% of the C. elegans genome in FASTAce1pct.gff3.gz1% of the C. elegans genome in GFF3 (for ref, not used)ce.dba sqlite database of the files abovehmm.jsona simple HMM parameter file
00python-start-herepure Python to inspire other solutionsc-kliba C solution based partly on klib
Status: acceptable, requires version >= 3.10
Other languages should have programs with similar names and produce nearly
identical output. There should be a run.sh that builds and runs the programs.
run.shuse this to run all programs (thenrm *.outlater)mylib.pycontains the FASTA iterator andanti()functiondust.pymasks low complexity sequence, optionally lowercasekmers.pycounts kmers, optionally double-strandedgenotype.pysimulates genotyping by sequencing, threadedexons.pyreports exon sequencesparams.pyreads a JSON and spits it back out reformatted
Status: dust complete
The C implementation uses Klib at its core, which is a great library for C-based bioinformatics work. It is used in htslib, minimap2, etc. The sqlite interaction uses the sqlite amalgamation. The JSON parser is JSMN.
The programs all have their own directory with a Makefile and a main.c.
Makefile
lib/
include/
jsmn.h
khash.h krng.h kseq.h kvec.h
mylib.h
sqlite3.h
src/
mylib.c
sqlite3.c
dust/
Makefile
main.c
kmers/
genotypes/
exons/
params/
Status: major feature complete. FASTA iterator could use SIMD acceleration for parsing. Extra HMM implementation is mostly complete.
-
fastaiterator -
exonsextraction from SQLite 3 -
paramsde/serialize (with pretty printing)- Remarks: as a strongly-typed language, Rust inherently have a different model to do data de/serialization. It is possible to not use a schema and instead work with raw values, but it is extremely-error prone. This should have been implemented in the library. However, I did not have a well-defined schema for HMM params yet, so the current implementation is in the binary with a best-effort guessed schema from test data.
- kmer counter
- genotyping simulator
- dust filter
Extras:
- Hidden Markov Model
- Structural Modeling
- Viterbi Algorithm
- Parse Strctural JSON Parameters
- Parse Non-Structural JSON Parameters
The Rust implementation ties together several crates, as Rust ecosystem does
not seem to have bioinformatic libraries just yet. The SQLite interaction uses
the awesome rusqlite crate. The JSON parser is serde and serde-json.
The library resides under the root of src/ directory. In the bin/
subdirectory, each file gets compiled into a separate binary (in a separate
crate). Cargo is the official way to build Rust programs and manage deps.
rust/
Cargo.lock
Cargo.toml
run.sh
src/
bin/
dust.rs
exons.rs
genotypes.rs
kmers.rs
occassionally_dishonest_casino.rs
params.rs
collections/
vec2.rs
collections.rs
dust.rs
fasta.rs
genotype.rs
hidden_markov.rs
kmer.rs
lib.rs
sequence.rs