Palette quantization library that powers pngquant and other PNG optimizers
-
Updated
Feb 11, 2026 - Rust
Palette quantization library that powers pngquant and other PNG optimizers
RDNA-native LLM inference engine in Rust.
PMetal: high-performance Apple Silicon framework for local LLM inference, LoRA/QLoRA fine-tuning, serving, quantization, and MLX/Metal acceleration.
Rust implementation of Google's TurboQuant algorithm for vector quantization
258 KB WASM runtime for Needle a 26M-parameter tool-calling transformer. Runs in browser, Cloudflare Workers, and Node.js. No backend required.
Fast library for converting RGBA images to 8-bit palette images. Written in Rust; can be used in C programs
PyTorch-equivalent ML framework in pure Rust — 22 crates, CUDA GPU, biometrics, IR detection, LLMs, ONNX, distributed training
Apple Neural Engine (ANE) LLM inference engine — reverse-engineered private APIs, Metal GPU shaders, hybrid ANE+GPU+CPU on Apple Silicon. 32 tok/s matching llama.cpp, 3.6 TFLOPS fused ANE mega-kernels.
Rust implementation of k-d tree to efficiently perform color quantization to predefined sets
⚡ Vectro: Lightning-fast embedding quantization. Hit 12M+ vec/s throughput (4.85× faster than FAISS C++) while drastically cutting memory footprint for vector databases and local AI research.
High-performance Rust integration for aggressive KV cache quantization on Apple Silicon GPUs (Metal). Features a multi-turn TUI, smart context windowing, and full LLM observability.
Rust-native MoE inference runtime with custom CUDA kernels for Blackwell GPUs. Includes DFlash speculative decoding, multi-tier Engram memory, and entropy-adaptive routing. Targets Qwen3.5-35B-A3B on a single RTX 5060 Ti 16GB.
Rust KV-cache compression for LLM inference. Implements TurboQuant (Zandieh et al., ICLR 2026) plus PQO — our variant that drops QJL, adds a fused CUDA kernel, and shrinks the cache to ~20% of FP16 (49% total VRAM at 32K). mistral.rs integration.
rust api wrapper for llm-inference chatllm.cpp
VectorPrime takes a model file and your hardware, then finds the fastest way to run it. It profiles your CPU, GPU, and RAM
Train tiny neural networks, quantize to INT8, generate C code, flash to ESP32. No TFLite runtime — just compiled inference kernels. 17us per inference on real hardware.
Fajar Lang (fj) — Systems programming language for embedded ML & OS development. Compiler-enforced safety with @kernel/@device/@safe contexts. Rust-based compiler with Cranelift/LLVM backends. Made in Indonesia.
Add a description, image, and links to the quantization topic page so that developers can more easily learn about it.
To associate your repository with the quantization topic, visit your repo's landing page and select "manage topics."