FlashInfer: Kernel Library for LLM Serving
-
Updated
Jan 4, 2026 - Python
FlashInfer: Kernel Library for LLM Serving
GPU cluster manager for optimized AI model deployment
Analyze and generate unstructured data using LLMs, from quick experiments to billion token jobs.
Source code of the paper "Private Collaborative Edge Inference via Over-the-Air Computation".
Accelerate reproducible inference experiments for large language models with LLM-D! This lab automates the setup of a complete evaluation environment on OpenShift/OKD: GPU worker pools, core operators, observability, traffic control, and ready-to-run example workloads.
Super Ollama Load Balancer - Performance-aware routing for distributed Ollama deployments with Ray, Dask, and adaptive metrics
Official impl. of ACM MM paper "Identity-Aware Attribute Recognition via Real-Time Distributed Inference in Mobile Edge Clouds". A distributed inference model for pedestrian attribute recognition with re-ID in an MEC-enabled camera monitoring system. Jointly training of pedestrian attribute recognition and Re-ID.
A comprehensive framework for multi-node, multi-GPU scalable LLM inference on HPC systems using vLLM and Ollama. Includes distributed deployment templates, benchmarking workflows, and chatbot/RAG pipelines for high-throughput, production-grade AI services
Encrypted Decentralized Inference and Learning (E.D.I.L.)
Add a description, image, and links to the distributed-inference topic page so that developers can more easily learn about it.
To associate your repository with the distributed-inference topic, visit your repo's landing page and select "manage topics."