Tags: Leeaandrob/neurogrid
Tags
v0.3.0 Distributed Inference - Multi-GPU Pipeline Parallelism Major Features: - Coordinator/Worker architecture for distributed inference - P2P weight distribution via libp2p - Remote layer execution across multiple GPUs/machines - Mistral 7B model support with chat templates - Llama 2 13B benchmarks on distributed setup Infrastructure: - Network notifee for automatic worker detection - --skip-weight-transfer flag for pre-loaded workers - --bootstrap flag for explicit peer connection - Generic HuggingFace model download (make download REPO=org/model) Performance: - Llama 2 7B: ~5.2 tokens/sec (single RTX 4090) - Llama 2 13B: ~3.1 tokens/sec (distributed RTX 4090 + GH200) Testing: - 1311 lines of distributed inference E2E tests - 688 lines of model loader tests - Full router and scheduler test coverage