Skip to content

Tags: Leeaandrob/neurogrid

Tags

v0.3.0

Toggle v0.3.0's commit message
v0.3.0 Distributed Inference - Multi-GPU Pipeline Parallelism

Major Features:
- Coordinator/Worker architecture for distributed inference
- P2P weight distribution via libp2p
- Remote layer execution across multiple GPUs/machines
- Mistral 7B model support with chat templates
- Llama 2 13B benchmarks on distributed setup

Infrastructure:
- Network notifee for automatic worker detection
- --skip-weight-transfer flag for pre-loaded workers
- --bootstrap flag for explicit peer connection
- Generic HuggingFace model download (make download REPO=org/model)

Performance:
- Llama 2 7B: ~5.2 tokens/sec (single RTX 4090)
- Llama 2 13B: ~3.1 tokens/sec (distributed RTX 4090 + GH200)

Testing:
- 1311 lines of distributed inference E2E tests
- 688 lines of model loader tests
- Full router and scheduler test coverage

v0.2.0

Toggle v0.2.0's commit message
v0.2.0 HyperGrid Foundation - SSE streaming, Prometheus metrics, NAT …

…traversal, peer reconnection