All-in-one desktop management suite for local LLM inference — real-time monitoring, KV cache benchmarking, and smart Modelfile generation. Currently supports Ollama as the inference backend.
- Real-time Dashboard — VRAM usage, model status, KV cache pressure, time-series metrics via WebSocket
- KV Cache Benchmarker — Automated testing across f16/q8_0/q4_0 configurations with standardized prompts
- Smart Modelfile Generator — Hardware-aware parameter optimization with use-case templates (chat, coding, analysis, creative, agent)
# Prerequisites: Node.js >= 18, Ollama running on localhost:11434
# Install dependencies
npm install
# Start development (backend + frontend)
npm run dev
# Open http://localhost:3000Monorepo with two packages:
| Package | Description | Port |
|---|---|---|
@inference-forge/server |
Express + WebSocket backend | 3001 |
@inference-forge/dashboard |
React + Vite frontend | 3000 |
Ollama supports KV cache quantization via environment variable:
Linux / macOS:
export OLLAMA_KV_CACHE_TYPE=q8_0 # Half memory, minimal quality loss
export OLLAMA_FLASH_ATTENTION=1 # Required for KV quantization
ollama serveWindows (PowerShell):
$env:OLLAMA_KV_CACHE_TYPE = "q8_0"
$env:OLLAMA_FLASH_ATTENTION = "1"
ollama serve| Type | Memory vs f16 | Quality Impact |
|---|---|---|
| f16 | 1x (default) | None |
| q8_0 | ~0.5x | Very small |
| q4_0 | ~0.25x | Small-medium |
TypeScript, Node.js, Express, WebSocket, React 18, Vite, TailwindCSS, Recharts
- GPU hardware detection (NVIDIA via
nvidia-smi, AMD viarocm-smi) - Per-model token throughput tracking over time
- Alert thresholds for VRAM pressure and model eviction
- Perplexity estimation via log-likelihood comparison across KV cache types
- Custom prompt sets and configurable run parameters
- Export benchmark reports to PDF and JSON
- Side-by-side model comparison charts
- Visual Modelfile editor with live preview
- Import/export Modelfile library
- Community template gallery
- One-click model creation via API
- Concurrent model orchestration dashboard
- Agent workflow builder with model routing
- Session and conversation memory management
- Resource allocation across running agents
- Advanced KV cache compression techniques (e.g. PolarQuant-style quantization) when available in llama.cpp
- Electron desktop app packaging
- Remote instance management
- Plugin system for custom metrics and tools
- Additional inference backend support (vLLM, llama.cpp server)
Contributions are welcome! Here's how to get started.
git clone https://github.com/DjimIT/inference-forge.git
cd inference-forge
npm install
npm run devThe backend runs on http://localhost:3001 and the dashboard on http://localhost:3000 with hot reload enabled for both.
inference-forge/
├── packages/server/ # Express + WebSocket backend
│ └── src/
│ ├── api/ # REST API routes
│ ├── services/ # Ollama client, monitor, benchmark, modelfile
│ └── ws/ # WebSocket handlers
├── packages/dashboard/ # React + Vite frontend
│ └── src/
│ ├── components/ # UI components
│ └── hooks/ # WebSocket and API hooks
└── docs/ # Documentation and screenshots
- TypeScript — all code must be fully typed, no
anyin production code - Branching — create feature branches from
main(e.g.feature/gpu-detection) - Commits — use conventional commits (
feat:,fix:,docs:,refactor:) - Pull requests — include a description of what changed and why, plus testing steps
- Tests — add tests for new services and API routes (test framework TBD in v0.2)
Open an issue on GitHub with:
- Your OS and Node.js version
- Ollama version and running models
- Steps to reproduce the problem
- Expected vs actual behavior
MIT — DjimIT B.V.
