Inference Forge

All-in-one desktop management suite for local LLM inference — real-time monitoring, KV cache benchmarking, and smart Modelfile generation. Currently supports Ollama as the inference backend.

Features

Real-time Dashboard — VRAM usage, model status, KV cache pressure, time-series metrics via WebSocket
KV Cache Benchmarker — Automated testing across f16/q8_0/q4_0 configurations with standardized prompts
Smart Modelfile Generator — Hardware-aware parameter optimization with use-case templates (chat, coding, analysis, creative, agent)

Quick Start

# Prerequisites: Node.js >= 18, Ollama running on localhost:11434

# Install dependencies
npm install

# Start development (backend + frontend)
npm run dev

# Open http://localhost:3000

Architecture

Monorepo with two packages:

Package	Description	Port
`@inference-forge/server`	Express + WebSocket backend	3001
`@inference-forge/dashboard`	React + Vite frontend	3000

KV Cache Optimization

Ollama supports KV cache quantization via environment variable:

Linux / macOS:

export OLLAMA_KV_CACHE_TYPE=q8_0    # Half memory, minimal quality loss
export OLLAMA_FLASH_ATTENTION=1      # Required for KV quantization
ollama serve

Windows (PowerShell):

$env:OLLAMA_KV_CACHE_TYPE = "q8_0"
$env:OLLAMA_FLASH_ATTENTION = "1"
ollama serve

Type	Memory vs f16	Quality Impact
f16	1x (default)	None
q8_0	~0.5x	Very small
q4_0	~0.25x	Small-medium

Tech Stack

TypeScript, Node.js, Express, WebSocket, React 18, Vite, TailwindCSS, Recharts

Roadmap

v0.2 — Enhanced Monitoring

GPU hardware detection (NVIDIA via nvidia-smi, AMD via rocm-smi)
Per-model token throughput tracking over time
Alert thresholds for VRAM pressure and model eviction

v0.3 — Advanced Benchmarking

Perplexity estimation via log-likelihood comparison across KV cache types
Custom prompt sets and configurable run parameters
Export benchmark reports to PDF and JSON
Side-by-side model comparison charts

v0.4 — Modelfile Studio

Visual Modelfile editor with live preview
Import/export Modelfile library
Community template gallery
One-click model creation via API

v0.5 — Multi-Agent Support

Concurrent model orchestration dashboard
Agent workflow builder with model routing
Session and conversation memory management
Resource allocation across running agents

Future

Advanced KV cache compression techniques (e.g. PolarQuant-style quantization) when available in llama.cpp
Electron desktop app packaging
Remote instance management
Plugin system for custom metrics and tools
Additional inference backend support (vLLM, llama.cpp server)

Contributing

Contributions are welcome! Here's how to get started.

Development Setup

git clone https://github.com/DjimIT/inference-forge.git
cd inference-forge
npm install
npm run dev

The backend runs on http://localhost:3001 and the dashboard on http://localhost:3000 with hot reload enabled for both.

Project Structure

inference-forge/
├── packages/server/       # Express + WebSocket backend
│   └── src/
│       ├── api/           # REST API routes
│       ├── services/      # Ollama client, monitor, benchmark, modelfile
│       └── ws/            # WebSocket handlers
├── packages/dashboard/    # React + Vite frontend
│   └── src/
│       ├── components/    # UI components
│       └── hooks/         # WebSocket and API hooks
└── docs/                  # Documentation and screenshots

Guidelines

TypeScript — all code must be fully typed, no any in production code
Branching — create feature branches from main (e.g. feature/gpu-detection)
Commits — use conventional commits (feat:, fix:, docs:, refactor:)
Pull requests — include a description of what changed and why, plus testing steps
Tests — add tests for new services and API routes (test framework TBD in v0.2)

Reporting Issues

Open an issue on GitHub with:

Your OS and Node.js version
Ollama version and running models
Steps to reproduce the problem
Expected vs actual behavior

License

MIT — DjimIT B.V.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
docs/screenshots		docs/screenshots
packages		packages
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inference Forge

Features

Quick Start

Architecture

KV Cache Optimization

Tech Stack

Roadmap

v0.2 — Enhanced Monitoring

v0.3 — Advanced Benchmarking

v0.4 — Modelfile Studio

v0.5 — Multi-Agent Support

Future

Contributing

Development Setup

Project Structure

Guidelines

Reporting Issues

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Inference Forge

Features

Quick Start

Architecture

KV Cache Optimization

Tech Stack

Roadmap

v0.2 — Enhanced Monitoring

v0.3 — Advanced Benchmarking

v0.4 — Modelfile Studio

v0.5 — Multi-Agent Support

Future

Contributing

Development Setup

Project Structure

Guidelines

Reporting Issues

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages