LLM API Gateway with Intelligent Cache Optimization
LLM providers charge 10x more for cache misses vs cache hits. TokenRouter transforms your LLM infrastructure:
ββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββ
β Client A ββββββΆβ ββββββΆβ DeepSeek β
ββββββββββββββββ€ β TokenRouter Gateway β βββββββββββββββ€
β Client B ββββββΆβ Cache Optimization β’ Deduplication β’ Cost Tracking ββββββΆβ OpenAI β
ββββββββββββββββ€ β β βββββββββββββββ€
β Client C ββββββΆβ ββββββΆβ Anthropic β
ββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββ
| Problem | TokenRouter Solution | Impact |
|---|---|---|
| Low cache hit rate (<30%) | Structural convergence via Chunker + Arranger + Canonicalizer | Cache hits >70% |
| Inconsistent tool ordering | Alphabetical normalization for cross-user cache sharing | Cross-user cache sharing |
| Duplicate concurrent requests | In-memory deduplication (zero upstream calls) | Eliminate redundant calls |
| No cost visibility | Real-time Prometheus metrics (cache savings, dedup savings) | Track every dollar saved |
Result: Cache hit rates >70%, cost reduction up to 90%
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TokenRouter Performance Dashboard β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Throughput P99 Latency Cache Hit Rate Cost Savings β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β 10,000 β β <50ms β β >70% β β Up to β β
β β req/s β β β β β β 90% β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 95% β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Based on load testing with 10,000 concurrent requests:
| Metric | Value | Baseline | Improvement |
|---|---|---|---|
| Throughput | 10,000 req/s | 1,000 req/s | 10x |
| P99 Latency | <50ms | 200ms | 75%β |
| Cache Hit Rate | >70% | <30% | 2.3x |
| Cost Savings | Up to 90% | 0% | 90%β |
| Dedup Rate | >5% | 0% | New |
Every incoming request flows through this pipeline:
βββββββββββ βββββββββββ ββββββββββββ βββββββββββββββββ βββββββββββββββ βββββββββ ββββββββ βββββββββββ βββββββββ
βInbound ββββΆβChunker ββββΆβArranger ββββΆβCanonicalizer ββββΆβCacheInjectorββββΆβHasher ββββΆβDedup ββββΆβOutbound ββββΆβProxy β
βAdapter β β β β β β β β β β β β β βAdapter β β β
βββββββββββ βββββββββββ ββββββββββββ βββββββββββββββββ βββββββββββββββ βββββββββ ββββββββ βββββββββββ βββββββββ
β β β β β β β β
β β β β β β β β
Parse to Split into Order blocks: Deterministic Inject vendor- Compute Check Build Forward
Envelope Block types SystemβToolβ JSON serialization specific cache hashes for vendor- to upstream
HistoryβQuery directives duplicates specific format
| Component | Function | Impact | Performance |
|---|---|---|---|
| Chunker | Splits messages into System/Tool/History/Query blocks | Structured processing | <1ms |
| Arranger | Orders blocks: System β Tool (sorted) β History β Query | Cache prefix alignment | <1ms |
| Canonicalizer | Deterministic JSON serialization | Byte-perfect hash stability | <2ms |
| CacheInjector | Vendor-specific cache directives | Maximize vendor KV cache | <1ms |
| Hasher | PrefixHash (cache) + FullHash (dedup) | Intelligent routing | <1ms |
| Dedup | In-flight request deduplication | Zero redundant calls | <1ms |
Total Pipeline Overhead: <10ms (P99)
| Feature | TokenRouter | Cloudflare AI Gateway | LiteLLM |
|---|---|---|---|
| KV Cache Optimization | β Structural convergence | β Passthrough only | β Passthrough only |
| Request Deduplication | β In-memory | β No | β No |
| Tool Normalization | β Alphabetical sort | β No | β No |
| Cost Tracking | β Real-time Prometheus | ||
| Open Source | β Full | β Proprietary | β Full |
| Self-Hosted | β Yes | β Cloud only | β Yes |
| Streaming Support | β Full | β Limited | β Full |
| Multi-Provider | β DeepSeek/OpenAI/Anthropic | β Multiple | β Multiple |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Cost per 1M Tokens (USD) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Direct API Call ββββββββββββββββββββββββββββββββββ $1.00 β
β (no optimization) β β β
β β β β
β With TokenRouter ββββββ β $0.10 β
β (70% cache hit) β β β
β β β β
β Savings βββββββββββββββββββββββββββββ β 90% β β
β β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Clone repository
git clone https://github.com/GouBuliya/TokenRouter.git
cd TokenRouter/deployments
# Start all services
docker compose up -d
# View logs
docker compose logs -fAccess:
- TokenRouter API: http://localhost:8080
- Grafana Dashboard: http://localhost:3000 (admin/admin)
- Prometheus: http://localhost:9090
# Clone repository
git clone https://github.com/GouBuliya/TokenRouter.git
cd TokenRouter
# Build
make build
# Run tests
make test
# Run locally (requires Postgres and Redis)
cp .env.example .env
# Edit .env with your API keys
make devcurl -X POST http://localhost:8080/admin/api-keys \
-H "Content-Type: application/json" \
-d '{
"name": "my-key",
"quota_usd": 100
}'Response:
{
"id": "uuid-here",
"key": "sk-tr-abc123...",
"quota_usd": 100
}
β οΈ Save the key immediately - it's only shown once!
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-tr-abc123..." \
-d '{
"model": "deepseek-v4-flash",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-tr-abc123..." \
-d '{
"model": "deepseek-v4-flash",
"messages": [
{"role": "user", "content": "What is the weather in Beijing?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"}
},
"required": ["city"]
}
}
}
]
}'| Variable | Description | Default | Required |
|---|---|---|---|
PORT |
HTTP server port | 8080 |
β |
DATABASE_URL |
Postgres connection string | - | β |
REDIS_URL |
Redis connection string | - | β |
DEEPSEEK_API_KEY |
DeepSeek API key | - | β |
CACHE_INJECT_ENABLED |
Enable cache injection | true |
β |
DEDUP_ENABLED |
Enable request deduplication | true |
β |
TOOL_SORT_ENABLED |
Enable tool alphabetical sorting | true |
β |
DEDUP_TTL |
Deduplication TTL | 2m |
β |
LOG_LEVEL |
Log level (debug/info/warn/error) | info |
β |
See .env.example for full list.
Development Environment
PORT=8080
LOG_LEVEL=debug
DATABASE_URL=postgres://tokenrouter:tokenrouter@localhost:5432/tokenrouter?sslmode=disable
REDIS_URL=redis://localhost:6379/0
DEEPSEEK_API_KEY=sk-xxx
DEDUP_ENABLED=true
CACHE_INJECT_ENABLED=true
RATE_LIMIT_ENABLED=false # Disable for developmentProduction Environment (Small Scale)
PORT=8080
LOG_LEVEL=warn
DATABASE_URL=postgres://user:pass@db.example.com:5432/tokenrouter?sslmode=require
REDIS_URL=redis://redis.example.com:6379/0
DEEPSEEK_API_KEY=sk-xxx
DB_MAX_OPEN_CONNS=50
DB_MAX_IDLE_CONNS=10
DB_CONN_MAX_LIFETIME=30m
AUTH_CACHE_TTL=5mProduction Environment (High Concurrency)
PORT=8080
LOG_LEVEL=error
DATABASE_URL=postgres://user:pass@db.example.com:5432/tokenrouter?sslmode=require
REDIS_URL=redis://redis-cluster.example.com:6379/0
# High concurrency settings
GLOBAL_CONCURRENT_LIMIT=10000
STREAM_CONCURRENT_LIMIT=6000
NON_STREAM_CONCURRENT_LIMIT=4000
PROVIDER_CONCURRENT_LIMIT=1000
DB_MAX_OPEN_CONNS=100
DB_MAX_IDLE_CONNS=25
DB_CONN_MAX_LIFETIME=1h
# Connection pool optimization
PROXY_MAX_IDLE_CONNS=10000
PROXY_MAX_IDLE_CONNS_PER_HOST=1000
PROXY_MAX_CONNS_PER_HOST=10000
PROXY_IDLE_CONN_TIMEOUT=90s- π Installation Guide - Complete setup instructions
- βοΈ Configuration Guide - Environment variables and tuning
- π Quick Start - Development environment setup
- π‘ Usage Examples - API call examples
- π System Architecture - Core architecture and module design
- π Adapter Design - Inbound/Outbound adapter patterns
- πΎ Cache Intelligence - Cache optimization strategies
- π‘ Chat Completions API - API endpoint specifications
- π§ Admin API - Management endpoints
- π€ Contributing Guide - How to contribute
- π§ͺ Testing Guide - End-to-end testing
- π Adapter Development - Building new provider adapters
We welcome contributions! Please see our Contributing Guide for details.
# Fork and clone
git clone https://github.com/YOUR_USERNAME/TokenRouter.git
cd TokenRouter
# Create branch
git checkout -b feature/your-feature
# Make changes and test
make test
make lint
# Commit and push
git commit -am "feat: add your feature"
git push origin feature/your-feature
# Open Pull RequestLook for issues labeled good first issue to get started.
This project is licensed under the Apache License 2.0.
- Inspired by Cloudflare AI Gateway
- Cache optimization concepts from Anthropic
- Built with Gin and GORM
- GitHub Issues: Report bugs or request features
- Discussions: Join the conversation
- Email: Contact maintainers
- Twitter: @TokenRouter (coming soon)
- Discord: Join our community (coming soon)