Troubleshooting

Common problems and solutions for OmniRoute.

Quick Fixes

Problem	Solution
First login not working	Set `INITIAL_PASSWORD` in `.env` (no hardcoded default)
Dashboard opens on wrong port	Set `PORT=20128` and `NEXT_PUBLIC_BASE_URL=http://localhost:20128`
No request logs under `logs/`	Set `ENABLE_REQUEST_LOGS=true`
EACCES: permission denied	Set `DATA_DIR=/path/to/writable/dir` to override `~/.omniroute`
Routing strategy not saving	Update to v1.4.11+ (Zod schema fix for settings persistence)

Provider Issues

"Language model did not provide messages"

Cause: Provider quota exhausted.

Fix:

Check dashboard quota tracker
Use a combo with fallback tiers
Switch to cheaper/free tier

Rate Limiting

Cause: Subscription quota exhausted.

Fix:

Add fallback: cc/claude-opus-4-6 → glm/glm-4.7 → if/kimi-k2-thinking
Use GLM/MiniMax as cheap backup

OAuth Token Expired

OmniRoute auto-refreshes tokens. If issues persist:

Dashboard → Provider → Reconnect
Delete and re-add the provider connection

Cloud Issues

Cloud Sync Errors

Verify BASE_URL points to your running instance (e.g., http://localhost:20128)
Verify CLOUD_URL points to your cloud endpoint (e.g., https://omniroute.dev)
Keep NEXT_PUBLIC_* values aligned with server-side values

Cloud `stream=false` Returns 500

Symptom: Unexpected token 'd'... on cloud endpoint for non-streaming calls.

Cause: Upstream returns SSE payload while client expects JSON.

Workaround: Use stream=true for cloud direct calls. Local runtime includes SSE→JSON fallback.

Cloud Says Connected but "Invalid API key"

Create a fresh key from local dashboard (/api/keys)
Run cloud sync: Enable Cloud → Sync Now
Old/non-synced keys can still return 401 on cloud

Docker Issues

CLI Tool Shows Not Installed

Check runtime fields: curl http://localhost:20128/api/cli-tools/runtime/codex | jq
For portable mode: use image target runner-cli (bundled CLIs)
For host mount mode: set CLI_EXTRA_PATHS and mount host bin directory as read-only
If installed=true and runnable=false: binary was found but failed healthcheck

Quick Runtime Validation

curl -s http://localhost:20128/api/cli-tools/codex-settings | jq '{installed,runnable,commandPath,runtimeMode,reason}'
curl -s http://localhost:20128/api/cli-tools/claude-settings | jq '{installed,runnable,commandPath,runtimeMode,reason}'
curl -s http://localhost:20128/api/cli-tools/openclaw-settings | jq '{installed,runnable,commandPath,runtimeMode,reason}'

Cost Issues

High Costs

Check usage stats in Dashboard → Usage
Switch primary model to GLM/MiniMax
Use free tier (Gemini CLI, iFlow) for non-critical tasks
Set cost budgets per API key: Dashboard → API Keys → Budget

Debugging

Enable Request Logs

Set ENABLE_REQUEST_LOGS=true in your .env file. Logs appear under logs/ directory.

Check Provider Health

# Health dashboard
http://localhost:20128/dashboard/health

# API health check
curl http://localhost:20128/api/monitoring/health

Runtime Storage

Main state: ${DATA_DIR}/storage.sqlite (providers, combos, aliases, keys, settings)
Usage: SQLite tables in storage.sqlite (usage_history, call_logs, proxy_logs) + optional ${DATA_DIR}/log.txt and ${DATA_DIR}/call_logs/
Request logs: <repo>/logs/... (when ENABLE_REQUEST_LOGS=true)

Circuit Breaker Issues

Provider stuck in OPEN state

When a provider's circuit breaker is OPEN, requests are blocked until the cooldown expires.

Fix:

Go to Dashboard → Settings → Resilience
Check the circuit breaker card for the affected provider
Click Reset All to clear all breakers, or wait for the cooldown to expire
Verify the provider is actually available before resetting

Provider keeps tripping the circuit breaker

If a provider repeatedly enters OPEN state:

Check Dashboard → Health → Provider Health for the failure pattern
Go to Settings → Resilience → Provider Profiles and increase the failure threshold
Check if the provider has changed API limits or requires re-authentication
Review latency telemetry — high latency may cause timeout-based failures

Audio Transcription Issues

"Unsupported model" error

Ensure you're using the correct prefix: deepgram/nova-3 or assemblyai/best
Verify the provider is connected in Dashboard → Providers

Transcription returns empty or fails

Check supported audio formats: mp3, wav, m4a, flac, ogg, webm
Verify file size is within provider limits (typically < 25MB)
Check provider API key validity in the provider card

Translator Debugging

Use Dashboard → Translator to debug format translation issues:

Mode	When to Use
Playground	Compare input/output formats side by side — paste a failing request to see how it translates
Chat Tester	Send live messages and inspect the full request/response payload including headers
Test Bench	Run batch tests across format combinations to find which translations are broken
Live Monitor	Watch real-time request flow to catch intermittent translation issues

Common format issues

Thinking tags not appearing — Check if the target provider supports thinking and the thinking budget setting
Tool calls dropping — Some format translations may strip unsupported fields; verify in Playground mode
System prompt missing — Claude and Gemini handle system prompts differently; check translation output
SDK returns raw string instead of object — Fixed in v1.1.0: response sanitizer now strips non-standard fields (x_groq, usage_breakdown, etc.) that cause OpenAI SDK Pydantic validation failures
GLM/ERNIE rejects system role — Fixed in v1.1.0: role normalizer automatically merges system messages into user messages for incompatible models
developer role not recognized — Fixed in v1.1.0: automatically converted to system for non-OpenAI providers
json_schema not working with Gemini — Fixed in v1.1.0: response_format is now converted to Gemini's responseMimeType + responseSchema

Resilience Settings

Auto rate-limit not triggering

Auto rate-limit only applies to API key providers (not OAuth/subscription)
Verify Settings → Resilience → Provider Profiles has auto-rate-limit enabled
Check if the provider returns 429 status codes or Retry-After headers

Tuning exponential backoff

Provider profiles support these settings:

Base delay — Initial wait time after first failure (default: 1s)
Max delay — Maximum wait time cap (default: 30s)
Multiplier — How much to increase delay per consecutive failure (default: 2x)

Anti-thundering herd

When many concurrent requests hit a rate-limited provider, OmniRoute uses mutex + auto rate-limiting to serialize requests and prevent cascading failures. This is automatic for API key providers.

Optional RAG / LLM failure taxonomy (16 problems)

Some OmniRoute users place the gateway in front of RAG or agent stacks. In those setups it is common to see a strange pattern: OmniRoute looks healthy (providers up, routing profiles ok, no rate limit alerts) but the final answer is still wrong.

In practice these incidents usually come from the downstream RAG pipeline, not from the gateway itself.

If you want a shared vocabulary to describe those failures you can use the WFGY ProblemMap, an external MIT license text resource that defines sixteen recurring RAG / LLM failure patterns. At a high level it covers:

retrieval drift and broken context boundaries
empty or stale indexes and vector stores
embedding versus semantic mismatch
prompt assembly and context window issues
logic collapse and overconfident answers
long chain and agent coordination failures
multi agent memory and role drift
deployment and bootstrap ordering problems

The idea is simple:

When you investigate a bad response, capture:
- user task and request
- route or provider combo in OmniRoute
- any RAG context used downstream (retrieved documents, tool calls, etc)
Map the incident to one or two WFGY ProblemMap numbers (No.1 … No.16).
Store the number in your own dashboard, runbook, or incident tracker next to the OmniRoute logs.
Use the corresponding WFGY page to decide whether you need to change your RAG stack, retriever, or routing strategy.

Full text and concrete recipes live here (MIT license, text only):

WFGY ProblemMap README

You can ignore this section if you do not run RAG or agent pipelines behind OmniRoute.

Still Stuck?

GitHub Issues: github.com/diegosouzapw/OmniRoute/issues
Architecture: See docs/ARCHITECTURE.md for internal details
API Reference: See docs/API_REFERENCE.md for all endpoints
Health Dashboard: Check Dashboard → Health for real-time system status
Translator: Use Dashboard → Translator to debug format issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting

Quick Fixes

Provider Issues

"Language model did not provide messages"

Rate Limiting

OAuth Token Expired

Cloud Issues

Cloud Sync Errors

Cloud `stream=false` Returns 500

Cloud Says Connected but "Invalid API key"

Docker Issues

CLI Tool Shows Not Installed

Quick Runtime Validation

Cost Issues

High Costs

Debugging

Enable Request Logs

Check Provider Health

Runtime Storage

Circuit Breaker Issues

Provider stuck in OPEN state

Provider keeps tripping the circuit breaker

Audio Transcription Issues

"Unsupported model" error

Transcription returns empty or fails

Translator Debugging

Common format issues

Resilience Settings

Auto rate-limit not triggering

Tuning exponential backoff

Anti-thundering herd

Optional RAG / LLM failure taxonomy (16 problems)

Still Stuck?

FilesExpand file tree

TROUBLESHOOTING.md

Latest commit

History

TROUBLESHOOTING.md

File metadata and controls

Troubleshooting

Quick Fixes

Provider Issues

"Language model did not provide messages"

Rate Limiting

OAuth Token Expired

Cloud Issues

Cloud Sync Errors

Cloud stream=false Returns 500

Cloud Says Connected but "Invalid API key"

Docker Issues

CLI Tool Shows Not Installed

Quick Runtime Validation

Cost Issues

High Costs

Debugging

Enable Request Logs

Check Provider Health

Runtime Storage

Circuit Breaker Issues

Provider stuck in OPEN state

Provider keeps tripping the circuit breaker

Audio Transcription Issues

"Unsupported model" error

Transcription returns empty or fails

Translator Debugging

Common format issues

Resilience Settings

Auto rate-limit not triggering

Tuning exponential backoff

Anti-thundering herd

Optional RAG / LLM failure taxonomy (16 problems)

Still Stuck?

Cloud `stream=false` Returns 500