Skip to content

fix(provider): restore native OpenRouter endpoint when switching back from a direct profile#280

Open
prateekjain-afk wants to merge 5 commits into
1jehuang:masterfrom
prateekjain-afk:fix/openrouter-nvidia-endpoint-leak
Open

fix(provider): restore native OpenRouter endpoint when switching back from a direct profile#280
prateekjain-afk wants to merge 5 commits into
1jehuang:masterfrom
prateekjain-afk:fix/openrouter-nvidia-endpoint-leak

Conversation

@prateekjain-afk
Copy link
Copy Markdown

Problem

Switching models OpenRouter → NVIDIA NIM → back to a native OpenRouter model (e.g. openrouter/owl-alpha) fails with a 404. The request is sent to the wrong endpoint:

endpoint: https://integrate.api.nvidia.com/v1/chat/completions
model:    openrouter/owl-alpha
auth:     NVIDIA_API_KEY
→ status: 404 Not Found  (404 page not found)

Root cause

Selecting a built-in OpenAI-compatible profile (e.g. NVIDIA NIM via nvidia-nim:...) calls force_apply_openai_compatible_profile_env(Some(profile)), which stamps that profile's endpoint + API key into the process-global JCODE_OPENROUTER_* env vars. The ActiveProvider::OpenRouter arm of set_model never cleared those overrides when switching back, so a native OpenRouter model was POSTed to the stale profile endpoint with the wrong key.

Because the leak lives in process-global env, even brand-new sessions kept failing until the server was fully restarted. Other providers (Claude/OpenAI/Gemini/Copilot) are unaffected because they use self-contained providers and never touch this shared env.

Fix

In the OpenRouter set_model arm: when the previous selection was a built-in direct profile (profile_id.is_some()) and the target is a native openrouter.ai catalog model (id starts with openrouter/), reset the profile env to None and rebuild the provider so it talks to the native endpoint again.

Deliberately left untouched:

  • raw/custom endpoints configured directly via JCODE_OPENROUTER_API_BASE (profile_id == None)
  • @provider-pinned or opaque model ids on forced-OpenRouter providers
  • locked named profiles (JCODE_PROVIDER_PROFILE_ACTIVE)

Tests

Adds test_switch_back_to_native_openrouter_restores_endpoint_after_nvidia, which reproduces the OpenRouter → NVIDIA → OpenRouter switch-back and asserts the endpoint override is cleared. It fails without the fix with the exact integrate.api.nvidia.com URL, and passes with it. Verified no regressions in the provider test suite (the only remaining failures are pre-existing parallel-env-contamination flakes present on master that pass in isolation).

… from a direct profile

Selecting a built-in OpenAI-compatible profile (e.g. NVIDIA NIM via
"nvidia-nim:...") calls force_apply_openai_compatible_profile_env(Some(profile)),
which stamps that profile's endpoint and API key into the global
JCODE_OPENROUTER_* env. Switching back to a native OpenRouter catalog model
("openrouter/owl-alpha") never cleared those overrides, so the native model
was POSTed to the stale profile endpoint (https://integrate.api.nvidia.com/v1)
with the wrong key and returned 404. Because the leak lives in process-global
env, even brand-new sessions kept failing until the server was restarted.

Fix: in the OpenRouter set_model arm, when the previous selection was a built-in
direct profile (profile_id is Some) and the target is a native openrouter.ai
catalog model, reset the profile env to None and rebuild the provider so it
talks to the native endpoint again. Raw/custom JCODE_OPENROUTER_API_BASE
endpoints (profile_id == None), @-pinned ids, and locked named profiles are
deliberately left untouched.

Adds a regression test that reproduces the OpenRouter -> NVIDIA -> OpenRouter
switch-back and asserts the endpoint override is cleared (fails without the fix
with the exact integrate.api.nvidia.com URL).
Spawned swarm agents got stuck forever at 'startup queued' because the
default spawn mode was Visible: the server forks a terminal launcher
(e.g. 'open -a Terminal'), the fork succeeds, but on a server/headless
host (jcode serve shared server, no GUI) no interactive client ever
attaches to drive the agent loop. The member sits 'running / startup
queued', DMs land in an unread mailbox, and wake/resume fail because no
task ever ran.

Fixes:
- Auto mode now verifies a visible launch actually produced a live
  client attachment (SwarmMember.event_txs becomes non-empty) within a
  short timeout; if not, it tears down the orphaned visible session and
  falls back to the in-process headless runner, which always executes.
- register_visible_spawned_member no longer clobbers a member that a
  real client already attached to (avoids a race when a client connects
  during the Auto attach-wait window).
- Default swarm_spawn_mode changed Visible -> Auto so swarm works out of
  the box on both desktop and headless hosts.

Adds unit tests for attach detection, timeout fallback, and the
non-clobber guard.
@prateekjain-afk prateekjain-afk force-pushed the fix/openrouter-nvidia-endpoint-leak branch from 00f546a to 8e35c8d Compare May 29, 2026 14:02
…ng/master

Provides a one-command, safe way to pull upstream (origin/master) updates
timely while keeping local fix commits on top. Tags a backup before
rewriting history and aborts cleanly on conflict.
…ing "action missing" label

Two swarm UX bugs surfaced when running a research swarm on a headless
\`jcode serve\` shared server:

1. Auto spawn opened a useless bare-jcode Terminal window per child and
   then waited out the 8s attach timeout before falling back to headless.
   Now Auto checks up-front whether the requesting coordinator itself has
   a live interactive client (event_txs). If not (headless server), it
   skips the visible attempt entirely and spawns the child headless
   immediately, eliminating the orphan window and the per-spawn delay.
   The post-launch wait_for_live_attachment safety net is retained for
   the case where an attached coordinator opens a child window that fails
   to attach.

2. The TUI rendered swarm/memory/initiative/side_panel tool calls as
   "action missing" (with a warning logged) whenever the streamed tool
   input had not yet populated its arguments (empty object). This flashed
   "swarm action missing" for every spawned agent. Added
   tool_input_is_unpopulated + resolve_tool_action_for_display so an
   unpopulated/streaming call shows a neutral "…" without logging, while
   a genuinely malformed (populated-but-action-less) call still surfaces
   the diagnostic.

Adds 6 unit tests (3 in jcode-app-core, 3 in jcode-tui).
…ed serve server

The previous Auto gate keyed off the coordinator session having a live
event channel, but an interactive coordinator attached to a *detached*
\`jcode serve\` shared server still reports as attached, so visible spawns
were still attempted (orphan window + 8s attach-timeout per child).

Switch the signal to whether THIS process has a controlling TTY: a
detached \`jcode serve\` server has none (ps TTY \`??\`), while an
interactive jcode/desktop run by a user does. When detached, Auto spawns
children headless directly. Add JCODE_SWARM_FORCE_VISIBLE=1 escape hatch.
session_has_live_attachment is retained under #[cfg(test)].

Adds running_as_detached_server_respects_force_visible_override test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants