Skip to content

feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime#692

Open
heavygee wants to merge 9 commits into
tiann:mainfrom
heavygee:feat/pluggable-voice-backend
Open

feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime#692
heavygee wants to merge 9 commits into
tiann:mainfrom
heavygee:feat/pluggable-voice-backend

Conversation

@heavygee
Copy link
Copy Markdown
Contributor

@heavygee heavygee commented May 25, 2026

Overview

Rebased and completed @Overbaker's #401 onto current main after it went fallow (~4 weeks). All original design and implementation credit belongs to @Overbaker and @TennyDDDD — I've only done the merge work and fixed up the test runner.

Adds a pluggable voice backend architecture extending the existing ElevenLabs integration with two new providers:

  • Gemini 2.5 Live (VOICE_BACKEND=gemini-live): Google's real-time audio WebSocket API with full function calling (messageCodingAgent, processPermissionRequest) via a hub-side proxy that handles region restrictions
  • Qwen Realtime (VOICE_BACKEND=qwen-realtime): Alibaba DashScope via hub WebSocket proxy (browser WebSocket API cannot set Authorization headers, so the hub proxies and injects the key server-side)
  • ElevenLabs remains the default — existing behaviour completely unchanged

Changes from original PR

  • Rebased 135 upstream commits cleanly, including scheduling, goal state, conversation outline, composer enter behaviour setting, and more
  • HappyComposer: kept upstream's configurable enter-behaviour setting (feat(web): add configurable enter behavior setting #586) instead of Overbaker's hard-coded Ctrl+Enter change — the upstream setting covers both behaviours and is better for all users
  • Test runner fix: converted gemini/toolAdapter.test.ts and gemini/pcmUtils.test.ts from bun:test to vitest — the web package uses vitest, not bun's test runner
  • All HAPI Bot findings from the original PR were addressed by @TennyDDDD; this rebase inherits those fixes

Configuration

# Gemini Live (free tier available, full function calling)
VOICE_BACKEND=gemini-live
GEMINI_API_KEY=your-google-api-key   # also accepts GOOGLE_API_KEY

# Qwen Realtime (voice conversation; function calling limited by model support)
VOICE_BACKEND=qwen-realtime
DASHSCOPE_API_KEY=your-dashscope-key  # also accepts QWEN_API_KEY

# ElevenLabs (default — unchanged)
VOICE_BACKEND=elevenlabs
ELEVENLABS_API_KEY=your-elevenlabs-key

Files changed

Area Files Description
Shared shared/src/voice.ts Backend types, Gemini/Qwen model constants, improved system prompt with explicit tool-call priority rule, Chinese language block separated from ElevenLabs config
Hub routes hub/src/web/routes/voice.ts Backend discovery (GET /voice/backend), POST /voice/gemini-token, POST /voice/qwen-token
Hub server hub/src/web/server.ts Gemini/Qwen WebSocket proxy handlers with JWT auth and message queueing during upstream connect
Hub socket hub/src/socket/server.ts maxHttpBufferSize: 55 MB to match upload limit
Gemini session web/src/realtime/GeminiLiveVoiceSession.tsx Full Gemini Live implementation — WebSocket + AudioWorklet, serial tool calls, mobile AudioContext
Qwen session web/src/realtime/QwenVoiceSession.tsx Qwen Realtime via hub proxy, OpenAI-compatible realtime protocol
Backend selector web/src/realtime/VoiceBackendSession.tsx Dynamic backend selector with React.lazy, gates voice button until module is registered
Audio pipeline web/src/realtime/gemini/ PCM utils, AudioWorklet recorder, 24 kHz player, Gemini function-call adapter
Integration web/src/components/SessionChat.tsx Uses VoiceBackendSession, gates voice toggle on backend readiness
Config web/tsconfig.json Exclude test files from TS compilation

Test plan

  • All 221 hub tests pass (bun test hub/src)
  • All 636 web tests pass (bun run test in web/)
  • TypeScript clean (tsc --noEmit in both web/ and hub/)
  • PCM audio round-trip tests pass
  • Gemini tool adapter tests pass
  • ElevenLabs default flow: existing voice behaviour unchanged
  • Gemini Live end-to-end: voice + function calling (requires GEMINI_API_KEY)
  • Qwen Realtime end-to-end: voice via hub proxy (requires DASHSCOPE_API_KEY)
  • Mobile browsers: iOS Safari, Android Chrome (AudioContext created in user gesture)

Rebased from Overbaker/hapi#401 onto current main. Adds a pluggable voice
backend architecture that extends the existing ElevenLabs integration:

- Gemini 2.5 Live (gemini-live): Google real-time audio via WebSocket
  with full function calling (messageCodingAgent, processPermissionRequest)
- Qwen Realtime (qwen-realtime): Alibaba DashScope via hub WebSocket
  proxy (browser cannot set Authorization header directly)
- VoiceBackendSession: dynamic backend selector with React.lazy loading,
  gates voice button until backend module is registered
- Hub WS proxies: JWT-authenticated /api/voice/gemini-ws and
  /api/voice/qwen-ws endpoints in Bun.serve, with message queueing during
  upstream connect to prevent dropped setup frames
- AudioWorklet pipeline: inline Blob URL recorder, 24 kHz PCM player,
  serial tool call execution, AudioContext created in user gesture for mobile
- Backend discovery: GET /voice/backend + POST /voice/gemini-token /
  POST /voice/qwen-token hub routes; frontend auto-detects active backend

Merge notes:
- Rebased 135 upstream commits cleanly; HappyComposer keeps upstream's
  configurable enter-behavior setting (supersedes hard-coded Ctrl+Enter)
- Converted gemini test files from bun:test to vitest (web package uses vitest)
- All 221 hub tests and 636 web tests pass; TypeScript clean
@heavygee heavygee force-pushed the feat/pluggable-voice-backend branch from 21d2417 to 9081a9b Compare May 25, 2026 18:10
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Gemini turn completion ignores the user mute state — when Gemini starts speaking, the recorder is force-muted, but turnComplete always calls setMuted(false). If the user had muted the mic, the next model turn re-enables the MediaStream track and can stream microphone audio while the UI still shows muted, evidence web/src/realtime/GeminiLiveVoiceSession.tsx:217.
    Suggested fix:
    state.modelSpeaking = false
    state.recorder?.setMuted(state.micMuted)

Summary

  • Review mode: initial
  • One major issue found in the new Gemini voice backend mute handling.

Testing

  • Not run (automation; bun is not installed in this runner).

HAPI Bot

Comment thread web/src/realtime/GeminiLiveVoiceSession.tsx Outdated
turnComplete handler was unconditionally calling setMuted(false), which
re-enabled the mic track even when the user had manually muted. Now
restores to state.micMuted instead.
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Gemini restart can keep microphone input disabled after stopping during playback — cleanup() closes the old session but leaves state.modelSpeaking unchanged. If the user stops or the socket closes while Gemini is speaking, the next startSession() begins with modelSpeaking === true, and sendAudioChunk() drops every mic chunk until a page reload or a later model turn flips the flag, evidence web/src/realtime/GeminiLiveVoiceSession.tsx:60 and related guard at web/src/realtime/GeminiLiveVoiceSession.tsx:301.
    Suggested fix:
    function cleanup() {
        // existing recorder/player/ws cleanup...
        state.ws = null
        state.modelSpeaking = false
    }

Summary

  • Review mode: follow-up after new commits
  • One major issue remains in the new Gemini Live backend lifecycle handling. The previous mute-state finding is addressed by the latest commit.

Testing

  • Not run (automation; bun is not installed in this runner).

HAPI Bot

Comment thread web/src/realtime/GeminiLiveVoiceSession.tsx
heavygee added 2 commits May 25, 2026 21:11
buildGeminiLiveConfig was appending VOICE_CHINESE_LANGUAGE_BLOCK which
forced Gemini to always respond in Mandarin regardless of user locale.
Gemini now uses the neutral base prompt and responds in the language the
user speaks to it, consistent with the ElevenLabs behaviour.
If the session closes while Gemini is mid-speech, cleanup() left
state.modelSpeaking=true. The next startSession() would then drop all
mic audio in sendAudioChunk() until a model turn eventually flipped
the flag — effectively deaf until page reload.
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Stale Gemini close handlers can tear down the next session — ws.onclose always calls module-level cleanup(), but cleanup() operates on state.ws, not necessarily the socket that fired the event. If the user stops/restarts quickly, the old socket can close after the new socket is assigned and close the new session, leaving voice disconnected, evidence web/src/realtime/GeminiLiveVoiceSession.tsx:258.
    Suggested fix:
    ws.onclose = (event) => {
        if (state.ws !== ws) return
        cleanup()
        resetRealtimeSessionState()
        // existing setupDone/status handling...
    }
  • [Major] Stale Qwen close handlers can tear down the next session — same module-level cleanup race exists in the Qwen backend, so an old socket close can reset/close a newer session after a quick reconnect, evidence web/src/realtime/QwenVoiceSession.tsx:285.
    Suggested fix:
    ws.onclose = (event) => {
        if (state.ws !== ws) return
        cleanup()
        resetRealtimeSessionState()
        // existing sessionReady/status handling...
    }

Questions

  • None.

Summary

  • Review mode: follow-up after new commits
  • The previous Gemini modelSpeaking restart issue is fixed in the latest head. Two lifecycle races remain in the new WebSocket-backed voice sessions.

Testing

  • Not run (automation; bun is not installed in this runner).

HAPI Bot

Comment thread web/src/realtime/GeminiLiveVoiceSession.tsx
Comment thread web/src/realtime/QwenVoiceSession.tsx
@heavygee
Copy link
Copy Markdown
Contributor Author

Follow-on: per-user backend picker in Settings

The current VOICE_BACKEND env var approach works for initial setup, but switching backends requires an SSH + restart. Now that multiple backends are viable, a settings dropdown would be much more useful.

Rough shape:

  • GET /api/voice/backends returns only the backends with API keys configured on the hub (e.g. ["elevenlabs", "gemini-live"] — no Qwen entry if DASHSCOPE_API_KEY is unset)
  • Settings page shows a Voice backend picker alongside the existing voice picker, hidden if only one backend is available
  • Selection stored in localStorage; passed to VoiceBackendSession to route to the right session component

This composes cleanly with the voice picker from #690 — if you switch back to ElevenLabs, the voice picker reappears with your account voices (including clones). Gemini/Qwen don't have per-voice selection yet, so that section would hide itself when a non-ElevenLabs backend is active.

Will file this as a follow-on once #690 and this PR are settled, since both touch the settings page and it makes sense to land them together.

ws.onclose operated on module-level state.ws, not the socket that fired
the event. A rapid stop/restart could cause the old socket's onclose to
call cleanup() after the new socket was assigned, tearing down the live
session. Guard with `if (state.ws !== ws) return` before cleanup.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Minor] Gemini Live omits the non-ElevenLabs language block — VOICE_CHINESE_LANGUAGE_BLOCK is documented as the block appended for Gemini/Qwen backends, and Qwen appends it before sending session instructions, but buildGeminiLiveConfig() still sends only VOICE_SYSTEM_PROMPT. With VOICE_BACKEND=gemini-live, the assistant will not get the Mandarin response instruction that the new Qwen path gets, evidence shared/src/voice.ts:347.
    Suggested fix:
    export function buildGeminiLiveConfig(): GeminiLiveConfig {
        return {
            model: GEMINI_LIVE_MODEL,
            systemInstruction: VOICE_SYSTEM_PROMPT + VOICE_CHINESE_LANGUAGE_BLOCK,
            tools: [
                {
                    functionDeclarations: buildGeminiLiveFunctionDeclarations()
                }
            ],
            responseModalities: ['AUDIO']
        }
    }

Questions

  • None.

Summary

  • Review mode: follow-up after new commits
  • The previous stale-close findings are fixed in the latest head. One Gemini/Qwen behavior mismatch remains.

Testing

  • Not run (automation; PR code not executed).

HAPI Bot

Comment thread shared/src/voice.ts
@heavygee
Copy link
Copy Markdown
Contributor Author

Qwen Realtime untested — DashScope API signup was inaccessible during dogfooding. Will update once a key is obtainable.

Matches the Gemini fix — both backends now use VOICE_SYSTEM_PROMPT
without the Chinese language block, giving consistent English-default
behaviour across all non-ElevenLabs backends.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
@heavygee
Copy link
Copy Markdown
Contributor Author

On removing VOICE_CHINESE_LANGUAGE_BLOCK from both backends: language still works correctly for users.

VOICE_CHINESE_LANGUAGE_BLOCK forced Mandarin unconditionally — the right behaviour for a Chinese-only deployment but wrong for anyone else. Both Gemini and Qwen will naturally mirror the user's spoken language without an explicit instruction: a Mandarin speaker gets Mandarin responses, an English speaker gets English. The block was over-constraining, not enabling.

ElevenLabs handles this separately via its language field in the agent config (with override support per-conversation), so it was never affected.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Global visibility suppresses native push for unrelated sessions — the new always-on { all: true } SSE connection is also reported as visible, so PushNotificationChannel sees the namespace as visible and sends the toast to that global connection instead of falling back to Web Push. When the user is viewing session A, a ready/permission/task notification for session B is delivered only to the open tab and no native push is sent, even though the selected-session connection is not subscribed to B. Evidence web/src/App.tsx:327 and existing push fallback logic in hub/src/push/pushNotificationChannel.ts:39.
    Suggested fix:
    const isGlobalVisibilityAuthoritative = !sessionEventSubscription
    
    useVisibilityReporter({
        api,
        subscriptionId: globalSubscriptionId,
        enabled: sseEnabled && isGlobalVisibilityAuthoritative
    })
    
    useVisibilityReporter({
        api,
        subscriptionId: sessionSubscriptionId,
        enabled: sseEnabled && Boolean(sessionEventSubscription)
    })

Questions

  • None.

Summary

  • Review mode: follow-up after new commits
  • The new split-SSE wiring still keeps session-list updates live, but its visibility reporting can now swallow native notifications for sessions other than the selected one.

Testing

  • Not run (automation; PR code not executed). Add coverage for the split-SSE visibility case: selected session visible, notification for another session should still call Web Push rather than only delivering an in-app toast.

HAPI Bot

Comment thread web/src/App.tsx Outdated
Adds a "Proactive voice" toggle (default: off = reactive) to the Voice
Assistant settings section.

Reactive (default): initial context and agent-ready events are fed
silently; the assistant waits for the user to speak first.

Proactive: original behaviour — Gemini/Qwen narrate context on connect
and speak unprompted when the agent finishes a task. ElevenLabs is also
affected via onReady sending a user message rather than a silent update.

Covers all three backends uniformly. localStorage key: hapi-voice-proactive.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
@heavygee
Copy link
Copy Markdown
Contributor Author

On the proactive/reactive toggle (scope increase explanation)

The HAPI Bot flagged that VOICE_CHINESE_LANGUAGE_BLOCK was present in Qwen but absent from Gemini. Looking into it, the bot had the direction backwards — we'd intentionally removed it from Gemini because it was causing the assistant to respond in Chinese regardless of the user's language. The right fix was to remove it from Qwen too for consistency.

But that review led somewhere useful: digging into why the original code behaved that way, it became clear the Chinese block wasn't just a language setting — it was tied to a proactive narration style where the assistant speaks unprompted when context changes or the agent finishes a task. That's a legitimate and useful mode; it's just not the right default for everyone.

Rather than silently drop that behaviour (which was part of @Overbaker's original design intent in #401), we've surfaced it as a Proactive voice toggle in Settings, defaulting to off (reactive, matching current main behaviour). Users who want the assistant to narrate agent activity unprompted can enable it.

The toggle applies uniformly to all three backends: Gemini, Qwen, and ElevenLabs.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Normalize upstream close codes before closing the client WebSocket — both new voice proxies forward event.code from the upstream socket. Abnormal upstream closes commonly surface as 1006, and 1005/1006/1015 are reserved codes that cannot be sent in a close frame. If clientWs.close(event.code, ...) throws, the catch swallows it after the upstream is removed, leaving the browser socket open while later messages are dropped. Evidence hub/src/web/server.ts:68.
    Suggested fix:
    function toClientCloseCode(code: number): number {
        return code >= 1000 && code <= 4999 && code !== 1005 && code !== 1006 && code !== 1015
            ? code
            : 1011
    }
    
    function closeClientFromUpstream(clientWs: ServerWebSocket<unknown>, event: CloseEvent): void {
        try { clientWs.close(toClientCloseCode(event.code), event.reason || 'Upstream closed') } catch { /* client gone */ }
    }
    
    upstream.onclose = (event) => {
        pendingMap.delete(clientWs)
        closeClientFromUpstream(clientWs, event)
        upstreamMap.delete(clientWs)
    }
  • [Minor] Barrel export defeats lazy backend loading — VoiceBackendSession lazy-imports Gemini/Qwen to keep alternate backends out of the initial path, but the barrel now statically re-exports those same modules. Any @/realtime import creates static dependencies on both backend modules, so the lazy split is unreliable. Evidence web/src/realtime/index.ts:20.
    Suggested fix:
    export { RealtimeVoiceSession, type RealtimeVoiceSessionProps } from './RealtimeVoiceSession'
    export { VoiceBackendSession, type VoiceBackendSessionProps } from './VoiceBackendSession'

Questions

  • None.

Summary

  • Review mode: follow-up after new commits
  • Found two issues in the final diff: upstream abnormal-close handling can leave proxied voice sockets hung, and the realtime barrel export undermines the intended lazy backend split.

Testing

  • Not run (automation; PR code not executed). Static fallback attempted: bun typecheck is unavailable in this runner (bun: command not found), and direct tsc -p web/tsconfig.json --noEmit / tsc -p hub/tsconfig.json --noEmit are blocked by missing workspace type packages (vite/client, bun-types, node).

HAPI Bot

Comment thread hub/src/web/server.ts Outdated
Comment thread web/src/realtime/index.ts Outdated
… visibility

- hub/server.ts: add toClientCloseCode() to normalize reserved upstream
  close codes (1005/1006/1015) to 1011 before forwarding to browser;
  abnormal upstream drops (1006) would otherwise throw on clientWs.close()
  and leave the browser socket open

- realtime/index.ts: remove static GeminiLiveVoiceSession and QwenVoiceSession
  barrel exports; VoiceBackendSession lazy-imports both, so barrel re-exports
  created static dependencies that defeated the intended code-split

- App.tsx: gate global useVisibilityReporter on !sessionEventSubscription so
  the always-on SSE connection does not suppress native Web Push notifications
  for sessions the user is not currently viewing

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Keep the session SSE connection responsible for session catch-up — after the split, only the global SSE passes onConnect, so reconnecting while a session is open invalidates the session query and fetches latest messages from the global connection lifecycle. That global connection does not restart when selectedSessionId changes or when the session-scoped SSE reconnects, so a dropped /api/events?sessionId=... stream can reconnect at web/src/hooks/useSSE.ts:543 without running the catch-up at web/src/App.tsx:206, leaving the active chat stale until another event arrives. Evidence web/src/App.tsx:312.
    Suggested fix:
    const handleSessionSseConnect = useCallback(() => {
        if (!selectedSessionId || !api) return
        void Promise.all([
            queryClient.invalidateQueries({ queryKey: queryKeys.session(selectedSessionId) }),
            fetchLatestMessages(api, selectedSessionId)
        ]).catch((error) => {
            console.error('Failed to refresh session after SSE connect:', error)
        })
    }, [api, queryClient, selectedSessionId])
    
    const { subscriptionId: sessionSubscriptionId } = useSSE({
        enabled: sseEnabled && Boolean(sessionEventSubscription),
        token: token ?? '',
        baseUrl,
        subscription: sessionEventSubscription ?? undefined,
        scope: 'full',
        onConnect: handleSessionSseConnect,
        onEvent: handleSseEvent
    })
  • [Major] Do not disable visibility reporting for the global subscription while a session is open — useVisibilityReporter only clears its retry/local refs when disabled; it does not POST hidden or unregister the SSE subscription. With this new enabled: sseEnabled && !sessionEventSubscription, the all-sessions connection stays registered as visible from its initial visibility query, and SSEManager.sendToast delivers to any visible connection in the namespace. Result: while the user is reading a session, global toast delivery can count as visible and suppress push/Telegram fallback even if the session-scoped reporter later goes hidden or fails. Evidence web/src/App.tsx:330.
    Suggested fix:
    useVisibilityReporter({
        api,
        subscriptionId: globalSubscriptionId,
        enabled: sseEnabled
    })
    
    useVisibilityReporter({
        api,
        subscriptionId: sessionSubscriptionId,
        enabled: sseEnabled && Boolean(sessionEventSubscription)
    })

Questions

  • None.

Summary

  • Review mode: follow-up after new commits
  • The previous bot findings were addressed, but the SSE split introduces two regressions: active-session reconnects no longer run session catch-up, and the global SSE subscription can remain visible without a reporter while a session is open.

Testing

  • Not run (automation; bun is unavailable in this runner: bun: command not found).

HAPI Bot

Comment thread web/src/App.tsx
subscription: eventSubscription,
subscription: globalEventSubscription,
scope: 'global',
onConnect: handleSseConnect,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] Only the global SSE currently gets onConnect, so a reconnect of the session-scoped /api/events?sessionId=... stream runs useSSE's onopen path but never executes the catch-up that invalidates the active session and fetches latest messages. Because the global connection stays mounted across session changes/reconnects, the active chat can remain stale after the session stream drops until a later event happens to arrive.

Suggested fix:

const handleSessionSseConnect = useCallback(() => {
    if (!selectedSessionId || !api) return
    void Promise.all([
        queryClient.invalidateQueries({ queryKey: queryKeys.session(selectedSessionId) }),
        fetchLatestMessages(api, selectedSessionId)
    ]).catch((error) => {
        console.error('Failed to refresh session after SSE connect:', error)
    })
}, [api, queryClient, selectedSessionId])

const { subscriptionId: sessionSubscriptionId } = useSSE({
    enabled: sseEnabled && Boolean(sessionEventSubscription),
    token: token ?? '',
    baseUrl,
    subscription: sessionEventSubscription ?? undefined,
    scope: 'full',
    onConnect: handleSessionSseConnect,
    onEvent: handleSseEvent
})

Comment thread web/src/App.tsx Outdated
useVisibilityReporter({
api,
subscriptionId: globalSubscriptionId,
enabled: sseEnabled && !sessionEventSubscription
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] Disabling the reporter for the global subscription while a session is selected leaves that SSE connection registered with whatever visibility it had when opened. useVisibilityReporter does not POST hidden when disabled, and SSEManager.sendToast treats any visible connection in the namespace as enough to deliver in-app, so the stale global visibility can suppress push/Telegram fallback while the user is in a session.

Suggested fix:

useVisibilityReporter({
    api,
    subscriptionId: globalSubscriptionId,
    enabled: sseEnabled
})

useVisibilityReporter({
    api,
    subscriptionId: sessionSubscriptionId,
    enabled: sseEnabled && Boolean(sessionEventSubscription)
})

…#694

The following files were accidentally included from work intended for a
separate PR (session-list-status / SSE split subscriptions) due to that
branch being submitted from main rather than its own worktree branch:

- web/src/App.tsx (reverted to upstream/main)
- web/src/hooks/useSSE.ts (reverted to upstream/main)
- web/src/hooks/useSSE.test.ts (removed — added by other branch)
- web/src/lib/appSseSubscriptions.ts (removed — added by other branch)
- web/src/lib/appSseSubscriptions.test.ts (removed — added by other branch)
- hub/src/sync/syncEngine.ts (reverted to upstream/main)

Apologies for the noise. This PR (tiann#692) is scoped to the pluggable voice
backend only.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
@heavygee
Copy link
Copy Markdown
Contributor Author

Apologies for the noise on the previous push — a separate agent was working on SSE split-subscription changes (for a different PR) and accidentally submitted from main rather than its own worktree branch, which caused those files to land here. The revert commit above removes them. This PR is now clean and scoped to the pluggable voice backend only.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Minor] Preserve the voice language preference for Gemini/Qwen — VoiceProvider still passes the selected ElevenLabs language code into startRealtimeSession, but the new Gemini and Qwen backends never read config.language. The PR also adds VOICE_CHINESE_LANGUAGE_BLOCK specifically for Gemini/Qwen, but buildGeminiLiveConfig() is called without the language and Qwen builds instructions from VOICE_SYSTEM_PROMPT only, so selecting Chinese in settings silently works only for ElevenLabs. Evidence web/src/realtime/GeminiLiveVoiceSession.tsx:127, web/src/realtime/QwenVoiceSession.tsx:160.
    Suggested fix:
    // shared/src/voice.ts
    export function buildGeminiLiveConfig(language?: string): GeminiLiveConfig {
        const systemInstruction = language === 'zh'
            ? `${VOICE_SYSTEM_PROMPT}${VOICE_CHINESE_LANGUAGE_BLOCK}`
            : VOICE_SYSTEM_PROMPT
    
        return {
            model: GEMINI_LIVE_MODEL,
            systemInstruction,
            tools: [{ functionDeclarations: buildGeminiLiveFunctionDeclarations() }],
            responseModalities: ['AUDIO']
        }
    }
    
    // web/src/realtime/GeminiLiveVoiceSession.tsx
    const liveConfig = buildGeminiLiveConfig(config.language)
    
    // web/src/realtime/QwenVoiceSession.tsx
    const basePrompt = config.language === 'zh'
        ? `${VOICE_SYSTEM_PROMPT}${VOICE_CHINESE_LANGUAGE_BLOCK}`
        : VOICE_SYSTEM_PROMPT

Questions

  • None.

Summary

  • Review mode: follow-up after new commits
  • The previous bot findings were reviewed as context. I found one integration gap in the new voice backends: the existing voice language setting is ignored outside ElevenLabs.

Testing

  • Not run (automation; bun is unavailable in this runner).

HAPI Bot

ws.onopen = () => {
if (DEBUG) console.log('[GeminiLive] WebSocket connected, sending setup')

const liveConfig = buildGeminiLiveConfig()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MINOR] VoiceProvider still passes the selected voice language into startRealtimeSession, but the Gemini path calls buildGeminiLiveConfig() without config.language, and Qwen builds its instructions from VOICE_SYSTEM_PROMPT only. Since this PR adds VOICE_CHINESE_LANGUAGE_BLOCK for Gemini/Qwen, selecting Chinese in settings currently has no effect for the new backends.

Suggested fix:

// shared/src/voice.ts
export function buildGeminiLiveConfig(language?: string): GeminiLiveConfig {
    const systemInstruction = language === 'zh'
        ? `${VOICE_SYSTEM_PROMPT}${VOICE_CHINESE_LANGUAGE_BLOCK}`
        : VOICE_SYSTEM_PROMPT

    return {
        model: GEMINI_LIVE_MODEL,
        systemInstruction,
        tools: [{ functionDeclarations: buildGeminiLiveFunctionDeclarations() }],
        responseModalities: ['AUDIO']
    }
}

// web/src/realtime/GeminiLiveVoiceSession.tsx
const liveConfig = buildGeminiLiveConfig(config.language)

// web/src/realtime/QwenVoiceSession.tsx
const basePrompt = config.language === 'zh'
    ? `${VOICE_SYSTEM_PROMPT}${VOICE_CHINESE_LANGUAGE_BLOCK}`
    : VOICE_SYSTEM_PROMPT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant