feat(voice): dictation config, hotkey lifecycle, and WebSocket streaming (#332)#371
feat(voice): dictation config, hotkey lifecycle, and WebSocket streaming (#332)#371oxoxDev wants to merge 3 commits intotinyhumansai:mainfrom
Conversation
…ing (tinyhumansai#332) Add the foundational infrastructure for voice dictation (EPIC tinyhumansai#332): **Rust core:** - New `DictationConfig` schema with serde defaults and env var overrides (enabled, hotkey, activation_mode, llm_refinement, streaming, interval) - RPC controllers: `config_get_dictation_settings` / `config_update_dictation_settings` - WebSocket endpoint `/ws/dictation` for streaming PCM16 transcription with periodic partial inference and final LLM refinement - Microphone permission declaration (`NSMicrophoneUsageDescription`) in Tauri macOS bundle config **Frontend:** - `useDictationHotkey` hook: fetches config from core RPC, auto-registers global hotkey, listens for `dictation://toggle` events - `DictationHotkeyManager` headless component mounted in App.tsx - Fix voice RPC response type mismatch: voice handlers return flat results (no `{result, logs}` wrapper), so remove incorrect `CommandResponse<T>` wrapping from `openhumanVoiceStatus`, `openhumanVoiceTranscribe`, `openhumanVoiceTranscribeBytes`, and `openhumanVoiceTts` Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `infoPlist` field in tauri.conf.json expects a string path, not an inline object. Remove it for now — microphone permission will be added via a proper Info.plist supplement in the production build pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR introduces voice dictation with global hotkey support across the full stack. Frontend changes add a dictation hotkey manager component and hook that register/unregister hotkeys and listen for toggle events. Backend adds dictation configuration schema, RPC endpoints for reading/updating settings, and a WebSocket streaming transcription handler. Voice RPC wrappers are refactored to return unwrapped payloads instead of envelope types. Changes
Sequence DiagramssequenceDiagram
actor User
participant App as App.tsx
participant HotkeyMgr as DictationHotkeyManager
participant Hook as useDictationHotkey
participant Tauri as Tauri RPC
participant Core as Core Backend
User->>App: Launch application
App->>HotkeyMgr: Mount component
HotkeyMgr->>Hook: Call useDictationHotkey()
Hook->>Tauri: callCoreRpc(config_get_dictation_settings)
Tauri->>Core: openhuman.config_get_dictation_settings
Core-->>Tauri: { dictationEnabled, hotkey, activationMode }
Tauri-->>Hook: Settings payload
Hook->>Hook: Update state (enabled, hotkey)
alt Dictation enabled and hotkey configured
Hook->>Tauri: registerDictationHotkey(hotkey)
Tauri->>Core: Register global hotkey
Core-->>Tauri: Success
Tauri-->>Hook: hotkeyRegistered = true
end
Hook->>Tauri: listen('dictation://toggle')
Tauri-->>Hook: Event listener registered
HotkeyMgr-->>App: Component mounted (renders null)
User->>User: Press hotkey
Tauri->>Hook: dictation://toggle event
Hook->>Hook: Increment toggleCount
HotkeyMgr->>HotkeyMgr: Log state change
sequenceDiagram
participant Client as WebSocket Client
participant Handler as handle_dictation_ws
participant Buffer as Audio Buffer
participant Inference as STT Engine
participant Refine as LLM Refinement
participant Client2 as Client (response)
Client->>Handler: Upgrade WebSocket connection
Handler->>Handler: Load config, create buffer
Handler->>Handler: Spawn periodic inference task
loop Stream Audio Frames
Client->>Handler: Send binary PCM16 frame
Handler->>Buffer: Validate & append samples
Buffer->>Buffer: Accumulate audio
end
par Periodic Inference
Handler->>Handler: Timer fires
Handler->>Inference: Transcribe buffered audio (partial)
Inference-->>Handler: Interim result
Handler->>Client2: Send {"type":"partial","text":"..."}
and Main Loop Continues
Client->>Handler: Send {"type":"stop"} text frame
Handler->>Handler: Trigger finalization
end
Handler->>Inference: Transcribe full buffer (final)
Inference-->>Handler: Full transcription result
alt LLM refinement enabled
Handler->>Refine: cleanup_transcription(text)
Refine-->>Handler: Refined text
Handler->>Client2: Send {"type":"final","text":"refined","raw_text":"original"}
else No refinement
Handler->>Client2: Send {"type":"final","text":"...","raw_text":"..."}
end
Client->>Handler: Close or disconnect
Handler->>Handler: Abort background tasks, cleanup
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (6)
app/src/components/DictationHotkeyManager.tsx (1)
12-12: Prefer arrow component declaration for consistency.Line 12 uses a function declaration; convert to a
constarrow component to match project style guidelines.♻️ Suggested refactor
-export default function DictationHotkeyManager() { +const DictationHotkeyManager = () => { const { dictationEnabled, hotkeyRegistered, toggleCount, hotkey } = useDictationHotkey(); @@ return null; -} +}; + +export default DictationHotkeyManager;Aligns with project guideline for
**/*.{js,jsx,ts,tsx}: "Prefer arrow functions over function declarations".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/src/components/DictationHotkeyManager.tsx` at line 12, Convert the function declaration for DictationHotkeyManager into a const arrow component to match project style: replace "export default function DictationHotkeyManager()" with a const arrow assignment like "const DictationHotkeyManager = () => { ... }" and keep the default export (either export default on the const or export default DictationHotkeyManager at the end); preserve all existing props, return value and internal logic inside the new arrow function and retain any TypeScript/React typings if present.src/openhuman/config/ops.rs (1)
539-546: Missing derive attributes onDictationSettingsPatch.Other patch structs in this file (e.g.,
ModelSettingsPatch,MemorySettingsPatch) deriveDebug,Clone, andDefault. This struct is missing those attributes, which may cause issues if debugging or cloning is needed.♻️ Proposed fix to add consistent derives
+#[derive(Debug, Clone, Default)] pub struct DictationSettingsPatch { pub enabled: Option<bool>, pub hotkey: Option<String>, pub activation_mode: Option<String>, pub llm_refinement: Option<bool>, pub streaming: Option<bool>, pub streaming_interval_ms: Option<u64>, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/config/ops.rs` around lines 539 - 546, The struct DictationSettingsPatch is missing the consistent derive attributes used by other patch structs; update DictationSettingsPatch to derive Debug, Clone, and Default (matching ModelSettingsPatch and MemorySettingsPatch) so it can be debug-printed, cloned, and default-constructed where expected; locate the DictationSettingsPatch definition and add the derives above the struct declaration.app/src/hooks/useDictationHotkey.ts (3)
116-129: Potential race condition in listener setup.If the component unmounts before
listen()resolves,unlistenwill beundefinedand the cleanup won't properly unsubscribe. Consider using an async IIFE with a disposed flag check, similar to the firstuseEffect.♻️ Proposed fix with disposal tracking
useEffect(() => { if (!isTauri()) return; - let unlisten: (() => void) | undefined; + let disposed = false; + let unlistenFn: (() => void) | undefined; - listen('dictation://toggle', () => { - console.debug('[dictation] hotkey toggle event received'); - setToggleCount(c => c + 1); - }) - .then(fn => { - unlisten = fn; - }) - .catch(err => { - console.warn('[dictation] failed to listen for dictation toggle', err); - }); + (async () => { + try { + const fn = await listen('dictation://toggle', () => { + console.debug('[dictation] hotkey toggle event received'); + setToggleCount(c => c + 1); + }); + if (disposed) { + fn(); + } else { + unlistenFn = fn; + } + } catch (err) { + console.warn('[dictation] failed to listen for dictation toggle', err); + } + })(); return () => { - unlisten?.(); + disposed = true; + unlistenFn?.(); }; }, []);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/src/hooks/useDictationHotkey.ts` around lines 116 - 129, The listener setup can race with unmount because listen(...) is async and may resolve after cleanup; modify the useDictationHotkey effect that calls listen to use an async IIFE and a disposed flag (e.g., let disposed = false) so that when listen(...) resolves you only assign unlisten = fn and call setToggleCount if not disposed, and if disposed is true call fn() immediately to unsubscribe; ensure the cleanup sets disposed = true and calls unlisten?.() so the listener is always removed even if listen resolved after unmount (refer to the listen, unlisten and setToggleCount symbols).
72-74: Type-unsafe RpcOutcome wrapper handling.The check
'result' in settingsand subsequent cast is fragile. If the backend response shape changes, this could silently extract wrong data. Consider defining a more explicit type for the wrapped response or trusting the backend to return consistent shapes.♻️ Proposed improvement with explicit typing
+interface RpcOutcomeWrapper<T> { + result: T; +} + +type DictationSettingsResponse = DictationSettings | RpcOutcomeWrapper<DictationSettings>; + // Inside init(): - const settings = await callCoreRpc<DictationSettings>({ + const response = await callCoreRpc<DictationSettingsResponse>({ method: 'openhuman.config_get_dictation_settings', }); if (disposed) return; - if (!settings || typeof settings !== 'object') { + if (!response || typeof response !== 'object') { console.debug('[dictation] no dictation settings from core'); return; } // Handle RpcOutcome wrapper — the result may be nested in .result - const s = ( - 'result' in settings ? (settings as Record<string, unknown>).result : settings - ) as DictationSettings; + const s: DictationSettings = + 'result' in response && typeof response.result === 'object' + ? (response.result as DictationSettings) + : (response as DictationSettings);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/src/hooks/useDictationHotkey.ts` around lines 72 - 74, The current extraction of DictationSettings using "'result' in settings" is type-unsafe; add an explicit RpcOutcome<T> interface and a type guard like isRpcOutcome(obj): obj is RpcOutcome<DictationSettings>, then in useDictationHotkey replace the ad-hoc check around the local variable settings (and the assignment to s) with a guarded extraction that returns settings.result when isRpcOutcome(settings) is true, otherwise treats settings as DictationSettings; reference the RpcOutcome generic, the isRpcOutcome type guard, the useDictationHotkey function, the settings variable, and the local variable s to locate and update the logic.
45-45: Prefer arrow function for hook definition.As per coding guidelines, arrow functions are preferred over function declarations.
♻️ Proposed fix
-export function useDictationHotkey(): DictationHotkeyState { +export const useDictationHotkey = (): DictationHotkeyState => {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/src/hooks/useDictationHotkey.ts` at line 45, The hook is declared with a function declaration; convert it to an exported arrow function to follow project guidelines: replace the declaration export function useDictationHotkey(): DictationHotkeyState { ... } with an exported const arrow form (export const useDictationHotkey = (): DictationHotkeyState => { ... }) preserving the existing body and return type, and ensure any internal references (closures, hooks) remain unchanged.src/openhuman/voice/streaming.rs (1)
57-65: Buffer clone on each inference pass could be optimized.Cloning the entire accumulated buffer (
guard.clone()) for each partial inference may become expensive as audio accumulates (e.g., 30s of audio = ~960KB). Consider tracking processed sample count and only cloning new data, or accepting this trade-off for simplicity in the initial implementation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/streaming.rs` around lines 57 - 65, The current loop clones the entire buffer (guard.clone()) into samples each inference which grows expensive; instead add a processed offset (e.g., processed_samples) and only clone the new slice: inside the block that locks buf_clone, compute let new_len = guard.len(); if new_len <= processed_samples || (new_len - processed_samples) < 8000 { continue; } let samples: Vec<i16> = guard[processed_samples..new_len].to_vec(); processed_samples = new_len; thereby replacing last_len and guard.clone() usage in the samples creation logic (update any conditions that used last_len to use processed_samples and the new_len delta).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/core/jsonrpc.rs`:
- Around line 331-344: The handler dictation_ws_handler currently performs the
WebSocket upgrade before loading config; change it to load the config with
crate::openhuman::config::rpc::load_config_with_timeout().await (creating
Arc::new on success) before calling ws.on_upgrade, and if loading fails return
an appropriate non-upgrade Response (e.g., 500 error) instead of performing the
upgrade; once config is loaded, call ws.on_upgrade(move |socket| async move {
crate::openhuman::voice::streaming::handle_dictation_ws(socket, config).await;
}) so handle_dictation_ws receives the preloaded Arc config.
In `@src/openhuman/config/schema/load.rs`:
- Around line 735-738: When parsing OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS in
the load routine, validate the parsed u64 before assigning to
self.dictation.streaming_interval_ms: reject 0 (and optionally cap to a sensible
max) so you don't create a tight loop; update the parsing block in load.rs (the
section that reads OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS and currently sets
self.dictation.streaming_interval_ms = ms) to only assign when ms >= 1 (or
within your chosen min/max bounds) and log or fallback to a safe default when
out of range.
In `@src/openhuman/voice/streaming.rs`:
- Around line 67-85: The call to whisper_engine::transcribe_pcm_i16 obtains a
synchronous parking_lot::Mutex via service.whisper and can block other WebSocket
sessions; update the streaming task to run the blocking inference on a blocking
thread (e.g., tokio::task::spawn_blocking) instead of calling transcribe_pcm_i16
directly on the async path, or alternatively document the limitation on
concurrent sessions. Locate the code around local_ai::global(&config_clone) and
the transcribe_pcm_i16 usage and move the heavy/locking work into spawn_blocking
(await its JoinHandle) so partial_tx.send(trimmed).await runs only after the
blocking call completes without holding up the async reactor.
- Around line 139-164: The code moves inference_handle twice causing an
ownership error; change all places that destructure it inside the loop and after
the loop to take ownership via Option::take() instead of pattern-matching
directly (i.e., replace uses like "if let Some(h) = inference_handle {
h.abort(); }" inside the match arms and the final stop block with "if let
Some(h) = inference_handle.take() { h.abort(); }") so the Option is emptied on
first use and subsequent checks are safe; reference the variable
inference_handle and the match arms handling Message::Close/None and
Some(Err(e)) as well as the final "Stop the periodic inference task" block.
---
Nitpick comments:
In `@app/src/components/DictationHotkeyManager.tsx`:
- Line 12: Convert the function declaration for DictationHotkeyManager into a
const arrow component to match project style: replace "export default function
DictationHotkeyManager()" with a const arrow assignment like "const
DictationHotkeyManager = () => { ... }" and keep the default export (either
export default on the const or export default DictationHotkeyManager at the
end); preserve all existing props, return value and internal logic inside the
new arrow function and retain any TypeScript/React typings if present.
In `@app/src/hooks/useDictationHotkey.ts`:
- Around line 116-129: The listener setup can race with unmount because
listen(...) is async and may resolve after cleanup; modify the
useDictationHotkey effect that calls listen to use an async IIFE and a disposed
flag (e.g., let disposed = false) so that when listen(...) resolves you only
assign unlisten = fn and call setToggleCount if not disposed, and if disposed is
true call fn() immediately to unsubscribe; ensure the cleanup sets disposed =
true and calls unlisten?.() so the listener is always removed even if listen
resolved after unmount (refer to the listen, unlisten and setToggleCount
symbols).
- Around line 72-74: The current extraction of DictationSettings using "'result'
in settings" is type-unsafe; add an explicit RpcOutcome<T> interface and a type
guard like isRpcOutcome(obj): obj is RpcOutcome<DictationSettings>, then in
useDictationHotkey replace the ad-hoc check around the local variable settings
(and the assignment to s) with a guarded extraction that returns settings.result
when isRpcOutcome(settings) is true, otherwise treats settings as
DictationSettings; reference the RpcOutcome generic, the isRpcOutcome type
guard, the useDictationHotkey function, the settings variable, and the local
variable s to locate and update the logic.
- Line 45: The hook is declared with a function declaration; convert it to an
exported arrow function to follow project guidelines: replace the declaration
export function useDictationHotkey(): DictationHotkeyState { ... } with an
exported const arrow form (export const useDictationHotkey = ():
DictationHotkeyState => { ... }) preserving the existing body and return type,
and ensure any internal references (closures, hooks) remain unchanged.
In `@src/openhuman/config/ops.rs`:
- Around line 539-546: The struct DictationSettingsPatch is missing the
consistent derive attributes used by other patch structs; update
DictationSettingsPatch to derive Debug, Clone, and Default (matching
ModelSettingsPatch and MemorySettingsPatch) so it can be debug-printed, cloned,
and default-constructed where expected; locate the DictationSettingsPatch
definition and add the derives above the struct declaration.
In `@src/openhuman/voice/streaming.rs`:
- Around line 57-65: The current loop clones the entire buffer (guard.clone())
into samples each inference which grows expensive; instead add a processed
offset (e.g., processed_samples) and only clone the new slice: inside the block
that locks buf_clone, compute let new_len = guard.len(); if new_len <=
processed_samples || (new_len - processed_samples) < 8000 { continue; } let
samples: Vec<i16> = guard[processed_samples..new_len].to_vec();
processed_samples = new_len; thereby replacing last_len and guard.clone() usage
in the samples creation logic (update any conditions that used last_len to use
processed_samples and the new_len delta).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 423af558-c16f-40a0-81d6-a0ab9794dc15
📒 Files selected for processing (15)
app/src/App.tsxapp/src/components/DictationHotkeyManager.tsxapp/src/hooks/useDictationHotkey.tsapp/src/pages/Conversations.tsxapp/src/utils/tauriCommands.tssrc/core/jsonrpc.rssrc/openhuman/config/mod.rssrc/openhuman/config/ops.rssrc/openhuman/config/schema/dictation.rssrc/openhuman/config/schema/load.rssrc/openhuman/config/schema/mod.rssrc/openhuman/config/schema/types.rssrc/openhuman/config/schemas.rssrc/openhuman/voice/mod.rssrc/openhuman/voice/streaming.rs
| /// WebSocket upgrade handler for streaming voice dictation. | ||
| async fn dictation_ws_handler(ws: WebSocketUpgrade) -> Response { | ||
| log::info!("[ws] dictation WebSocket upgrade requested"); | ||
| ws.on_upgrade(|socket| async move { | ||
| let config = match crate::openhuman::config::rpc::load_config_with_timeout().await { | ||
| Ok(c) => Arc::new(c), | ||
| Err(e) => { | ||
| log::error!("[ws] failed to load config for dictation: {e}"); | ||
| return; | ||
| } | ||
| }; | ||
| crate::openhuman::voice::streaming::handle_dictation_ws(socket, config).await; | ||
| }) | ||
| } |
There was a problem hiding this comment.
Load config before upgrading the WebSocket.
At Line [334], the server upgrades first and only then attempts config load (Lines [335]-[340]). On failure, the socket is dropped post-101, which is a misleading handshake success for clients.
🔧 Proposed fix
async fn dictation_ws_handler(ws: WebSocketUpgrade) -> Response {
log::info!("[ws] dictation WebSocket upgrade requested");
- ws.on_upgrade(|socket| async move {
- let config = match crate::openhuman::config::rpc::load_config_with_timeout().await {
- Ok(c) => Arc::new(c),
- Err(e) => {
- log::error!("[ws] failed to load config for dictation: {e}");
- return;
- }
- };
- crate::openhuman::voice::streaming::handle_dictation_ws(socket, config).await;
- })
+ let config = match crate::openhuman::config::rpc::load_config_with_timeout().await {
+ Ok(c) => Arc::new(c),
+ Err(e) => {
+ log::error!("[ws] failed to load config for dictation: {e}");
+ return (
+ StatusCode::SERVICE_UNAVAILABLE,
+ Json(json!({ "ok": false, "error": "dictation_unavailable" })),
+ )
+ .into_response();
+ }
+ };
+ ws.on_upgrade(move |socket| async move {
+ crate::openhuman::voice::streaming::handle_dictation_ws(socket, config).await;
+ })
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/core/jsonrpc.rs` around lines 331 - 344, The handler dictation_ws_handler
currently performs the WebSocket upgrade before loading config; change it to
load the config with
crate::openhuman::config::rpc::load_config_with_timeout().await (creating
Arc::new on success) before calling ws.on_upgrade, and if loading fails return
an appropriate non-upgrade Response (e.g., 500 error) instead of performing the
upgrade; once config is loaded, call ws.on_upgrade(move |socket| async move {
crate::openhuman::voice::streaming::handle_dictation_ws(socket, config).await;
}) so handle_dictation_ws receives the preloaded Arc config.
| if let Ok(val) = std::env::var("OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS") { | ||
| if let Ok(ms) = val.trim().parse::<u64>() { | ||
| self.dictation.streaming_interval_ms = ms; | ||
| } |
There was a problem hiding this comment.
Add bounds validation for dictation streaming interval.
Lines [736]-[738] accept any u64, including 0. For a periodic inference interval, that can cause a tight loop and severe CPU pressure.
🔧 Proposed fix
if let Ok(val) = std::env::var("OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS") {
if let Ok(ms) = val.trim().parse::<u64>() {
- self.dictation.streaming_interval_ms = ms;
+ if (100..=60_000).contains(&ms) {
+ self.dictation.streaming_interval_ms = ms;
+ } else {
+ tracing::warn!(
+ interval_ms = ms,
+ "ignoring invalid OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS (valid: 100..=60000)"
+ );
+ }
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if let Ok(val) = std::env::var("OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS") { | |
| if let Ok(ms) = val.trim().parse::<u64>() { | |
| self.dictation.streaming_interval_ms = ms; | |
| } | |
| if let Ok(val) = std::env::var("OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS") { | |
| if let Ok(ms) = val.trim().parse::<u64>() { | |
| if (100..=60_000).contains(&ms) { | |
| self.dictation.streaming_interval_ms = ms; | |
| } else { | |
| tracing::warn!( | |
| interval_ms = ms, | |
| "ignoring invalid OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS (valid: 100..=60000)" | |
| ); | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/openhuman/config/schema/load.rs` around lines 735 - 738, When parsing
OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS in the load routine, validate the
parsed u64 before assigning to self.dictation.streaming_interval_ms: reject 0
(and optionally cap to a sensible max) so you don't create a tight loop; update
the parsing block in load.rs (the section that reads
OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS and currently sets
self.dictation.streaming_interval_ms = ms) to only assign when ms >= 1 (or
within your chosen min/max bounds) and log or fallback to a safe default when
out of range.
| let service = local_ai::global(&config_clone); | ||
| match whisper_engine::transcribe_pcm_i16(&service.whisper, &samples, None) { | ||
| Ok(text) => { | ||
| let trimmed = text.trim().to_string(); | ||
| if !trimmed.is_empty() { | ||
| log::debug!( | ||
| "{LOG_PREFIX} partial transcription ({} samples): {}", | ||
| samples.len(), | ||
| &trimmed[..trimmed.len().min(80)] | ||
| ); | ||
| if partial_tx.send(trimmed).await.is_err() { | ||
| break; // receiver dropped | ||
| } | ||
| } | ||
| } | ||
| Err(e) => { | ||
| log::warn!("{LOG_PREFIX} partial inference error: {e}"); | ||
| } | ||
| } |
There was a problem hiding this comment.
Synchronous lock contention on shared Whisper engine.
The transcribe_pcm_i16 call uses a parking_lot::Mutex (per whisper_engine.rs), which is a synchronous blocking lock. Multiple concurrent WebSocket sessions will serialize their inference calls, potentially causing one session to block for seconds while another completes. Consider documenting this limitation or using spawn_blocking if inference is CPU-bound.
This may be acceptable for initial implementation since concurrent dictation sessions are unlikely, but worth noting for future scalability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/openhuman/voice/streaming.rs` around lines 67 - 85, The call to
whisper_engine::transcribe_pcm_i16 obtains a synchronous parking_lot::Mutex via
service.whisper and can block other WebSocket sessions; update the streaming
task to run the blocking inference on a blocking thread (e.g.,
tokio::task::spawn_blocking) instead of calling transcribe_pcm_i16 directly on
the async path, or alternatively document the limitation on concurrent sessions.
Locate the code around local_ai::global(&config_clone) and the
transcribe_pcm_i16 usage and move the heavy/locking work into spawn_blocking
(await its JoinHandle) so partial_tx.send(trimmed).await runs only after the
blocking call completes without holding up the async reactor.
| Some(Ok(Message::Close(_))) | None => { | ||
| log::info!("{LOG_PREFIX} client disconnected"); | ||
| if let Some(h) = inference_handle { | ||
| h.abort(); | ||
| } | ||
| return; | ||
| } | ||
|
|
||
| Some(Err(e)) => { | ||
| log::warn!("{LOG_PREFIX} websocket error: {e}"); | ||
| if let Some(h) = inference_handle { | ||
| h.abort(); | ||
| } | ||
| return; | ||
| } | ||
|
|
||
| _ => {} | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Stop the periodic inference task | ||
| if let Some(h) = inference_handle { | ||
| h.abort(); | ||
| } |
There was a problem hiding this comment.
Ownership error: inference_handle moved twice.
The inference_handle is moved into the match arms at lines 141-143 and 149-151 (inside the loop), but then accessed again at line 162 after the loop breaks via the stop command path. This will fail to compile because the value may have been moved.
🐛 Proposed fix using Option::take()
- let inference_handle = if do_streaming {
+ let mut inference_handle = if do_streaming {
let handle = tokio::spawn(async move {
// ... inference task
});
Some(handle)
} else {
None
};
loop {
tokio::select! {
// ... partial_rx branch unchanged ...
msg = socket.recv() => {
match msg {
// ... Binary and Text branches unchanged ...
Some(Ok(Message::Close(_))) | None => {
log::info!("{LOG_PREFIX} client disconnected");
- if let Some(h) = inference_handle {
+ if let Some(h) = inference_handle.take() {
h.abort();
}
return;
}
Some(Err(e)) => {
log::warn!("{LOG_PREFIX} websocket error: {e}");
- if let Some(h) = inference_handle {
+ if let Some(h) = inference_handle.take() {
h.abort();
}
return;
}
_ => {}
}
}
}
}
// Stop the periodic inference task
- if let Some(h) = inference_handle {
+ if let Some(h) = inference_handle.take() {
h.abort();
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/openhuman/voice/streaming.rs` around lines 139 - 164, The code moves
inference_handle twice causing an ownership error; change all places that
destructure it inside the loop and after the loop to take ownership via
Option::take() instead of pattern-matching directly (i.e., replace uses like "if
let Some(h) = inference_handle { h.abort(); }" inside the match arms and the
final stop block with "if let Some(h) = inference_handle.take() { h.abort(); }")
so the Option is emptied on first use and subsequent checks are safe; reference
the variable inference_handle and the match arms handling Message::Close/None
and Some(Err(e)) as well as the final "Stop the periodic inference task" block.
|
I'm merging this into #368 thanks bro |
…cycle, WebSocket streaming) Merge feat/332-voice-dictation into feat/stt to combine: - Our standalone voice server (hotkey → record → transcribe → insert) - PR tinyhumansai#371's DictationConfig, WebSocket streaming endpoint, frontend hotkey hook, and voice RPC type fixes Resolved conflict in src/openhuman/voice/mod.rs — kept both server and streaming modules. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Foundational infrastructure for voice dictation (EPIC #332, Issue #333):
config_get_dictation_settings/config_update_dictation_settingscontrollers following existing registry pattern/ws/dictationaccepts PCM16 LE audio chunks (16kHz mono), runs periodic whisper inference on accumulated buffer, sends partial/final transcription results as JSONuseDictationHotkeyhook fetches config from core RPC, auto-registers global hotkey via Tauri shell, listens fordictation://toggleevents;DictationHotkeyManagerheadless component mounted in App.tsx{result, logs}wrapper), fixed incorrectCommandResponse<T>wrapping onopenhumanVoiceStatus,openhumanVoiceTranscribe,openhumanVoiceTranscribeBytes, andopenhumanVoiceTtsthat caused "Could not check voice availability" errorFiles changed
src/openhuman/config/schema/dictation.rssrc/openhuman/config/schema/{mod,types,load}.rssrc/openhuman/config/{mod,ops,schemas}.rssrc/openhuman/voice/streaming.rssrc/openhuman/voice/mod.rssrc/core/jsonrpc.rs/ws/dictationroute + WebSocket upgrade handlerapp/src/hooks/useDictationHotkey.tsapp/src/components/DictationHotkeyManager.tsxapp/src/App.tsxapp/src/utils/tauriCommands.tsapp/src/pages/Conversations.tsxTest plan
cargo check— compiles cleancargo fmt --check— no formatting issuesyarn lint— 0 errors (6 pre-existing warnings)yarn typecheck(tsc --noEmit) — passes[dictation]logs)ws://127.0.0.1:7788/ws/dictationCloses #333
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements