Skip to content

feat(voice): dictation config, hotkey lifecycle, and WebSocket streaming (#332)#371

Closed
oxoxDev wants to merge 3 commits intotinyhumansai:mainfrom
oxoxDev:feat/332-voice-dictation
Closed

feat(voice): dictation config, hotkey lifecycle, and WebSocket streaming (#332)#371
oxoxDev wants to merge 3 commits intotinyhumansai:mainfrom
oxoxDev:feat/332-voice-dictation

Conversation

@oxoxDev
Copy link
Copy Markdown
Contributor

@oxoxDev oxoxDev commented Apr 6, 2026

Summary

Foundational infrastructure for voice dictation (EPIC #332, Issue #333):

  • DictationConfig schema — new config module with serde defaults and env var overrides (enabled, hotkey, activation_mode, llm_refinement, streaming, streaming_interval_ms)
  • RPC surfaceconfig_get_dictation_settings / config_update_dictation_settings controllers following existing registry pattern
  • WebSocket streaming endpoint/ws/dictation accepts PCM16 LE audio chunks (16kHz mono), runs periodic whisper inference on accumulated buffer, sends partial/final transcription results as JSON
  • Frontend hotkey lifecycleuseDictationHotkey hook fetches config from core RPC, auto-registers global hotkey via Tauri shell, listens for dictation://toggle events; DictationHotkeyManager headless component mounted in App.tsx
  • Voice RPC type fix — voice handlers return flat results (no {result, logs} wrapper), fixed incorrect CommandResponse<T> wrapping on openhumanVoiceStatus, openhumanVoiceTranscribe, openhumanVoiceTranscribeBytes, and openhumanVoiceTts that caused "Could not check voice availability" error

Files changed

File What
src/openhuman/config/schema/dictation.rs New DictationConfig + DictationActivationMode
src/openhuman/config/schema/{mod,types,load}.rs Wire dictation into Config struct + env overrides
src/openhuman/config/{mod,ops,schemas}.rs RPC get/update handlers for dictation settings
src/openhuman/voice/streaming.rs WebSocket streaming transcription handler
src/openhuman/voice/mod.rs Export streaming module
src/core/jsonrpc.rs Add /ws/dictation route + WebSocket upgrade handler
app/src/hooks/useDictationHotkey.ts Frontend hotkey hook
app/src/components/DictationHotkeyManager.tsx Headless hotkey manager component
app/src/App.tsx Mount DictationHotkeyManager
app/src/utils/tauriCommands.ts Fix voice RPC return types
app/src/pages/Conversations.tsx Fix voice status/transcribe/tts consumers

Test plan

  • cargo check — compiles clean
  • cargo fmt --check — no formatting issues
  • yarn lint — 0 errors (6 pre-existing warnings)
  • yarn typecheck (tsc --noEmit) — passes
  • Manual: verify dictation hotkey registration in DevTools console ([dictation] logs)
  • Manual: test WebSocket endpoint ws://127.0.0.1:7788/ws/dictation
  • Manual: verify voice status shows "Ready" instead of "Could not check voice availability"

Closes #333

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added dictation hotkey support for quick activation via configurable keyboard shortcuts.
    • Introduced WebSocket-based streaming transcription for real-time dictation feedback.
    • Added dictation configuration options including activation mode (toggle/push), hotkey customization, and optional LLM-based transcription refinement.
    • Enabled environment variable configuration for dictation settings.
  • Improvements

    • Simplified voice API response handling for cleaner integration.

oxoxDev and others added 3 commits April 7, 2026 00:28
…ing (tinyhumansai#332)

Add the foundational infrastructure for voice dictation (EPIC tinyhumansai#332):

**Rust core:**
- New `DictationConfig` schema with serde defaults and env var overrides
  (enabled, hotkey, activation_mode, llm_refinement, streaming, interval)
- RPC controllers: `config_get_dictation_settings` / `config_update_dictation_settings`
- WebSocket endpoint `/ws/dictation` for streaming PCM16 transcription
  with periodic partial inference and final LLM refinement
- Microphone permission declaration (`NSMicrophoneUsageDescription`) in
  Tauri macOS bundle config

**Frontend:**
- `useDictationHotkey` hook: fetches config from core RPC, auto-registers
  global hotkey, listens for `dictation://toggle` events
- `DictationHotkeyManager` headless component mounted in App.tsx
- Fix voice RPC response type mismatch: voice handlers return flat results
  (no `{result, logs}` wrapper), so remove incorrect `CommandResponse<T>`
  wrapping from `openhumanVoiceStatus`, `openhumanVoiceTranscribe`,
  `openhumanVoiceTranscribeBytes`, and `openhumanVoiceTts`

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `infoPlist` field in tauri.conf.json expects a string path, not an
inline object. Remove it for now — microphone permission will be added
via a proper Info.plist supplement in the production build pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 6, 2026

📝 Walkthrough

Walkthrough

This PR introduces voice dictation with global hotkey support across the full stack. Frontend changes add a dictation hotkey manager component and hook that register/unregister hotkeys and listen for toggle events. Backend adds dictation configuration schema, RPC endpoints for reading/updating settings, and a WebSocket streaming transcription handler. Voice RPC wrappers are refactored to return unwrapped payloads instead of envelope types.

Changes

Cohort / File(s) Summary
Frontend Dictation Management
app/src/App.tsx, app/src/components/DictationHotkeyManager.tsx, app/src/hooks/useDictationHotkey.ts
Added headless DictationHotkeyManager component and useDictationHotkey hook to manage global hotkey registration, event listening for toggle events, and state synchronization with core dictation settings.
Voice RPC Type Unwrapping
app/src/utils/tauriCommands.ts, app/src/pages/Conversations.tsx
Updated voice function signatures (status, transcribe, transcribeBytes, TTS) to return unwrapped payloads instead of CommandResponse envelopes; adjusted call sites to use unwrapped result fields.
Backend Dictation Configuration Schema
src/openhuman/config/schema/dictation.rs, src/openhuman/config/schema/mod.rs, src/openhuman/config/schema/types.rs, src/openhuman/config/schema/load.rs
Added DictationActivationMode and DictationConfig schema types with serde/JSON-schema support; integrated into Config struct with environment variable override handling for all dictation settings.
Configuration API & Persistence
src/openhuman/config/mod.rs, src/openhuman/config/ops.rs, src/openhuman/config/schemas.rs
Exposed public re-exports for dictation types; implemented get/update RPC endpoints for dictation settings; registered controller handlers for config.get_dictation_settings and config.update_dictation_settings.
WebSocket Streaming Transcription
src/core/jsonrpc.rs, src/openhuman/voice/mod.rs, src/openhuman/voice/streaming.rs
Added WebSocket route handler and new streaming module; implemented handle_dictation_ws to accumulate PCM16 audio frames, run periodic interim transcription at configured intervals, and send final results with optional LLM refinement.

Sequence Diagrams

sequenceDiagram
    actor User
    participant App as App.tsx
    participant HotkeyMgr as DictationHotkeyManager
    participant Hook as useDictationHotkey
    participant Tauri as Tauri RPC
    participant Core as Core Backend

    User->>App: Launch application
    App->>HotkeyMgr: Mount component
    HotkeyMgr->>Hook: Call useDictationHotkey()
    Hook->>Tauri: callCoreRpc(config_get_dictation_settings)
    Tauri->>Core: openhuman.config_get_dictation_settings
    Core-->>Tauri: { dictationEnabled, hotkey, activationMode }
    Tauri-->>Hook: Settings payload
    Hook->>Hook: Update state (enabled, hotkey)
    
    alt Dictation enabled and hotkey configured
        Hook->>Tauri: registerDictationHotkey(hotkey)
        Tauri->>Core: Register global hotkey
        Core-->>Tauri: Success
        Tauri-->>Hook: hotkeyRegistered = true
    end
    
    Hook->>Tauri: listen('dictation://toggle')
    Tauri-->>Hook: Event listener registered
    HotkeyMgr-->>App: Component mounted (renders null)
    
    User->>User: Press hotkey
    Tauri->>Hook: dictation://toggle event
    Hook->>Hook: Increment toggleCount
    HotkeyMgr->>HotkeyMgr: Log state change
Loading
sequenceDiagram
    participant Client as WebSocket Client
    participant Handler as handle_dictation_ws
    participant Buffer as Audio Buffer
    participant Inference as STT Engine
    participant Refine as LLM Refinement
    participant Client2 as Client (response)

    Client->>Handler: Upgrade WebSocket connection
    Handler->>Handler: Load config, create buffer
    Handler->>Handler: Spawn periodic inference task
    
    loop Stream Audio Frames
        Client->>Handler: Send binary PCM16 frame
        Handler->>Buffer: Validate & append samples
        Buffer->>Buffer: Accumulate audio
    end
    
    par Periodic Inference
        Handler->>Handler: Timer fires
        Handler->>Inference: Transcribe buffered audio (partial)
        Inference-->>Handler: Interim result
        Handler->>Client2: Send {"type":"partial","text":"..."}
    and Main Loop Continues
        Client->>Handler: Send {"type":"stop"} text frame
        Handler->>Handler: Trigger finalization
    end
    
    Handler->>Inference: Transcribe full buffer (final)
    Inference-->>Handler: Full transcription result
    
    alt LLM refinement enabled
        Handler->>Refine: cleanup_transcription(text)
        Refine-->>Handler: Refined text
        Handler->>Client2: Send {"type":"final","text":"refined","raw_text":"original"}
    else No refinement
        Handler->>Client2: Send {"type":"final","text":"...","raw_text":"..."}
    end
    
    Client->>Handler: Close or disconnect
    Handler->>Handler: Abort background tasks, cleanup
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

  • PR #278: Introduces overlapping dictation hotkey registration/unregistration logic and frontend dictation components that are foundational to this PR's DictationHotkeyManager and hook architecture.
  • PR #178: Earlier voice/dictation streaming feature that establishes the baseline voice pipeline and RPC wrappers extended by this PR's WebSocket streaming endpoint and type unwrapping changes.

Suggested reviewers

  • YellowSnnowmann
  • graycyrus

Poem

🐰 Hop, hop! A hotkey is born,
Dictation calls at dawn,
WebSockets sing in streaming tune,
Voice flows like drops of moon. 🎤✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.26% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title clearly and concisely describes the main feature additions: dictation config, hotkey lifecycle, and WebSocket streaming capabilities.
Linked Issues check ✅ Passed PR addresses all coding requirements from #333: configurable hotkey via useDictationHotkey hook, hotkey registration/cleanup lifecycle, dictation enabled/disabled state management, and WebSocket streaming endpoint for voice input.
Out of Scope Changes check ✅ Passed All changes are scoped to dictation infrastructure: config schema, RPC handlers, WebSocket endpoint, frontend hooks/components, and voice RPC type fixes. No unrelated refactoring or feature creep detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (6)
app/src/components/DictationHotkeyManager.tsx (1)

12-12: Prefer arrow component declaration for consistency.

Line 12 uses a function declaration; convert to a const arrow component to match project style guidelines.

♻️ Suggested refactor
-export default function DictationHotkeyManager() {
+const DictationHotkeyManager = () => {
   const { dictationEnabled, hotkeyRegistered, toggleCount, hotkey } = useDictationHotkey();
@@
   return null;
-}
+};
+
+export default DictationHotkeyManager;

Aligns with project guideline for **/*.{js,jsx,ts,tsx}: "Prefer arrow functions over function declarations".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/DictationHotkeyManager.tsx` at line 12, Convert the
function declaration for DictationHotkeyManager into a const arrow component to
match project style: replace "export default function DictationHotkeyManager()"
with a const arrow assignment like "const DictationHotkeyManager = () => { ...
}" and keep the default export (either export default on the const or export
default DictationHotkeyManager at the end); preserve all existing props, return
value and internal logic inside the new arrow function and retain any
TypeScript/React typings if present.
src/openhuman/config/ops.rs (1)

539-546: Missing derive attributes on DictationSettingsPatch.

Other patch structs in this file (e.g., ModelSettingsPatch, MemorySettingsPatch) derive Debug, Clone, and Default. This struct is missing those attributes, which may cause issues if debugging or cloning is needed.

♻️ Proposed fix to add consistent derives
+#[derive(Debug, Clone, Default)]
 pub struct DictationSettingsPatch {
     pub enabled: Option<bool>,
     pub hotkey: Option<String>,
     pub activation_mode: Option<String>,
     pub llm_refinement: Option<bool>,
     pub streaming: Option<bool>,
     pub streaming_interval_ms: Option<u64>,
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/config/ops.rs` around lines 539 - 546, The struct
DictationSettingsPatch is missing the consistent derive attributes used by other
patch structs; update DictationSettingsPatch to derive Debug, Clone, and Default
(matching ModelSettingsPatch and MemorySettingsPatch) so it can be
debug-printed, cloned, and default-constructed where expected; locate the
DictationSettingsPatch definition and add the derives above the struct
declaration.
app/src/hooks/useDictationHotkey.ts (3)

116-129: Potential race condition in listener setup.

If the component unmounts before listen() resolves, unlisten will be undefined and the cleanup won't properly unsubscribe. Consider using an async IIFE with a disposed flag check, similar to the first useEffect.

♻️ Proposed fix with disposal tracking
   useEffect(() => {
     if (!isTauri()) return;

-    let unlisten: (() => void) | undefined;
+    let disposed = false;
+    let unlistenFn: (() => void) | undefined;

-    listen('dictation://toggle', () => {
-      console.debug('[dictation] hotkey toggle event received');
-      setToggleCount(c => c + 1);
-    })
-      .then(fn => {
-        unlisten = fn;
-      })
-      .catch(err => {
-        console.warn('[dictation] failed to listen for dictation toggle', err);
-      });
+    (async () => {
+      try {
+        const fn = await listen('dictation://toggle', () => {
+          console.debug('[dictation] hotkey toggle event received');
+          setToggleCount(c => c + 1);
+        });
+        if (disposed) {
+          fn();
+        } else {
+          unlistenFn = fn;
+        }
+      } catch (err) {
+        console.warn('[dictation] failed to listen for dictation toggle', err);
+      }
+    })();

     return () => {
-      unlisten?.();
+      disposed = true;
+      unlistenFn?.();
     };
   }, []);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/hooks/useDictationHotkey.ts` around lines 116 - 129, The listener
setup can race with unmount because listen(...) is async and may resolve after
cleanup; modify the useDictationHotkey effect that calls listen to use an async
IIFE and a disposed flag (e.g., let disposed = false) so that when listen(...)
resolves you only assign unlisten = fn and call setToggleCount if not disposed,
and if disposed is true call fn() immediately to unsubscribe; ensure the cleanup
sets disposed = true and calls unlisten?.() so the listener is always removed
even if listen resolved after unmount (refer to the listen, unlisten and
setToggleCount symbols).

72-74: Type-unsafe RpcOutcome wrapper handling.

The check 'result' in settings and subsequent cast is fragile. If the backend response shape changes, this could silently extract wrong data. Consider defining a more explicit type for the wrapped response or trusting the backend to return consistent shapes.

♻️ Proposed improvement with explicit typing
+interface RpcOutcomeWrapper<T> {
+  result: T;
+}
+
+type DictationSettingsResponse = DictationSettings | RpcOutcomeWrapper<DictationSettings>;
+
 // Inside init():
-        const settings = await callCoreRpc<DictationSettings>({
+        const response = await callCoreRpc<DictationSettingsResponse>({
           method: 'openhuman.config_get_dictation_settings',
         });

         if (disposed) return;

-        if (!settings || typeof settings !== 'object') {
+        if (!response || typeof response !== 'object') {
           console.debug('[dictation] no dictation settings from core');
           return;
         }

         // Handle RpcOutcome wrapper — the result may be nested in .result
-        const s = (
-          'result' in settings ? (settings as Record<string, unknown>).result : settings
-        ) as DictationSettings;
+        const s: DictationSettings =
+          'result' in response && typeof response.result === 'object'
+            ? (response.result as DictationSettings)
+            : (response as DictationSettings);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/hooks/useDictationHotkey.ts` around lines 72 - 74, The current
extraction of DictationSettings using "'result' in settings" is type-unsafe; add
an explicit RpcOutcome<T> interface and a type guard like isRpcOutcome(obj): obj
is RpcOutcome<DictationSettings>, then in useDictationHotkey replace the ad-hoc
check around the local variable settings (and the assignment to s) with a
guarded extraction that returns settings.result when isRpcOutcome(settings) is
true, otherwise treats settings as DictationSettings; reference the RpcOutcome
generic, the isRpcOutcome type guard, the useDictationHotkey function, the
settings variable, and the local variable s to locate and update the logic.

45-45: Prefer arrow function for hook definition.

As per coding guidelines, arrow functions are preferred over function declarations.

♻️ Proposed fix
-export function useDictationHotkey(): DictationHotkeyState {
+export const useDictationHotkey = (): DictationHotkeyState => {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/hooks/useDictationHotkey.ts` at line 45, The hook is declared with a
function declaration; convert it to an exported arrow function to follow project
guidelines: replace the declaration export function useDictationHotkey():
DictationHotkeyState { ... } with an exported const arrow form (export const
useDictationHotkey = (): DictationHotkeyState => { ... }) preserving the
existing body and return type, and ensure any internal references (closures,
hooks) remain unchanged.
src/openhuman/voice/streaming.rs (1)

57-65: Buffer clone on each inference pass could be optimized.

Cloning the entire accumulated buffer (guard.clone()) for each partial inference may become expensive as audio accumulates (e.g., 30s of audio = ~960KB). Consider tracking processed sample count and only cloning new data, or accepting this trade-off for simplicity in the initial implementation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/streaming.rs` around lines 57 - 65, The current loop
clones the entire buffer (guard.clone()) into samples each inference which grows
expensive; instead add a processed offset (e.g., processed_samples) and only
clone the new slice: inside the block that locks buf_clone, compute let new_len
= guard.len(); if new_len <= processed_samples || (new_len - processed_samples)
< 8000 { continue; } let samples: Vec<i16> =
guard[processed_samples..new_len].to_vec(); processed_samples = new_len; thereby
replacing last_len and guard.clone() usage in the samples creation logic (update
any conditions that used last_len to use processed_samples and the new_len
delta).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/core/jsonrpc.rs`:
- Around line 331-344: The handler dictation_ws_handler currently performs the
WebSocket upgrade before loading config; change it to load the config with
crate::openhuman::config::rpc::load_config_with_timeout().await (creating
Arc::new on success) before calling ws.on_upgrade, and if loading fails return
an appropriate non-upgrade Response (e.g., 500 error) instead of performing the
upgrade; once config is loaded, call ws.on_upgrade(move |socket| async move {
crate::openhuman::voice::streaming::handle_dictation_ws(socket, config).await;
}) so handle_dictation_ws receives the preloaded Arc config.

In `@src/openhuman/config/schema/load.rs`:
- Around line 735-738: When parsing OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS in
the load routine, validate the parsed u64 before assigning to
self.dictation.streaming_interval_ms: reject 0 (and optionally cap to a sensible
max) so you don't create a tight loop; update the parsing block in load.rs (the
section that reads OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS and currently sets
self.dictation.streaming_interval_ms = ms) to only assign when ms >= 1 (or
within your chosen min/max bounds) and log or fallback to a safe default when
out of range.

In `@src/openhuman/voice/streaming.rs`:
- Around line 67-85: The call to whisper_engine::transcribe_pcm_i16 obtains a
synchronous parking_lot::Mutex via service.whisper and can block other WebSocket
sessions; update the streaming task to run the blocking inference on a blocking
thread (e.g., tokio::task::spawn_blocking) instead of calling transcribe_pcm_i16
directly on the async path, or alternatively document the limitation on
concurrent sessions. Locate the code around local_ai::global(&config_clone) and
the transcribe_pcm_i16 usage and move the heavy/locking work into spawn_blocking
(await its JoinHandle) so partial_tx.send(trimmed).await runs only after the
blocking call completes without holding up the async reactor.
- Around line 139-164: The code moves inference_handle twice causing an
ownership error; change all places that destructure it inside the loop and after
the loop to take ownership via Option::take() instead of pattern-matching
directly (i.e., replace uses like "if let Some(h) = inference_handle {
h.abort(); }" inside the match arms and the final stop block with "if let
Some(h) = inference_handle.take() { h.abort(); }") so the Option is emptied on
first use and subsequent checks are safe; reference the variable
inference_handle and the match arms handling Message::Close/None and
Some(Err(e)) as well as the final "Stop the periodic inference task" block.

---

Nitpick comments:
In `@app/src/components/DictationHotkeyManager.tsx`:
- Line 12: Convert the function declaration for DictationHotkeyManager into a
const arrow component to match project style: replace "export default function
DictationHotkeyManager()" with a const arrow assignment like "const
DictationHotkeyManager = () => { ... }" and keep the default export (either
export default on the const or export default DictationHotkeyManager at the
end); preserve all existing props, return value and internal logic inside the
new arrow function and retain any TypeScript/React typings if present.

In `@app/src/hooks/useDictationHotkey.ts`:
- Around line 116-129: The listener setup can race with unmount because
listen(...) is async and may resolve after cleanup; modify the
useDictationHotkey effect that calls listen to use an async IIFE and a disposed
flag (e.g., let disposed = false) so that when listen(...) resolves you only
assign unlisten = fn and call setToggleCount if not disposed, and if disposed is
true call fn() immediately to unsubscribe; ensure the cleanup sets disposed =
true and calls unlisten?.() so the listener is always removed even if listen
resolved after unmount (refer to the listen, unlisten and setToggleCount
symbols).
- Around line 72-74: The current extraction of DictationSettings using "'result'
in settings" is type-unsafe; add an explicit RpcOutcome<T> interface and a type
guard like isRpcOutcome(obj): obj is RpcOutcome<DictationSettings>, then in
useDictationHotkey replace the ad-hoc check around the local variable settings
(and the assignment to s) with a guarded extraction that returns settings.result
when isRpcOutcome(settings) is true, otherwise treats settings as
DictationSettings; reference the RpcOutcome generic, the isRpcOutcome type
guard, the useDictationHotkey function, the settings variable, and the local
variable s to locate and update the logic.
- Line 45: The hook is declared with a function declaration; convert it to an
exported arrow function to follow project guidelines: replace the declaration
export function useDictationHotkey(): DictationHotkeyState { ... } with an
exported const arrow form (export const useDictationHotkey = ():
DictationHotkeyState => { ... }) preserving the existing body and return type,
and ensure any internal references (closures, hooks) remain unchanged.

In `@src/openhuman/config/ops.rs`:
- Around line 539-546: The struct DictationSettingsPatch is missing the
consistent derive attributes used by other patch structs; update
DictationSettingsPatch to derive Debug, Clone, and Default (matching
ModelSettingsPatch and MemorySettingsPatch) so it can be debug-printed, cloned,
and default-constructed where expected; locate the DictationSettingsPatch
definition and add the derives above the struct declaration.

In `@src/openhuman/voice/streaming.rs`:
- Around line 57-65: The current loop clones the entire buffer (guard.clone())
into samples each inference which grows expensive; instead add a processed
offset (e.g., processed_samples) and only clone the new slice: inside the block
that locks buf_clone, compute let new_len = guard.len(); if new_len <=
processed_samples || (new_len - processed_samples) < 8000 { continue; } let
samples: Vec<i16> = guard[processed_samples..new_len].to_vec();
processed_samples = new_len; thereby replacing last_len and guard.clone() usage
in the samples creation logic (update any conditions that used last_len to use
processed_samples and the new_len delta).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 423af558-c16f-40a0-81d6-a0ab9794dc15

📥 Commits

Reviewing files that changed from the base of the PR and between 9b73d00 and 1261874.

📒 Files selected for processing (15)
  • app/src/App.tsx
  • app/src/components/DictationHotkeyManager.tsx
  • app/src/hooks/useDictationHotkey.ts
  • app/src/pages/Conversations.tsx
  • app/src/utils/tauriCommands.ts
  • src/core/jsonrpc.rs
  • src/openhuman/config/mod.rs
  • src/openhuman/config/ops.rs
  • src/openhuman/config/schema/dictation.rs
  • src/openhuman/config/schema/load.rs
  • src/openhuman/config/schema/mod.rs
  • src/openhuman/config/schema/types.rs
  • src/openhuman/config/schemas.rs
  • src/openhuman/voice/mod.rs
  • src/openhuman/voice/streaming.rs

Comment thread src/core/jsonrpc.rs
Comment on lines +331 to +344
/// WebSocket upgrade handler for streaming voice dictation.
async fn dictation_ws_handler(ws: WebSocketUpgrade) -> Response {
log::info!("[ws] dictation WebSocket upgrade requested");
ws.on_upgrade(|socket| async move {
let config = match crate::openhuman::config::rpc::load_config_with_timeout().await {
Ok(c) => Arc::new(c),
Err(e) => {
log::error!("[ws] failed to load config for dictation: {e}");
return;
}
};
crate::openhuman::voice::streaming::handle_dictation_ws(socket, config).await;
})
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Load config before upgrading the WebSocket.

At Line [334], the server upgrades first and only then attempts config load (Lines [335]-[340]). On failure, the socket is dropped post-101, which is a misleading handshake success for clients.

🔧 Proposed fix
 async fn dictation_ws_handler(ws: WebSocketUpgrade) -> Response {
     log::info!("[ws] dictation WebSocket upgrade requested");
-    ws.on_upgrade(|socket| async move {
-        let config = match crate::openhuman::config::rpc::load_config_with_timeout().await {
-            Ok(c) => Arc::new(c),
-            Err(e) => {
-                log::error!("[ws] failed to load config for dictation: {e}");
-                return;
-            }
-        };
-        crate::openhuman::voice::streaming::handle_dictation_ws(socket, config).await;
-    })
+    let config = match crate::openhuman::config::rpc::load_config_with_timeout().await {
+        Ok(c) => Arc::new(c),
+        Err(e) => {
+            log::error!("[ws] failed to load config for dictation: {e}");
+            return (
+                StatusCode::SERVICE_UNAVAILABLE,
+                Json(json!({ "ok": false, "error": "dictation_unavailable" })),
+            )
+                .into_response();
+        }
+    };
+    ws.on_upgrade(move |socket| async move {
+        crate::openhuman::voice::streaming::handle_dictation_ws(socket, config).await;
+    })
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/jsonrpc.rs` around lines 331 - 344, The handler dictation_ws_handler
currently performs the WebSocket upgrade before loading config; change it to
load the config with
crate::openhuman::config::rpc::load_config_with_timeout().await (creating
Arc::new on success) before calling ws.on_upgrade, and if loading fails return
an appropriate non-upgrade Response (e.g., 500 error) instead of performing the
upgrade; once config is loaded, call ws.on_upgrade(move |socket| async move {
crate::openhuman::voice::streaming::handle_dictation_ws(socket, config).await;
}) so handle_dictation_ws receives the preloaded Arc config.

Comment on lines +735 to +738
if let Ok(val) = std::env::var("OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS") {
if let Ok(ms) = val.trim().parse::<u64>() {
self.dictation.streaming_interval_ms = ms;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add bounds validation for dictation streaming interval.

Lines [736]-[738] accept any u64, including 0. For a periodic inference interval, that can cause a tight loop and severe CPU pressure.

🔧 Proposed fix
         if let Ok(val) = std::env::var("OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS") {
             if let Ok(ms) = val.trim().parse::<u64>() {
-                self.dictation.streaming_interval_ms = ms;
+                if (100..=60_000).contains(&ms) {
+                    self.dictation.streaming_interval_ms = ms;
+                } else {
+                    tracing::warn!(
+                        interval_ms = ms,
+                        "ignoring invalid OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS (valid: 100..=60000)"
+                    );
+                }
             }
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if let Ok(val) = std::env::var("OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS") {
if let Ok(ms) = val.trim().parse::<u64>() {
self.dictation.streaming_interval_ms = ms;
}
if let Ok(val) = std::env::var("OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS") {
if let Ok(ms) = val.trim().parse::<u64>() {
if (100..=60_000).contains(&ms) {
self.dictation.streaming_interval_ms = ms;
} else {
tracing::warn!(
interval_ms = ms,
"ignoring invalid OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS (valid: 100..=60000)"
);
}
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/config/schema/load.rs` around lines 735 - 738, When parsing
OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS in the load routine, validate the
parsed u64 before assigning to self.dictation.streaming_interval_ms: reject 0
(and optionally cap to a sensible max) so you don't create a tight loop; update
the parsing block in load.rs (the section that reads
OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS and currently sets
self.dictation.streaming_interval_ms = ms) to only assign when ms >= 1 (or
within your chosen min/max bounds) and log or fallback to a safe default when
out of range.

Comment on lines +67 to +85
let service = local_ai::global(&config_clone);
match whisper_engine::transcribe_pcm_i16(&service.whisper, &samples, None) {
Ok(text) => {
let trimmed = text.trim().to_string();
if !trimmed.is_empty() {
log::debug!(
"{LOG_PREFIX} partial transcription ({} samples): {}",
samples.len(),
&trimmed[..trimmed.len().min(80)]
);
if partial_tx.send(trimmed).await.is_err() {
break; // receiver dropped
}
}
}
Err(e) => {
log::warn!("{LOG_PREFIX} partial inference error: {e}");
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Synchronous lock contention on shared Whisper engine.

The transcribe_pcm_i16 call uses a parking_lot::Mutex (per whisper_engine.rs), which is a synchronous blocking lock. Multiple concurrent WebSocket sessions will serialize their inference calls, potentially causing one session to block for seconds while another completes. Consider documenting this limitation or using spawn_blocking if inference is CPU-bound.

This may be acceptable for initial implementation since concurrent dictation sessions are unlikely, but worth noting for future scalability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/streaming.rs` around lines 67 - 85, The call to
whisper_engine::transcribe_pcm_i16 obtains a synchronous parking_lot::Mutex via
service.whisper and can block other WebSocket sessions; update the streaming
task to run the blocking inference on a blocking thread (e.g.,
tokio::task::spawn_blocking) instead of calling transcribe_pcm_i16 directly on
the async path, or alternatively document the limitation on concurrent sessions.
Locate the code around local_ai::global(&config_clone) and the
transcribe_pcm_i16 usage and move the heavy/locking work into spawn_blocking
(await its JoinHandle) so partial_tx.send(trimmed).await runs only after the
blocking call completes without holding up the async reactor.

Comment on lines +139 to +164
Some(Ok(Message::Close(_))) | None => {
log::info!("{LOG_PREFIX} client disconnected");
if let Some(h) = inference_handle {
h.abort();
}
return;
}

Some(Err(e)) => {
log::warn!("{LOG_PREFIX} websocket error: {e}");
if let Some(h) = inference_handle {
h.abort();
}
return;
}

_ => {}
}
}
}
}

// Stop the periodic inference task
if let Some(h) = inference_handle {
h.abort();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Ownership error: inference_handle moved twice.

The inference_handle is moved into the match arms at lines 141-143 and 149-151 (inside the loop), but then accessed again at line 162 after the loop breaks via the stop command path. This will fail to compile because the value may have been moved.

🐛 Proposed fix using Option::take()
-    let inference_handle = if do_streaming {
+    let mut inference_handle = if do_streaming {
         let handle = tokio::spawn(async move {
             // ... inference task
         });
         Some(handle)
     } else {
         None
     };

     loop {
         tokio::select! {
             // ... partial_rx branch unchanged ...

             msg = socket.recv() => {
                 match msg {
                     // ... Binary and Text branches unchanged ...

                     Some(Ok(Message::Close(_))) | None => {
                         log::info!("{LOG_PREFIX} client disconnected");
-                        if let Some(h) = inference_handle {
+                        if let Some(h) = inference_handle.take() {
                             h.abort();
                         }
                         return;
                     }

                     Some(Err(e)) => {
                         log::warn!("{LOG_PREFIX} websocket error: {e}");
-                        if let Some(h) = inference_handle {
+                        if let Some(h) = inference_handle.take() {
                             h.abort();
                         }
                         return;
                     }

                     _ => {}
                 }
             }
         }
     }

     // Stop the periodic inference task
-    if let Some(h) = inference_handle {
+    if let Some(h) = inference_handle.take() {
         h.abort();
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/streaming.rs` around lines 139 - 164, The code moves
inference_handle twice causing an ownership error; change all places that
destructure it inside the loop and after the loop to take ownership via
Option::take() instead of pattern-matching directly (i.e., replace uses like "if
let Some(h) = inference_handle { h.abort(); }" inside the match arms and the
final stop block with "if let Some(h) = inference_handle.take() { h.abort(); }")
so the Option is emptied on first use and subsequent checks are safe; reference
the variable inference_handle and the match arms handling Message::Close/None
and Some(Err(e)) as well as the final "Stop the periodic inference task" block.

@senamakel
Copy link
Copy Markdown
Member

I'm merging this into #368 thanks bro

@senamakel senamakel closed this Apr 6, 2026
senamakel added a commit to senamakel/openhuman that referenced this pull request Apr 6, 2026
…cycle, WebSocket streaming)

Merge feat/332-voice-dictation into feat/stt to combine:
- Our standalone voice server (hotkey → record → transcribe → insert)
- PR tinyhumansai#371's DictationConfig, WebSocket streaming endpoint, frontend
  hotkey hook, and voice RPC type fixes

Resolved conflict in src/openhuman/voice/mod.rs — kept both server
and streaming modules.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Voice dictation: global hotkey and overlay start/stop

2 participants