Skip to content

feat(voice): standalone voice dictation server with hotkey support#368

Merged
senamakel merged 23 commits intotinyhumansai:mainfrom
senamakel:feat/stt
Apr 6, 2026
Merged

feat(voice): standalone voice dictation server with hotkey support#368
senamakel merged 23 commits intotinyhumansai:mainfrom
senamakel:feat/stt

Conversation

@senamakel
Copy link
Copy Markdown
Member

@senamakel senamakel commented Apr 6, 2026

Summary

  • Add standalone voice dictation server: hotkey → record → transcribe (whisper) → insert text into active field
  • Support tap (toggle on/off) and push (hold-to-record) activation modes via configurable hotkey
  • Auto-enable LLM transcription cleanup when local model is downloaded and ready
  • New CLI command: openhuman voice [--hotkey <combo>] [--mode <tap|push>]
  • New RPC endpoints: voice_server_start, voice_server_stop, voice_server_status

New files

File Purpose
src/openhuman/voice/audio_capture.rs Microphone recording via cpal (16kHz mono WAV)
src/openhuman/voice/hotkey.rs Global hotkey listener via rdev
src/openhuman/voice/text_input.rs Text insertion into active field via enigo
src/openhuman/voice/server.rs Voice server orchestrating the full pipeline
src/openhuman/config/schema/voice_server.rs Config schema for voice server settings

Dependencies added

cpal (audio capture), hound (WAV encoding), enigo (text insertion), rdev (global hotkey)

Test plan

  • All new unit tests pass (cargo test — 38+ voice tests)
  • cargo check clean (zero errors, zero warnings)
  • CLI help renders correctly (openhuman voice --help)
  • Manual test: run openhuman voice and dictate with Ctrl+Shift+Space
  • Manual test: push-to-talk mode with --mode push
  • Manual test: verify text insertion into various apps (terminal, browser, editor)
  • Test on macOS with accessibility permissions granted

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Voice dictation server with configurable hotkey (toggle or push-to-talk) and start/stop/status controls.
    • Real-time streaming transcription over WebSocket with interim and final results.
    • Automatic insertion of transcribed text into the active application.
    • Desktop hotkey manager and UI hook for dictation toggles.
    • New app-level config endpoints and settings UI for dictation (hotkey, mode, refinement, streaming).

senamakel and others added 6 commits April 6, 2026 10:36
- Introduced a new `voice` subcommand to the CLI for running a standalone voice dictation server that listens for a hotkey, records audio, transcribes it using Whisper, and inserts the result into the active text field.
- Implemented configuration options for the voice server, including hotkey combination, activation mode (tap or push), and an option to skip LLM post-processing.
- Added audio capture functionality using the `cpal` crate and integrated hotkey listening with the `rdev` crate for global key event handling.
- Enhanced the configuration schema to include voice server settings and updated the main configuration structure accordingly.
- Updated relevant modules and tests to ensure consistent behavior and functionality across the application.

This feature enhances user interaction by allowing voice dictation directly into any active text field, improving accessibility and usability.
- Introduced a standalone voice dictation server that listens for a configurable hotkey to start recording audio, transcribes it using whisper, and inserts the transcribed text into the active text field.
- Added CLI support for the `voice` command, allowing users to manage the voice server's configuration, including hotkey and activation mode settings.
- Implemented configuration structures for the voice server, including options for automatic start, hotkey combination, activation mode, and cleanup behavior.
- Enhanced audio capture functionality using the `cpal` library for microphone input and integrated text insertion using the `enigo` library for simulating keyboard input.
- Updated relevant modules and schemas to support the new voice server features, ensuring a cohesive integration within the OpenHuman platform.
…unctionality

- Updated the `run_voice_server_command` function to initialize the configuration with environment overrides instead of loading from a file, improving performance and flexibility.
- Refactored the audio capture logic in `start_recording` to enhance thread management and error handling, ensuring a more robust audio stream setup.
- Improved the handling of audio stream creation and playback, ensuring that all cpal objects are managed on the same thread as required, enhancing stability during recording operations.
- Eliminated the `HotkeyListenerHandle` import from the `server.rs` file, streamlining the code and improving clarity by removing unnecessary dependencies.
The postprocessor now checks the local LLM state and automatically
enables transcription cleanup when the model is downloaded and ready,
even if not explicitly configured. Falls back gracefully to raw text
when the LLM is unavailable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 6, 2026

Warning

Rate limit exceeded

@senamakel has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 12 minutes and 7 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 12 minutes and 7 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4190bdb5-2853-4bde-a9bd-ca1c2edeefb8

📥 Commits

Reviewing files that changed from the base of the PR and between 7721269 and 89f5e31.

📒 Files selected for processing (10)
  • .github/Dockerfile
  • app/src/pages/Conversations.tsx
  • app/src/utils/tauriCommands.ts
  • src/core/cli.rs
  • src/core/jsonrpc.rs
  • src/core/socketio.rs
  • src/openhuman/config/mod.rs
  • src/openhuman/config/schema/load.rs
  • src/openhuman/config/schema/mod.rs
  • src/openhuman/config/schema/types.rs
📝 Walkthrough

Walkthrough

Adds a voice dictation system: hotkey-driven audio capture, system-wide hotkey listener, WAV resampling, WebSocket streaming transcription with optional LLM refinement, text insertion, config schema/ops/RPC, frontend hotkey hook/component, CLI and Docker/CI adjustments.

Changes

Cohort / File(s) Summary
Dependencies
Cargo.toml
Added runtime deps: cpal, hound, enigo, rdev.
Config schema & loading
src/openhuman/config/schema/mod.rs, src/openhuman/config/schema/types.rs, src/openhuman/config/schema/dictation.rs, src/openhuman/config/schema/voice_server.rs, src/openhuman/config/schema/load.rs
New DictationConfig & VoiceServerConfig, activation enums, serde defaults, integrated into top-level Config, and env-var overrides for dictation.
Config ops & RPC
src/openhuman/config/mod.rs, src/openhuman/config/ops.rs, src/openhuman/config/schemas.rs
Added dictation settings RPCs and patch/update types (get/update handlers and patch struct).
Audio capture, hotkey, text input
src/openhuman/voice/audio_capture.rs, src/openhuman/voice/hotkey.rs, src/openhuman/voice/text_input.rs
Implemented microphone capture (resample→16kHz mono WAV), system-wide hotkey parsing/listener, and simulated-typing text insertion.
Voice server & streaming WS
src/openhuman/voice/server.rs, src/openhuman/voice/streaming.rs, src/openhuman/voice/mod.rs
New voice server orchestration (hotkey→record→transcribe→insert), global server singleton, standalone run, and /ws/dictation streaming handler with partial/final messages and optional LLM cleanup.
Voice RPCs, CLI, postprocess
src/openhuman/voice/schemas.rs, src/openhuman/voice/postprocess.rs, src/core/cli.rs, src/core/jsonrpc.rs
Added voice_server start/stop/status controllers, CLI voice/dictate handling, /ws/dictation endpoint, and LLM-readiness gating for cleanup.
Frontend: Tauri bindings & UI
app/src/utils/tauriCommands.ts, app/src/pages/Conversations.tsx, app/src/App.tsx, app/src/components/DictationHotkeyManager.tsx, app/src/hooks/useDictationHotkey.ts
Unwrapped voice RPC return types, wired hotkey manager into App, added useDictationHotkey hook and headless manager component.
Socket / broadcast & dictation listener
src/core/socketio.rs, src/openhuman/voice/dictation_listener.rs
Broadcast bus for dictation events, Socket.IO bridge emission of dictation:toggle/dictation_toggle, and startup hook start_if_enabled.
CI / Docker
.github/Dockerfile, .github/workflows/test.yml
Added Linux packages needed for audio/input/X11 (cmake, libasound2-dev, libxdo-dev, libxtst-dev, libx11-dev, libevdev-dev).

Sequence Diagram(s)

sequenceDiagram
    participant User as User (OS)
    participant Listener as Hotkey Listener
    participant Server as Voice Server
    participant Audio as Audio Device
    participant Whisper as Whisper Engine
    participant LLM as LLM Service
    participant TextInput as Text Input

    User->>Listener: Press hotkey
    Listener->>Server: Emit HotkeyEvent::Pressed
    Server->>Audio: start_recording()
    Audio->>Audio: Capture PCM samples
    User->>Listener: Release hotkey (push mode) / Press again (tap mode)
    Listener->>Server: Emit HotkeyEvent::Released or Pressed
    Server->>Audio: stop()
    Audio->>Server: WAV bytes
    Server->>Whisper: transcribe WAV
    Whisper->>Server: raw_text
    alt LLM refinement enabled
        Server->>LLM: cleanup_transcription(raw_text)
        LLM->>Server: refined_text
    else
        Server->>Server: use raw_text
    end
    Server->>TextInput: insert_text(final_text)
    TextInput->>User: text appears in focused field
Loading
sequenceDiagram
    participant Client as Browser Client
    participant WS as /ws/dictation
    participant Buffer as Audio Buffer
    participant Inference as Periodic Inference
    participant Whisper as Whisper Engine
    participant Cleanup as LLM Cleanup

    Client->>WS: Connect
    WS->>Buffer: init buffer
    WS->>Inference: spawn periodic task (if streaming)
    Client->>WS: Send binary audio frame
    WS->>Buffer: append PCM16
    Inference->>Whisper: transcribe partial
    Whisper->>Inference: partial text
    Inference->>Client: {type:"partial", text}
    Client->>WS: Send {"type":"stop"}
    WS->>Inference: stop task
    WS->>Whisper: final transcribe(all)
    Whisper->>WS: raw_text
    alt LLM refinement enabled
        WS->>Cleanup: cleanup_transcription(raw_text)
        Cleanup->>WS: refined_text
    else
        WS->>WS: use raw_text
    end
    WS->>Client: {type:"final", text, raw_text}
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • graycyrus
  • YellowSnnowmann

Poem

🐇 I pressed the key, the soft mic woke,
Waves turned to words in a whispered stroke.
With rdev and cpal I dance and play,
Enigo types what my ears relay.
One hop, one key — dictation's day! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main feature being added: a standalone voice dictation server with hotkey support, which is the primary focus of all the changes in the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (11)
src/openhuman/voice/hotkey.rs (1)

203-281: Consider adding arrow keys and navigation keys.

The string_to_key mapping covers common hotkey candidates but omits arrow keys (up, down, left, right), home, end, pageup, pagedown, and insert. These are occasionally used in hotkey combos. Low priority since the current set covers typical use cases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/hotkey.rs` around lines 203 - 281, The mapping in
string_to_key is missing navigation and arrow keys; add patterns for
"up"/"down"/"left"/"right" mapping to Key::Up/Key::Down/Key::Left/Key::Right,
and also add "home" => Key::Home, "end" => Key::End, "pageup"/"pgup" =>
Key::PageUp, "pagedown"/"pgdn" => Key::PageDown, and "insert"/"ins" =>
Key::Insert (use the exact rdev Key enum variants available) so these keys are
recognized in hotkey parsing.
src/openhuman/voice/postprocess.rs (2)

118-125: Test relies on implicit LLM state assumption.

This test assumes local_ai::global(config) returns a service with state other than "ready" or "degraded" in the test environment. If the LocalAiService initialization behavior changes (e.g., to auto-detect a local model), this test could become flaky. Consider adding a comment documenting this assumption, or mock the service state for deterministic testing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/postprocess.rs` around lines 118 - 125, The test
disabled_cleanup_returns_raw_text relies on the implicit state of the
LocalAiService returned by local_ai::global(config) which may be "ready" or
"degraded" and make the test flaky; update the test to explicitly control the
service state (e.g., mock or stub local_ai::global to return a LocalAiService
with a known non-ready/degraded state) or set the service state
deterministically before calling cleanup_transcription, and add a brief comment
documenting the required assumption; target the test function
disabled_cleanup_returns_raw_text and the call to cleanup_transcription (and any
helper used to obtain the LocalAiService) when making this change.

41-54: Logic is correct but doc comment could be clearer.

The gating logic works correctly: even when voice_llm_cleanup_enabled is true, you gracefully skip cleanup if the LLM isn't ready (lines 51-54). However, the doc comment (lines 23-27) says cleanup is enabled when LLM is "ready" but the code also treats "degraded" as ready (line 39). Consider updating the doc comment to mention "degraded" for accuracy.

The dual-check pattern (should_cleanup then llm_ready) is intentional graceful degradation—no changes needed to the logic itself.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/postprocess.rs` around lines 41 - 54, Update the doc
comment explaining when cleanup is enabled to accurately reflect the code: note
that cleanup is enabled when config.local_ai.voice_llm_cleanup_enabled is true
OR when llm_ready is true, and explicitly mention that the llm_ready check
treats both "ready" and "degraded" as acceptable states; leave the gating logic
(the should_cleanup variable and the subsequent llm_ready checks that return
raw_text) unchanged.
src/openhuman/config/schema/voice_server.rs (1)

17-19: Consider using the ActivationMode enum instead of String.

The hotkey.rs module already defines ActivationMode with proper serde support (#[serde(rename_all = "snake_case")]). Using the enum here would provide compile-time type safety and automatic validation during config deserialization, rejecting invalid values like "Tap" or "toggle".

Suggested change
+use crate::openhuman::voice::hotkey::ActivationMode;
+
 /// Configuration for the voice dictation server.
 #[derive(Debug, Clone, Serialize, Deserialize, JsonSchema)]
 pub struct VoiceServerConfig {
     // ...
 
-    /// Activation mode: "tap" (toggle) or "push" (hold-to-record).
-    #[serde(default = "default_activation_mode")]
-    pub activation_mode: String,
+    /// Activation mode: tap (toggle) or push (hold-to-record).
+    #[serde(default)]
+    pub activation_mode: ActivationMode,

This requires ensuring ActivationMode implements JsonSchema (add schemars::JsonSchema derive in hotkey.rs if not already present).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/config/schema/voice_server.rs` around lines 17 - 19, Replace
the activation_mode String field in voice_server.rs with the ActivationMode enum
(use the existing ActivationMode from hotkey.rs) so deserialization enforces
valid snake_case values; update the serde default handler
(default_activation_mode) to return ActivationMode (or implement Default for
ActivationMode and use #[serde(default)]) and adjust the field signature
accordingly; also ensure ActivationMode in hotkey.rs derives
schemars::JsonSchema in addition to serde traits so config schema generation
works.
src/core/cli.rs (1)

272-274: Potential duplicate hotkey listener if core server is also running.

Per the context in server.rs, run_standalone creates a new VoiceServer instance independent of the global_server() singleton used by RPC endpoints. If a user runs openhuman voice while the core server (with voice RPC) is already active, both would attempt to register the same hotkey, potentially causing conflicts.

Consider documenting this as a known limitation, or adding a check (e.g., via a lockfile or socket probe) to warn if the core server is already running.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/cli.rs` around lines 272 - 274, The CLI may start a duplicate hotkey
listener because run_standalone creates a new VoiceServer independent of the
global_server() used by RPC; before calling run_standalone from CLI, probe
whether the core server/global VoiceServer is already active (e.g., attempt to
acquire the same lockfile, open the core server socket, or call a small
RPC/health endpoint) and if detected, warn the user and either abort or start in
a mode that does not register the hotkey (or register with a no-op); update the
behavior in cli.rs around the run_standalone call and adjust VoiceServer startup
in server.rs to support a "skip_hotkey_registration" flag or early-exit when
global_server() is present, and add a short user-facing message documenting this
limitation.
src/openhuman/voice/server.rs (2)

263-289: run_standalone creates a new server instance, not using global singleton.

This is intentional for CLI isolation (the standalone server shouldn't interfere with RPC-managed instances), but it means CLI-started servers won't be queryable via voice_server_status RPC since that checks the global singleton. Consider documenting this behavior or having the CLI optionally register with the global.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/server.rs` around lines 263 - 289, The run_standalone
function creates a fresh VoiceServer via VoiceServer::new and runs it in an Arc
without registering it with the global singleton used by the voice_server_status
RPC, so CLI-started instances aren't queryable; either document this behavior in
the run_standalone comments or add an optional flag to register the created
server into the global singleton used by voice_server_status (e.g.
set_global_voice_server or similar) after creating server and before running it,
ensuring shutdown still works via the same stop() flow and preserving CLI
isolation when the flag is false.

119-133: 100ms sleep in select! creates unnecessary wake-ups.

The tokio::select! with a 100ms sleep branch causes the loop to wake every 100ms even when idle. Since hotkey_rx.recv() will block until an event arrives, and the loop checks self.running on each iteration, the sleep branch primarily serves as a check interval for the running flag. Consider using a cancellation token or watch channel for cleaner shutdown signaling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/server.rs` around lines 119 - 133, The loop is waking
every 100ms due to the sleep branch in the tokio::select!, causing unnecessary
wake-ups; replace that sleep-based shutdown check with a proper async shutdown
signal (e.g., a tokio::sync::watch<bool> or a
tokio_util::sync::CancellationToken) and select on that instead of
tokio::time::sleep so the task only wakes for real events or shutdown; update
the select! to await hotkey_rx.recv() and the chosen shutdown signal (e.g.,
watch.changed() or token.cancelled()), and remove the sleep and the polling of
self.running (or have the shutdown signal be driven by setting self.running) so
shutdown is immediate and efficient in the functions using self.running,
tokio::select!, and hotkey_rx.recv().
src/openhuman/voice/schemas.rs (2)

311-322: handle_voice_server_stop returns error when server not initialized.

Returning Err("voice server is not running") is reasonable behavior for stop when there's nothing to stop. However, consider whether returning a "Stopped" status (like handle_voice_server_status does) would be more consistent and user-friendly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/schemas.rs` around lines 311 - 322,
handle_voice_server_stop currently returns Err("voice server is not running")
when try_global_server() yields None; change it to return the same "Stopped"
status as handle_voice_server_status instead. Update the else branch in
handle_voice_server_stop to produce a serialized stopped status (use the same
Status enum/value or call the existing status-generation logic used by
handle_voice_server_status) via serde_json::to_value(...) so callers receive a
consistent status JSON rather than an error.

273-276: Silent fallback to Tap mode for invalid activation_mode values.

Any value other than "push" defaults to Tap without warning. If a user passes "Push" (capitalized) or "hold", they'll get Tap mode without realizing. Consider logging a warning for unrecognized values or validating strictly.

🛡️ Proposed fix: warn on unrecognized mode
         let activation_mode = match mode_str {
             "push" => ActivationMode::Push,
-            _ => ActivationMode::Tap,
+            "tap" => ActivationMode::Tap,
+            other => {
+                log::warn!(
+                    "[voice_server] unrecognized activation_mode '{other}', defaulting to tap"
+                );
+                ActivationMode::Tap
+            }
         };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/schemas.rs` around lines 273 - 276, The code currently
maps mode_str to ActivationMode with a silent default to Tap; modify the mapping
in the match that creates activation_mode (matching on mode_str) to explicitly
recognize accepted values (e.g., exact "push" and maybe canonicalized lowercased
variants) and emit a warning when mode_str is unrecognized (including the
received value) instead of silently defaulting—use the project's logging
facility (e.g., warn! or tracing::warn) to log something like "Unrecognized
activation_mode '{}', defaulting to Tap" and then fall back to
ActivationMode::Tap; ensure you reference mode_str, ActivationMode::Push, and
ActivationMode::Tap in the change.
src/openhuman/voice/text_input.rs (2)

17-41: paste_text and type_text have identical implementations.

Both functions call enigo.text(text) despite having different docstrings. The type_text docstring claims it types "character-by-character" and "doesn't touch the clipboard," but the implementation is identical to paste_text. Either consolidate into one function, or implement the actual character-by-character typing if that behavior is needed.

♻️ Option 1: Consolidate if both methods should behave the same
-/// Insert text into the currently active text field by simulating a
-/// clipboard paste (Cmd+V on macOS, Ctrl+V elsewhere).
-///
-/// This is more reliable than typing character-by-character because it
-/// handles Unicode, IME, and special characters correctly.
-pub fn paste_text(text: &str) -> Result<(), String> {
+/// Insert text into the active text field using enigo's platform-appropriate method.
+///
+/// Handles Unicode, IME, and special characters correctly.
+pub fn insert_text_raw(text: &str) -> Result<(), String> {
     if text.is_empty() {
-        debug!("{LOG_PREFIX} empty text, nothing to paste");
+        debug!("{LOG_PREFIX} empty text, nothing to insert");
         return Ok(());
     }
 
     info!(
-        "{LOG_PREFIX} pasting {} chars into active field",
+        "{LOG_PREFIX} inserting {} chars into active field",
         text.len()
     );
 
-    // Save current clipboard, set our text, paste, then restore.
-    // For simplicity we just overwrite — restoring clipboard is fragile
-    // across platforms and async contexts.
     let mut enigo = Enigo::new(&Settings::default())
         .map_err(|e| format!("failed to create enigo instance: {e}"))?;
 
-    // Use enigo's text method which handles the platform-appropriate paste.
     enigo
         .text(text)
         .map_err(|e| format!("failed to type text: {e}"))?;
 
-    debug!("{LOG_PREFIX} text pasted successfully");
+    debug!("{LOG_PREFIX} text inserted successfully");
     Ok(())
 }
-
-/// Type text character-by-character into the active field.
-/// Slower but doesn't touch the clipboard.
-pub fn type_text(text: &str) -> Result<(), String> {
-    if text.is_empty() {
-        debug!("{LOG_PREFIX} empty text, nothing to type");
-        return Ok(());
-    }
-
-    info!("{LOG_PREFIX} typing {} chars into active field", text.len());
-
-    let mut enigo = Enigo::new(&Settings::default())
-        .map_err(|e| format!("failed to create enigo instance: {e}"))?;
-
-    enigo
-        .text(text)
-        .map_err(|e| format!("failed to type text: {e}"))?;
-
-    debug!("{LOG_PREFIX} text typed successfully");
-    Ok(())
-}
♻️ Option 2: Implement actual character-by-character typing if needed
 /// Type text character-by-character into the active field.
 /// Slower but doesn't touch the clipboard.
 pub fn type_text(text: &str) -> Result<(), String> {
     if text.is_empty() {
         debug!("{LOG_PREFIX} empty text, nothing to type");
         return Ok(());
     }
 
     info!("{LOG_PREFIX} typing {} chars into active field", text.len());
 
     let mut enigo = Enigo::new(&Settings::default())
         .map_err(|e| format!("failed to create enigo instance: {e}"))?;
 
-    enigo
-        .text(text)
-        .map_err(|e| format!("failed to type text: {e}"))?;
+    for c in text.chars() {
+        enigo
+            .key(enigo::Key::Unicode(c), enigo::Direction::Click)
+            .map_err(|e| format!("failed to type character '{c}': {e}"))?;
+    }
 
     debug!("{LOG_PREFIX} text typed successfully");
     Ok(())
 }

Also applies to: 43-62

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/text_input.rs` around lines 17 - 41, Both paste_text and
type_text currently call Enigo::text(text) and are identical; change to either
consolidate them into a single function (remove duplicate) or implement real
per-character typing in type_text. If consolidating, remove type_text and update
call sites to use paste_text (or vice versa) and keep Enigo::text(text) as the
single behavior. If implementing character-by-character typing, modify type_text
to iterate over chars and call enigo.key_sequence or enigo.key_down/key_up per
character (or a small sleep between calls) while ensuring it does not touch
clipboard; keep paste_text using enigo.text(text) and maintain existing error
handling and logging in both functions.

68-75: Inconsistent whitespace handling between insert_text and paste_text.

insert_text trims text before checking emptiness (line 69), but paste_text only checks for empty string. If the caller passes " " directly to paste_text, it will attempt to paste whitespace, whereas insert_text would skip it. This is fine if paste_text is only called via insert_text, but since both are pub, consider documenting this difference or aligning the behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/text_input.rs` around lines 68 - 75, The issue is
inconsistent whitespace handling: insert_text trims before emptiness check but
paste_text only checks for exact empty string, so callers of pub paste_text can
paste pure-whitespace. Fix by making paste_text perform the same trim-and-empty
check (or have insert_text call paste_text with text.trim()), e.g., update
paste_text to call text.trim() and skip/panic/return Ok with the same warning
log used in insert_text when trimmed text is empty; ensure both functions'
behavior is aligned or document the intended contract in their public comments
(reference functions: insert_text and paste_text).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Cargo.toml`:
- Line 94: The enigo dependency entry currently disables default features which
breaks compilation on Linux because no platform backend (xdo, x11rb, wayland, or
libei) is enabled; update the Cargo.toml enigo dependency so it either
re-enables default features or explicitly enables a Linux backend (e.g., add the
appropriate feature like "xdo" or "x11rb" under the enigo = { ... } line) so the
crate has a platform-specific backend enabled.

In `@src/core/cli.rs`:
- Around line 257-258: The CLI currently constructs the config via
Config::default() and then apply_env_overrides(), which ignores user
config.toml; replace that logic to call Config::load_or_init() (which loads from
the config file or falls back to defaults) and then still call
config.apply_env_overrides() so env vars override file values; update the code
that references the config variable (the let mut config assignment and
subsequent uses) to use the loaded config from Config::load_or_init() instead of
Config::default().

In `@src/openhuman/voice/hotkey.rs`:
- Around line 59-71: The stop_flag only causes the callback in
HotkeyListenerHandle::stop/Drop to return early but does not terminate the
thread because rdev::listen is blocking; update the stop log in
HotkeyListenerHandle::stop to a more accurate message such as "hotkey listener
signaled to skip events" (replace the current "{LOG_PREFIX} stop signal sent to
hotkey listener"), and add a clear code comment in start_listener next to the
rdev::listen(callback) call documenting that rdev::listen blocks in the platform
event loop (no graceful cancellation in rdev 0.5.3) so the thread remains alive
and only process exit removes the hook — mention that this is a known limitation
and why the stop_flag only causes the callback to ignore events.

In `@src/openhuman/voice/schemas.rs`:
- Around line 283-297: The code currently calls global_server(server_config)
which uses OnceCell::get_or_init and will silently ignore a new
VoiceServerConfig if the server already exists; update voice_server_start (the
caller creating VoiceServerConfig and calling global_server) to detect when an
existing server/config is present by retrieving the stored config from the
global server singleton (via the same global_server/OnceCell or a getter),
compare fields (hotkey, activation_mode, skip_cleanup, context) against the
incoming VoiceServerConfig, and if they differ either return an Err indicating
the server is already running with a different config or at minimum log a clear
warning via log::warn! that the provided config was ignored; include this
comparison and the warning/error before spawning the tokio task that calls
server.run(&config_clone).

In `@src/openhuman/voice/server.rs`:
- Around line 291-297: The current truncate_for_log function slices bytes with
&s[..max], which can panic on multi-byte UTF-8 boundaries; replace the
byte-slice approach with a safe character-based truncation (use
s.chars().take(max).collect::<String>() or iterate with s.char_indices()) and
append "..." only when the original string was longer than max so the function
never slices in the middle of a Unicode scalar and safely returns a valid UTF-8
String.
- Around line 46-66: The runtime config is missing the min_duration_secs field
declared in the schema; add pub min_duration_secs: f32 to VoiceServerConfig and
set its default to 0.3 in the Default impl, then replace the hardcoded threshold
literal used when checking transcription/audio length (the runtime comparison
currently using the embedded numeric threshold) to use self.min_duration_secs
(or config.min_duration_secs) instead so the runtime honors the schema value.

---

Nitpick comments:
In `@src/core/cli.rs`:
- Around line 272-274: The CLI may start a duplicate hotkey listener because
run_standalone creates a new VoiceServer independent of the global_server() used
by RPC; before calling run_standalone from CLI, probe whether the core
server/global VoiceServer is already active (e.g., attempt to acquire the same
lockfile, open the core server socket, or call a small RPC/health endpoint) and
if detected, warn the user and either abort or start in a mode that does not
register the hotkey (or register with a no-op); update the behavior in cli.rs
around the run_standalone call and adjust VoiceServer startup in server.rs to
support a "skip_hotkey_registration" flag or early-exit when global_server() is
present, and add a short user-facing message documenting this limitation.

In `@src/openhuman/config/schema/voice_server.rs`:
- Around line 17-19: Replace the activation_mode String field in voice_server.rs
with the ActivationMode enum (use the existing ActivationMode from hotkey.rs) so
deserialization enforces valid snake_case values; update the serde default
handler (default_activation_mode) to return ActivationMode (or implement Default
for ActivationMode and use #[serde(default)]) and adjust the field signature
accordingly; also ensure ActivationMode in hotkey.rs derives
schemars::JsonSchema in addition to serde traits so config schema generation
works.

In `@src/openhuman/voice/hotkey.rs`:
- Around line 203-281: The mapping in string_to_key is missing navigation and
arrow keys; add patterns for "up"/"down"/"left"/"right" mapping to
Key::Up/Key::Down/Key::Left/Key::Right, and also add "home" => Key::Home, "end"
=> Key::End, "pageup"/"pgup" => Key::PageUp, "pagedown"/"pgdn" => Key::PageDown,
and "insert"/"ins" => Key::Insert (use the exact rdev Key enum variants
available) so these keys are recognized in hotkey parsing.

In `@src/openhuman/voice/postprocess.rs`:
- Around line 118-125: The test disabled_cleanup_returns_raw_text relies on the
implicit state of the LocalAiService returned by local_ai::global(config) which
may be "ready" or "degraded" and make the test flaky; update the test to
explicitly control the service state (e.g., mock or stub local_ai::global to
return a LocalAiService with a known non-ready/degraded state) or set the
service state deterministically before calling cleanup_transcription, and add a
brief comment documenting the required assumption; target the test function
disabled_cleanup_returns_raw_text and the call to cleanup_transcription (and any
helper used to obtain the LocalAiService) when making this change.
- Around line 41-54: Update the doc comment explaining when cleanup is enabled
to accurately reflect the code: note that cleanup is enabled when
config.local_ai.voice_llm_cleanup_enabled is true OR when llm_ready is true, and
explicitly mention that the llm_ready check treats both "ready" and "degraded"
as acceptable states; leave the gating logic (the should_cleanup variable and
the subsequent llm_ready checks that return raw_text) unchanged.

In `@src/openhuman/voice/schemas.rs`:
- Around line 311-322: handle_voice_server_stop currently returns Err("voice
server is not running") when try_global_server() yields None; change it to
return the same "Stopped" status as handle_voice_server_status instead. Update
the else branch in handle_voice_server_stop to produce a serialized stopped
status (use the same Status enum/value or call the existing status-generation
logic used by handle_voice_server_status) via serde_json::to_value(...) so
callers receive a consistent status JSON rather than an error.
- Around line 273-276: The code currently maps mode_str to ActivationMode with a
silent default to Tap; modify the mapping in the match that creates
activation_mode (matching on mode_str) to explicitly recognize accepted values
(e.g., exact "push" and maybe canonicalized lowercased variants) and emit a
warning when mode_str is unrecognized (including the received value) instead of
silently defaulting—use the project's logging facility (e.g., warn! or
tracing::warn) to log something like "Unrecognized activation_mode '{}',
defaulting to Tap" and then fall back to ActivationMode::Tap; ensure you
reference mode_str, ActivationMode::Push, and ActivationMode::Tap in the change.

In `@src/openhuman/voice/server.rs`:
- Around line 263-289: The run_standalone function creates a fresh VoiceServer
via VoiceServer::new and runs it in an Arc without registering it with the
global singleton used by the voice_server_status RPC, so CLI-started instances
aren't queryable; either document this behavior in the run_standalone comments
or add an optional flag to register the created server into the global singleton
used by voice_server_status (e.g. set_global_voice_server or similar) after
creating server and before running it, ensuring shutdown still works via the
same stop() flow and preserving CLI isolation when the flag is false.
- Around line 119-133: The loop is waking every 100ms due to the sleep branch in
the tokio::select!, causing unnecessary wake-ups; replace that sleep-based
shutdown check with a proper async shutdown signal (e.g., a
tokio::sync::watch<bool> or a tokio_util::sync::CancellationToken) and select on
that instead of tokio::time::sleep so the task only wakes for real events or
shutdown; update the select! to await hotkey_rx.recv() and the chosen shutdown
signal (e.g., watch.changed() or token.cancelled()), and remove the sleep and
the polling of self.running (or have the shutdown signal be driven by setting
self.running) so shutdown is immediate and efficient in the functions using
self.running, tokio::select!, and hotkey_rx.recv().

In `@src/openhuman/voice/text_input.rs`:
- Around line 17-41: Both paste_text and type_text currently call
Enigo::text(text) and are identical; change to either consolidate them into a
single function (remove duplicate) or implement real per-character typing in
type_text. If consolidating, remove type_text and update call sites to use
paste_text (or vice versa) and keep Enigo::text(text) as the single behavior. If
implementing character-by-character typing, modify type_text to iterate over
chars and call enigo.key_sequence or enigo.key_down/key_up per character (or a
small sleep between calls) while ensuring it does not touch clipboard; keep
paste_text using enigo.text(text) and maintain existing error handling and
logging in both functions.
- Around line 68-75: The issue is inconsistent whitespace handling: insert_text
trims before emptiness check but paste_text only checks for exact empty string,
so callers of pub paste_text can paste pure-whitespace. Fix by making paste_text
perform the same trim-and-empty check (or have insert_text call paste_text with
text.trim()), e.g., update paste_text to call text.trim() and skip/panic/return
Ok with the same warning log used in insert_text when trimmed text is empty;
ensure both functions' behavior is aligned or document the intended contract in
their public comments (reference functions: insert_text and paste_text).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 43969220-7598-4d34-b1ee-9c2bc5fcce2e

📥 Commits

Reviewing files that changed from the base of the PR and between 9b73d00 and b3bbd63.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (15)
  • Cargo.toml
  • app/src/components/settings/panels/ScreenIntelligencePanel.tsx
  • src/core/cli.rs
  • src/core/screen_intelligence_cli.rs
  • src/openhuman/config/schema/mod.rs
  • src/openhuman/config/schema/types.rs
  • src/openhuman/config/schema/voice_server.rs
  • src/openhuman/screen_intelligence/engine.rs
  • src/openhuman/voice/audio_capture.rs
  • src/openhuman/voice/hotkey.rs
  • src/openhuman/voice/mod.rs
  • src/openhuman/voice/postprocess.rs
  • src/openhuman/voice/schemas.rs
  • src/openhuman/voice/server.rs
  • src/openhuman/voice/text_input.rs

Comment thread Cargo.toml Outdated
Comment thread src/core/cli.rs Outdated
Comment thread src/openhuman/voice/hotkey.rs
Comment thread src/openhuman/voice/schemas.rs
Comment thread src/openhuman/voice/server.rs
Comment thread src/openhuman/voice/server.rs
@senamakel senamakel marked this pull request as ready for review April 6, 2026 18:24
oxoxDev and others added 4 commits April 7, 2026 00:28
…ing (tinyhumansai#332)

Add the foundational infrastructure for voice dictation (EPIC tinyhumansai#332):

**Rust core:**
- New `DictationConfig` schema with serde defaults and env var overrides
  (enabled, hotkey, activation_mode, llm_refinement, streaming, interval)
- RPC controllers: `config_get_dictation_settings` / `config_update_dictation_settings`
- WebSocket endpoint `/ws/dictation` for streaming PCM16 transcription
  with periodic partial inference and final LLM refinement
- Microphone permission declaration (`NSMicrophoneUsageDescription`) in
  Tauri macOS bundle config

**Frontend:**
- `useDictationHotkey` hook: fetches config from core RPC, auto-registers
  global hotkey, listens for `dictation://toggle` events
- `DictationHotkeyManager` headless component mounted in App.tsx
- Fix voice RPC response type mismatch: voice handlers return flat results
  (no `{result, logs}` wrapper), so remove incorrect `CommandResponse<T>`
  wrapping from `openhumanVoiceStatus`, `openhumanVoiceTranscribe`,
  `openhumanVoiceTranscribeBytes`, and `openhumanVoiceTts`

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `infoPlist` field in tauri.conf.json expects a string path, not an
inline object. Remove it for now — microphone permission will be added
via a proper Info.plist supplement in the production build pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cycle, WebSocket streaming)

Merge feat/332-voice-dictation into feat/stt to combine:
- Our standalone voice server (hotkey → record → transcribe → insert)
- PR tinyhumansai#371's DictationConfig, WebSocket streaming endpoint, frontend
  hotkey hook, and voice RPC type fixes

Resolved conflict in src/openhuman/voice/mod.rs — kept both server
and streaming modules.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (9)
src/openhuman/voice/schemas.rs (2)

341-351: Synthesized "Stopped" status uses empty hotkey.

When no server exists, handle_voice_server_stop and handle_voice_server_status return a VoiceServerStatus with hotkey: String::new(). Consider using a placeholder like "(none)" or the configured default from config.voice_server.hotkey for better UI presentation.

Also applies to: 361-369

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/schemas.rs` around lines 341 - 351, The synthesized
Stopped VoiceServerStatus currently sets hotkey to an empty string; update the
status construction in the server-status/stop handlers (e.g., where
VoiceServerStatus is created in handle_voice_server_stop and
handle_voice_server_status) to use a sensible placeholder such as "(none)" or
the configured default (config.voice_server.hotkey) instead of String::new(),
and ensure both occurrences (the block at ~341-351 and the similar block at
~361-369) are updated so the UI receives a meaningful hotkey value when no
server exists.

316-323: Fixed 200ms sleep after spawn is fragile.

The tokio::time::sleep(200ms) assumes the server initializes within that window. A more robust approach would poll try_global_server() with a short timeout/retry loop, or have server.run() signal readiness via a channel.

This is acceptable for an initial implementation but may cause intermittent "failed to initialize" errors under load.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/schemas.rs` around lines 316 - 323, The fixed 200ms sleep
after spawning the server is fragile; replace it with a readiness wait that
polls try_global_server() (or use a oneshot/channel signalled by server.run)
until the server is available or a short timeout elapses. Specifically, after
tokio::spawn(async move { if let Err(e) = server.run(&config_clone).await { ...
} }); implement a retry loop that calls try_global_server() with a small sleep
(e.g., 10–50ms) between attempts and an overall timeout (e.g., a few seconds),
returning an error if the timeout elapses; alternatively modify server.run to
send a readiness signal on a oneshot::Sender and await the corresponding
Receiver here—use the names server.run and try_global_server to locate where to
change the code.
src/openhuman/config/schema/voice_server.rs (1)

6-6: Two distinct ActivationMode enums exist — verify this is intentional.

VoiceServerConfig uses voice::hotkey::ActivationMode (variants: Tap, Push), while DictationConfig in dictation.rs uses DictationActivationMode (variants: Toggle, Push). The semantic difference between "Tap" and "Toggle" is unclear and may confuse users or cause subtle bugs if the wrong enum is used.

If both are needed, consider adding doc comments clarifying when each is used. Otherwise, consolidate into a single shared enum.

Also applies to: 19-21

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/config/schema/voice_server.rs` at line 6, VoiceServerConfig
imports and uses voice::hotkey::ActivationMode (variants Tap, Push) while
DictationConfig uses a separate DictationActivationMode (variants Toggle, Push),
which is likely an accidental duplication causing semantic confusion; either
consolidate both into one shared enum (e.g., rename to ActivationMode with
unified variants like Tap/Toggle/Push as appropriate) and update usages in
VoiceServerConfig and DictationConfig to reference the single enum, or keep both
but add crate-level doc comments on the enum definitions clarifying their
distinct semantics and when to use each (update definitions where ActivationMode
and DictationActivationMode are declared and all references in VoiceServerConfig
and DictationConfig accordingly).
src/openhuman/config/schema/dictation.rs (1)

54-56: Different default hotkeys for dictation vs voice server.

DictationConfig defaults to "CmdOrCtrl+Shift+D" while VoiceServerConfig (in voice_server.rs) defaults to "ctrl+shift+space". If both are active simultaneously, users may be confused by which hotkey triggers which feature.

Consider documenting the distinction clearly or unifying the defaults if they serve the same purpose.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/config/schema/dictation.rs` around lines 54 - 56,
DictationConfig's default_hotkey() returns "CmdOrCtrl+Shift+D" while
VoiceServerConfig uses "ctrl+shift+space", causing conflicting UX; pick a single
canonical hotkey or explicitly document distinct purposes and wire both to a
shared constant. Update the default_hotkey() function and VoiceServerConfig to
use a shared constant (e.g., DEFAULT_DICTATION_HOTKEY or DEFAULT_VOICE_HOTKEY),
add a brief doc comment on the DictationConfig and VoiceServerConfig structs
explaining their different behaviors if you keep distinct keys, and ensure any
user-facing docs/examples reflect the chosen/unified default.
src/openhuman/config/schema/load.rs (1)

735-739: Consider validating streaming_interval_ms bounds.

Very low values (e.g., 1 ms) could cause excessive CPU usage during streaming transcription. Other numeric overrides in this file validate ranges (e.g., temperature, max_results).

♻️ Optional: Add minimum threshold
         if let Ok(val) = std::env::var("OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS") {
             if let Ok(ms) = val.trim().parse::<u64>() {
-                self.dictation.streaming_interval_ms = ms;
+                if ms >= 100 {
+                    self.dictation.streaming_interval_ms = ms;
+                } else {
+                    tracing::warn!(
+                        ms = %ms,
+                        "ignoring OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS below 100ms"
+                    );
+                }
             }
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/config/schema/load.rs` around lines 735 - 739, The code reads
OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS and assigns it to
self.dictation.streaming_interval_ms without bounds checking; add validation
after parsing (e.g., parse val to u64, then if ms < MIN_STREAMING_INTERVAL_MS or
ms > MAX_STREAMING_INTERVAL_MS clamp or reject and fallback to the existing
default) and ensure you log or warn when the provided value is out of range;
introduce constants like MIN_STREAMING_INTERVAL_MS and MAX_STREAMING_INTERVAL_MS
(or reuse an existing DEFAULT) and update the assignment in the block that
handles OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS so only validated/clamped
values are written to dictation.streaming_interval_ms.
src/openhuman/config/ops.rs (1)

539-546: Add derive macros for consistency with other patch structs.

Other *SettingsPatch structs in this file (e.g., ModelSettingsPatch, MemorySettingsPatch) derive Debug, Clone, and Default. Consider adding these for consistency and to enable debugging/cloning if needed:

♻️ Proposed fix
-pub struct DictationSettingsPatch {
+#[derive(Debug, Clone, Default)]
+pub struct DictationSettingsPatch {
     pub enabled: Option<bool>,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/config/ops.rs` around lines 539 - 546, The
DictationSettingsPatch struct lacks derive macros that other *SettingsPatch
structs have; update the declaration of DictationSettingsPatch to derive Debug,
Clone, and Default (add #[derive(Debug, Clone, Default)]) so it matches
ModelSettingsPatch and MemorySettingsPatch and allows easy
debugging/cloning/default construction for DictationSettingsPatch.
src/openhuman/voice/streaming.rs (2)

139-158: inference_handle ownership could be clearer.

The inference_handle is moved in the Close and Error arms but the variable is also used after the loop (line 162). This works because Close/Error return early, but consider using Option::take() for clarity:

♻️ Suggested alternative for clarity
-                    Some(Ok(Message::Close(_))) | None => {
-                        log::info!("{LOG_PREFIX} client disconnected");
-                        if let Some(h) = inference_handle {
-                            h.abort();
-                        }
-                        return;
-                    }
+                    Some(Ok(Message::Close(_))) | None => {
+                        log::info!("{LOG_PREFIX} client disconnected");
+                        if let Some(ref h) = inference_handle {
+                            h.abort();
+                        }
+                        return;
+                    }

Or make inference_handle mutable and use take() consistently.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/streaming.rs` around lines 139 - 158, The review notes
unclear ownership of inference_handle being moved in the Close/Error match arms;
make inference_handle mutable and use Option::take() before aborting so you
don't move it repeatedly or depend on early returns—locate the websocket
handling loop (where inference_handle is matched in the Close/Err arms) and
replace direct moves like if let Some(h) = inference_handle { h.abort(); } with
let h = inference_handle.take(); if let Some(h) = h { h.abort(); } so the
variable is consistently consumed and safe to reference after the loop.

199-204: LLM refinement may silently fall back to raw text without indication.

When config.dictation.llm_refinement is true but the local LLM isn't ready, cleanup_transcription returns the raw text unchanged (per postprocess.rs lines 55-57). The client receives identical text and raw_text fields with no indication that refinement was skipped.

Consider logging when this fallback occurs, or including a "refinement_applied": bool field in the response so clients know whether cleanup actually ran.

♻️ Proposed enhancement
     // LLM refinement if enabled
-    let refined_text = if config.dictation.llm_refinement && !raw_text.is_empty() {
-        postprocess::cleanup_transcription(&config, &raw_text, None).await
+    let (refined_text, refinement_applied) = if config.dictation.llm_refinement && !raw_text.is_empty() {
+        let cleaned = postprocess::cleanup_transcription(&config, &raw_text, None).await;
+        let applied = cleaned != raw_text;
+        if !applied {
+            log::debug!("{LOG_PREFIX} LLM refinement requested but skipped (LLM not ready)");
+        }
+        (cleaned, applied)
     } else {
-        raw_text.clone()
+        (raw_text.clone(), false)
     };

     let msg = serde_json::json!({
         "type": "final",
         "text": refined_text,
         "raw_text": raw_text,
+        "refinement_applied": refinement_applied,
     });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/streaming.rs` around lines 199 - 204, The LLM refinement
path can silently return the raw text (cleanup_transcription in postprocess.rs)
leaving clients unaware refinement was skipped; update the logic around
refined_text (the block checking config.dictation.llm_refinement and calling
postprocess::cleanup_transcription) to detect when cleanup_transcription
returned the unchanged text or when the local LLM wasn't ready and then either
(a) set and return an explicit boolean flag (e.g., "refinement_applied") in the
response alongside text/raw_text, or (b) emit a clear log message indicating
refinement was skipped; reference the refined_text variable and
cleanup_transcription function to implement the detection and add the new
response field or log.
src/core/jsonrpc.rs (1)

331-344: Consider adding authentication to the dictation WebSocket endpoint.

The /ws/dictation endpoint accepts connections without any authentication or authorization check. While the server defaults to binding on 127.0.0.1 (limiting exposure), if the host is configured to bind to 0.0.0.0 or a public interface, any network client could use the dictation service.

Other RPC endpoints appear to rely on session tokens for authenticated operations. Consider whether this endpoint should validate a session token (e.g., via query parameter or initial WebSocket message) before processing audio data.

#!/bin/bash
# Check if other WebSocket or SSE endpoints have authentication checks
rg -n -C3 'WebSocketUpgrade|events_handler' src/core/jsonrpc.rs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/jsonrpc.rs` around lines 331 - 344, The dictation_ws_handler
currently accepts WebSocket upgrades without authentication; modify it to
validate a session token before handing the socket to handle_dictation_ws:
extract a token (e.g., from query params on the upgrade request or require the
first WebSocket message to contain credentials), call your existing session
validation routine (the same mechanism other RPC endpoints use) and only call
crate::openhuman::voice::streaming::handle_dictation_ws(socket, config).await
after successful validation; on failure, reject the upgrade or immediately close
the socket and log the reason. Ensure the changes are implemented inside
dictation_ws_handler (around WebSocketUpgrade handling and before invoking
handle_dictation_ws) and reuse
crate::openhuman::config::rpc::load_config_with_timeout() as needed for config
access.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/src/hooks/useDictationHotkey.ts`:
- Around line 110-130: The effect registering the dictation hotkey can leak if
the component unmounts before listen() resolves; update the useEffect (the block
using isTauri(), listen, unlisten, and setToggleCount) to use a disposal guard:
introduce a local disposed flag, call listen(...).then(fn => { if (disposed) {
fn(); } else { unlisten = fn; } }).catch(...), and in the cleanup set disposed =
true and call unlisten?.(); this ensures a late-resolving listen() will
immediately unregister instead of leaving a dangling listener.

---

Nitpick comments:
In `@src/core/jsonrpc.rs`:
- Around line 331-344: The dictation_ws_handler currently accepts WebSocket
upgrades without authentication; modify it to validate a session token before
handing the socket to handle_dictation_ws: extract a token (e.g., from query
params on the upgrade request or require the first WebSocket message to contain
credentials), call your existing session validation routine (the same mechanism
other RPC endpoints use) and only call
crate::openhuman::voice::streaming::handle_dictation_ws(socket, config).await
after successful validation; on failure, reject the upgrade or immediately close
the socket and log the reason. Ensure the changes are implemented inside
dictation_ws_handler (around WebSocketUpgrade handling and before invoking
handle_dictation_ws) and reuse
crate::openhuman::config::rpc::load_config_with_timeout() as needed for config
access.

In `@src/openhuman/config/ops.rs`:
- Around line 539-546: The DictationSettingsPatch struct lacks derive macros
that other *SettingsPatch structs have; update the declaration of
DictationSettingsPatch to derive Debug, Clone, and Default (add #[derive(Debug,
Clone, Default)]) so it matches ModelSettingsPatch and MemorySettingsPatch and
allows easy debugging/cloning/default construction for DictationSettingsPatch.

In `@src/openhuman/config/schema/dictation.rs`:
- Around line 54-56: DictationConfig's default_hotkey() returns
"CmdOrCtrl+Shift+D" while VoiceServerConfig uses "ctrl+shift+space", causing
conflicting UX; pick a single canonical hotkey or explicitly document distinct
purposes and wire both to a shared constant. Update the default_hotkey()
function and VoiceServerConfig to use a shared constant (e.g.,
DEFAULT_DICTATION_HOTKEY or DEFAULT_VOICE_HOTKEY), add a brief doc comment on
the DictationConfig and VoiceServerConfig structs explaining their different
behaviors if you keep distinct keys, and ensure any user-facing docs/examples
reflect the chosen/unified default.

In `@src/openhuman/config/schema/load.rs`:
- Around line 735-739: The code reads OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS
and assigns it to self.dictation.streaming_interval_ms without bounds checking;
add validation after parsing (e.g., parse val to u64, then if ms <
MIN_STREAMING_INTERVAL_MS or ms > MAX_STREAMING_INTERVAL_MS clamp or reject and
fallback to the existing default) and ensure you log or warn when the provided
value is out of range; introduce constants like MIN_STREAMING_INTERVAL_MS and
MAX_STREAMING_INTERVAL_MS (or reuse an existing DEFAULT) and update the
assignment in the block that handles OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS
so only validated/clamped values are written to dictation.streaming_interval_ms.

In `@src/openhuman/config/schema/voice_server.rs`:
- Line 6: VoiceServerConfig imports and uses voice::hotkey::ActivationMode
(variants Tap, Push) while DictationConfig uses a separate
DictationActivationMode (variants Toggle, Push), which is likely an accidental
duplication causing semantic confusion; either consolidate both into one shared
enum (e.g., rename to ActivationMode with unified variants like Tap/Toggle/Push
as appropriate) and update usages in VoiceServerConfig and DictationConfig to
reference the single enum, or keep both but add crate-level doc comments on the
enum definitions clarifying their distinct semantics and when to use each
(update definitions where ActivationMode and DictationActivationMode are
declared and all references in VoiceServerConfig and DictationConfig
accordingly).

In `@src/openhuman/voice/schemas.rs`:
- Around line 341-351: The synthesized Stopped VoiceServerStatus currently sets
hotkey to an empty string; update the status construction in the
server-status/stop handlers (e.g., where VoiceServerStatus is created in
handle_voice_server_stop and handle_voice_server_status) to use a sensible
placeholder such as "(none)" or the configured default
(config.voice_server.hotkey) instead of String::new(), and ensure both
occurrences (the block at ~341-351 and the similar block at ~361-369) are
updated so the UI receives a meaningful hotkey value when no server exists.
- Around line 316-323: The fixed 200ms sleep after spawning the server is
fragile; replace it with a readiness wait that polls try_global_server() (or use
a oneshot/channel signalled by server.run) until the server is available or a
short timeout elapses. Specifically, after tokio::spawn(async move { if let
Err(e) = server.run(&config_clone).await { ... } }); implement a retry loop that
calls try_global_server() with a small sleep (e.g., 10–50ms) between attempts
and an overall timeout (e.g., a few seconds), returning an error if the timeout
elapses; alternatively modify server.run to send a readiness signal on a
oneshot::Sender and await the corresponding Receiver here—use the names
server.run and try_global_server to locate where to change the code.

In `@src/openhuman/voice/streaming.rs`:
- Around line 139-158: The review notes unclear ownership of inference_handle
being moved in the Close/Error match arms; make inference_handle mutable and use
Option::take() before aborting so you don't move it repeatedly or depend on
early returns—locate the websocket handling loop (where inference_handle is
matched in the Close/Err arms) and replace direct moves like if let Some(h) =
inference_handle { h.abort(); } with let h = inference_handle.take(); if let
Some(h) = h { h.abort(); } so the variable is consistently consumed and safe to
reference after the loop.
- Around line 199-204: The LLM refinement path can silently return the raw text
(cleanup_transcription in postprocess.rs) leaving clients unaware refinement was
skipped; update the logic around refined_text (the block checking
config.dictation.llm_refinement and calling postprocess::cleanup_transcription)
to detect when cleanup_transcription returned the unchanged text or when the
local LLM wasn't ready and then either (a) set and return an explicit boolean
flag (e.g., "refinement_applied") in the response alongside text/raw_text, or
(b) emit a clear log message indicating refinement was skipped; reference the
refined_text variable and cleanup_transcription function to implement the
detection and add the new response field or log.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c543bfd3-25b1-47d4-b878-d80a71531077

📥 Commits

Reviewing files that changed from the base of the PR and between b3bbd63 and dfb1958.

📒 Files selected for processing (23)
  • Cargo.toml
  • app/src/App.tsx
  • app/src/components/DictationHotkeyManager.tsx
  • app/src/hooks/useDictationHotkey.ts
  • app/src/pages/Conversations.tsx
  • app/src/utils/tauriCommands.ts
  • src/core/cli.rs
  • src/core/jsonrpc.rs
  • src/openhuman/config/mod.rs
  • src/openhuman/config/ops.rs
  • src/openhuman/config/schema/dictation.rs
  • src/openhuman/config/schema/load.rs
  • src/openhuman/config/schema/mod.rs
  • src/openhuman/config/schema/types.rs
  • src/openhuman/config/schema/voice_server.rs
  • src/openhuman/config/schemas.rs
  • src/openhuman/voice/hotkey.rs
  • src/openhuman/voice/mod.rs
  • src/openhuman/voice/postprocess.rs
  • src/openhuman/voice/schemas.rs
  • src/openhuman/voice/server.rs
  • src/openhuman/voice/streaming.rs
  • src/openhuman/voice/text_input.rs
✅ Files skipped from review due to trivial changes (2)
  • Cargo.toml
  • src/openhuman/voice/hotkey.rs
🚧 Files skipped from review as they are similar to previous changes (7)
  • src/openhuman/config/schema/mod.rs
  • src/openhuman/config/schema/types.rs
  • src/openhuman/voice/postprocess.rs
  • src/openhuman/voice/text_input.rs
  • src/core/cli.rs
  • src/openhuman/voice/mod.rs
  • src/openhuman/voice/server.rs

Comment thread app/src/hooks/useDictationHotkey.ts Outdated
- Added a global dictation hotkey listener that activates based on configuration.
- Implemented a web channel bridge to handle dictation events and broadcast them to connected clients.
- Updated the voice module to include the new dictation listener functionality.

This enhances the voice dictation capabilities by ensuring real-time event handling and client communication.
- Updated `Cargo.toml` to mark voice-related dependencies as optional.
- Introduced `VoiceActivationMode` enum for better control over voice server activation.
- Refactored voice server command handling and dictation event broadcasting to support new features.
- Added conditional compilation for voice features across various modules, ensuring they are only included when enabled.

This commit improves the modularity and configurability of the voice server, allowing for more flexible integration and usage.
- Removed unnecessary blank lines in `cli.rs`, `jsonrpc.rs`, `schemas.rs`, and `socketio.rs` to improve code readability.
- Adjusted import order in `mod.rs` for better organization.

This commit enhances the overall code quality by ensuring consistent formatting across multiple files.
…em dependencies

- Added installation of system dependencies (cmake, ALSA, X11) in the Dockerfile for improved build support.
- Updated the GitHub Actions workflow to reflect the new dependencies, ensuring consistent environment setup for testing.

This commit enhances the build environment by including necessary libraries for audio and GUI support.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
src/openhuman/voice/dictation_listener.rs (1)

103-123: Consider moving normalized and mode_str ownership into the spawned task more efficiently.

The current implementation clones normalized on every event (line 117). Since the hotkey string is constant for the listener's lifetime, you could move it into the closure once.

♻️ Minor optimization to avoid per-event clones
     let mode_str = match mode {
         ActivationMode::Tap => "toggle",
         ActivationMode::Push => "push",
-    };
+    }
+    .to_string();
 
     log::info!("{LOG_PREFIX} dictation hotkey active: {normalized}");
 
     // Forward hotkey events to the broadcast channel.
     tokio::spawn(async move {
         // Keep the listener handle alive for the lifetime of this task.
         let _handle = listener_handle;
 
         while let Some(event) = hotkey_rx.recv().await {
             let event_type = match event {
                 HotkeyEvent::Pressed => "pressed",
                 HotkeyEvent::Released => "released",
             };
 
             log::debug!("{LOG_PREFIX} hotkey {event_type}");
 
             publish_dictation_event(DictationEvent {
                 event_type: event_type.to_string(),
-                hotkey: normalized.clone(),
-                activation_mode: mode_str.to_string(),
+                hotkey: normalized.clone(), // normalized is moved into closure, clone for each event
+                activation_mode: mode_str.clone(),
             });
         }

Actually, normalized is already moved into the closure — the .clone() is needed because the loop runs multiple times. This is already optimal given the broadcast channel requires owned values. No change needed.

On reflection, the current implementation is correct — normalized must be cloned per-event since DictationEvent owns its strings and the loop iterates multiple times. This is fine for low-frequency hotkey events.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/dictation_listener.rs` around lines 103 - 123, The code
review suggests moving ownership of `normalized` and `mode_str` into the spawned
task to avoid cloning per event, but after inspection this is unnecessary
because `DictationEvent` requires owned Strings and the loop emits multiple
events; therefore no code change is required—keep the current
`normalized.clone()` and `mode_str.to_string()` usage inside the `tokio::spawn`
closure where `publish_dictation_event(DictationEvent { ... })` is called.
app/src/hooks/useDictationHotkey.ts (1)

67-70: Unnecessary RpcOutcome wrapper check — callCoreRpc already unwraps the result.

Per context snippet 2 (app/src/services/coreRpcClient.ts:180), callCoreRpc returns json.result as T directly, not a wrapper object. The server-side RpcOutcome is serialized into the JSON-RPC result field, which callCoreRpc then extracts. This defensive check will never match and can be removed.

♻️ Proposed fix
-        // Handle RpcOutcome wrapper — the result may be nested in .result
-        const s = (
-          'result' in settings ? (settings as Record<string, unknown>).result : settings
-        ) as DictationSettings;
+        const s = settings;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/hooks/useDictationHotkey.ts` around lines 67 - 70, The conditional
that checks for an RpcOutcome wrapper in useDictationHotkey (the 'result' in
settings branch that assigns to s) is unnecessary because callCoreRpc already
unwraps and returns the inner result; remove the ternary check and simply cast
settings to DictationSettings (e.g., assign settings as DictationSettings to s)
so code uses the unwrapped value directly and drop the Record<string, unknown>
cast/branch.
src/openhuman/voice/schemas.rs (2)

358-374: Consider extracting duplicated stopped-status construction.

The synthesized VoiceServerStatus for non-running servers is duplicated between handle_voice_server_stop (lines 346-352) and handle_voice_server_status (lines 364-370).

♻️ Extract helper function
+fn stopped_status() -> crate::openhuman::voice::server::VoiceServerStatus {
+    crate::openhuman::voice::server::VoiceServerStatus {
+        state: crate::openhuman::voice::server::ServerState::Stopped,
+        hotkey: String::new(),
+        activation_mode: crate::openhuman::voice::hotkey::ActivationMode::Tap,
+        transcription_count: 0,
+        last_error: None,
+    }
+}

 fn handle_voice_server_stop(_params: Map<String, Value>) -> ControllerFuture {
     Box::pin(async move {
         if let Some(server) = crate::openhuman::voice::server::try_global_server() {
             // ...
         } else {
-            let status = crate::openhuman::voice::server::VoiceServerStatus {
-                state: crate::openhuman::voice::server::ServerState::Stopped,
-                hotkey: String::new(),
-                activation_mode: crate::openhuman::voice::hotkey::ActivationMode::Tap,
-                transcription_count: 0,
-                last_error: None,
-            };
-            serde_json::to_value(status).map_err(|e| format!("serialize error: {e}"))
+            serde_json::to_value(stopped_status()).map_err(|e| format!("serialize error: {e}"))
         }
     })
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/schemas.rs` around lines 358 - 374, Both
handle_voice_server_status and handle_voice_server_stop duplicate construction
of a stopped VoiceServerStatus object; extract a small helper (e.g., fn
default_stopped_status() -> crate::openhuman::voice::server::VoiceServerStatus)
that returns the same VoiceServerStatus with ServerState::Stopped, empty hotkey,
ActivationMode::Tap, transcription_count 0 and last_error None, then call that
helper from both handle_voice_server_status and handle_voice_server_stop and
serialize its result with serde_json::to_value as before.

325-333: Sleep-based synchronization is fragile but pragmatic.

The 200ms sleep is a heuristic to wait for the spawned server task to initialize. If initialization takes longer, the returned status may be stale. Consider documenting this limitation or using a synchronization primitive (e.g., oneshot channel) for deterministic readiness signaling in a future iteration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/schemas.rs` around lines 325 - 333, The 200ms
tokio::time::sleep used before calling
crate::openhuman::voice::server::try_global_server() is a fragile heuristic;
replace it by implementing deterministic readiness signaling: change the server
spawn path to send a oneshot::Sender (or similar) when initialization completes
and in this routine await the corresponding oneshot::Receiver before calling
try_global_server() and s.status().await, or at minimum add a comment noting the
race and the 200ms limitation next to the sleep to document the fragility of
this approach.
src/core/cli.rs (1)

258-267: Config load errors are silently swallowed; consider logging a warning.

When Config::load_or_init() fails, unwrap_or_default() silently falls back to defaults. The user won't know their config file has syntax errors or permission issues. Additionally, --mode silently defaults to Tap for any unrecognized value (including typos like "Push" or "psh").

💡 Proposed improvement
     rt.block_on(async {
-        let mut config = crate::openhuman::config::Config::load_or_init()
-            .await
-            .unwrap_or_default();
+        let mut config = match crate::openhuman::config::Config::load_or_init().await {
+            Ok(c) => c,
+            Err(e) => {
+                log::warn!("[voice] failed to load config, using defaults: {e}");
+                crate::openhuman::config::Config::default()
+            }
+        };
         config.apply_env_overrides();

         let activation_mode = match mode.as_deref() {
             Some("push") => ActivationMode::Push,
-            _ => ActivationMode::Tap,
+            Some("tap") | None => ActivationMode::Tap,
+            Some(other) => {
+                log::warn!("[voice] unrecognized --mode '{other}', defaulting to tap");
+                ActivationMode::Tap
+            }
         };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/cli.rs` around lines 258 - 267, When loading configuration, don't
silently swallow errors from Config::load_or_init(); instead propagate or log
them: capture the Result error from
crate::openhuman::config::Config::load_or_init() and, if Err, call the global
logger (or e.g., eprintln!) with a concise warning including the error and note
that defaults are being used before calling apply_env_overrides on the fallback
Config; likewise make parsing of the CLI `mode` variable and setting of
activation_mode (ActivationMode::Push / ActivationMode::Tap) case-insensitive
and explicit—accept known values using mode.as_deref().map(|s| s.to_lowercase())
(or match on both cases) and log a warning when an unrecognized value is
provided before defaulting to ActivationMode::Tap so typos are clearly visible.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/openhuman/voice/schemas.rs`:
- Around line 297-316: The current separate check (try_global_server()) and init
(global_server(server_config)) allow a TOCTOU race where two concurrent
voice_server_start calls can both pass the check and one wins with its config;
fix by serializing initialization: introduce a global initialization mutex
(e.g., SERVER_INIT_MUTEX) and acquire it at the start of voice_server_start (or
the public start path) before calling try_global_server() and before calling
global_server(server_config), perform the same existing_state vs server_config
comparison while holding the lock, then call global_server(server_config) and
release the lock; alternatively, move the config validation into the
OnceCell::get_or_init closure used by global_server so the chosen config is
validated atomically inside the init path (use ServerState, try_global_server,
global_server, server_config, and voice_server_start as the reference points).

---

Nitpick comments:
In `@app/src/hooks/useDictationHotkey.ts`:
- Around line 67-70: The conditional that checks for an RpcOutcome wrapper in
useDictationHotkey (the 'result' in settings branch that assigns to s) is
unnecessary because callCoreRpc already unwraps and returns the inner result;
remove the ternary check and simply cast settings to DictationSettings (e.g.,
assign settings as DictationSettings to s) so code uses the unwrapped value
directly and drop the Record<string, unknown> cast/branch.

In `@src/core/cli.rs`:
- Around line 258-267: When loading configuration, don't silently swallow errors
from Config::load_or_init(); instead propagate or log them: capture the Result
error from crate::openhuman::config::Config::load_or_init() and, if Err, call
the global logger (or e.g., eprintln!) with a concise warning including the
error and note that defaults are being used before calling apply_env_overrides
on the fallback Config; likewise make parsing of the CLI `mode` variable and
setting of activation_mode (ActivationMode::Push / ActivationMode::Tap)
case-insensitive and explicit—accept known values using mode.as_deref().map(|s|
s.to_lowercase()) (or match on both cases) and log a warning when an
unrecognized value is provided before defaulting to ActivationMode::Tap so typos
are clearly visible.

In `@src/openhuman/voice/dictation_listener.rs`:
- Around line 103-123: The code review suggests moving ownership of `normalized`
and `mode_str` into the spawned task to avoid cloning per event, but after
inspection this is unnecessary because `DictationEvent` requires owned Strings
and the loop emits multiple events; therefore no code change is required—keep
the current `normalized.clone()` and `mode_str.to_string()` usage inside the
`tokio::spawn` closure where `publish_dictation_event(DictationEvent { ... })`
is called.

In `@src/openhuman/voice/schemas.rs`:
- Around line 358-374: Both handle_voice_server_status and
handle_voice_server_stop duplicate construction of a stopped VoiceServerStatus
object; extract a small helper (e.g., fn default_stopped_status() ->
crate::openhuman::voice::server::VoiceServerStatus) that returns the same
VoiceServerStatus with ServerState::Stopped, empty hotkey, ActivationMode::Tap,
transcription_count 0 and last_error None, then call that helper from both
handle_voice_server_status and handle_voice_server_stop and serialize its result
with serde_json::to_value as before.
- Around line 325-333: The 200ms tokio::time::sleep used before calling
crate::openhuman::voice::server::try_global_server() is a fragile heuristic;
replace it by implementing deterministic readiness signaling: change the server
spawn path to send a oneshot::Sender (or similar) when initialization completes
and in this routine await the corresponding oneshot::Receiver before calling
try_global_server() and s.status().await, or at minimum add a comment noting the
race and the 200ms limitation next to the sleep to document the fragility of
this approach.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b981e000-bc36-4ea7-baa0-d9f7d0b766ad

📥 Commits

Reviewing files that changed from the base of the PR and between dfb1958 and 7721269.

📒 Files selected for processing (12)
  • .github/Dockerfile
  • .github/workflows/test.yml
  • app/src/hooks/useDictationHotkey.ts
  • src/core/cli.rs
  • src/core/jsonrpc.rs
  • src/core/socketio.rs
  • src/openhuman/config/mod.rs
  • src/openhuman/config/schema/mod.rs
  • src/openhuman/config/schema/voice_server.rs
  • src/openhuman/voice/dictation_listener.rs
  • src/openhuman/voice/mod.rs
  • src/openhuman/voice/schemas.rs
✅ Files skipped from review due to trivial changes (1)
  • .github/Dockerfile
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/openhuman/config/mod.rs
  • src/openhuman/config/schema/mod.rs
  • src/openhuman/voice/mod.rs
  • src/core/jsonrpc.rs

Comment on lines +297 to +316
// Check if a server is already running with a different config.
if let Some(existing) = crate::openhuman::voice::server::try_global_server() {
let existing_status = existing.status().await;
if existing_status.state != crate::openhuman::voice::server::ServerState::Stopped {
if existing_status.hotkey != server_config.hotkey
|| existing_status.activation_mode != server_config.activation_mode
{
return Err(format!(
"voice server already running (hotkey={}, mode={:?}); \
stop it first before starting with different config",
existing_status.hotkey, existing_status.activation_mode
));
}
// Same config, already running — return current status.
return serde_json::to_value(existing_status)
.map_err(|e| format!("serialize error: {e}"));
}
}

let server = global_server(server_config);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

TOCTOU race condition in server initialization check.

The check at line 298 (try_global_server()) and initialization at line 316 (global_server(server_config)) are not atomic. Two concurrent voice_server_start calls with different configs could both pass the check (seeing None), then one wins the OnceCell::get_or_init race while the other silently gets the first caller's config.

The window is narrow, and OnceCell ensures memory safety, but the "different config" validation can be bypassed. Since this is a user-facing CLI/RPC and concurrent starts are unlikely, this is low-severity.

🔒 Potential fix using a mutex for initialization
+use std::sync::Mutex;
+use once_cell::sync::Lazy;
+
+static START_LOCK: Lazy<Mutex<()>> = Lazy::new(|| Mutex::new(()));

 fn handle_voice_server_start(params: Map<String, Value>) -> ControllerFuture {
     Box::pin(async move {
+        // Serialize start attempts to prevent TOCTOU races
+        let _guard = START_LOCK.lock().map_err(|e| format!("lock error: {e}"))?;
+
         use crate::openhuman::voice::hotkey::ActivationMode;
         // ... rest of handler

Alternatively, accept that two concurrent starts may result in the first config winning (document this behavior).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/schemas.rs` around lines 297 - 316, The current separate
check (try_global_server()) and init (global_server(server_config)) allow a
TOCTOU race where two concurrent voice_server_start calls can both pass the
check and one wins with its config; fix by serializing initialization: introduce
a global initialization mutex (e.g., SERVER_INIT_MUTEX) and acquire it at the
start of voice_server_start (or the public start path) before calling
try_global_server() and before calling global_server(server_config), perform the
same existing_state vs server_config comparison while holding the lock, then
call global_server(server_config) and release the lock; alternatively, move the
config validation into the OnceCell::get_or_init closure used by global_server
so the chosen config is validated atomically inside the init path (use
ServerState, try_global_server, global_server, server_config, and
voice_server_start as the reference points).

@senamakel senamakel merged commit 11f718d into tinyhumansai:main Apr 6, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants