feat(voice): standalone voice dictation server with hotkey support#368
feat(voice): standalone voice dictation server with hotkey support#368senamakel merged 23 commits intotinyhumansai:mainfrom
Conversation
- Introduced a new `voice` subcommand to the CLI for running a standalone voice dictation server that listens for a hotkey, records audio, transcribes it using Whisper, and inserts the result into the active text field. - Implemented configuration options for the voice server, including hotkey combination, activation mode (tap or push), and an option to skip LLM post-processing. - Added audio capture functionality using the `cpal` crate and integrated hotkey listening with the `rdev` crate for global key event handling. - Enhanced the configuration schema to include voice server settings and updated the main configuration structure accordingly. - Updated relevant modules and tests to ensure consistent behavior and functionality across the application. This feature enhances user interaction by allowing voice dictation directly into any active text field, improving accessibility and usability.
- Introduced a standalone voice dictation server that listens for a configurable hotkey to start recording audio, transcribes it using whisper, and inserts the transcribed text into the active text field. - Added CLI support for the `voice` command, allowing users to manage the voice server's configuration, including hotkey and activation mode settings. - Implemented configuration structures for the voice server, including options for automatic start, hotkey combination, activation mode, and cleanup behavior. - Enhanced audio capture functionality using the `cpal` library for microphone input and integrated text insertion using the `enigo` library for simulating keyboard input. - Updated relevant modules and schemas to support the new voice server features, ensuring a cohesive integration within the OpenHuman platform.
…unctionality - Updated the `run_voice_server_command` function to initialize the configuration with environment overrides instead of loading from a file, improving performance and flexibility. - Refactored the audio capture logic in `start_recording` to enhance thread management and error handling, ensuring a more robust audio stream setup. - Improved the handling of audio stream creation and playback, ensuring that all cpal objects are managed on the same thread as required, enhancing stability during recording operations.
- Eliminated the `HotkeyListenerHandle` import from the `server.rs` file, streamlining the code and improving clarity by removing unnecessary dependencies.
The postprocessor now checks the local LLM state and automatically enables transcription cleanup when the model is downloaded and ready, even if not explicitly configured. Falls back gracefully to raw text when the LLM is unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 12 minutes and 7 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (10)
📝 WalkthroughWalkthroughAdds a voice dictation system: hotkey-driven audio capture, system-wide hotkey listener, WAV resampling, WebSocket streaming transcription with optional LLM refinement, text insertion, config schema/ops/RPC, frontend hotkey hook/component, CLI and Docker/CI adjustments. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User (OS)
participant Listener as Hotkey Listener
participant Server as Voice Server
participant Audio as Audio Device
participant Whisper as Whisper Engine
participant LLM as LLM Service
participant TextInput as Text Input
User->>Listener: Press hotkey
Listener->>Server: Emit HotkeyEvent::Pressed
Server->>Audio: start_recording()
Audio->>Audio: Capture PCM samples
User->>Listener: Release hotkey (push mode) / Press again (tap mode)
Listener->>Server: Emit HotkeyEvent::Released or Pressed
Server->>Audio: stop()
Audio->>Server: WAV bytes
Server->>Whisper: transcribe WAV
Whisper->>Server: raw_text
alt LLM refinement enabled
Server->>LLM: cleanup_transcription(raw_text)
LLM->>Server: refined_text
else
Server->>Server: use raw_text
end
Server->>TextInput: insert_text(final_text)
TextInput->>User: text appears in focused field
sequenceDiagram
participant Client as Browser Client
participant WS as /ws/dictation
participant Buffer as Audio Buffer
participant Inference as Periodic Inference
participant Whisper as Whisper Engine
participant Cleanup as LLM Cleanup
Client->>WS: Connect
WS->>Buffer: init buffer
WS->>Inference: spawn periodic task (if streaming)
Client->>WS: Send binary audio frame
WS->>Buffer: append PCM16
Inference->>Whisper: transcribe partial
Whisper->>Inference: partial text
Inference->>Client: {type:"partial", text}
Client->>WS: Send {"type":"stop"}
WS->>Inference: stop task
WS->>Whisper: final transcribe(all)
Whisper->>WS: raw_text
alt LLM refinement enabled
WS->>Cleanup: cleanup_transcription(raw_text)
Cleanup->>WS: refined_text
else
WS->>WS: use raw_text
end
WS->>Client: {type:"final", text, raw_text}
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (11)
src/openhuman/voice/hotkey.rs (1)
203-281: Consider adding arrow keys and navigation keys.The
string_to_keymapping covers common hotkey candidates but omits arrow keys (up,down,left,right),home,end,pageup,pagedown, andinsert. These are occasionally used in hotkey combos. Low priority since the current set covers typical use cases.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/hotkey.rs` around lines 203 - 281, The mapping in string_to_key is missing navigation and arrow keys; add patterns for "up"/"down"/"left"/"right" mapping to Key::Up/Key::Down/Key::Left/Key::Right, and also add "home" => Key::Home, "end" => Key::End, "pageup"/"pgup" => Key::PageUp, "pagedown"/"pgdn" => Key::PageDown, and "insert"/"ins" => Key::Insert (use the exact rdev Key enum variants available) so these keys are recognized in hotkey parsing.src/openhuman/voice/postprocess.rs (2)
118-125: Test relies on implicit LLM state assumption.This test assumes
local_ai::global(config)returns a service with state other than"ready"or"degraded"in the test environment. If the LocalAiService initialization behavior changes (e.g., to auto-detect a local model), this test could become flaky. Consider adding a comment documenting this assumption, or mock the service state for deterministic testing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/postprocess.rs` around lines 118 - 125, The test disabled_cleanup_returns_raw_text relies on the implicit state of the LocalAiService returned by local_ai::global(config) which may be "ready" or "degraded" and make the test flaky; update the test to explicitly control the service state (e.g., mock or stub local_ai::global to return a LocalAiService with a known non-ready/degraded state) or set the service state deterministically before calling cleanup_transcription, and add a brief comment documenting the required assumption; target the test function disabled_cleanup_returns_raw_text and the call to cleanup_transcription (and any helper used to obtain the LocalAiService) when making this change.
41-54: Logic is correct but doc comment could be clearer.The gating logic works correctly: even when
voice_llm_cleanup_enabledis true, you gracefully skip cleanup if the LLM isn't ready (lines 51-54). However, the doc comment (lines 23-27) says cleanup is enabled when LLM is "ready" but the code also treats"degraded"as ready (line 39). Consider updating the doc comment to mention"degraded"for accuracy.The dual-check pattern (should_cleanup then llm_ready) is intentional graceful degradation—no changes needed to the logic itself.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/postprocess.rs` around lines 41 - 54, Update the doc comment explaining when cleanup is enabled to accurately reflect the code: note that cleanup is enabled when config.local_ai.voice_llm_cleanup_enabled is true OR when llm_ready is true, and explicitly mention that the llm_ready check treats both "ready" and "degraded" as acceptable states; leave the gating logic (the should_cleanup variable and the subsequent llm_ready checks that return raw_text) unchanged.src/openhuman/config/schema/voice_server.rs (1)
17-19: Consider using theActivationModeenum instead of String.The
hotkey.rsmodule already definesActivationModewith proper serde support (#[serde(rename_all = "snake_case")]). Using the enum here would provide compile-time type safety and automatic validation during config deserialization, rejecting invalid values like"Tap"or"toggle".Suggested change
+use crate::openhuman::voice::hotkey::ActivationMode; + /// Configuration for the voice dictation server. #[derive(Debug, Clone, Serialize, Deserialize, JsonSchema)] pub struct VoiceServerConfig { // ... - /// Activation mode: "tap" (toggle) or "push" (hold-to-record). - #[serde(default = "default_activation_mode")] - pub activation_mode: String, + /// Activation mode: tap (toggle) or push (hold-to-record). + #[serde(default)] + pub activation_mode: ActivationMode,This requires ensuring
ActivationModeimplementsJsonSchema(addschemars::JsonSchemaderive in hotkey.rs if not already present).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/config/schema/voice_server.rs` around lines 17 - 19, Replace the activation_mode String field in voice_server.rs with the ActivationMode enum (use the existing ActivationMode from hotkey.rs) so deserialization enforces valid snake_case values; update the serde default handler (default_activation_mode) to return ActivationMode (or implement Default for ActivationMode and use #[serde(default)]) and adjust the field signature accordingly; also ensure ActivationMode in hotkey.rs derives schemars::JsonSchema in addition to serde traits so config schema generation works.src/core/cli.rs (1)
272-274: Potential duplicate hotkey listener if core server is also running.Per the context in
server.rs,run_standalonecreates a newVoiceServerinstance independent of theglobal_server()singleton used by RPC endpoints. If a user runsopenhuman voicewhile the core server (with voice RPC) is already active, both would attempt to register the same hotkey, potentially causing conflicts.Consider documenting this as a known limitation, or adding a check (e.g., via a lockfile or socket probe) to warn if the core server is already running.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/core/cli.rs` around lines 272 - 274, The CLI may start a duplicate hotkey listener because run_standalone creates a new VoiceServer independent of the global_server() used by RPC; before calling run_standalone from CLI, probe whether the core server/global VoiceServer is already active (e.g., attempt to acquire the same lockfile, open the core server socket, or call a small RPC/health endpoint) and if detected, warn the user and either abort or start in a mode that does not register the hotkey (or register with a no-op); update the behavior in cli.rs around the run_standalone call and adjust VoiceServer startup in server.rs to support a "skip_hotkey_registration" flag or early-exit when global_server() is present, and add a short user-facing message documenting this limitation.src/openhuman/voice/server.rs (2)
263-289:run_standalonecreates a new server instance, not using global singleton.This is intentional for CLI isolation (the standalone server shouldn't interfere with RPC-managed instances), but it means CLI-started servers won't be queryable via
voice_server_statusRPC since that checks the global singleton. Consider documenting this behavior or having the CLI optionally register with the global.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/server.rs` around lines 263 - 289, The run_standalone function creates a fresh VoiceServer via VoiceServer::new and runs it in an Arc without registering it with the global singleton used by the voice_server_status RPC, so CLI-started instances aren't queryable; either document this behavior in the run_standalone comments or add an optional flag to register the created server into the global singleton used by voice_server_status (e.g. set_global_voice_server or similar) after creating server and before running it, ensuring shutdown still works via the same stop() flow and preserving CLI isolation when the flag is false.
119-133: 100ms sleep in select! creates unnecessary wake-ups.The
tokio::select!with a 100ms sleep branch causes the loop to wake every 100ms even when idle. Sincehotkey_rx.recv()will block until an event arrives, and the loop checksself.runningon each iteration, the sleep branch primarily serves as a check interval for therunningflag. Consider using a cancellation token or watch channel for cleaner shutdown signaling.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/server.rs` around lines 119 - 133, The loop is waking every 100ms due to the sleep branch in the tokio::select!, causing unnecessary wake-ups; replace that sleep-based shutdown check with a proper async shutdown signal (e.g., a tokio::sync::watch<bool> or a tokio_util::sync::CancellationToken) and select on that instead of tokio::time::sleep so the task only wakes for real events or shutdown; update the select! to await hotkey_rx.recv() and the chosen shutdown signal (e.g., watch.changed() or token.cancelled()), and remove the sleep and the polling of self.running (or have the shutdown signal be driven by setting self.running) so shutdown is immediate and efficient in the functions using self.running, tokio::select!, and hotkey_rx.recv().src/openhuman/voice/schemas.rs (2)
311-322:handle_voice_server_stopreturns error when server not initialized.Returning
Err("voice server is not running")is reasonable behavior for stop when there's nothing to stop. However, consider whether returning a "Stopped" status (likehandle_voice_server_statusdoes) would be more consistent and user-friendly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/schemas.rs` around lines 311 - 322, handle_voice_server_stop currently returns Err("voice server is not running") when try_global_server() yields None; change it to return the same "Stopped" status as handle_voice_server_status instead. Update the else branch in handle_voice_server_stop to produce a serialized stopped status (use the same Status enum/value or call the existing status-generation logic used by handle_voice_server_status) via serde_json::to_value(...) so callers receive a consistent status JSON rather than an error.
273-276: Silent fallback toTapmode for invalidactivation_modevalues.Any value other than
"push"defaults toTapwithout warning. If a user passes"Push"(capitalized) or"hold", they'll getTapmode without realizing. Consider logging a warning for unrecognized values or validating strictly.🛡️ Proposed fix: warn on unrecognized mode
let activation_mode = match mode_str { "push" => ActivationMode::Push, - _ => ActivationMode::Tap, + "tap" => ActivationMode::Tap, + other => { + log::warn!( + "[voice_server] unrecognized activation_mode '{other}', defaulting to tap" + ); + ActivationMode::Tap + } };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/schemas.rs` around lines 273 - 276, The code currently maps mode_str to ActivationMode with a silent default to Tap; modify the mapping in the match that creates activation_mode (matching on mode_str) to explicitly recognize accepted values (e.g., exact "push" and maybe canonicalized lowercased variants) and emit a warning when mode_str is unrecognized (including the received value) instead of silently defaulting—use the project's logging facility (e.g., warn! or tracing::warn) to log something like "Unrecognized activation_mode '{}', defaulting to Tap" and then fall back to ActivationMode::Tap; ensure you reference mode_str, ActivationMode::Push, and ActivationMode::Tap in the change.src/openhuman/voice/text_input.rs (2)
17-41:paste_textandtype_texthave identical implementations.Both functions call
enigo.text(text)despite having different docstrings. Thetype_textdocstring claims it types "character-by-character" and "doesn't touch the clipboard," but the implementation is identical topaste_text. Either consolidate into one function, or implement the actual character-by-character typing if that behavior is needed.♻️ Option 1: Consolidate if both methods should behave the same
-/// Insert text into the currently active text field by simulating a -/// clipboard paste (Cmd+V on macOS, Ctrl+V elsewhere). -/// -/// This is more reliable than typing character-by-character because it -/// handles Unicode, IME, and special characters correctly. -pub fn paste_text(text: &str) -> Result<(), String> { +/// Insert text into the active text field using enigo's platform-appropriate method. +/// +/// Handles Unicode, IME, and special characters correctly. +pub fn insert_text_raw(text: &str) -> Result<(), String> { if text.is_empty() { - debug!("{LOG_PREFIX} empty text, nothing to paste"); + debug!("{LOG_PREFIX} empty text, nothing to insert"); return Ok(()); } info!( - "{LOG_PREFIX} pasting {} chars into active field", + "{LOG_PREFIX} inserting {} chars into active field", text.len() ); - // Save current clipboard, set our text, paste, then restore. - // For simplicity we just overwrite — restoring clipboard is fragile - // across platforms and async contexts. let mut enigo = Enigo::new(&Settings::default()) .map_err(|e| format!("failed to create enigo instance: {e}"))?; - // Use enigo's text method which handles the platform-appropriate paste. enigo .text(text) .map_err(|e| format!("failed to type text: {e}"))?; - debug!("{LOG_PREFIX} text pasted successfully"); + debug!("{LOG_PREFIX} text inserted successfully"); Ok(()) } - -/// Type text character-by-character into the active field. -/// Slower but doesn't touch the clipboard. -pub fn type_text(text: &str) -> Result<(), String> { - if text.is_empty() { - debug!("{LOG_PREFIX} empty text, nothing to type"); - return Ok(()); - } - - info!("{LOG_PREFIX} typing {} chars into active field", text.len()); - - let mut enigo = Enigo::new(&Settings::default()) - .map_err(|e| format!("failed to create enigo instance: {e}"))?; - - enigo - .text(text) - .map_err(|e| format!("failed to type text: {e}"))?; - - debug!("{LOG_PREFIX} text typed successfully"); - Ok(()) -}♻️ Option 2: Implement actual character-by-character typing if needed
/// Type text character-by-character into the active field. /// Slower but doesn't touch the clipboard. pub fn type_text(text: &str) -> Result<(), String> { if text.is_empty() { debug!("{LOG_PREFIX} empty text, nothing to type"); return Ok(()); } info!("{LOG_PREFIX} typing {} chars into active field", text.len()); let mut enigo = Enigo::new(&Settings::default()) .map_err(|e| format!("failed to create enigo instance: {e}"))?; - enigo - .text(text) - .map_err(|e| format!("failed to type text: {e}"))?; + for c in text.chars() { + enigo + .key(enigo::Key::Unicode(c), enigo::Direction::Click) + .map_err(|e| format!("failed to type character '{c}': {e}"))?; + } debug!("{LOG_PREFIX} text typed successfully"); Ok(()) }Also applies to: 43-62
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/text_input.rs` around lines 17 - 41, Both paste_text and type_text currently call Enigo::text(text) and are identical; change to either consolidate them into a single function (remove duplicate) or implement real per-character typing in type_text. If consolidating, remove type_text and update call sites to use paste_text (or vice versa) and keep Enigo::text(text) as the single behavior. If implementing character-by-character typing, modify type_text to iterate over chars and call enigo.key_sequence or enigo.key_down/key_up per character (or a small sleep between calls) while ensuring it does not touch clipboard; keep paste_text using enigo.text(text) and maintain existing error handling and logging in both functions.
68-75: Inconsistent whitespace handling betweeninsert_textandpaste_text.
insert_texttrims text before checking emptiness (line 69), butpaste_textonly checks for empty string. If the caller passes" "directly topaste_text, it will attempt to paste whitespace, whereasinsert_textwould skip it. This is fine ifpaste_textis only called viainsert_text, but since both arepub, consider documenting this difference or aligning the behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/text_input.rs` around lines 68 - 75, The issue is inconsistent whitespace handling: insert_text trims before emptiness check but paste_text only checks for exact empty string, so callers of pub paste_text can paste pure-whitespace. Fix by making paste_text perform the same trim-and-empty check (or have insert_text call paste_text with text.trim()), e.g., update paste_text to call text.trim() and skip/panic/return Ok with the same warning log used in insert_text when trimmed text is empty; ensure both functions' behavior is aligned or document the intended contract in their public comments (reference functions: insert_text and paste_text).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Cargo.toml`:
- Line 94: The enigo dependency entry currently disables default features which
breaks compilation on Linux because no platform backend (xdo, x11rb, wayland, or
libei) is enabled; update the Cargo.toml enigo dependency so it either
re-enables default features or explicitly enables a Linux backend (e.g., add the
appropriate feature like "xdo" or "x11rb" under the enigo = { ... } line) so the
crate has a platform-specific backend enabled.
In `@src/core/cli.rs`:
- Around line 257-258: The CLI currently constructs the config via
Config::default() and then apply_env_overrides(), which ignores user
config.toml; replace that logic to call Config::load_or_init() (which loads from
the config file or falls back to defaults) and then still call
config.apply_env_overrides() so env vars override file values; update the code
that references the config variable (the let mut config assignment and
subsequent uses) to use the loaded config from Config::load_or_init() instead of
Config::default().
In `@src/openhuman/voice/hotkey.rs`:
- Around line 59-71: The stop_flag only causes the callback in
HotkeyListenerHandle::stop/Drop to return early but does not terminate the
thread because rdev::listen is blocking; update the stop log in
HotkeyListenerHandle::stop to a more accurate message such as "hotkey listener
signaled to skip events" (replace the current "{LOG_PREFIX} stop signal sent to
hotkey listener"), and add a clear code comment in start_listener next to the
rdev::listen(callback) call documenting that rdev::listen blocks in the platform
event loop (no graceful cancellation in rdev 0.5.3) so the thread remains alive
and only process exit removes the hook — mention that this is a known limitation
and why the stop_flag only causes the callback to ignore events.
In `@src/openhuman/voice/schemas.rs`:
- Around line 283-297: The code currently calls global_server(server_config)
which uses OnceCell::get_or_init and will silently ignore a new
VoiceServerConfig if the server already exists; update voice_server_start (the
caller creating VoiceServerConfig and calling global_server) to detect when an
existing server/config is present by retrieving the stored config from the
global server singleton (via the same global_server/OnceCell or a getter),
compare fields (hotkey, activation_mode, skip_cleanup, context) against the
incoming VoiceServerConfig, and if they differ either return an Err indicating
the server is already running with a different config or at minimum log a clear
warning via log::warn! that the provided config was ignored; include this
comparison and the warning/error before spawning the tokio task that calls
server.run(&config_clone).
In `@src/openhuman/voice/server.rs`:
- Around line 291-297: The current truncate_for_log function slices bytes with
&s[..max], which can panic on multi-byte UTF-8 boundaries; replace the
byte-slice approach with a safe character-based truncation (use
s.chars().take(max).collect::<String>() or iterate with s.char_indices()) and
append "..." only when the original string was longer than max so the function
never slices in the middle of a Unicode scalar and safely returns a valid UTF-8
String.
- Around line 46-66: The runtime config is missing the min_duration_secs field
declared in the schema; add pub min_duration_secs: f32 to VoiceServerConfig and
set its default to 0.3 in the Default impl, then replace the hardcoded threshold
literal used when checking transcription/audio length (the runtime comparison
currently using the embedded numeric threshold) to use self.min_duration_secs
(or config.min_duration_secs) instead so the runtime honors the schema value.
---
Nitpick comments:
In `@src/core/cli.rs`:
- Around line 272-274: The CLI may start a duplicate hotkey listener because
run_standalone creates a new VoiceServer independent of the global_server() used
by RPC; before calling run_standalone from CLI, probe whether the core
server/global VoiceServer is already active (e.g., attempt to acquire the same
lockfile, open the core server socket, or call a small RPC/health endpoint) and
if detected, warn the user and either abort or start in a mode that does not
register the hotkey (or register with a no-op); update the behavior in cli.rs
around the run_standalone call and adjust VoiceServer startup in server.rs to
support a "skip_hotkey_registration" flag or early-exit when global_server() is
present, and add a short user-facing message documenting this limitation.
In `@src/openhuman/config/schema/voice_server.rs`:
- Around line 17-19: Replace the activation_mode String field in voice_server.rs
with the ActivationMode enum (use the existing ActivationMode from hotkey.rs) so
deserialization enforces valid snake_case values; update the serde default
handler (default_activation_mode) to return ActivationMode (or implement Default
for ActivationMode and use #[serde(default)]) and adjust the field signature
accordingly; also ensure ActivationMode in hotkey.rs derives
schemars::JsonSchema in addition to serde traits so config schema generation
works.
In `@src/openhuman/voice/hotkey.rs`:
- Around line 203-281: The mapping in string_to_key is missing navigation and
arrow keys; add patterns for "up"/"down"/"left"/"right" mapping to
Key::Up/Key::Down/Key::Left/Key::Right, and also add "home" => Key::Home, "end"
=> Key::End, "pageup"/"pgup" => Key::PageUp, "pagedown"/"pgdn" => Key::PageDown,
and "insert"/"ins" => Key::Insert (use the exact rdev Key enum variants
available) so these keys are recognized in hotkey parsing.
In `@src/openhuman/voice/postprocess.rs`:
- Around line 118-125: The test disabled_cleanup_returns_raw_text relies on the
implicit state of the LocalAiService returned by local_ai::global(config) which
may be "ready" or "degraded" and make the test flaky; update the test to
explicitly control the service state (e.g., mock or stub local_ai::global to
return a LocalAiService with a known non-ready/degraded state) or set the
service state deterministically before calling cleanup_transcription, and add a
brief comment documenting the required assumption; target the test function
disabled_cleanup_returns_raw_text and the call to cleanup_transcription (and any
helper used to obtain the LocalAiService) when making this change.
- Around line 41-54: Update the doc comment explaining when cleanup is enabled
to accurately reflect the code: note that cleanup is enabled when
config.local_ai.voice_llm_cleanup_enabled is true OR when llm_ready is true, and
explicitly mention that the llm_ready check treats both "ready" and "degraded"
as acceptable states; leave the gating logic (the should_cleanup variable and
the subsequent llm_ready checks that return raw_text) unchanged.
In `@src/openhuman/voice/schemas.rs`:
- Around line 311-322: handle_voice_server_stop currently returns Err("voice
server is not running") when try_global_server() yields None; change it to
return the same "Stopped" status as handle_voice_server_status instead. Update
the else branch in handle_voice_server_stop to produce a serialized stopped
status (use the same Status enum/value or call the existing status-generation
logic used by handle_voice_server_status) via serde_json::to_value(...) so
callers receive a consistent status JSON rather than an error.
- Around line 273-276: The code currently maps mode_str to ActivationMode with a
silent default to Tap; modify the mapping in the match that creates
activation_mode (matching on mode_str) to explicitly recognize accepted values
(e.g., exact "push" and maybe canonicalized lowercased variants) and emit a
warning when mode_str is unrecognized (including the received value) instead of
silently defaulting—use the project's logging facility (e.g., warn! or
tracing::warn) to log something like "Unrecognized activation_mode '{}',
defaulting to Tap" and then fall back to ActivationMode::Tap; ensure you
reference mode_str, ActivationMode::Push, and ActivationMode::Tap in the change.
In `@src/openhuman/voice/server.rs`:
- Around line 263-289: The run_standalone function creates a fresh VoiceServer
via VoiceServer::new and runs it in an Arc without registering it with the
global singleton used by the voice_server_status RPC, so CLI-started instances
aren't queryable; either document this behavior in the run_standalone comments
or add an optional flag to register the created server into the global singleton
used by voice_server_status (e.g. set_global_voice_server or similar) after
creating server and before running it, ensuring shutdown still works via the
same stop() flow and preserving CLI isolation when the flag is false.
- Around line 119-133: The loop is waking every 100ms due to the sleep branch in
the tokio::select!, causing unnecessary wake-ups; replace that sleep-based
shutdown check with a proper async shutdown signal (e.g., a
tokio::sync::watch<bool> or a tokio_util::sync::CancellationToken) and select on
that instead of tokio::time::sleep so the task only wakes for real events or
shutdown; update the select! to await hotkey_rx.recv() and the chosen shutdown
signal (e.g., watch.changed() or token.cancelled()), and remove the sleep and
the polling of self.running (or have the shutdown signal be driven by setting
self.running) so shutdown is immediate and efficient in the functions using
self.running, tokio::select!, and hotkey_rx.recv().
In `@src/openhuman/voice/text_input.rs`:
- Around line 17-41: Both paste_text and type_text currently call
Enigo::text(text) and are identical; change to either consolidate them into a
single function (remove duplicate) or implement real per-character typing in
type_text. If consolidating, remove type_text and update call sites to use
paste_text (or vice versa) and keep Enigo::text(text) as the single behavior. If
implementing character-by-character typing, modify type_text to iterate over
chars and call enigo.key_sequence or enigo.key_down/key_up per character (or a
small sleep between calls) while ensuring it does not touch clipboard; keep
paste_text using enigo.text(text) and maintain existing error handling and
logging in both functions.
- Around line 68-75: The issue is inconsistent whitespace handling: insert_text
trims before emptiness check but paste_text only checks for exact empty string,
so callers of pub paste_text can paste pure-whitespace. Fix by making paste_text
perform the same trim-and-empty check (or have insert_text call paste_text with
text.trim()), e.g., update paste_text to call text.trim() and skip/panic/return
Ok with the same warning log used in insert_text when trimmed text is empty;
ensure both functions' behavior is aligned or document the intended contract in
their public comments (reference functions: insert_text and paste_text).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 43969220-7598-4d34-b1ee-9c2bc5fcce2e
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (15)
Cargo.tomlapp/src/components/settings/panels/ScreenIntelligencePanel.tsxsrc/core/cli.rssrc/core/screen_intelligence_cli.rssrc/openhuman/config/schema/mod.rssrc/openhuman/config/schema/types.rssrc/openhuman/config/schema/voice_server.rssrc/openhuman/screen_intelligence/engine.rssrc/openhuman/voice/audio_capture.rssrc/openhuman/voice/hotkey.rssrc/openhuman/voice/mod.rssrc/openhuman/voice/postprocess.rssrc/openhuman/voice/schemas.rssrc/openhuman/voice/server.rssrc/openhuman/voice/text_input.rs
…ing (tinyhumansai#332) Add the foundational infrastructure for voice dictation (EPIC tinyhumansai#332): **Rust core:** - New `DictationConfig` schema with serde defaults and env var overrides (enabled, hotkey, activation_mode, llm_refinement, streaming, interval) - RPC controllers: `config_get_dictation_settings` / `config_update_dictation_settings` - WebSocket endpoint `/ws/dictation` for streaming PCM16 transcription with periodic partial inference and final LLM refinement - Microphone permission declaration (`NSMicrophoneUsageDescription`) in Tauri macOS bundle config **Frontend:** - `useDictationHotkey` hook: fetches config from core RPC, auto-registers global hotkey, listens for `dictation://toggle` events - `DictationHotkeyManager` headless component mounted in App.tsx - Fix voice RPC response type mismatch: voice handlers return flat results (no `{result, logs}` wrapper), so remove incorrect `CommandResponse<T>` wrapping from `openhumanVoiceStatus`, `openhumanVoiceTranscribe`, `openhumanVoiceTranscribeBytes`, and `openhumanVoiceTts` Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `infoPlist` field in tauri.conf.json expects a string path, not an inline object. Remove it for now — microphone permission will be added via a proper Info.plist supplement in the production build pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cycle, WebSocket streaming) Merge feat/332-voice-dictation into feat/stt to combine: - Our standalone voice server (hotkey → record → transcribe → insert) - PR tinyhumansai#371's DictationConfig, WebSocket streaming endpoint, frontend hotkey hook, and voice RPC type fixes Resolved conflict in src/openhuman/voice/mod.rs — kept both server and streaming modules. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (9)
src/openhuman/voice/schemas.rs (2)
341-351: Synthesized "Stopped" status uses empty hotkey.When no server exists,
handle_voice_server_stopandhandle_voice_server_statusreturn aVoiceServerStatuswithhotkey: String::new(). Consider using a placeholder like"(none)"or the configured default fromconfig.voice_server.hotkeyfor better UI presentation.Also applies to: 361-369
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/schemas.rs` around lines 341 - 351, The synthesized Stopped VoiceServerStatus currently sets hotkey to an empty string; update the status construction in the server-status/stop handlers (e.g., where VoiceServerStatus is created in handle_voice_server_stop and handle_voice_server_status) to use a sensible placeholder such as "(none)" or the configured default (config.voice_server.hotkey) instead of String::new(), and ensure both occurrences (the block at ~341-351 and the similar block at ~361-369) are updated so the UI receives a meaningful hotkey value when no server exists.
316-323: Fixed 200ms sleep after spawn is fragile.The
tokio::time::sleep(200ms)assumes the server initializes within that window. A more robust approach would polltry_global_server()with a short timeout/retry loop, or haveserver.run()signal readiness via a channel.This is acceptable for an initial implementation but may cause intermittent "failed to initialize" errors under load.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/schemas.rs` around lines 316 - 323, The fixed 200ms sleep after spawning the server is fragile; replace it with a readiness wait that polls try_global_server() (or use a oneshot/channel signalled by server.run) until the server is available or a short timeout elapses. Specifically, after tokio::spawn(async move { if let Err(e) = server.run(&config_clone).await { ... } }); implement a retry loop that calls try_global_server() with a small sleep (e.g., 10–50ms) between attempts and an overall timeout (e.g., a few seconds), returning an error if the timeout elapses; alternatively modify server.run to send a readiness signal on a oneshot::Sender and await the corresponding Receiver here—use the names server.run and try_global_server to locate where to change the code.src/openhuman/config/schema/voice_server.rs (1)
6-6: Two distinctActivationModeenums exist — verify this is intentional.
VoiceServerConfigusesvoice::hotkey::ActivationMode(variants:Tap,Push), whileDictationConfigindictation.rsusesDictationActivationMode(variants:Toggle,Push). The semantic difference between "Tap" and "Toggle" is unclear and may confuse users or cause subtle bugs if the wrong enum is used.If both are needed, consider adding doc comments clarifying when each is used. Otherwise, consolidate into a single shared enum.
Also applies to: 19-21
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/config/schema/voice_server.rs` at line 6, VoiceServerConfig imports and uses voice::hotkey::ActivationMode (variants Tap, Push) while DictationConfig uses a separate DictationActivationMode (variants Toggle, Push), which is likely an accidental duplication causing semantic confusion; either consolidate both into one shared enum (e.g., rename to ActivationMode with unified variants like Tap/Toggle/Push as appropriate) and update usages in VoiceServerConfig and DictationConfig to reference the single enum, or keep both but add crate-level doc comments on the enum definitions clarifying their distinct semantics and when to use each (update definitions where ActivationMode and DictationActivationMode are declared and all references in VoiceServerConfig and DictationConfig accordingly).src/openhuman/config/schema/dictation.rs (1)
54-56: Different default hotkeys for dictation vs voice server.
DictationConfigdefaults to"CmdOrCtrl+Shift+D"whileVoiceServerConfig(invoice_server.rs) defaults to"ctrl+shift+space". If both are active simultaneously, users may be confused by which hotkey triggers which feature.Consider documenting the distinction clearly or unifying the defaults if they serve the same purpose.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/config/schema/dictation.rs` around lines 54 - 56, DictationConfig's default_hotkey() returns "CmdOrCtrl+Shift+D" while VoiceServerConfig uses "ctrl+shift+space", causing conflicting UX; pick a single canonical hotkey or explicitly document distinct purposes and wire both to a shared constant. Update the default_hotkey() function and VoiceServerConfig to use a shared constant (e.g., DEFAULT_DICTATION_HOTKEY or DEFAULT_VOICE_HOTKEY), add a brief doc comment on the DictationConfig and VoiceServerConfig structs explaining their different behaviors if you keep distinct keys, and ensure any user-facing docs/examples reflect the chosen/unified default.src/openhuman/config/schema/load.rs (1)
735-739: Consider validatingstreaming_interval_msbounds.Very low values (e.g.,
1ms) could cause excessive CPU usage during streaming transcription. Other numeric overrides in this file validate ranges (e.g., temperature, max_results).♻️ Optional: Add minimum threshold
if let Ok(val) = std::env::var("OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS") { if let Ok(ms) = val.trim().parse::<u64>() { - self.dictation.streaming_interval_ms = ms; + if ms >= 100 { + self.dictation.streaming_interval_ms = ms; + } else { + tracing::warn!( + ms = %ms, + "ignoring OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS below 100ms" + ); + } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/config/schema/load.rs` around lines 735 - 739, The code reads OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS and assigns it to self.dictation.streaming_interval_ms without bounds checking; add validation after parsing (e.g., parse val to u64, then if ms < MIN_STREAMING_INTERVAL_MS or ms > MAX_STREAMING_INTERVAL_MS clamp or reject and fallback to the existing default) and ensure you log or warn when the provided value is out of range; introduce constants like MIN_STREAMING_INTERVAL_MS and MAX_STREAMING_INTERVAL_MS (or reuse an existing DEFAULT) and update the assignment in the block that handles OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS so only validated/clamped values are written to dictation.streaming_interval_ms.src/openhuman/config/ops.rs (1)
539-546: Add derive macros for consistency with other patch structs.Other
*SettingsPatchstructs in this file (e.g.,ModelSettingsPatch,MemorySettingsPatch) deriveDebug,Clone, andDefault. Consider adding these for consistency and to enable debugging/cloning if needed:♻️ Proposed fix
-pub struct DictationSettingsPatch { +#[derive(Debug, Clone, Default)] +pub struct DictationSettingsPatch { pub enabled: Option<bool>,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/config/ops.rs` around lines 539 - 546, The DictationSettingsPatch struct lacks derive macros that other *SettingsPatch structs have; update the declaration of DictationSettingsPatch to derive Debug, Clone, and Default (add #[derive(Debug, Clone, Default)]) so it matches ModelSettingsPatch and MemorySettingsPatch and allows easy debugging/cloning/default construction for DictationSettingsPatch.src/openhuman/voice/streaming.rs (2)
139-158:inference_handleownership could be clearer.The
inference_handleis moved in theCloseandErrorarms but the variable is also used after the loop (line 162). This works becauseClose/Errorreturn early, but consider usingOption::take()for clarity:♻️ Suggested alternative for clarity
- Some(Ok(Message::Close(_))) | None => { - log::info!("{LOG_PREFIX} client disconnected"); - if let Some(h) = inference_handle { - h.abort(); - } - return; - } + Some(Ok(Message::Close(_))) | None => { + log::info!("{LOG_PREFIX} client disconnected"); + if let Some(ref h) = inference_handle { + h.abort(); + } + return; + }Or make
inference_handlemutable and usetake()consistently.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/streaming.rs` around lines 139 - 158, The review notes unclear ownership of inference_handle being moved in the Close/Error match arms; make inference_handle mutable and use Option::take() before aborting so you don't move it repeatedly or depend on early returns—locate the websocket handling loop (where inference_handle is matched in the Close/Err arms) and replace direct moves like if let Some(h) = inference_handle { h.abort(); } with let h = inference_handle.take(); if let Some(h) = h { h.abort(); } so the variable is consistently consumed and safe to reference after the loop.
199-204: LLM refinement may silently fall back to raw text without indication.When
config.dictation.llm_refinementistruebut the local LLM isn't ready,cleanup_transcriptionreturns the raw text unchanged (perpostprocess.rslines 55-57). The client receives identicaltextandraw_textfields with no indication that refinement was skipped.Consider logging when this fallback occurs, or including a
"refinement_applied": boolfield in the response so clients know whether cleanup actually ran.♻️ Proposed enhancement
// LLM refinement if enabled - let refined_text = if config.dictation.llm_refinement && !raw_text.is_empty() { - postprocess::cleanup_transcription(&config, &raw_text, None).await + let (refined_text, refinement_applied) = if config.dictation.llm_refinement && !raw_text.is_empty() { + let cleaned = postprocess::cleanup_transcription(&config, &raw_text, None).await; + let applied = cleaned != raw_text; + if !applied { + log::debug!("{LOG_PREFIX} LLM refinement requested but skipped (LLM not ready)"); + } + (cleaned, applied) } else { - raw_text.clone() + (raw_text.clone(), false) }; let msg = serde_json::json!({ "type": "final", "text": refined_text, "raw_text": raw_text, + "refinement_applied": refinement_applied, });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/streaming.rs` around lines 199 - 204, The LLM refinement path can silently return the raw text (cleanup_transcription in postprocess.rs) leaving clients unaware refinement was skipped; update the logic around refined_text (the block checking config.dictation.llm_refinement and calling postprocess::cleanup_transcription) to detect when cleanup_transcription returned the unchanged text or when the local LLM wasn't ready and then either (a) set and return an explicit boolean flag (e.g., "refinement_applied") in the response alongside text/raw_text, or (b) emit a clear log message indicating refinement was skipped; reference the refined_text variable and cleanup_transcription function to implement the detection and add the new response field or log.src/core/jsonrpc.rs (1)
331-344: Consider adding authentication to the dictation WebSocket endpoint.The
/ws/dictationendpoint accepts connections without any authentication or authorization check. While the server defaults to binding on127.0.0.1(limiting exposure), if the host is configured to bind to0.0.0.0or a public interface, any network client could use the dictation service.Other RPC endpoints appear to rely on session tokens for authenticated operations. Consider whether this endpoint should validate a session token (e.g., via query parameter or initial WebSocket message) before processing audio data.
#!/bin/bash # Check if other WebSocket or SSE endpoints have authentication checks rg -n -C3 'WebSocketUpgrade|events_handler' src/core/jsonrpc.rs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/core/jsonrpc.rs` around lines 331 - 344, The dictation_ws_handler currently accepts WebSocket upgrades without authentication; modify it to validate a session token before handing the socket to handle_dictation_ws: extract a token (e.g., from query params on the upgrade request or require the first WebSocket message to contain credentials), call your existing session validation routine (the same mechanism other RPC endpoints use) and only call crate::openhuman::voice::streaming::handle_dictation_ws(socket, config).await after successful validation; on failure, reject the upgrade or immediately close the socket and log the reason. Ensure the changes are implemented inside dictation_ws_handler (around WebSocketUpgrade handling and before invoking handle_dictation_ws) and reuse crate::openhuman::config::rpc::load_config_with_timeout() as needed for config access.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@app/src/hooks/useDictationHotkey.ts`:
- Around line 110-130: The effect registering the dictation hotkey can leak if
the component unmounts before listen() resolves; update the useEffect (the block
using isTauri(), listen, unlisten, and setToggleCount) to use a disposal guard:
introduce a local disposed flag, call listen(...).then(fn => { if (disposed) {
fn(); } else { unlisten = fn; } }).catch(...), and in the cleanup set disposed =
true and call unlisten?.(); this ensures a late-resolving listen() will
immediately unregister instead of leaving a dangling listener.
---
Nitpick comments:
In `@src/core/jsonrpc.rs`:
- Around line 331-344: The dictation_ws_handler currently accepts WebSocket
upgrades without authentication; modify it to validate a session token before
handing the socket to handle_dictation_ws: extract a token (e.g., from query
params on the upgrade request or require the first WebSocket message to contain
credentials), call your existing session validation routine (the same mechanism
other RPC endpoints use) and only call
crate::openhuman::voice::streaming::handle_dictation_ws(socket, config).await
after successful validation; on failure, reject the upgrade or immediately close
the socket and log the reason. Ensure the changes are implemented inside
dictation_ws_handler (around WebSocketUpgrade handling and before invoking
handle_dictation_ws) and reuse
crate::openhuman::config::rpc::load_config_with_timeout() as needed for config
access.
In `@src/openhuman/config/ops.rs`:
- Around line 539-546: The DictationSettingsPatch struct lacks derive macros
that other *SettingsPatch structs have; update the declaration of
DictationSettingsPatch to derive Debug, Clone, and Default (add #[derive(Debug,
Clone, Default)]) so it matches ModelSettingsPatch and MemorySettingsPatch and
allows easy debugging/cloning/default construction for DictationSettingsPatch.
In `@src/openhuman/config/schema/dictation.rs`:
- Around line 54-56: DictationConfig's default_hotkey() returns
"CmdOrCtrl+Shift+D" while VoiceServerConfig uses "ctrl+shift+space", causing
conflicting UX; pick a single canonical hotkey or explicitly document distinct
purposes and wire both to a shared constant. Update the default_hotkey()
function and VoiceServerConfig to use a shared constant (e.g.,
DEFAULT_DICTATION_HOTKEY or DEFAULT_VOICE_HOTKEY), add a brief doc comment on
the DictationConfig and VoiceServerConfig structs explaining their different
behaviors if you keep distinct keys, and ensure any user-facing docs/examples
reflect the chosen/unified default.
In `@src/openhuman/config/schema/load.rs`:
- Around line 735-739: The code reads OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS
and assigns it to self.dictation.streaming_interval_ms without bounds checking;
add validation after parsing (e.g., parse val to u64, then if ms <
MIN_STREAMING_INTERVAL_MS or ms > MAX_STREAMING_INTERVAL_MS clamp or reject and
fallback to the existing default) and ensure you log or warn when the provided
value is out of range; introduce constants like MIN_STREAMING_INTERVAL_MS and
MAX_STREAMING_INTERVAL_MS (or reuse an existing DEFAULT) and update the
assignment in the block that handles OPENHUMAN_DICTATION_STREAMING_INTERVAL_MS
so only validated/clamped values are written to dictation.streaming_interval_ms.
In `@src/openhuman/config/schema/voice_server.rs`:
- Line 6: VoiceServerConfig imports and uses voice::hotkey::ActivationMode
(variants Tap, Push) while DictationConfig uses a separate
DictationActivationMode (variants Toggle, Push), which is likely an accidental
duplication causing semantic confusion; either consolidate both into one shared
enum (e.g., rename to ActivationMode with unified variants like Tap/Toggle/Push
as appropriate) and update usages in VoiceServerConfig and DictationConfig to
reference the single enum, or keep both but add crate-level doc comments on the
enum definitions clarifying their distinct semantics and when to use each
(update definitions where ActivationMode and DictationActivationMode are
declared and all references in VoiceServerConfig and DictationConfig
accordingly).
In `@src/openhuman/voice/schemas.rs`:
- Around line 341-351: The synthesized Stopped VoiceServerStatus currently sets
hotkey to an empty string; update the status construction in the
server-status/stop handlers (e.g., where VoiceServerStatus is created in
handle_voice_server_stop and handle_voice_server_status) to use a sensible
placeholder such as "(none)" or the configured default
(config.voice_server.hotkey) instead of String::new(), and ensure both
occurrences (the block at ~341-351 and the similar block at ~361-369) are
updated so the UI receives a meaningful hotkey value when no server exists.
- Around line 316-323: The fixed 200ms sleep after spawning the server is
fragile; replace it with a readiness wait that polls try_global_server() (or use
a oneshot/channel signalled by server.run) until the server is available or a
short timeout elapses. Specifically, after tokio::spawn(async move { if let
Err(e) = server.run(&config_clone).await { ... } }); implement a retry loop that
calls try_global_server() with a small sleep (e.g., 10–50ms) between attempts
and an overall timeout (e.g., a few seconds), returning an error if the timeout
elapses; alternatively modify server.run to send a readiness signal on a
oneshot::Sender and await the corresponding Receiver here—use the names
server.run and try_global_server to locate where to change the code.
In `@src/openhuman/voice/streaming.rs`:
- Around line 139-158: The review notes unclear ownership of inference_handle
being moved in the Close/Error match arms; make inference_handle mutable and use
Option::take() before aborting so you don't move it repeatedly or depend on
early returns—locate the websocket handling loop (where inference_handle is
matched in the Close/Err arms) and replace direct moves like if let Some(h) =
inference_handle { h.abort(); } with let h = inference_handle.take(); if let
Some(h) = h { h.abort(); } so the variable is consistently consumed and safe to
reference after the loop.
- Around line 199-204: The LLM refinement path can silently return the raw text
(cleanup_transcription in postprocess.rs) leaving clients unaware refinement was
skipped; update the logic around refined_text (the block checking
config.dictation.llm_refinement and calling postprocess::cleanup_transcription)
to detect when cleanup_transcription returned the unchanged text or when the
local LLM wasn't ready and then either (a) set and return an explicit boolean
flag (e.g., "refinement_applied") in the response alongside text/raw_text, or
(b) emit a clear log message indicating refinement was skipped; reference the
refined_text variable and cleanup_transcription function to implement the
detection and add the new response field or log.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c543bfd3-25b1-47d4-b878-d80a71531077
📒 Files selected for processing (23)
Cargo.tomlapp/src/App.tsxapp/src/components/DictationHotkeyManager.tsxapp/src/hooks/useDictationHotkey.tsapp/src/pages/Conversations.tsxapp/src/utils/tauriCommands.tssrc/core/cli.rssrc/core/jsonrpc.rssrc/openhuman/config/mod.rssrc/openhuman/config/ops.rssrc/openhuman/config/schema/dictation.rssrc/openhuman/config/schema/load.rssrc/openhuman/config/schema/mod.rssrc/openhuman/config/schema/types.rssrc/openhuman/config/schema/voice_server.rssrc/openhuman/config/schemas.rssrc/openhuman/voice/hotkey.rssrc/openhuman/voice/mod.rssrc/openhuman/voice/postprocess.rssrc/openhuman/voice/schemas.rssrc/openhuman/voice/server.rssrc/openhuman/voice/streaming.rssrc/openhuman/voice/text_input.rs
✅ Files skipped from review due to trivial changes (2)
- Cargo.toml
- src/openhuman/voice/hotkey.rs
🚧 Files skipped from review as they are similar to previous changes (7)
- src/openhuman/config/schema/mod.rs
- src/openhuman/config/schema/types.rs
- src/openhuman/voice/postprocess.rs
- src/openhuman/voice/text_input.rs
- src/core/cli.rs
- src/openhuman/voice/mod.rs
- src/openhuman/voice/server.rs
- Added a global dictation hotkey listener that activates based on configuration. - Implemented a web channel bridge to handle dictation events and broadcast them to connected clients. - Updated the voice module to include the new dictation listener functionality. This enhances the voice dictation capabilities by ensuring real-time event handling and client communication.
- Updated `Cargo.toml` to mark voice-related dependencies as optional. - Introduced `VoiceActivationMode` enum for better control over voice server activation. - Refactored voice server command handling and dictation event broadcasting to support new features. - Added conditional compilation for voice features across various modules, ensuring they are only included when enabled. This commit improves the modularity and configurability of the voice server, allowing for more flexible integration and usage.
- Removed unnecessary blank lines in `cli.rs`, `jsonrpc.rs`, `schemas.rs`, and `socketio.rs` to improve code readability. - Adjusted import order in `mod.rs` for better organization. This commit enhances the overall code quality by ensuring consistent formatting across multiple files.
…em dependencies - Added installation of system dependencies (cmake, ALSA, X11) in the Dockerfile for improved build support. - Updated the GitHub Actions workflow to reflect the new dependencies, ensuring consistent environment setup for testing. This commit enhances the build environment by including necessary libraries for audio and GUI support.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
src/openhuman/voice/dictation_listener.rs (1)
103-123: Consider movingnormalizedandmode_strownership into the spawned task more efficiently.The current implementation clones
normalizedon every event (line 117). Since the hotkey string is constant for the listener's lifetime, you could move it into the closure once.♻️ Minor optimization to avoid per-event clones
let mode_str = match mode { ActivationMode::Tap => "toggle", ActivationMode::Push => "push", - }; + } + .to_string(); log::info!("{LOG_PREFIX} dictation hotkey active: {normalized}"); // Forward hotkey events to the broadcast channel. tokio::spawn(async move { // Keep the listener handle alive for the lifetime of this task. let _handle = listener_handle; while let Some(event) = hotkey_rx.recv().await { let event_type = match event { HotkeyEvent::Pressed => "pressed", HotkeyEvent::Released => "released", }; log::debug!("{LOG_PREFIX} hotkey {event_type}"); publish_dictation_event(DictationEvent { event_type: event_type.to_string(), - hotkey: normalized.clone(), - activation_mode: mode_str.to_string(), + hotkey: normalized.clone(), // normalized is moved into closure, clone for each event + activation_mode: mode_str.clone(), }); }Actually,
normalizedis already moved into the closure — the.clone()is needed because the loop runs multiple times. This is already optimal given the broadcast channel requires owned values. No change needed.On reflection, the current implementation is correct —
normalizedmust be cloned per-event sinceDictationEventowns its strings and the loop iterates multiple times. This is fine for low-frequency hotkey events.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/dictation_listener.rs` around lines 103 - 123, The code review suggests moving ownership of `normalized` and `mode_str` into the spawned task to avoid cloning per event, but after inspection this is unnecessary because `DictationEvent` requires owned Strings and the loop emits multiple events; therefore no code change is required—keep the current `normalized.clone()` and `mode_str.to_string()` usage inside the `tokio::spawn` closure where `publish_dictation_event(DictationEvent { ... })` is called.app/src/hooks/useDictationHotkey.ts (1)
67-70: UnnecessaryRpcOutcomewrapper check —callCoreRpcalready unwraps the result.Per context snippet 2 (
app/src/services/coreRpcClient.ts:180),callCoreRpcreturnsjson.result as Tdirectly, not a wrapper object. The server-sideRpcOutcomeis serialized into the JSON-RPCresultfield, whichcallCoreRpcthen extracts. This defensive check will never match and can be removed.♻️ Proposed fix
- // Handle RpcOutcome wrapper — the result may be nested in .result - const s = ( - 'result' in settings ? (settings as Record<string, unknown>).result : settings - ) as DictationSettings; + const s = settings;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/src/hooks/useDictationHotkey.ts` around lines 67 - 70, The conditional that checks for an RpcOutcome wrapper in useDictationHotkey (the 'result' in settings branch that assigns to s) is unnecessary because callCoreRpc already unwraps and returns the inner result; remove the ternary check and simply cast settings to DictationSettings (e.g., assign settings as DictationSettings to s) so code uses the unwrapped value directly and drop the Record<string, unknown> cast/branch.src/openhuman/voice/schemas.rs (2)
358-374: Consider extracting duplicated stopped-status construction.The synthesized
VoiceServerStatusfor non-running servers is duplicated betweenhandle_voice_server_stop(lines 346-352) andhandle_voice_server_status(lines 364-370).♻️ Extract helper function
+fn stopped_status() -> crate::openhuman::voice::server::VoiceServerStatus { + crate::openhuman::voice::server::VoiceServerStatus { + state: crate::openhuman::voice::server::ServerState::Stopped, + hotkey: String::new(), + activation_mode: crate::openhuman::voice::hotkey::ActivationMode::Tap, + transcription_count: 0, + last_error: None, + } +} fn handle_voice_server_stop(_params: Map<String, Value>) -> ControllerFuture { Box::pin(async move { if let Some(server) = crate::openhuman::voice::server::try_global_server() { // ... } else { - let status = crate::openhuman::voice::server::VoiceServerStatus { - state: crate::openhuman::voice::server::ServerState::Stopped, - hotkey: String::new(), - activation_mode: crate::openhuman::voice::hotkey::ActivationMode::Tap, - transcription_count: 0, - last_error: None, - }; - serde_json::to_value(status).map_err(|e| format!("serialize error: {e}")) + serde_json::to_value(stopped_status()).map_err(|e| format!("serialize error: {e}")) } }) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/schemas.rs` around lines 358 - 374, Both handle_voice_server_status and handle_voice_server_stop duplicate construction of a stopped VoiceServerStatus object; extract a small helper (e.g., fn default_stopped_status() -> crate::openhuman::voice::server::VoiceServerStatus) that returns the same VoiceServerStatus with ServerState::Stopped, empty hotkey, ActivationMode::Tap, transcription_count 0 and last_error None, then call that helper from both handle_voice_server_status and handle_voice_server_stop and serialize its result with serde_json::to_value as before.
325-333: Sleep-based synchronization is fragile but pragmatic.The 200ms sleep is a heuristic to wait for the spawned server task to initialize. If initialization takes longer, the returned status may be stale. Consider documenting this limitation or using a synchronization primitive (e.g., oneshot channel) for deterministic readiness signaling in a future iteration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/schemas.rs` around lines 325 - 333, The 200ms tokio::time::sleep used before calling crate::openhuman::voice::server::try_global_server() is a fragile heuristic; replace it by implementing deterministic readiness signaling: change the server spawn path to send a oneshot::Sender (or similar) when initialization completes and in this routine await the corresponding oneshot::Receiver before calling try_global_server() and s.status().await, or at minimum add a comment noting the race and the 200ms limitation next to the sleep to document the fragility of this approach.src/core/cli.rs (1)
258-267: Config load errors are silently swallowed; consider logging a warning.When
Config::load_or_init()fails,unwrap_or_default()silently falls back to defaults. The user won't know their config file has syntax errors or permission issues. Additionally,--modesilently defaults toTapfor any unrecognized value (including typos like "Push" or "psh").💡 Proposed improvement
rt.block_on(async { - let mut config = crate::openhuman::config::Config::load_or_init() - .await - .unwrap_or_default(); + let mut config = match crate::openhuman::config::Config::load_or_init().await { + Ok(c) => c, + Err(e) => { + log::warn!("[voice] failed to load config, using defaults: {e}"); + crate::openhuman::config::Config::default() + } + }; config.apply_env_overrides(); let activation_mode = match mode.as_deref() { Some("push") => ActivationMode::Push, - _ => ActivationMode::Tap, + Some("tap") | None => ActivationMode::Tap, + Some(other) => { + log::warn!("[voice] unrecognized --mode '{other}', defaulting to tap"); + ActivationMode::Tap + } };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/core/cli.rs` around lines 258 - 267, When loading configuration, don't silently swallow errors from Config::load_or_init(); instead propagate or log them: capture the Result error from crate::openhuman::config::Config::load_or_init() and, if Err, call the global logger (or e.g., eprintln!) with a concise warning including the error and note that defaults are being used before calling apply_env_overrides on the fallback Config; likewise make parsing of the CLI `mode` variable and setting of activation_mode (ActivationMode::Push / ActivationMode::Tap) case-insensitive and explicit—accept known values using mode.as_deref().map(|s| s.to_lowercase()) (or match on both cases) and log a warning when an unrecognized value is provided before defaulting to ActivationMode::Tap so typos are clearly visible.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/openhuman/voice/schemas.rs`:
- Around line 297-316: The current separate check (try_global_server()) and init
(global_server(server_config)) allow a TOCTOU race where two concurrent
voice_server_start calls can both pass the check and one wins with its config;
fix by serializing initialization: introduce a global initialization mutex
(e.g., SERVER_INIT_MUTEX) and acquire it at the start of voice_server_start (or
the public start path) before calling try_global_server() and before calling
global_server(server_config), perform the same existing_state vs server_config
comparison while holding the lock, then call global_server(server_config) and
release the lock; alternatively, move the config validation into the
OnceCell::get_or_init closure used by global_server so the chosen config is
validated atomically inside the init path (use ServerState, try_global_server,
global_server, server_config, and voice_server_start as the reference points).
---
Nitpick comments:
In `@app/src/hooks/useDictationHotkey.ts`:
- Around line 67-70: The conditional that checks for an RpcOutcome wrapper in
useDictationHotkey (the 'result' in settings branch that assigns to s) is
unnecessary because callCoreRpc already unwraps and returns the inner result;
remove the ternary check and simply cast settings to DictationSettings (e.g.,
assign settings as DictationSettings to s) so code uses the unwrapped value
directly and drop the Record<string, unknown> cast/branch.
In `@src/core/cli.rs`:
- Around line 258-267: When loading configuration, don't silently swallow errors
from Config::load_or_init(); instead propagate or log them: capture the Result
error from crate::openhuman::config::Config::load_or_init() and, if Err, call
the global logger (or e.g., eprintln!) with a concise warning including the
error and note that defaults are being used before calling apply_env_overrides
on the fallback Config; likewise make parsing of the CLI `mode` variable and
setting of activation_mode (ActivationMode::Push / ActivationMode::Tap)
case-insensitive and explicit—accept known values using mode.as_deref().map(|s|
s.to_lowercase()) (or match on both cases) and log a warning when an
unrecognized value is provided before defaulting to ActivationMode::Tap so typos
are clearly visible.
In `@src/openhuman/voice/dictation_listener.rs`:
- Around line 103-123: The code review suggests moving ownership of `normalized`
and `mode_str` into the spawned task to avoid cloning per event, but after
inspection this is unnecessary because `DictationEvent` requires owned Strings
and the loop emits multiple events; therefore no code change is required—keep
the current `normalized.clone()` and `mode_str.to_string()` usage inside the
`tokio::spawn` closure where `publish_dictation_event(DictationEvent { ... })`
is called.
In `@src/openhuman/voice/schemas.rs`:
- Around line 358-374: Both handle_voice_server_status and
handle_voice_server_stop duplicate construction of a stopped VoiceServerStatus
object; extract a small helper (e.g., fn default_stopped_status() ->
crate::openhuman::voice::server::VoiceServerStatus) that returns the same
VoiceServerStatus with ServerState::Stopped, empty hotkey, ActivationMode::Tap,
transcription_count 0 and last_error None, then call that helper from both
handle_voice_server_status and handle_voice_server_stop and serialize its result
with serde_json::to_value as before.
- Around line 325-333: The 200ms tokio::time::sleep used before calling
crate::openhuman::voice::server::try_global_server() is a fragile heuristic;
replace it by implementing deterministic readiness signaling: change the server
spawn path to send a oneshot::Sender (or similar) when initialization completes
and in this routine await the corresponding oneshot::Receiver before calling
try_global_server() and s.status().await, or at minimum add a comment noting the
race and the 200ms limitation next to the sleep to document the fragility of
this approach.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b981e000-bc36-4ea7-baa0-d9f7d0b766ad
📒 Files selected for processing (12)
.github/Dockerfile.github/workflows/test.ymlapp/src/hooks/useDictationHotkey.tssrc/core/cli.rssrc/core/jsonrpc.rssrc/core/socketio.rssrc/openhuman/config/mod.rssrc/openhuman/config/schema/mod.rssrc/openhuman/config/schema/voice_server.rssrc/openhuman/voice/dictation_listener.rssrc/openhuman/voice/mod.rssrc/openhuman/voice/schemas.rs
✅ Files skipped from review due to trivial changes (1)
- .github/Dockerfile
🚧 Files skipped from review as they are similar to previous changes (4)
- src/openhuman/config/mod.rs
- src/openhuman/config/schema/mod.rs
- src/openhuman/voice/mod.rs
- src/core/jsonrpc.rs
| // Check if a server is already running with a different config. | ||
| if let Some(existing) = crate::openhuman::voice::server::try_global_server() { | ||
| let existing_status = existing.status().await; | ||
| if existing_status.state != crate::openhuman::voice::server::ServerState::Stopped { | ||
| if existing_status.hotkey != server_config.hotkey | ||
| || existing_status.activation_mode != server_config.activation_mode | ||
| { | ||
| return Err(format!( | ||
| "voice server already running (hotkey={}, mode={:?}); \ | ||
| stop it first before starting with different config", | ||
| existing_status.hotkey, existing_status.activation_mode | ||
| )); | ||
| } | ||
| // Same config, already running — return current status. | ||
| return serde_json::to_value(existing_status) | ||
| .map_err(|e| format!("serialize error: {e}")); | ||
| } | ||
| } | ||
|
|
||
| let server = global_server(server_config); |
There was a problem hiding this comment.
TOCTOU race condition in server initialization check.
The check at line 298 (try_global_server()) and initialization at line 316 (global_server(server_config)) are not atomic. Two concurrent voice_server_start calls with different configs could both pass the check (seeing None), then one wins the OnceCell::get_or_init race while the other silently gets the first caller's config.
The window is narrow, and OnceCell ensures memory safety, but the "different config" validation can be bypassed. Since this is a user-facing CLI/RPC and concurrent starts are unlikely, this is low-severity.
🔒 Potential fix using a mutex for initialization
+use std::sync::Mutex;
+use once_cell::sync::Lazy;
+
+static START_LOCK: Lazy<Mutex<()>> = Lazy::new(|| Mutex::new(()));
fn handle_voice_server_start(params: Map<String, Value>) -> ControllerFuture {
Box::pin(async move {
+ // Serialize start attempts to prevent TOCTOU races
+ let _guard = START_LOCK.lock().map_err(|e| format!("lock error: {e}"))?;
+
use crate::openhuman::voice::hotkey::ActivationMode;
// ... rest of handlerAlternatively, accept that two concurrent starts may result in the first config winning (document this behavior).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/openhuman/voice/schemas.rs` around lines 297 - 316, The current separate
check (try_global_server()) and init (global_server(server_config)) allow a
TOCTOU race where two concurrent voice_server_start calls can both pass the
check and one wins with its config; fix by serializing initialization: introduce
a global initialization mutex (e.g., SERVER_INIT_MUTEX) and acquire it at the
start of voice_server_start (or the public start path) before calling
try_global_server() and before calling global_server(server_config), perform the
same existing_state vs server_config comparison while holding the lock, then
call global_server(server_config) and release the lock; alternatively, move the
config validation into the OnceCell::get_or_init closure used by global_server
so the chosen config is validated atomically inside the init path (use
ServerState, try_global_server, global_server, server_config, and
voice_server_start as the reference points).
Summary
openhuman voice [--hotkey <combo>] [--mode <tap|push>]voice_server_start,voice_server_stop,voice_server_statusNew files
src/openhuman/voice/audio_capture.rssrc/openhuman/voice/hotkey.rssrc/openhuman/voice/text_input.rssrc/openhuman/voice/server.rssrc/openhuman/config/schema/voice_server.rsDependencies added
cpal(audio capture),hound(WAV encoding),enigo(text insertion),rdev(global hotkey)Test plan
cargo test— 38+ voice tests)cargo checkclean (zero errors, zero warnings)openhuman voice --help)openhuman voiceand dictate with Ctrl+Shift+Space--mode push🤖 Generated with Claude Code
Summary by CodeRabbit