fix(voice): reduce dictation hallucinations and improve Fn/focus reliability (#385)#409
Conversation
…inyhumansai#385) Reject whisper segments with avg token log-probability below -0.7 or entropy above 2.4. Return TranscriptionResult with confidence metadata instead of plain String. Update callers in speech.rs and streaming.rs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…#385) Base model produces significantly fewer hallucinations than tiny, especially in noisy/quiet conditions. User can still override via config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ai#385) Gate sustained silence (>500ms) from being sent to whisper to prevent hallucinations. Maintain 100ms look-ahead ring buffer so speech onset after pauses is not clipped. Thresholds adapt to source sample rate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nyhumansai#385) start_recording() blocks 1-7s on cpal device init but macOS fires Fn Release almost immediately, causing skipped cycles. Move recording start to spawn_blocking so the event loop stays responsive. Buffer Release events during setup and ensure minimum 1.5s recording duration when release arrives before recording handle is ready. Also includes: capture focused app on hotkey press, pass through pipeline for focus validation before paste. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ss (tinyhumansai#385) Add expected_app parameter to insert_text(). Before Cmd+V, validate focus via accessibility API and restore via AppleScript if shifted. Don't abort paste on focus validation failure — attempt insertion regardless so text is never silently lost. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughUnwrapped voice RPC return shapes in frontend, changed frontend state reads to use refs, updated tests and tauri RPC wrappers; introduced per-segment transcription results and confidence metrics; added RMS-based silence gating and buffering in audio capture; altered hotkey/recording control flow (macOS focus capture and deferred stop); updated default STT model. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Hotkey as Hotkey Loop
participant Capture as Audio Capture
participant Gate as Silence Gate
participant Stream as Streaming Engine
participant Whisper as Whisper Engine
participant Server as Voice Server
participant TextInput as Text Input
User->>Hotkey: Press hotkey
Hotkey->>Hotkey: capture_expected_app_name (macOS)
Hotkey->>Server: request start recording (spawn blocking task)
Server->>Capture: start_recording (blocking thread)
Capture->>Capture: CPAL collects chunk
Capture->>Gate: process(mono_chunk)
Gate-->>Capture: gated_chunk (or drop)
Capture->>Stream: append audio to buffers (audio_buf + full_audio_buf)
Stream->>Whisper: trigger partial transcription (on revision change)
Whisper-->>Stream: TranscriptionResult { text, avg_logprob, segments_* }
Stream->>Stream: log avg_logprob, maybe show partial
User->>Hotkey: Release hotkey
Hotkey->>Server: signal stop (may be buffered if setup pending)
Server->>Stream: finalize using full_audio_buf
Stream->>Whisper: final transcription
Whisper-->>Stream: final TranscriptionResult
Stream->>TextInput: insert_text(result.text, expected_app)
TextInput->>TextInput: validate/restore app focus (macOS) then paste
TextInput-->>User: Text inserted
sequenceDiagram
participant CPAL as CPAL Input
participant Gate as SilenceGate
participant Lookahead as Lookahead Buffer
participant Buffer as Recording Buffer
loop each audio chunk
CPAL->>Gate: process(chunk)
Gate->>Gate: compute chunk_rms()
alt above threshold or within lookahead
Gate->>Lookahead: preserve chunk
alt lookahead full
Lookahead->>Buffer: flush preserved audio
end
Gate-->>CPAL: output non-empty
else sustained silence
Gate->>Gate: increment silence duration
alt silence > SILENCE_GATE_MS
Gate-->>CPAL: drop chunk (suppress)
else
Gate-->>CPAL: output from lookahead
end
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
app/src/components/settings/panels/VoicePanel.tsx (1)
38-57:⚠️ Potential issue | 🟠 MajorThe polling guard still uses mount-time state.
loadData()closes oversettingsandsavedSettings, but the interval is created once inuseEffect([]). After the first render, the timer keeps calling the stale closure, so the guard on Lines 49-53 can still overwrite unsaved edits every 2 seconds.Also applies to: 69-75
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/src/components/settings/panels/VoicePanel.tsx` around lines 38 - 57, loadData currently closes over mount-time settings and savedSettings so the polling interval created in useEffect([]) uses stale values and can clobber edits; update loadData to read the latest values via refs or by moving the interval creation so it reads current state: create React refs (e.g., settingsRef and savedSettingsRef), update them whenever setSettings/setSavedSettings run, and in loadData (used by the 2s timer) check settingsRef.current and savedSettingsRef.current instead of the closed-over settings/savedSettings; alternatively, define the interval inside a useEffect that depends on settings/savedSettings so the closure is fresh — ensure references to loadData, setSettings, setSavedSettings, settings, and savedSettings are updated accordingly.
🧹 Nitpick comments (1)
app/src/utils/tauriCommands/voice.ts (1)
63-82: Prefer arrow exports for these updated RPC helpers.The return-type change looks fine, but the touched helpers still use function declarations. Converting them to
const ... = async () =>keeps this module aligned with the repo’s TypeScript style.♻️ Suggested refactor
-export async function openhumanVoiceServerStatus(): Promise<VoiceServerStatus> { +export const openhumanVoiceServerStatus = async (): Promise<VoiceServerStatus> => { return await callCoreRpc<VoiceServerStatus>({ method: 'openhuman.voice_server_status', params: {}, }); -} +}; -export async function openhumanVoiceServerStart(params?: { +export const openhumanVoiceServerStart = async (params?: { hotkey?: string; activation_mode?: 'tap' | 'push'; skip_cleanup?: boolean; -}): Promise<VoiceServerStatus> { +}): Promise<VoiceServerStatus> => { return await callCoreRpc<VoiceServerStatus>({ method: 'openhuman.voice_server_start', params: params ?? {}, }); -} +}; -export async function openhumanVoiceServerStop(): Promise<VoiceServerStatus> { +export const openhumanVoiceServerStop = async (): Promise<VoiceServerStatus> => { return await callCoreRpc<VoiceServerStatus>({ method: 'openhuman.voice_server_stop', params: {}, }); -} +};As per coding guidelines,
**/*.{js,jsx,ts,tsx}: Prefer arrow functions over function declarations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/src/utils/tauriCommands/voice.ts` around lines 63 - 82, Convert the three exported RPC helper functions to arrow-function exports: change the function declarations openhumanVoiceServerStatus, openhumanVoiceServerStart, and openhumanVoiceServerStop into const <name> = async () => style (exported), preserving their signatures, return types (Promise<VoiceServerStatus>), and the existing callCoreRpc calls/params; ensure exports remain named and behavior is unchanged so the module follows the repo's TypeScript arrow-function style.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/openhuman/local_ai/model_ids.rs`:
- Around line 49-52: The effective_stt_model_id() fallback was changed to
"ggml-base-q5_1.bin" but the schema defaults for stt_model_id and
stt_download_url still set "ggml-tiny-q5_1.bin", causing new/empty configs to be
inconsistent; update the defaults in the local_ai schema (the stt_model_id and
corresponding stt_download_url default values) to match the new fallback
("ggml-base-q5_1.bin") so downloader config and effective_stt_model_id() are
consistent for fresh and empty-ID configs.
In `@src/openhuman/voice/audio_capture.rs`:
- Around line 21-22: The gate currently uses a hardcoded SILENCE_RMS_THRESHOLD
(const) while the UI exposes a configurable silence_threshold; update the code
to accept and pass that configured value into the live gate: add a threshold:
f32 parameter to SilenceGate::new (and any SilenceGate constructors), remove or
keep the const as a default only, and thread that parameter through
start_recording (and any call sites that construct SilenceGate) so the saved
silence_threshold from the UI is passed when creating the SilenceGate instance
instead of comparing against 0.002.
In `@src/openhuman/voice/server.rs`:
- Around line 253-273: The sleep that enforces MIN_RECORDING_AFTER_SETUP must be
removed from the main hotkey select! loop to avoid blocking cancellation and
hotkey handling; instead, when pending_release is observed set recording and
recording_expected_app as you do, then spawn a detached async task (or create a
deadline branch) that awaits tokio::time::sleep(MIN_RECORDING_AFTER_SETUP) and
after the delay calls self.spawn_process_recording(handle, app_config,
expected_app) with the moved handle/expected_app (or fails fast if cancelled via
a shared cancellation channel/flag). Apply the same pattern for the similar
block around lines 297-300 so both deferred-stop timers run off-thread and do
not block the hotkey/select! loop.
- Around line 192-205: The hotkey handler currently drops stop/toggle presses
when recording_pending_rx.is_some(), losing real stop requests; change it to
record the stop intent and apply it once the pending start completes: introduce
a small flag or channel (e.g., recording_stop_pending boolean or
recording_stop_tx) that you set when a hotkey arrives while
recording_pending_rx.is_some(), and modify the start_recording flow (where the
new Recording handle and recording_expected_app are installed) to check that
flag and immediately call spawn_process_recording(handle, app_config,
recording_expected_app.take()) if set; do the same fix for the analogous block
at lines 235-239 so pending stop presses are honored.
In `@src/openhuman/voice/streaming.rs`:
- Around line 132-143: The current sliding-window logic trims audio_buf
(protected by audio_buf.lock()) to MAX_STREAM_BUFFER_SAMPLES which is then later
used to produce the final "final" transcript, causing the start of long
dictations to be lost; fix it by keeping two separate buffers: keep the existing
capped buffer (audio_buf or sliding_buffer) for interim/partial results and add
a persistent full_audio buffer (e.g., full_audio_vec protected by its own mutex
or appended before any drain) that accumulates all incoming samples without
trimming; update producers/consumers so interim logic uses
audio_buf/sliding_buffer and the final-result generation (the code that
currently reads audio_buf to produce "final") reads from full_audio_vec and
audio_revision handling remains consistent.
---
Outside diff comments:
In `@app/src/components/settings/panels/VoicePanel.tsx`:
- Around line 38-57: loadData currently closes over mount-time settings and
savedSettings so the polling interval created in useEffect([]) uses stale values
and can clobber edits; update loadData to read the latest values via refs or by
moving the interval creation so it reads current state: create React refs (e.g.,
settingsRef and savedSettingsRef), update them whenever
setSettings/setSavedSettings run, and in loadData (used by the 2s timer) check
settingsRef.current and savedSettingsRef.current instead of the closed-over
settings/savedSettings; alternatively, define the interval inside a useEffect
that depends on settings/savedSettings so the closure is fresh — ensure
references to loadData, setSettings, setSavedSettings, settings, and
savedSettings are updated accordingly.
---
Nitpick comments:
In `@app/src/utils/tauriCommands/voice.ts`:
- Around line 63-82: Convert the three exported RPC helper functions to
arrow-function exports: change the function declarations
openhumanVoiceServerStatus, openhumanVoiceServerStart, and
openhumanVoiceServerStop into const <name> = async () => style (exported),
preserving their signatures, return types (Promise<VoiceServerStatus>), and the
existing callCoreRpc calls/params; ensure exports remain named and behavior is
unchanged so the module follows the repo's TypeScript arrow-function style.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4d6b247f-ab83-4284-aa7e-9cca9d6a4314
⛔ Files ignored due to path filters (1)
app/src-tauri/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (10)
app/src/components/settings/panels/VoicePanel.tsxapp/src/components/settings/panels/__tests__/VoicePanel.test.tsxapp/src/utils/tauriCommands/voice.tssrc/openhuman/local_ai/model_ids.rssrc/openhuman/local_ai/service/speech.rssrc/openhuman/local_ai/service/whisper_engine.rssrc/openhuman/voice/audio_capture.rssrc/openhuman/voice/server.rssrc/openhuman/voice/streaming.rssrc/openhuman/voice/text_input.rs
| /// RMS threshold below which audio is considered silence. | ||
| const SILENCE_RMS_THRESHOLD: f32 = 0.002; |
There was a problem hiding this comment.
Wire the configured silence threshold into the live gate.
The new gate compares against a fixed 0.002 on Lines 21-22/65-66, but SilenceGate::new() and start_recording() take no threshold input. The UI still exposes a configurable silence threshold on Lines 300-318 of app/src/components/settings/panels/VoicePanel.tsx, so quieter users can lower the setting and still have speech discarded here before Whisper ever sees it.
💡 Minimal shape for making the gate configurable
-const SILENCE_RMS_THRESHOLD: f32 = 0.002;
-
struct SilenceGate {
+ threshold: f32,
source_sample_rate: u32,
@@
- fn new(source_sample_rate: u32) -> Self {
+ fn new(source_sample_rate: u32, threshold: f32) -> Self {
let gate_samples = ((source_sample_rate as usize * SILENCE_GATE_MS) / 1000).max(1);
let lookahead_samples = ((source_sample_rate as usize * LOOKAHEAD_MS) / 1000).max(1);
Self {
+ threshold,
source_sample_rate,
@@
- let is_silent = rms < SILENCE_RMS_THRESHOLD;
+ let is_silent = rms < self.threshold;Then pass the saved silence_threshold down when constructing SilenceGate.
Also applies to: 48-66, 223-229
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/openhuman/voice/audio_capture.rs` around lines 21 - 22, The gate
currently uses a hardcoded SILENCE_RMS_THRESHOLD (const) while the UI exposes a
configurable silence_threshold; update the code to accept and pass that
configured value into the live gate: add a threshold: f32 parameter to
SilenceGate::new (and any SilenceGate constructors), remove or keep the const as
a default only, and thread that parameter through start_recording (and any call
sites that construct SilenceGate) so the saved silence_threshold from the UI is
passed when creating the SilenceGate instance instead of comparing against
0.002.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/openhuman/voice/server.rs (1)
460-469: Consider extractingprocess_recording_bgarguments into a struct.The
#[allow(clippy::too_many_arguments)]suppress is acceptable for now, but bundlingstate,transcription_count,last_error,recent_transcripts, andexpected_appinto a context struct would improve maintainability if more parameters are added later.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/server.rs` around lines 460 - 469, Extract the clustered arguments into a context struct (e.g., RecordingContext or ProcessRecordingContext) and replace the multiple parameters in async fn process_recording_bg(...) with a single context parameter; include fields state: Arc<Mutex<ServerState>>, transcription_count: Arc<AtomicU64>, last_error: Arc<Mutex<Option<String>>>, recent_transcripts: Arc<Mutex<Vec<String>>>, and expected_app: Option<String> in that struct, update all call sites that invoke process_recording_bg to construct and pass the new context instance, and adjust any code inside process_recording_bg that referenced the old parameter names to use the struct field access instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/openhuman/voice/server.rs`:
- Around line 460-469: Extract the clustered arguments into a context struct
(e.g., RecordingContext or ProcessRecordingContext) and replace the multiple
parameters in async fn process_recording_bg(...) with a single context
parameter; include fields state: Arc<Mutex<ServerState>>, transcription_count:
Arc<AtomicU64>, last_error: Arc<Mutex<Option<String>>>, recent_transcripts:
Arc<Mutex<Vec<String>>>, and expected_app: Option<String> in that struct, update
all call sites that invoke process_recording_bg to construct and pass the new
context instance, and adjust any code inside process_recording_bg that
referenced the old parameter names to use the struct field access instead.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2231ca11-e488-48eb-b260-17973bde5683
⛔ Files ignored due to path filters (1)
app/src-tauri/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
app/src/components/settings/panels/VoicePanel.tsxsrc/openhuman/config/schema/local_ai.rssrc/openhuman/voice/server.rssrc/openhuman/voice/streaming.rs
✅ Files skipped from review due to trivial changes (1)
- src/openhuman/config/schema/local_ai.rs
Summary
#385voice dictation reliability issues across STT quality, hotkey timing, and insertion target behavior.Problem
start_recording()could block long enough thatReleasedarrived before recording became active, causing skipped or too-short captures.CommandResponse<T>for server status/start/stop while backend returns flatVoiceServerStatus, breaking UI voice server controls.Solution
whisper_engine.spawn_blocking+ pending handle channel).Releasedevents and applied deferred-stop handling with minimum post-setup capture window.voice.ts,VoicePanel.tsx, andVoicePanel.test.tsxfor flatVoiceServerStatusresponses.Submission Checklist
app/) and/orcargo test(core) for logic you add or changeapp/test/e2e, mock backend,tests/json_rpc_e2e.rsas appropriate)//////!(Rust), JSDoc or brief file/module headers (TS) on public APIs and non-obvious modules(Any feature related checklist can go in here)
cargo checkcargo fmt --checkcargo test --lib openhuman::voice::text_input::testscargo test --lib openhuman::voice::server::testscargo test --lib openhuman::local_ai::model_ids::tests::stt_tts_and_quantization_defaults_are_appliedcargo run voicedebugging (Fn race + focus/paste paths)cargo test --libclean run in this environment (unrelated long-runningscreen_intelligencenoise/failures observed)Impact
Related
Summary by CodeRabbit
New Features
Bug Fixes
Improvements