Skip to content

fix(voice): anti-hallucination, clipboard paste, Fn key reliability#380

Merged
senamakel merged 18 commits intotinyhumansai:mainfrom
senamakel:fix/monday-prod
Apr 7, 2026
Merged

fix(voice): anti-hallucination, clipboard paste, Fn key reliability#380
senamakel merged 18 commits intotinyhumansai:mainfrom
senamakel:fix/monday-prod

Conversation

@senamakel
Copy link
Copy Markdown
Member

Summary

  • Add silence/energy detection (RMS threshold) to skip transcription of silent audio — the Feat/gitbooks #1 cause of whisper hallucinations
  • Add hallucination output filtering (40+ patterns: [BLANK_AUDIO], "thank you", repeated words, single noise words)
  • Add whisper initial_prompt support with custom dictionary and recent transcript context for better accuracy across consecutive recordings
  • Switch text insertion from character-by-character typing (enigo text()) to clipboard-paste (Cmd+V) — fixes garbled/repeated output
  • Fix Fn key reliability on macOS — missed release events no longer break subsequent presses
  • Enable LLM cleanup by default (skip_cleanup: false) and align cleanup prompt with OpenWhispr's CLEANUP_PROMPT
  • Add silence_threshold and custom_dictionary to voice server config (TOML + RPC + UI)

Problem

The voice server had several issues making it unreliable for production use:

  1. Hallucinations: Pressing Fn with no speech produced "Thank you", "Hello world" repeated dozens of times, or other fabricated text. Whisper.cpp hallucinates when fed silent/noisy audio.
  2. Garbled output: Text insertion via enigo.text() typed character-by-character, causing macOS to duplicate/reorder characters at speed.
  3. Missed Fn presses: macOS doesn't reliably deliver KeyRelease for the Fn key. The is_active state machine got stuck, eating every other press.
  4. No LLM cleanup: skip_cleanup defaulted to true, so the LLM post-processing pass never ran — raw whisper output went straight to the text field.
  5. No vocabulary bias: Whisper had no context about what words to expect, making it guess wildly on ambiguous audio.

Solution

Anti-hallucination (inspired by OpenWhispr):

  • Compute peak RMS energy during recording via lock-free atomic tracking in the audio capture callback
  • Skip transcription entirely when peak_rms < silence_threshold (default 0.002, matching OpenWhispr)
  • Filter whisper output against 40+ known hallucination patterns before insertion
  • Set whisper.cpp temperature_inc=0 to disable temperature fallback (was retrying with randomness, producing "Thank you")
  • Enable suppress_blank, suppress_nst, single_segment mode, and explicit no_speech_thold=0.6
  • Pass initial_prompt to whisper with custom dictionary words + rolling recent transcripts

Clipboard-paste insertion (matching OpenWhispr's approach):

  • Save clipboard → write text → 120ms settle delay → simulate Cmd+V → 450ms restore delay → restore clipboard
  • Uses arboard crate for clipboard access, enigo only for the single Cmd+V keystroke

Fn key fix:

  • Simplified hotkey callback: is_active tracks press/release normally
  • When release IS delivered (your logs showed it does sometimes), it works
  • When release is missed, next press sees is_active=true and sends fallback Released
  • Transcription runs in background (tokio::spawn) so the hotkey loop is never blocked

LLM cleanup:

  • Changed skip_cleanup default to false
  • Replaced 2-line cleanup prompt with OpenWhispr's full CLEANUP_PROMPT (handles self-corrections, spoken punctuation, prompt injection protection)
  • Added info!-level logs showing LLM state so issues are diagnosable

Config & UI:

  • Added silence_threshold and custom_dictionary to VoiceServerConfig (TOML schema, RPC, controller schema inputs)
  • VoicePanel: custom dictionary tag editor, silence threshold control
  • Fixed 2s polling timer overwriting in-progress form edits

Submission Checklist

  • Unit tests — 57+ voice tests pass (cargo test -p openhuman -- voice::)
  • E2E / integration — Manual testing of Fn key press/release, silence detection, hallucination filtering, clipboard paste
  • Doc comments — All new public functions and config fields documented
  • Inline comments — Whisper params, hallucination patterns, clipboard timing documented with rationale

Impact

  • Desktop only — all changes in Rust core (src/openhuman/voice/) and React settings UI
  • New dependency: arboard = "3" for cross-platform clipboard access
  • Config migration: existing configs with skip_cleanup: true will keep that value (serde default only applies to missing fields)
  • Breaking behavior: LLM cleanup now runs by default if local LLM is ready. Users who prefer raw output can set skip_cleanup: true in settings.

Related

  • Reference: OpenWhispr (/references/openwhispr) — silence detection, clipboard paste, CLEANUP_PROMPT, custom dictionary patterns

senamakel added 18 commits April 6, 2026 16:18
- Changed the default global hotkey for dictation from "CmdOrCtrl+Shift+D" to "Fn" in both the configuration schema and the associated documentation.
- Updated the hotkey parsing function to recognize "Fn" as a valid key, enhancing the flexibility of hotkey configurations.
- Added a test case to ensure the "Fn" key can be parsed correctly, improving the robustness of the hotkey handling functionality.
- Changed the default activation mode for voice and dictation from "tap" to "push" in the respective configuration schemas.
- Updated the default hotkey for voice commands from "ctrl+shift+space" to "Fn" across various modules and documentation.
- Adjusted related tests to reflect the new defaults, ensuring consistency in behavior and expectations.
- Added a new asynchronous function `start_if_enabled` to the voice server module, which initializes the embedded voice server based on configuration settings.
- Updated the server run logic to check if the voice server should auto-start, enhancing the startup process for the core application.
- Integrated the new server startup function into the main server run logic, ensuring the voice server is launched if enabled in the configuration.
- Introduced a new `VoicePanel` component to handle voice server configurations, including startup options, hotkeys, and runtime controls.
- Updated routing in the settings page to include the new voice settings section.
- Enhanced the `useSettingsNavigation` hook to support navigation to the voice settings.
- Added tests for the `VoicePanel` to ensure functionality and reliability of the voice server management features.
…alization

- Revised comments in `DictationHotkeyManager` to clarify the component's mounting process within the app tree.
- Removed unused imports and unnecessary state management from `ServiceBlockingGate`, streamlining the component's logic.
- Updated tests for `ServiceBlockingGate` to reflect changes in behavior, ensuring accurate rendering of child components based on service status.
- Enhanced the `Cargo.lock` file by updating dependencies to their latest versions for improved stability and security.
…l options

- Changed the default value of `skip_cleanup` in the voice server configuration from `false` to `true` to improve transcription handling.
- Reordered options in the `VoicePanel` component to ensure "Natural cleanup" is displayed alongside "Verbatim transcription" for better user clarity.
- Updated tests to reflect the new default settings and ensure proper functionality of the VoicePanel component.
- Introduced a new module `window.ts` containing functions for managing window visibility and state in a Tauri application.
- Implemented commands to show, hide, toggle visibility, minimize, maximize, close, and set the title of the main window.
- Added checks to ensure commands are only executed in a Tauri environment, enhancing compatibility with web contexts.
- Introduced multiple new modules for Tauri commands, including `accessibility`, `autocomplete`, `config`, `conscious`, `core`, `cron`, `hardware`, `localAi`, and `window`.
- Each module contains functions for managing specific functionalities such as accessibility permissions, autocomplete suggestions, configuration settings, and hardware interactions.
- Implemented checks to ensure commands are executed only in a Tauri environment, enhancing compatibility and reliability.
- This addition significantly expands the command capabilities of the Tauri application, providing a robust framework for future development.
- Added functionality to transcribe audio with an optional initial prompt, allowing for vocabulary bias and improved conversational continuity.
- Updated the `transcribe_pcm_f32` and `transcribe_wav_file` functions to accept an `initial_prompt` parameter, enhancing recognition of specific vocabulary.
- Implemented peak RMS energy tracking during audio recording for silence detection, ensuring recordings below a defined threshold are skipped.
- Enhanced the voice server configuration to include a silence threshold and custom dictionary for better transcription context.
- Introduced methods to build and manage recent transcripts for improved continuity across consecutive recordings.
- Updated tests to validate new features and ensure proper functionality of the transcription process.
…tom dictionary

- Added `silence_threshold` to the `VoiceServerConfig` for improved silence detection, allowing recordings with low RMS energy to be skipped.
- Introduced `custom_dictionary` to bias transcription towards specific vocabulary, enhancing recognition of names and technical terms.
- Updated the `voice_transcribe` and `voice_transcribe_bytes` functions to utilize the new `initial_prompt` parameter for better context during transcription.
- Adjusted the `transcribe_pcm_i16` function to accept additional parameters for improved flexibility in handling audio input.
…VoicePanel

- Implemented a new input for setting the silence threshold, allowing recordings with low RMS energy to be skipped.
- Added functionality for a custom dictionary, enabling users to add specific vocabulary words to improve transcription accuracy.
- Updated the VoiceServerSettings interface and related functions to support the new features, ensuring seamless integration with existing settings.
- Enhanced the UI in the VoicePanel to facilitate user interaction with the new settings.
…ce server command

- Added `silence_threshold` and `custom_dictionary` parameters to the `run_voice_server_command` function, ensuring these settings are utilized during voice server operations.
- Enhanced integration with the existing voice server configuration to support improved transcription accuracy and silence detection.
- Adjusted import paths for `callCoreRpc` in multiple Tauri command modules to ensure correct referencing from the updated directory structure.
- This change enhances module organization and maintains consistency across the codebase.
… and versions

- Added new dependencies including `arboard`, `fax`, `fax_derive`, `gethostname`, `half`, `quick-error`, `tiff`, and `x11rb` to enhance functionality and support for clipboard operations, transcription improvements, and system interactions.
- Updated existing dependencies to their latest versions for better performance and compatibility.
- Modified `VoicePanel` to utilize the updated settings and ensure proper handling of voice server configurations.
- Enhanced the text input mechanism to use clipboard-paste for improved reliability in text insertion.
- Changed the default value of `skip_cleanup` in `VoiceServerConfig` from `true` to `false` to align with expected behavior and improve transcription handling.
- Added detailed logging in the transcription cleanup process to provide better insights into the LLM state and cleanup decisions.
- Removed unused functions related to unreliable key releases in hotkey handling to simplify the codebase.
- Updated tests to reflect the new default settings for `skip_cleanup` and ensure proper functionality across components.
…l tests

- Updated tests for the VoicePanel component to include new parameters: `silence_threshold` and `custom_dictionary`.
- Ensured that the tests reflect the latest configuration settings for improved transcription accuracy and functionality.
@senamakel senamakel merged commit 3851d1e into tinyhumansai:main Apr 7, 2026
4 checks passed
@senamakel senamakel deleted the fix/monday-prod branch April 7, 2026 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant