feat(voice): dedicated voice assistance module for STT/TTS by senamakel · Pull Request #178 · tinyhumansai/openhuman

senamakel · 2026-04-01T20:22:09Z

Summary

Adds a dedicated src/openhuman/voice/ domain module extracting speech-to-text (whisper.cpp) and text-to-speech (piper) from the local_ai module into its own RPC namespace (openhuman.voice_*)
Adds proactive voice_status endpoint that checks binary/model availability so the UI can show clear errors when voice mode can't work instead of failing silently
Updates frontend Conversations page to use new voice_* endpoints and show status feedback when switching to voice input mode

Changes

New files

src/openhuman/voice/mod.rs — Module declaration and re-exports
src/openhuman/voice/types.rs — DTOs: VoiceSpeechResult, VoiceTtsResult, VoiceStatus
src/openhuman/voice/ops.rs — Business logic: voice_status, voice_transcribe, voice_transcribe_bytes, voice_tts
src/openhuman/voice/schemas.rs — 4 RPC endpoint schemas + handlers
app/test/e2e/specs/voice-mode.spec.ts — E2E spec for voice mode UI integration

Modified files

src/openhuman/local_ai/mod.rs — Made paths and model_ids pub(crate) for cross-module access
src/openhuman/mod.rs — Added pub mod voice
src/core/all.rs — Registered voice controllers/schemas + namespace description
app/src/utils/tauriCommands.ts — Added voice RPC wrapper functions and types
app/src/pages/Conversations.tsx — Switched to voice_* endpoints + added status check on mode switch

Test plan

21 Rust unit tests (types, ops, schemas) — all passing
1 Rust integration test (voice_status_returns_availability) — passing
1771 total Rust tests — all passing, 0 failures
TypeScript type check passes
E2E: bash app/scripts/e2e-run-spec.sh test/e2e/specs/voice-mode.spec.ts voice-mode
Manual: toggle voice mode in UI, verify status message appears
Manual: with whisper-cli + piper installed, test full STT/TTS flow

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added comprehensive voice support: speech-to-text, text-to-speech replies, voice availability/status, and context-aware transcription with optional post-processing.
- Optional in-process transcription engine for improved local STT performance.
Style
- Improved transcript display spacing and labeling in settings.
Tests
- Added end-to-end voice-mode integration tests and a JSON-RPC availability test.
Chores
- CI/build updated to include required native build tooling.

Extracts speech-to-text (whisper.cpp) and text-to-speech (piper) into a dedicated `src/openhuman/voice/` domain module with its own RPC namespace (`openhuman.voice_*`). Adds proactive availability checking via `voice_status` so the UI can show clear errors when binaries/models are missing instead of failing silently at transcription time. - New module: voice/types.rs, voice/ops.rs, voice/schemas.rs, voice/mod.rs - 4 RPC endpoints: voice_status, voice_transcribe, voice_transcribe_bytes, voice_tts - 21 unit tests + 1 integration test (json_rpc_e2e) - Frontend updated to use voice_* endpoints with status check on mode switch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tests switching to voice input mode, verifying status check fires, recording button renders, and switching back to text mode restores text input. Also checks reply mode toggle visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-01T20:22:17Z

📝 Walkthrough

Walkthrough

Adds a new Voice domain with STT (in-process Whisper via whisper-rs and subprocess fallback), TTS, RPC bindings, frontend UI/command wiring, LLM-based transcription cleanup, CI/tooling updates (cmake), and new unit/e2e tests integrating voice flows.

Changes

Cohort / File(s)	Summary
Cargo & Config `Cargo.toml`, `src/openhuman/config/schema/local_ai.rs`	Adds `whisper-rs = "0.16"` dependency; introduces `whisper_in_process` and `voice_llm_cleanup_enabled` booleans (default true) in `LocalAiConfig`.
Core integration & exports `src/openhuman/mod.rs`, `src/core/all.rs`, `src/openhuman/local_ai/mod.rs`	Exports new `voice` module; registers voice controllers/schemas in core; makes `model_ids`, `paths` and `whisper_engine` crate-visible.
In-process Whisper engine `src/openhuman/local_ai/service/whisper_engine.rs`, `src/openhuman/local_ai/service/mod.rs`, `src/openhuman/local_ai/service/bootstrap.rs`	New `whisper_engine` module: handle, load/unload/is_loaded APIs, WAV decoder (i16/f32), PCM transcription functions; service stores handle and may load model at bootstrap when enabled.
Local AI service changes `src/openhuman/local_ai/service/speech.rs`, `src/openhuman/local_ai/service/public_infer.rs`	`transcribe` now tries in-process Whisper first (spawn_blocking), logs/falls back to subprocess; `inference` visibility widened to `pub(crate)`.
Voice domain controllers & types `src/openhuman/voice/mod.rs`, `src/openhuman/voice/ops.rs`, `src/openhuman/voice/types.rs`, `src/openhuman/voice/schemas.rs`, `src/openhuman/voice/postprocess.rs`	Adds voice RPCs: `voice_status`, `voice_transcribe`, `voice_transcribe_bytes`, `voice_tts`; DTOs (`VoiceSpeechResult`, `VoiceTtsResult`, `VoiceStatus`); RPC schema/dispatch; temp-file bytes handling and optional LLM `cleanup_transcription`.
Frontend: RPC bindings & UI `app/src/utils/tauriCommands.ts`, `app/src/pages/Conversations.tsx`, `app/src/components/settings/panels/LocalModelPanel.tsx`	New TS interfaces and RPC wrappers for voice ops; Conversations uses `openhuman.voice_*` RPCs and passes conversation context for transcription; minor UI spacing/label tweak in settings panel.
Tests `app/test/e2e/specs/voice-mode.spec.ts`, `tests/json_rpc_e2e.rs`	Adds Playwright E2E spec covering voice-mode UI interactions and a JSON-RPC E2E test for `openhuman.voice_status`.
CI / Build updates `.github/Dockerfile`, `.github/workflows/*`	Installs `cmake` in Dockerfile and multiple CI workflows to support building `whisper-rs`.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Frontend (Client)
    participant RPC as RPC Layer
    participant VoiceOps as openhuman::voice ops
    participant LocalAI as LocalAiService
    participant Whisper as Whisper Engine

    Client->>RPC: openhumanVoiceTranscribeBytes(bytes, ext, context)
    RPC->>VoiceOps: voice_transcribe_bytes(params)
    VoiceOps->>VoiceOps: write bytes -> temp file
    VoiceOps->>LocalAI: transcribe(temp_path)
    alt whisper_in_process enabled
        LocalAI->>Whisper: is_loaded(handle)?
        alt engine loaded
            Whisper->>Whisper: transcribe_wav_file(path)
            Whisper-->>LocalAI: raw_text
        else not loaded
            LocalAI->>LocalAI: transcribe_subprocess(path)
            LocalAI-->>LocalAI: raw_text
        end
    else in-process disabled
        LocalAI->>LocalAI: transcribe_subprocess(path)
        LocalAI-->>LocalAI: raw_text
    end
    LocalAI-->>VoiceOps: LocalAiSpeechResult{raw_text, model_id}
    VoiceOps->>VoiceOps: cleanup_transcription(raw_text, context) (optional)
    VoiceOps-->>RPC: VoiceSpeechResult{text, raw_text, model_id}
    RPC-->>Client: RPC response

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

feat(billing, team): add billing and team management RPC functionality #159 — related core registry wiring changes (both modify src/core/all.rs to register namespaces/controllers).

Poem

🐰 I hopped through bytes and whispered tunes,

I nudged the engine under moonlit runes,
From wav to words the tunnels gleam,
I cleaned the lines and stitched the stream,
Voice now speaks — a rabbit's dream.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(voice): dedicated voice assistance module for STT/TTS' accurately summarizes the main change—a new voice domain module for speech-to-text and text-to-speech functionality.
Docstring Coverage	✅ Passed	Docstring coverage is 86.02% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

- Add whisper-rs (0.16) for in-process whisper.cpp inference, eliminating cold-start latency from subprocess-per-call (~1-3s) to warm inference (~50ms). Model is loaded once during bootstrap and reused across calls. Falls back to whisper-cli subprocess if in-process loading fails. - Add LLM post-processing layer that passes raw transcription through Ollama to fix grammar, punctuation, and filler words. Accepts optional conversation context to disambiguate names and technical terms. Gracefully degrades to raw whisper output if Ollama is unavailable. - Update voice RPC endpoints with new optional params (context, skip_cleanup) and return both cleaned text and raw_text. - Update frontend to pass conversation history as context for voice transcription cleanup, and update TypeScript interfaces to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The whisper-rs crate requires cmake to compile whisper.cpp from source, which is not available in the CI environment. Move it behind an optional cargo feature so CI builds succeed without cmake. The whisper_engine module now compiles as a no-op stub when the feature is disabled, returning "whisper feature not compiled in" errors. Desktop builds can opt in with `--features whisper`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert whisper-rs from optional to mandatory dependency. Add cmake installation to all CI workflows (build, typecheck, test, release) and the CI Docker image so whisper-rs can compile whisper.cpp from source. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (5)

app/src/components/settings/panels/LocalModelPanel.tsx (1)
1132-1137: UI improvements look good, but file exceeds size guideline.

The transcript display changes are sensible improvements (better spacing and labeled output). However, this file is ~1179 lines, exceeding the ≤500 line guideline. Consider splitting into smaller components in a future refactor.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/settings/panels/LocalModelPanel.tsx` around lines 1132 -
1137, The file LocalModelPanel.tsx has grown too large (~1179 lines) and should
be split into smaller components; extract the transcript display block that
references transcribeOutput (the Model: {transcribeOutput.model_id} and the
<pre> showing transcribeOutput.text) into a new presentational component (e.g.,
TranscriptView) and optionally extract model header/info into ModelInfo, moving
any related handler/state or helper functions into their own files or into a
container component; update LocalModelPanel to import and render these new
components (keeping prop names like transcribeOutput) so the main file size is
reduced while behavior remains identical.
Cargo.toml (1)
89-89: Consider making whisper-rs optional.

The whisper-rs dependency adds native whisper.cpp bindings, increasing binary size and compile time for all users. Since in-process Whisper is controlled by a config flag (whisper_in_process), making this an optional feature would reduce build overhead for users who don't need it.
Optional refactor approach
 [dependencies]
 # ... other deps ...
-whisper-rs = "0.16"
+whisper-rs = { version = "0.16", optional = true }

 [features]
 # ... existing features ...
+whisper-in-process = ["dep:whisper-rs"]
Then conditionally compile the whisper engine code with #[cfg(feature = "whisper-in-process")].
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Cargo.toml` at line 89, Make the whisper-rs dependency optional and gate the
in-process whisper code behind a feature named "whisper-in-process": mark the
dependency whisper-rs as optional in Cargo.toml (optional = true) and add a
"whisper-in-process" feature in the [features] table that enables that optional
dependency; then wrap the in-process engine code paths with #[cfg(feature =
"whisper-in-process")] (and #[cfg(not(feature = "whisper-in-process"))]
fallbacks where needed) for modules/functions that reference whisper-rs (e.g.,
the whisper engine initialization, any create_whisper_client / WhisperEngine /
transcribe_* functions) so builds that don't opt into the feature won’t pull in
whisper-rs.
src/openhuman/voice/postprocess.rs (1)
83-101: Consider adding a test for the config-disabled path.

The current tests cover the empty/whitespace early-return paths. Consider adding a test that verifies behavior when voice_llm_cleanup_enabled is explicitly false (config-disabled path) with non-empty input, to ensure the raw text is returned unchanged.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/voice/postprocess.rs` around lines 83 - 101, Add a test that
asserts cleanup_transcription returns the raw input when the config disables
cleanup: create a Config with voice_llm_cleanup_enabled set to false (keeping
other defaults), call cleanup_transcription(&config, "some non-empty text",
None) inside a tokio runtime (similar to the existing tests), and assert the
result equals the original string; reference the Config struct and the
cleanup_transcription function to locate where to add the test.
src/openhuman/local_ai/service/speech.rs (1)
29-43: Minor TOCTOU race condition between is_loaded() and transcribe_in_process().

The engine state could change between the is_loaded() check (line 29) and the actual transcription call inside transcribe_in_process(). If the engine is unloaded between these calls, transcription will fail.

However, the current implementation handles this gracefully by falling back to the subprocess path on failure (lines 39-41), so the impact is limited to unnecessary fallback rather than a crash or data corruption. This is acceptable for the current use case.

If stricter consistency is needed in the future (e.g., avoiding retry overhead), consider acquiring the lock once and holding it through the transcription, or using a try-transcribe pattern that returns a specific error for "not loaded."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/local_ai/service/speech.rs` around lines 29 - 43, There is a
TOCTOU race between whisper_engine::is_loaded(&self.whisper) and calling
self.transcribe_in_process(audio_path) which can let the engine unload between
the two calls; to fix this either perform the "is loaded and transcribe"
atomically by acquiring whatever internal lock or guard the engine provides (or
hold the same self.whisper lock) while calling transcribe_in_process, or
implement a try_transcribe method on the whisper engine that attempts
transcription and returns a specific NotLoaded error so transcribe_in_process
can detect it without a separate is_loaded check; update the branch that
currently checks whisper_engine::is_loaded, calling the atomic/try_transcribe
approach and keep the existing fallback behavior on real transcription errors
while avoiding the separate is_loaded call.
app/test/e2e/specs/voice-mode.spec.ts (1)
136-139: Disambiguate the Input-mode toggle clicks.

There are two Voice buttons and two Text buttons on this screen, so clickText(...) will hit whichever match is surfaced first. That makes this flow brittle and can flip Reply mode instead of Input mode if the accessibility tree order changes. Scope the click to the Input toggle group, or use a dedicated toggle helper/test id here.

As per coding guidelines "Use clickNativeButton(), hasAppChrome(), waitForWebView(), clickToggle() and other cross-platform helper functions for element interaction in E2E specs".

Also applies to: 178-180
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/test/e2e/specs/voice-mode.spec.ts` around lines 136 - 139, The test uses
clickText('Voice', ...) which can target either the Input or Reply toggle;
replace these ambiguous clicks with a scoped click using the Input-mode toggle
helper or a test id and the cross-platform helpers: use
clickToggle('input-mode') or clickNativeButton(selectorForInputToggle) /
clickToggle with the Input toggle's test-id/class instead of clickText, and
update both occurrences (around clickText at the first block and the similar
block at lines ~178-180) so the test explicitly targets the Input toggle
element.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/test/e2e/specs/voice-mode.spec.ts`:
- Around line 188-197: The test "shows reply mode toggle with text and voice
options" is not independent and only checks for a global "Text" label; make it
runnable in isolation by performing the same auth/navigation/setup that earlier
tests do (or move that setup to a shared beforeEach used by this spec) so the
conversations page is loaded, then scope DOM queries to the Reply toggle group
instead of global textExists calls: locate the Reply toggle container (e.g.,
query by accessible name "Reply" or a unique selector for the reply-toggle
component) and assert that both "Text" and "Voice" options exist inside that
container (use the local container.querySelector / getByText on the container or
equivalent helpers rather than global textExists).
- Around line 162-177: The test currently only logs when neither the voice CTA
nor the unavailable message appears; update the block that checks
waitForAnyText(...) (variable hasVoiceButton) and textExists('Speech-to-text
unavailable') (variable hasStatus) so that if both are false the spec fails the
test—e.g., replace the console logs with an assertion or throw an Error
indicating "Voice mode did not render CTA or unavailable state" inside the same
conditional branch; keep the existing checks for Start Talking/Transcribing/Stop
& Send and Speech-to-text unavailable and only assert when both checks fail.

In `@src/openhuman/local_ai/service/whisper_engine.rs`:
- Around line 192-215: The WAV parser currently reads sample rate (_sample_rate)
and num_channels but never enforces them, so add a validation after parsing the
"fmt " chunk and before calling convert_pcm_to_f32: check that _sample_rate ==
16000 and that num_channels == 1 (or explicitly reject num_channels > 2 if you
want to allow stereo only) and return Err(...) with a clear message like
"unsupported sample rate" or "unsupported channel count" if the checks fail;
apply the same validation in the other parsing branch handling fmt/data (the
other block around the 228-268 region) so files with 44.1kHz or >2 channels fail
fast instead of being fed to Whisper.
- Around line 33-55: The load_engine function (and other blocking operations
like transcribe_in_process) perform heavy CPU/filesystem work on Tokio worker
threads; update the async call sites (handle_local_ai_transcribe,
handle_voice_transcribe) to offload these operations by calling
tokio::task::spawn_blocking (or tokio::task::block_in_place inside an async
context) and run load_engine/WhisperContext::new_with_params and
transcribe_in_process inside that blocking closure, ensuring
WhisperEngineHandle/engine initialization is moved back onto the Tokio runtime
only after the blocking work completes; reference the functions load_engine,
transcribe_in_process and the RPC handlers handle_local_ai_transcribe and
handle_voice_transcribe when applying the change.

In `@src/openhuman/voice/ops.rs`:
- Around line 265-287: Test voice_status_detects_stub_binaries mutates the
global WHISPER_BIN env var unsafely; change it to save the prior value with
std::env::var_os("WHISPER_BIN"), set the stub with std::env::set_var, and
restore the original value in a drop-safe guard (RAII) so restore happens even
on panic; additionally protect access with a test-wide synchronization primitive
(e.g., a static Mutex or use serial_test) to prevent concurrent tests from
racing, and keep references to the test helpers (voice_status, tempfile stub
creation) intact while replacing the unconditional std::env::remove_var call
with the guarded restore.
- Around line 38-43: The debug logs currently print full filesystem paths
(whisper_bin, piper_bin, stt_model, tts_voice) which may contain PII; sanitize
them by logging only booleans or safe basenames instead of full paths. Update
the debug! invocations that use LOG_PREFIX and the variables whisper_bin,
piper_bin, stt_model, tts_voice (and the other occurrences you noted around the
file) to map Option<PathBuf>/PathBuf values to their file_name().and_then(|n|
n.to_str()).unwrap_or("redacted") or to a short tag like "<redacted>" before
formatting, and keep the original boolean flags (stt_available, tts_available,
whisper_in_process) as-is so the messages convey state without exposing absolute
paths.
- Around line 131-140: The temp-audio cleanup currently ignores errors from
tokio::fs::remove_file(&file_path), so transient failures can leave recordings
behind; update the code around file_path, service.transcribe, and the
remove_file call to ensure cleanup failures are observed: either use a
tempfile-based auto-cleaning path (e.g., tempfile::NamedTempFile or
tempfile::Builder) for automatic deletion, or capture the Result from
tokio::fs::remove_file and log any Err using the project's tracing/log (use
debug/trace for success checkpoints and error for failures) and include
file_path display in the log; also ensure remove_file is attempted in a
finally-like block (after service.transcribe completes or on its Err) so cleanup
always runs.

In `@src/openhuman/voice/schemas.rs`:
- Around line 83-140: The ControllerSchema outputs for "voice_status",
"voice_transcribe", "voice_transcribe_bytes", and "voice_tts" claim wrapped
top-level keys ("status", "speech", "tts") but your handlers' to_json (or the
DTO serialization) emits the inner DTO directly; update either the schemas or
the serialization so they match: either change the ControllerSchema.outputs
entries to list the actual DTO fields returned by the handlers, or modify the
handlers / to_json logic to wrap the returned payload under the declared keys
(e.g., return {"status": <dto>} / {"speech": <dto>} / {"tts": <dto>}); ensure
the change is applied consistently for the same pattern referenced around lines
215-218 as well.

---

Nitpick comments:
In `@app/src/components/settings/panels/LocalModelPanel.tsx`:
- Around line 1132-1137: The file LocalModelPanel.tsx has grown too large (~1179
lines) and should be split into smaller components; extract the transcript
display block that references transcribeOutput (the Model:
{transcribeOutput.model_id} and the <pre> showing transcribeOutput.text) into a
new presentational component (e.g., TranscriptView) and optionally extract model
header/info into ModelInfo, moving any related handler/state or helper functions
into their own files or into a container component; update LocalModelPanel to
import and render these new components (keeping prop names like
transcribeOutput) so the main file size is reduced while behavior remains
identical.

In `@app/test/e2e/specs/voice-mode.spec.ts`:
- Around line 136-139: The test uses clickText('Voice', ...) which can target
either the Input or Reply toggle; replace these ambiguous clicks with a scoped
click using the Input-mode toggle helper or a test id and the cross-platform
helpers: use clickToggle('input-mode') or
clickNativeButton(selectorForInputToggle) / clickToggle with the Input toggle's
test-id/class instead of clickText, and update both occurrences (around
clickText at the first block and the similar block at lines ~178-180) so the
test explicitly targets the Input toggle element.

In `@Cargo.toml`:
- Line 89: Make the whisper-rs dependency optional and gate the in-process
whisper code behind a feature named "whisper-in-process": mark the dependency
whisper-rs as optional in Cargo.toml (optional = true) and add a
"whisper-in-process" feature in the [features] table that enables that optional
dependency; then wrap the in-process engine code paths with #[cfg(feature =
"whisper-in-process")] (and #[cfg(not(feature = "whisper-in-process"))]
fallbacks where needed) for modules/functions that reference whisper-rs (e.g.,
the whisper engine initialization, any create_whisper_client / WhisperEngine /
transcribe_* functions) so builds that don't opt into the feature won’t pull in
whisper-rs.

In `@src/openhuman/local_ai/service/speech.rs`:
- Around line 29-43: There is a TOCTOU race between
whisper_engine::is_loaded(&self.whisper) and calling
self.transcribe_in_process(audio_path) which can let the engine unload between
the two calls; to fix this either perform the "is loaded and transcribe"
atomically by acquiring whatever internal lock or guard the engine provides (or
hold the same self.whisper lock) while calling transcribe_in_process, or
implement a try_transcribe method on the whisper engine that attempts
transcription and returns a specific NotLoaded error so transcribe_in_process
can detect it without a separate is_loaded check; update the branch that
currently checks whisper_engine::is_loaded, calling the atomic/try_transcribe
approach and keep the existing fallback behavior on real transcription errors
while avoiding the separate is_loaded call.

In `@src/openhuman/voice/postprocess.rs`:
- Around line 83-101: Add a test that asserts cleanup_transcription returns the
raw input when the config disables cleanup: create a Config with
voice_llm_cleanup_enabled set to false (keeping other defaults), call
cleanup_transcription(&config, "some non-empty text", None) inside a tokio
runtime (similar to the existing tests), and assert the result equals the
original string; reference the Config struct and the cleanup_transcription
function to locate where to add the test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1ca30e83-412c-42ab-8252-9e36690ffe20

📥 Commits

Reviewing files that changed from the base of the PR and between 88a4647 and cf3cbeb.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (20)

Cargo.toml
app/src/components/settings/panels/LocalModelPanel.tsx
app/src/pages/Conversations.tsx
app/src/utils/tauriCommands.ts
app/test/e2e/specs/voice-mode.spec.ts
src/core/all.rs
src/openhuman/config/schema/local_ai.rs
src/openhuman/local_ai/mod.rs
src/openhuman/local_ai/service/bootstrap.rs
src/openhuman/local_ai/service/mod.rs
src/openhuman/local_ai/service/public_infer.rs
src/openhuman/local_ai/service/speech.rs
src/openhuman/local_ai/service/whisper_engine.rs
src/openhuman/mod.rs
src/openhuman/voice/mod.rs
src/openhuman/voice/ops.rs
src/openhuman/voice/postprocess.rs
src/openhuman/voice/schemas.rs
src/openhuman/voice/types.rs
tests/json_rpc_e2e.rs

coderabbitai · 2026-04-01T21:43:22Z

+    // --- Verify the voice recording button is visible ---
+    // In voice mode, we should see "Start Talking" (or the unavailability message)
+    const hasVoiceButton = await waitForAnyText(
+      ['Start Talking', 'Transcribing', 'Stop & Send'],
+      10_000
+    );
+    // The button should be present even if STT is unavailable (it will be disabled)
+    if (!hasVoiceButton) {
+      console.log('[VoiceModeE2E] Voice button not found, checking for status messages...');
+      // If whisper is not available, the button may still be present but the status message dominates
+      const hasStatus = await textExists('Speech-to-text unavailable');
+      if (hasStatus) {
+        console.log('[VoiceModeE2E] STT unavailable message displayed — voice mode active.');
+      }
+    }
+


⚠️ Potential issue | 🟠 Major

Fail the test when voice mode renders neither the CTA nor the unavailable state.

This branch only logs when the voice control is missing, so the spec still goes green even if voice mode shows neither a record button nor the expected unavailable UI. Please turn one of those outcomes into an assertion.

Based on learnings "Assert both UI outcomes and backend/mock effects in E2E specs when relevant to ensure full-stack correctness".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@app/test/e2e/specs/voice-mode.spec.ts` around lines 162 - 177, The test currently only logs when neither the voice CTA nor the unavailable message appears; update the block that checks waitForAnyText(...) (variable hasVoiceButton) and textExists('Speech-to-text unavailable') (variable hasStatus) so that if both are false the spec fails the test—e.g., replace the console logs with an assertion or throw an Error indicating "Voice mode did not render CTA or unavailable state" inside the same conditional branch; keep the existing checks for Start Talking/Transcribing/Stop & Send and Speech-to-text unavailable and only assert when both checks fail.

coderabbitai · 2026-04-01T21:43:22Z

+    let mut _sample_rate: u32 = 0;
+    let mut bits_per_sample: u16 = 0;
+
+    while pos + 8 <= data.len() {
+        let chunk_id = &data[pos..pos + 4];
+        let chunk_size =
+            u32::from_le_bytes([data[pos + 4], data[pos + 5], data[pos + 6], data[pos + 7]])
+                as usize;
+
+        if chunk_id == b"fmt " {
+            if chunk_size < 16 || pos + 8 + chunk_size > data.len() {
+                return Err("malformed fmt chunk".to_string());
+            }
+            let fmt = &data[pos + 8..];
+            audio_format = u16::from_le_bytes([fmt[0], fmt[1]]);
+            num_channels = u16::from_le_bytes([fmt[2], fmt[3]]);
+            _sample_rate = u32::from_le_bytes([fmt[4], fmt[5], fmt[6], fmt[7]]);
+            bits_per_sample = u16::from_le_bytes([fmt[14], fmt[15]]);
+            fmt_found = true;
+        }
+
+        if chunk_id == b"data" && fmt_found {
+            let pcm_data = &data[pos + 8..pos + 8 + chunk_size.min(data.len() - pos - 8)];
+            return convert_pcm_to_f32(pcm_data, audio_format, num_channels, bits_per_sample);


⚠️ Potential issue | 🟠 Major

Reject unsupported WAV sample rates and channel counts.

The decoder reads sample_rate and num_channels but never enforces them. A 44.1 kHz WAV or a num_channels > 2 file will be accepted and fed to Whisper as if it were valid 16 kHz mono input, which corrupts transcription instead of failing fast.

Suggested guard

- let mut _sample_rate: u32 = 0; + let mut sample_rate: u32 = 0; ... - _sample_rate = u32::from_le_bytes([fmt[4], fmt[5], fmt[6], fmt[7]]); + sample_rate = u32::from_le_bytes([fmt[4], fmt[5], fmt[6], fmt[7]]); ... - return convert_pcm_to_f32(pcm_data, audio_format, num_channels, bits_per_sample); + if sample_rate != 16_000 { + return Err(format!("unsupported WAV sample rate: {sample_rate}")); + } + if num_channels != 1 && num_channels != 2 { + return Err(format!("unsupported WAV channel count: {num_channels}")); + } + return convert_pcm_to_f32(pcm_data, audio_format, num_channels, bits_per_sample);

Also applies to: 228-268

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/openhuman/local_ai/service/whisper_engine.rs` around lines 192 - 215, The WAV parser currently reads sample rate (_sample_rate) and num_channels but never enforces them, so add a validation after parsing the "fmt " chunk and before calling convert_pcm_to_f32: check that _sample_rate == 16000 and that num_channels == 1 (or explicitly reject num_channels > 2 if you want to allow stereo only) and return Err(...) with a clear message like "unsupported sample rate" or "unsupported channel count" if the checks fail; apply the same validation in the other parsing branch handling fmt/data (the other block around the 228-268 region) so files with 44.1kHz or >2 channels fail fast instead of being fed to Whisper.

coderabbitai · 2026-04-01T21:43:22Z

+    async fn voice_status_detects_stub_binaries() {
+        let tmp = tempfile::tempdir().expect("tempdir");
+
+        let whisper_stub = tmp.path().join("whisper-cli");
+        std::fs::write(&whisper_stub, b"#!/bin/sh\n").expect("write stub");
+        #[cfg(unix)]
+        {
+            use std::os::unix::fs::PermissionsExt;
+            std::fs::set_permissions(&whisper_stub, std::fs::Permissions::from_mode(0o755))
+                .expect("chmod");
+        }
+
+        std::env::set_var("WHISPER_BIN", whisper_stub.display().to_string());
+
+        let mut config = Config::default();
+        config.workspace_dir = tmp.path().join("workspace");
+        config.config_path = tmp.path().join("config.toml");
+
+        let result = voice_status(&config).await.unwrap();
+        assert!(result.value.whisper_binary.is_some());
+
+        std::env::remove_var("WHISPER_BIN");
+    }


⚠️ Potential issue | 🟠 Major

Serialize and restore env state in this test.

WHISPER_BIN is process-global, and this test overwrites it and then unconditionally removes it. Parallel Rust test execution can race with any other reader/writer, and even serial runs will destroy a pre-existing value if the suite or developer environment already had one set. Please guard this with shared test synchronization and restore the old value in a drop-safe way.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/ops.rs` around lines 265 - 287, Test voice_status_detects_stub_binaries mutates the global WHISPER_BIN env var unsafely; change it to save the prior value with std::env::var_os("WHISPER_BIN"), set the stub with std::env::set_var, and restore the original value in a drop-safe guard (RAII) so restore happens even on panic; additionally protect access with a test-wide synchronization primitive (e.g., a static Mutex or use serial_test) to prevent concurrent tests from racing, and keep references to the test helpers (voice_status, tempfile stub creation) intact while replacing the unconditional std::env::remove_var call with the guarded restore.

coderabbitai · 2026-04-01T21:43:22Z

+        "voice_status" => ControllerSchema {
+            namespace: "voice",
+            function: "status",
+            description: "Check availability of STT/TTS binaries and models.",
+            inputs: vec![],
+            outputs: vec![json_output("status", "Voice availability status.")],
+        },
+        "voice_transcribe" => ControllerSchema {
+            namespace: "voice",
+            function: "transcribe",
+            description:
+                "Transcribe audio from a file path using whisper.cpp, with optional LLM cleanup.",
+            inputs: vec![
+                required_string("audio_path", "Path to the audio file."),
+                optional_string("context", "Conversation context for LLM post-processing."),
+                optional_bool(
+                    "skip_cleanup",
+                    "Skip LLM cleanup, return raw whisper output.",
+                ),
+            ],
+            outputs: vec![json_output(
+                "speech",
+                "Transcription result with text and raw_text.",
+            )],
+        },
+        "voice_transcribe_bytes" => ControllerSchema {
+            namespace: "voice",
+            function: "transcribe_bytes",
+            description:
+                "Transcribe audio from raw bytes using whisper.cpp, with optional LLM cleanup.",
+            inputs: vec![
+                FieldSchema {
+                    name: "audio_bytes",
+                    ty: TypeSchema::Bytes,
+                    comment: "Raw audio bytes.",
+                    required: true,
+                },
+                optional_string("extension", "Audio file extension (default: webm)."),
+                optional_string("context", "Conversation context for LLM post-processing."),
+                optional_bool(
+                    "skip_cleanup",
+                    "Skip LLM cleanup, return raw whisper output.",
+                ),
+            ],
+            outputs: vec![json_output(
+                "speech",
+                "Transcription result with text and raw_text.",
+            )],
+        },
+        "voice_tts" => ControllerSchema {
+            namespace: "voice",
+            function: "tts",
+            description: "Synthesize speech from text using piper.",
+            inputs: vec![
+                required_string("text", "Text to synthesize."),
+                optional_string("output_path", "Optional output file path."),
+            ],
+            outputs: vec![json_output("tts", "TTS result with output path.")],


⚠️ Potential issue | 🟠 Major

Align the declared outputs with the JSON you actually return.

These schemas advertise wrapped responses (status, speech, tts), but to_json serializes the inner DTO directly. Anything consuming ControllerSchema.outputs will infer the wrong top-level contract unless the handler wraps the payload under those keys or the schema lists the DTO’s real fields instead.

Also applies to: 215-218

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/schemas.rs` around lines 83 - 140, The ControllerSchema outputs for "voice_status", "voice_transcribe", "voice_transcribe_bytes", and "voice_tts" claim wrapped top-level keys ("status", "speech", "tts") but your handlers' to_json (or the DTO serialization) emits the inner DTO directly; update either the schemas or the serialization so they match: either change the ControllerSchema.outputs entries to list the actual DTO fields returned by the handlers, or modify the handlers / to_json logic to wrap the returned payload under the declared keys (e.g., return {"status": <dto>} / {"speech": <dto>} / {"tts": <dto>}); ensure the change is applied consistently for the same pattern referenced around lines 215-218 as well.

- whisper_engine: validate WAV sample rate (must be 16kHz) and channel count (1 or 2) before feeding audio to whisper - speech: offload load_engine and transcribe_in_process to tokio::task::spawn_blocking to avoid blocking the Tokio runtime - ops: use RAII guard for WHISPER_BIN env var in test to prevent races and ensure restore on panic; log temp file cleanup failures instead of silently ignoring; sanitize paths in debug logs to basenames only - postprocess: add test for disabled cleanup config returning raw text - voice-mode.spec: assert failure when neither voice CTA nor unavailable message appears; make reply mode test runnable in isolation with auth/nav setup Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

app/test/e2e/specs/voice-mode.spec.ts (1)
1-1: Consider removing @ts-nocheck or documenting why it's necessary.

Blanket disabling of TypeScript checks hides legitimate type errors. If specific imports or WebdriverIO globals cause issues, use targeted // @ts-expect-error`` comments with explanations instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/test/e2e/specs/voice-mode.spec.ts` at line 1, Remove the blanket "//
`@ts-nocheck`" from the top of voice-mode.spec.ts and either fix the underlying
type errors or replace with targeted comments: use specific "//
`@ts-expect-error`" annotations directly above the offending imports or
WebdriverIO global uses (e.g., any problematic import lines or references to
browser, $, $$, or other test globals) and add a short comment explaining why
the ignore is necessary; then run the typechecker to ensure only the intended
lines are suppressed.
src/openhuman/local_ai/service/whisper_engine.rs (1)
179-236: Add debug logs on decode rejection/error branches.

decode_wav_to_f32 returns clear errors, but critical rejection branches currently have no debug/trace breadcrumbs, which makes field diagnosis harder.

As per coding guidelines, "Use log/tracing at debug or trace level in Rust for critical checkpoints (entry/exit points, branch decisions, external calls, retries/timeouts, state transitions, error handling paths)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/local_ai/service/whisper_engine.rs` around lines 179 - 236, Add
debug/trace logging to decode_wav_to_f32's rejection/error branches so failures
are visible; instrument the function (and the branch that calls
convert_pcm_to_f32) to emit logs using the project's tracing/log macro (e.g.,
trace! or debug!) with contextual info like chunk_id, chunk_size, pos,
audio_format, num_channels, sample_rate, bits_per_sample and the error string
before each early return or Err(...) (for cases: too small file, invalid
RIFF/WAVE header, malformed fmt chunk, unsupported sample_rate/channel count,
and missing data chunk) so callers can correlate failures to the input bytes and
parsing state.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/test/e2e/specs/voice-mode.spec.ts`:
- Around line 182-205: The Reply-toggle test is using global textExists checks
so it can be fooled by the Input toggle; change the assertions to scope to the
Reply toggle container and assert both options exist. Locate the "shows reply
mode toggle with text and voice options" test (and helpers used there:
waitForAnyText, textExists) and: 1) find the Reply toggle element/container
(e.g., locate the element that contains the 'Reply' label or use the
waitForAnyText result as the anchor), 2) query only within that container for
the 'Text' and 'Voice' option labels (use an existing scoped helper if available
or get the Reply element then search its children), and 3) replace the global
expect(hasText).toBe(true) with two scoped assertions expecting both 'Text' and
'Voice' to be present inside the Reply toggle.

In `@src/openhuman/local_ai/service/whisper_engine.rs`:
- Around line 170-173: The code reads the entire WAV into memory unbounded (the
std::fs::read call) which can OOM; add a file-size guard before allocating:
define a MAX_WAV_BYTES constant and, in the function containing the lines using
std::fs::read, check the file size via std::fs::metadata or open the file and
use metadata.len(), returning an error if it exceeds the limit, then read the
file (or read up to the limit using a capped reader) and pass the resulting
bytes to decode_wav_to_f32 and transcribe_pcm_f32; update error messages to
indicate the file was too large and reference the existing functions
decode_wav_to_f32 and transcribe_pcm_f32 in the same flow.
- Around line 244-277: The code currently uses pcm.chunks_exact(...) in the
(audio_format, bits_per_sample) match arms which silently drops trailing partial
frames; instead, before chunking check the expected frame byte size (for PCM16:
frame_bytes = 2 * num_channels; for IEEE float32: frame_bytes = 4 *
num_channels) and if pcm.len() % frame_bytes != 0 return an Err indicating
malformed/truncated PCM rather than proceeding, so replace the silent truncation
around the pcm.chunks_exact(...) usage with an explicit length-check and early
error return in the same match branches.

---

Nitpick comments:
In `@app/test/e2e/specs/voice-mode.spec.ts`:
- Line 1: Remove the blanket "// `@ts-nocheck`" from the top of voice-mode.spec.ts
and either fix the underlying type errors or replace with targeted comments: use
specific "// `@ts-expect-error`" annotations directly above the offending imports
or WebdriverIO global uses (e.g., any problematic import lines or references to
browser, $, $$, or other test globals) and add a short comment explaining why
the ignore is necessary; then run the typechecker to ensure only the intended
lines are suppressed.

In `@src/openhuman/local_ai/service/whisper_engine.rs`:
- Around line 179-236: Add debug/trace logging to decode_wav_to_f32's
rejection/error branches so failures are visible; instrument the function (and
the branch that calls convert_pcm_to_f32) to emit logs using the project's
tracing/log macro (e.g., trace! or debug!) with contextual info like chunk_id,
chunk_size, pos, audio_format, num_channels, sample_rate, bits_per_sample and
the error string before each early return or Err(...) (for cases: too small
file, invalid RIFF/WAVE header, malformed fmt chunk, unsupported
sample_rate/channel count, and missing data chunk) so callers can correlate
failures to the input bytes and parsing state.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ee10696c-593f-4d97-8a76-03c8d3cee230

📥 Commits

Reviewing files that changed from the base of the PR and between 07475ee and cab95eb.

📒 Files selected for processing (6)

app/test/e2e/specs/voice-mode.spec.ts
src/openhuman/local_ai/service/bootstrap.rs
src/openhuman/local_ai/service/speech.rs
src/openhuman/local_ai/service/whisper_engine.rs
src/openhuman/voice/ops.rs
src/openhuman/voice/postprocess.rs

✅ Files skipped from review due to trivial changes (1)

src/openhuman/voice/ops.rs

🚧 Files skipped from review as they are similar to previous changes (3)

src/openhuman/voice/postprocess.rs
src/openhuman/local_ai/service/speech.rs
src/openhuman/local_ai/service/bootstrap.rs

coderabbitai · 2026-04-01T22:34:09Z

+  it('shows reply mode toggle with text and voice options', async () => {
+    // Ensure conversations page is loaded (re-authenticate if state was lost).
+    const onConversations = await waitForAnyText(
+      ['Message OpenHuman', 'Type a message', 'Reply'],
+      5_000
+    );
+    if (!onConversations) {
+      await triggerAuthDeepLink('e2e-voice-token');
+      await waitForWindowVisible(25_000);
+      await waitForWebView(15_000);
+      await waitForAppReady(15_000);
+      await completeOnboardingIfVisible();
+      await waitForHome(20_000);
+    }
+
+    // The Reply toggle should be visible on the conversations page
+    const hasReplyLabel = await textExists('Reply');
+    expect(hasReplyLabel).toBe(true);
+
+    // Verify both reply mode options exist
+    // (There are multiple "Text" and "Voice" buttons — Input + Reply groups)
+    const hasText = await textExists('Text');
+    expect(hasText).toBe(true);
+  });


⚠️ Potential issue | 🟡 Minor

Second test still uses global assertions rather than scoping to the Reply toggle group.

While the re-authentication fallback (lines 188-195) improves isolation, the assertions at lines 198-204 check for 'Reply' and 'Text' anywhere on the page. This can pass even if the Reply toggle is missing (since "Text" appears in the Input toggle as well). Additionally, the test doesn't verify the "Voice" option exists in the Reply toggle.

Scope assertions to the Reply toggle container and add the missing "Voice" check:

Proposed fix to scope assertions and add Voice check

// The Reply toggle should be visible on the conversations page const hasReplyLabel = await textExists('Reply'); expect(hasReplyLabel).toBe(true); - // Verify both reply mode options exist - // (There are multiple "Text" and "Voice" buttons — Input + Reply groups) - const hasText = await textExists('Text'); - expect(hasText).toBe(true); + // Verify both reply mode options exist. + // Note: There are multiple "Text" and "Voice" buttons (Input + Reply groups), + // so this assertion confirms at least one of each exists on the page. + // Ideally, scope to the Reply toggle container for precision. + const hasTextOption = await textExists('Text'); + expect(hasTextOption).toBe(true); + + const hasVoiceOption = await textExists('Voice'); + expect(hasVoiceOption).toBe(true); }); });

Based on learnings: "Ensure each E2E spec is runnable in isolation without dependencies on other specs".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

it('shows reply mode toggle with text and voice options', async () => {

// Ensure conversations page is loaded (re-authenticate if state was lost).

const onConversations = await waitForAnyText(

['Message OpenHuman', 'Type a message', 'Reply'],

5_000

);

if (!onConversations) {

await triggerAuthDeepLink('e2e-voice-token');

await waitForWindowVisible(25_000);

await waitForWebView(15_000);

await waitForAppReady(15_000);

await completeOnboardingIfVisible();

await waitForHome(20_000);

}

// The Reply toggle should be visible on the conversations page

const hasReplyLabel = await textExists('Reply');

expect(hasReplyLabel).toBe(true);

// Verify both reply mode options exist

// (There are multiple "Text" and "Voice" buttons — Input + Reply groups)

const hasText = await textExists('Text');

expect(hasText).toBe(true);

});

it('shows reply mode toggle with text and voice options', async () => {

// Ensure conversations page is loaded (re-authenticate if state was lost).

const onConversations = await waitForAnyText(

['Message OpenHuman', 'Type a message', 'Reply'],

5_000

);

if (!onConversations) {

await triggerAuthDeepLink('e2e-voice-token');

await waitForWindowVisible(25_000);

await waitForWebView(15_000);

await waitForAppReady(15_000);

await completeOnboardingIfVisible();

await waitForHome(20_000);

}

// The Reply toggle should be visible on the conversations page

const hasReplyLabel = await textExists('Reply');

expect(hasReplyLabel).toBe(true);

// Verify both reply mode options exist.

// Note: There are multiple "Text" and "Voice" buttons (Input + Reply groups),

// so this assertion confirms at least one of each exists on the page.

// Ideally, scope to the Reply toggle container for precision.

const hasTextOption = await textExists('Text');

expect(hasTextOption).toBe(true);

const hasVoiceOption = await textExists('Voice');

expect(hasVoiceOption).toBe(true);

});

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@app/test/e2e/specs/voice-mode.spec.ts` around lines 182 - 205, The Reply-toggle test is using global textExists checks so it can be fooled by the Input toggle; change the assertions to scope to the Reply toggle container and assert both options exist. Locate the "shows reply mode toggle with text and voice options" test (and helpers used there: waitForAnyText, textExists) and: 1) find the Reply toggle element/container (e.g., locate the element that contains the 'Reply' label or use the waitForAnyText result as the anchor), 2) query only within that container for the 'Text' and 'Voice' option labels (use an existing scoped helper if available or get the Reply element then search its children), and 3) replace the global expect(hasText).toBe(true) with two scoped assertions expecting both 'Text' and 'Voice' to be present inside the Reply toggle.

coderabbitai · 2026-04-01T22:34:09Z

+    let raw_bytes = std::fs::read(wav_path).map_err(|e| format!("failed to read WAV file: {e}"))?;
+
+    let audio_f32 = decode_wav_to_f32(&raw_bytes)?;
+    transcribe_pcm_f32(handle, &audio_f32, language)


⚠️ Potential issue | 🟠 Major

Bound WAV input size before reading into memory.

Line 170 reads the full file with no size cap. A very large WAV can cause high memory pressure/OOM in the RPC path.

💡 Suggested fix

pub fn transcribe_wav_file( handle: &WhisperEngineHandle, wav_path: &Path, language: Option<&str>, ) -> Result<String, String> { debug!("{LOG_PREFIX} reading WAV file: {}", wav_path.display()); - let raw_bytes = std::fs::read(wav_path).map_err(|e| format!("failed to read WAV file: {e}"))?; + const MAX_WAV_BYTES: u64 = 25 * 1024 * 1024; // 25 MiB safety cap + let meta = std::fs::metadata(wav_path) + .map_err(|e| format!("failed to stat WAV file: {e}"))?; + if meta.len() > MAX_WAV_BYTES { + return Err(format!( + "WAV file too large: {} bytes (max {})", + meta.len(), + MAX_WAV_BYTES + )); + } + let raw_bytes = + std::fs::read(wav_path).map_err(|e| format!("failed to read WAV file: {e}"))?; let audio_f32 = decode_wav_to_f32(&raw_bytes)?; transcribe_pcm_f32(handle, &audio_f32, language) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let raw_bytes = std::fs::read(wav_path).map_err(|e| format!("failed to read WAV file: {e}"))?;

let audio_f32 = decode_wav_to_f32(&raw_bytes)?;

transcribe_pcm_f32(handle, &audio_f32, language)

debug!("{LOG_PREFIX} reading WAV file: {}", wav_path.display());

const MAX_WAV_BYTES: u64 = 25 * 1024 * 1024; // 25 MiB safety cap

let meta = std::fs::metadata(wav_path)

.map_err(|e| format!("failed to stat WAV file: {e}"))?;

if meta.len() > MAX_WAV_BYTES {

return Err(format!(

"WAV file too large: {} bytes (max {})",

meta.len(),

MAX_WAV_BYTES

));

}

let raw_bytes =

std::fs::read(wav_path).map_err(|e| format!("failed to read WAV file: {e}"))?;

let audio_f32 = decode_wav_to_f32(&raw_bytes)?;

transcribe_pcm_f32(handle, &audio_f32, language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/openhuman/local_ai/service/whisper_engine.rs` around lines 170 - 173, The code reads the entire WAV into memory unbounded (the std::fs::read call) which can OOM; add a file-size guard before allocating: define a MAX_WAV_BYTES constant and, in the function containing the lines using std::fs::read, check the file size via std::fs::metadata or open the file and use metadata.len(), returning an error if it exceeds the limit, then read the file (or read up to the limit using a capped reader) and pass the resulting bytes to decode_wav_to_f32 and transcribe_pcm_f32; update error messages to indicate the file was too large and reference the existing functions decode_wav_to_f32 and transcribe_pcm_f32 in the same flow.

coderabbitai · 2026-04-01T22:34:09Z

+    match (audio_format, bits_per_sample) {
+        // PCM 16-bit
+        (1, 16) => {
+            let samples: Vec<i16> = pcm
+                .chunks_exact(2)
+                .map(|c| i16::from_le_bytes([c[0], c[1]]))
+                .collect();
+
+            let mono = if num_channels == 2 {
+                samples
+                    .chunks_exact(2)
+                    .map(|pair| ((pair[0] as i32 + pair[1] as i32) / 2) as i16)
+                    .collect::<Vec<_>>()
+            } else {
+                samples
+            };
+
+            Ok(mono.iter().map(|&s| s as f32 / 32768.0).collect())
+        }
+        // IEEE float 32-bit
+        (3, 32) => {
+            let samples: Vec<f32> = pcm
+                .chunks_exact(4)
+                .map(|c| f32::from_le_bytes([c[0], c[1], c[2], c[3]]))
+                .collect();
+
+            if num_channels == 2 {
+                Ok(samples
+                    .chunks_exact(2)
+                    .map(|pair| (pair[0] + pair[1]) / 2.0)
+                    .collect())
+            } else {
+                Ok(samples)
+            }


⚠️ Potential issue | 🟡 Minor

Reject malformed PCM frame alignment instead of silently truncating.

Both PCM branches can drop trailing partial frame data (chunks_exact behavior), which accepts malformed input and can skew transcripts.

💡 Suggested fix

fn convert_pcm_to_f32( pcm: &[u8], audio_format: u16, num_channels: u16, bits_per_sample: u16, ) -> Result<Vec<f32>, String> { match (audio_format, bits_per_sample) { // PCM 16-bit (1, 16) => { + let frame_bytes = 2usize + .checked_mul(num_channels as usize) + .ok_or_else(|| "invalid channel count".to_string())?; + if frame_bytes == 0 || pcm.len() % frame_bytes != 0 { + return Err("malformed PCM16 WAV data (partial frame)".to_string()); + } let samples: Vec<i16> = pcm .chunks_exact(2) .map(|c| i16::from_le_bytes([c[0], c[1]])) .collect(); @@ // IEEE float 32-bit (3, 32) => { + let frame_bytes = 4usize + .checked_mul(num_channels as usize) + .ok_or_else(|| "invalid channel count".to_string())?; + if frame_bytes == 0 || pcm.len() % frame_bytes != 0 { + return Err("malformed PCM32 WAV data (partial frame)".to_string()); + } let samples: Vec<f32> = pcm .chunks_exact(4) .map(|c| f32::from_le_bytes([c[0], c[1], c[2], c[3]])) .collect();

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/openhuman/local_ai/service/whisper_engine.rs` around lines 244 - 277, The code currently uses pcm.chunks_exact(...) in the (audio_format, bits_per_sample) match arms which silently drops trailing partial frames; instead, before chunking check the expected frame byte size (for PCM16: frame_bytes = 2 * num_channels; for IEEE float32: frame_bytes = 4 * num_channels) and if pcm.len() % frame_bytes != 0 return an Err indicating malformed/truncated PCM rather than proceeding, so replace the silent truncation around the pcm.chunks_exact(...) usage with an explicit length-check and early error return in the same match branches.

Enhance billing and credits management in the application

senamakel and others added 4 commits April 1, 2026 13:13

style: fix cargo fmt in voice/ops.rs

41cf6d7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove unused waitForText import in voice-mode e2e spec

e387711

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

senamakel and others added 2 commits April 1, 2026 13:50

style: apply cargo fmt formatting fixes

cf3cbeb

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

senamakel marked this pull request as ready for review April 1, 2026 21:26

senamakel and others added 4 commits April 1, 2026 14:30

style: cargo fmt whisper_engine.rs

0dd27a4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

style: cargo fmt whisper_engine.rs

07475ee

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Apr 1, 2026

View reviewed changes

senamakel and others added 2 commits April 1, 2026 15:24

style: apply cargo fmt

cab95eb

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Apr 1, 2026

View reviewed changes

senamakel merged commit cf344fa into tinyhumansai:main Apr 1, 2026
9 checks passed

M3gA-Mind pushed a commit to M3gA-Mind/openhuman that referenced this pull request Apr 7, 2026

Merge pull request tinyhumansai#178 from M3gA-Mind/feat/dmg-cyrus

3dfc38b

Enhance billing and credits management in the application

coderabbitai Bot mentioned this pull request Apr 14, 2026

fix(voice): add hallucination filter to chat voice path (#553) #556

Merged

4 tasks

Conversation

senamakel commented Apr 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New files

Modified files

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

senamakel commented Apr 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 1, 2026 •

edited

Loading