feat(voice): dedicated voice assistance module for STT/TTS#178
feat(voice): dedicated voice assistance module for STT/TTS#178senamakel merged 12 commits intotinyhumansai:mainfrom
Conversation
Extracts speech-to-text (whisper.cpp) and text-to-speech (piper) into a dedicated `src/openhuman/voice/` domain module with its own RPC namespace (`openhuman.voice_*`). Adds proactive availability checking via `voice_status` so the UI can show clear errors when binaries/models are missing instead of failing silently at transcription time. - New module: voice/types.rs, voice/ops.rs, voice/schemas.rs, voice/mod.rs - 4 RPC endpoints: voice_status, voice_transcribe, voice_transcribe_bytes, voice_tts - 21 unit tests + 1 integration test (json_rpc_e2e) - Frontend updated to use voice_* endpoints with status check on mode switch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests switching to voice input mode, verifying status check fires, recording button renders, and switching back to text mode restores text input. Also checks reply mode toggle visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds a new Voice domain with STT (in-process Whisper via whisper-rs and subprocess fallback), TTS, RPC bindings, frontend UI/command wiring, LLM-based transcription cleanup, CI/tooling updates (cmake), and new unit/e2e tests integrating voice flows. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Frontend (Client)
participant RPC as RPC Layer
participant VoiceOps as openhuman::voice ops
participant LocalAI as LocalAiService
participant Whisper as Whisper Engine
Client->>RPC: openhumanVoiceTranscribeBytes(bytes, ext, context)
RPC->>VoiceOps: voice_transcribe_bytes(params)
VoiceOps->>VoiceOps: write bytes -> temp file
VoiceOps->>LocalAI: transcribe(temp_path)
alt whisper_in_process enabled
LocalAI->>Whisper: is_loaded(handle)?
alt engine loaded
Whisper->>Whisper: transcribe_wav_file(path)
Whisper-->>LocalAI: raw_text
else not loaded
LocalAI->>LocalAI: transcribe_subprocess(path)
LocalAI-->>LocalAI: raw_text
end
else in-process disabled
LocalAI->>LocalAI: transcribe_subprocess(path)
LocalAI-->>LocalAI: raw_text
end
LocalAI-->>VoiceOps: LocalAiSpeechResult{raw_text, model_id}
VoiceOps->>VoiceOps: cleanup_transcription(raw_text, context) (optional)
VoiceOps-->>RPC: VoiceSpeechResult{text, raw_text, model_id}
RPC-->>Client: RPC response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
- Add whisper-rs (0.16) for in-process whisper.cpp inference, eliminating cold-start latency from subprocess-per-call (~1-3s) to warm inference (~50ms). Model is loaded once during bootstrap and reused across calls. Falls back to whisper-cli subprocess if in-process loading fails. - Add LLM post-processing layer that passes raw transcription through Ollama to fix grammar, punctuation, and filler words. Accepts optional conversation context to disambiguate names and technical terms. Gracefully degrades to raw whisper output if Ollama is unavailable. - Update voice RPC endpoints with new optional params (context, skip_cleanup) and return both cleaned text and raw_text. - Update frontend to pass conversation history as context for voice transcription cleanup, and update TypeScript interfaces to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The whisper-rs crate requires cmake to compile whisper.cpp from source, which is not available in the CI environment. Move it behind an optional cargo feature so CI builds succeed without cmake. The whisper_engine module now compiles as a no-op stub when the feature is disabled, returning "whisper feature not compiled in" errors. Desktop builds can opt in with `--features whisper`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Revert whisper-rs from optional to mandatory dependency. Add cmake installation to all CI workflows (build, typecheck, test, release) and the CI Docker image so whisper-rs can compile whisper.cpp from source. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (5)
app/src/components/settings/panels/LocalModelPanel.tsx (1)
1132-1137: UI improvements look good, but file exceeds size guideline.The transcript display changes are sensible improvements (better spacing and labeled output). However, this file is ~1179 lines, exceeding the ≤500 line guideline. Consider splitting into smaller components in a future refactor.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/src/components/settings/panels/LocalModelPanel.tsx` around lines 1132 - 1137, The file LocalModelPanel.tsx has grown too large (~1179 lines) and should be split into smaller components; extract the transcript display block that references transcribeOutput (the Model: {transcribeOutput.model_id} and the <pre> showing transcribeOutput.text) into a new presentational component (e.g., TranscriptView) and optionally extract model header/info into ModelInfo, moving any related handler/state or helper functions into their own files or into a container component; update LocalModelPanel to import and render these new components (keeping prop names like transcribeOutput) so the main file size is reduced while behavior remains identical.Cargo.toml (1)
89-89: Consider makingwhisper-rsoptional.The
whisper-rsdependency adds native whisper.cpp bindings, increasing binary size and compile time for all users. Since in-process Whisper is controlled by a config flag (whisper_in_process), making this an optional feature would reduce build overhead for users who don't need it.Optional refactor approach
[dependencies] # ... other deps ... -whisper-rs = "0.16" +whisper-rs = { version = "0.16", optional = true } [features] # ... existing features ... +whisper-in-process = ["dep:whisper-rs"]Then conditionally compile the whisper engine code with
#[cfg(feature = "whisper-in-process")].🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Cargo.toml` at line 89, Make the whisper-rs dependency optional and gate the in-process whisper code behind a feature named "whisper-in-process": mark the dependency whisper-rs as optional in Cargo.toml (optional = true) and add a "whisper-in-process" feature in the [features] table that enables that optional dependency; then wrap the in-process engine code paths with #[cfg(feature = "whisper-in-process")] (and #[cfg(not(feature = "whisper-in-process"))] fallbacks where needed) for modules/functions that reference whisper-rs (e.g., the whisper engine initialization, any create_whisper_client / WhisperEngine / transcribe_* functions) so builds that don't opt into the feature won’t pull in whisper-rs.src/openhuman/voice/postprocess.rs (1)
83-101: Consider adding a test for the config-disabled path.The current tests cover the empty/whitespace early-return paths. Consider adding a test that verifies behavior when
voice_llm_cleanup_enabledis explicitlyfalse(config-disabled path) with non-empty input, to ensure the raw text is returned unchanged.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/voice/postprocess.rs` around lines 83 - 101, Add a test that asserts cleanup_transcription returns the raw input when the config disables cleanup: create a Config with voice_llm_cleanup_enabled set to false (keeping other defaults), call cleanup_transcription(&config, "some non-empty text", None) inside a tokio runtime (similar to the existing tests), and assert the result equals the original string; reference the Config struct and the cleanup_transcription function to locate where to add the test.src/openhuman/local_ai/service/speech.rs (1)
29-43: Minor TOCTOU race condition betweenis_loaded()andtranscribe_in_process().The engine state could change between the
is_loaded()check (line 29) and the actual transcription call insidetranscribe_in_process(). If the engine is unloaded between these calls, transcription will fail.However, the current implementation handles this gracefully by falling back to the subprocess path on failure (lines 39-41), so the impact is limited to unnecessary fallback rather than a crash or data corruption. This is acceptable for the current use case.
If stricter consistency is needed in the future (e.g., avoiding retry overhead), consider acquiring the lock once and holding it through the transcription, or using a try-transcribe pattern that returns a specific error for "not loaded."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/local_ai/service/speech.rs` around lines 29 - 43, There is a TOCTOU race between whisper_engine::is_loaded(&self.whisper) and calling self.transcribe_in_process(audio_path) which can let the engine unload between the two calls; to fix this either perform the "is loaded and transcribe" atomically by acquiring whatever internal lock or guard the engine provides (or hold the same self.whisper lock) while calling transcribe_in_process, or implement a try_transcribe method on the whisper engine that attempts transcription and returns a specific NotLoaded error so transcribe_in_process can detect it without a separate is_loaded check; update the branch that currently checks whisper_engine::is_loaded, calling the atomic/try_transcribe approach and keep the existing fallback behavior on real transcription errors while avoiding the separate is_loaded call.app/test/e2e/specs/voice-mode.spec.ts (1)
136-139: Disambiguate the Input-mode toggle clicks.There are two
Voicebuttons and twoTextbuttons on this screen, soclickText(...)will hit whichever match is surfaced first. That makes this flow brittle and can flip Reply mode instead of Input mode if the accessibility tree order changes. Scope the click to the Input toggle group, or use a dedicated toggle helper/test id here.As per coding guidelines "Use clickNativeButton(), hasAppChrome(), waitForWebView(), clickToggle() and other cross-platform helper functions for element interaction in E2E specs".
Also applies to: 178-180
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/test/e2e/specs/voice-mode.spec.ts` around lines 136 - 139, The test uses clickText('Voice', ...) which can target either the Input or Reply toggle; replace these ambiguous clicks with a scoped click using the Input-mode toggle helper or a test id and the cross-platform helpers: use clickToggle('input-mode') or clickNativeButton(selectorForInputToggle) / clickToggle with the Input toggle's test-id/class instead of clickText, and update both occurrences (around clickText at the first block and the similar block at lines ~178-180) so the test explicitly targets the Input toggle element.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@app/test/e2e/specs/voice-mode.spec.ts`:
- Around line 188-197: The test "shows reply mode toggle with text and voice
options" is not independent and only checks for a global "Text" label; make it
runnable in isolation by performing the same auth/navigation/setup that earlier
tests do (or move that setup to a shared beforeEach used by this spec) so the
conversations page is loaded, then scope DOM queries to the Reply toggle group
instead of global textExists calls: locate the Reply toggle container (e.g.,
query by accessible name "Reply" or a unique selector for the reply-toggle
component) and assert that both "Text" and "Voice" options exist inside that
container (use the local container.querySelector / getByText on the container or
equivalent helpers rather than global textExists).
- Around line 162-177: The test currently only logs when neither the voice CTA
nor the unavailable message appears; update the block that checks
waitForAnyText(...) (variable hasVoiceButton) and textExists('Speech-to-text
unavailable') (variable hasStatus) so that if both are false the spec fails the
test—e.g., replace the console logs with an assertion or throw an Error
indicating "Voice mode did not render CTA or unavailable state" inside the same
conditional branch; keep the existing checks for Start Talking/Transcribing/Stop
& Send and Speech-to-text unavailable and only assert when both checks fail.
In `@src/openhuman/local_ai/service/whisper_engine.rs`:
- Around line 192-215: The WAV parser currently reads sample rate (_sample_rate)
and num_channels but never enforces them, so add a validation after parsing the
"fmt " chunk and before calling convert_pcm_to_f32: check that _sample_rate ==
16000 and that num_channels == 1 (or explicitly reject num_channels > 2 if you
want to allow stereo only) and return Err(...) with a clear message like
"unsupported sample rate" or "unsupported channel count" if the checks fail;
apply the same validation in the other parsing branch handling fmt/data (the
other block around the 228-268 region) so files with 44.1kHz or >2 channels fail
fast instead of being fed to Whisper.
- Around line 33-55: The load_engine function (and other blocking operations
like transcribe_in_process) perform heavy CPU/filesystem work on Tokio worker
threads; update the async call sites (handle_local_ai_transcribe,
handle_voice_transcribe) to offload these operations by calling
tokio::task::spawn_blocking (or tokio::task::block_in_place inside an async
context) and run load_engine/WhisperContext::new_with_params and
transcribe_in_process inside that blocking closure, ensuring
WhisperEngineHandle/engine initialization is moved back onto the Tokio runtime
only after the blocking work completes; reference the functions load_engine,
transcribe_in_process and the RPC handlers handle_local_ai_transcribe and
handle_voice_transcribe when applying the change.
In `@src/openhuman/voice/ops.rs`:
- Around line 265-287: Test voice_status_detects_stub_binaries mutates the
global WHISPER_BIN env var unsafely; change it to save the prior value with
std::env::var_os("WHISPER_BIN"), set the stub with std::env::set_var, and
restore the original value in a drop-safe guard (RAII) so restore happens even
on panic; additionally protect access with a test-wide synchronization primitive
(e.g., a static Mutex or use serial_test) to prevent concurrent tests from
racing, and keep references to the test helpers (voice_status, tempfile stub
creation) intact while replacing the unconditional std::env::remove_var call
with the guarded restore.
- Around line 38-43: The debug logs currently print full filesystem paths
(whisper_bin, piper_bin, stt_model, tts_voice) which may contain PII; sanitize
them by logging only booleans or safe basenames instead of full paths. Update
the debug! invocations that use LOG_PREFIX and the variables whisper_bin,
piper_bin, stt_model, tts_voice (and the other occurrences you noted around the
file) to map Option<PathBuf>/PathBuf values to their file_name().and_then(|n|
n.to_str()).unwrap_or("redacted") or to a short tag like "<redacted>" before
formatting, and keep the original boolean flags (stt_available, tts_available,
whisper_in_process) as-is so the messages convey state without exposing absolute
paths.
- Around line 131-140: The temp-audio cleanup currently ignores errors from
tokio::fs::remove_file(&file_path), so transient failures can leave recordings
behind; update the code around file_path, service.transcribe, and the
remove_file call to ensure cleanup failures are observed: either use a
tempfile-based auto-cleaning path (e.g., tempfile::NamedTempFile or
tempfile::Builder) for automatic deletion, or capture the Result from
tokio::fs::remove_file and log any Err using the project's tracing/log (use
debug/trace for success checkpoints and error for failures) and include
file_path display in the log; also ensure remove_file is attempted in a
finally-like block (after service.transcribe completes or on its Err) so cleanup
always runs.
In `@src/openhuman/voice/schemas.rs`:
- Around line 83-140: The ControllerSchema outputs for "voice_status",
"voice_transcribe", "voice_transcribe_bytes", and "voice_tts" claim wrapped
top-level keys ("status", "speech", "tts") but your handlers' to_json (or the
DTO serialization) emits the inner DTO directly; update either the schemas or
the serialization so they match: either change the ControllerSchema.outputs
entries to list the actual DTO fields returned by the handlers, or modify the
handlers / to_json logic to wrap the returned payload under the declared keys
(e.g., return {"status": <dto>} / {"speech": <dto>} / {"tts": <dto>}); ensure
the change is applied consistently for the same pattern referenced around lines
215-218 as well.
---
Nitpick comments:
In `@app/src/components/settings/panels/LocalModelPanel.tsx`:
- Around line 1132-1137: The file LocalModelPanel.tsx has grown too large (~1179
lines) and should be split into smaller components; extract the transcript
display block that references transcribeOutput (the Model:
{transcribeOutput.model_id} and the <pre> showing transcribeOutput.text) into a
new presentational component (e.g., TranscriptView) and optionally extract model
header/info into ModelInfo, moving any related handler/state or helper functions
into their own files or into a container component; update LocalModelPanel to
import and render these new components (keeping prop names like
transcribeOutput) so the main file size is reduced while behavior remains
identical.
In `@app/test/e2e/specs/voice-mode.spec.ts`:
- Around line 136-139: The test uses clickText('Voice', ...) which can target
either the Input or Reply toggle; replace these ambiguous clicks with a scoped
click using the Input-mode toggle helper or a test id and the cross-platform
helpers: use clickToggle('input-mode') or
clickNativeButton(selectorForInputToggle) / clickToggle with the Input toggle's
test-id/class instead of clickText, and update both occurrences (around
clickText at the first block and the similar block at lines ~178-180) so the
test explicitly targets the Input toggle element.
In `@Cargo.toml`:
- Line 89: Make the whisper-rs dependency optional and gate the in-process
whisper code behind a feature named "whisper-in-process": mark the dependency
whisper-rs as optional in Cargo.toml (optional = true) and add a
"whisper-in-process" feature in the [features] table that enables that optional
dependency; then wrap the in-process engine code paths with #[cfg(feature =
"whisper-in-process")] (and #[cfg(not(feature = "whisper-in-process"))]
fallbacks where needed) for modules/functions that reference whisper-rs (e.g.,
the whisper engine initialization, any create_whisper_client / WhisperEngine /
transcribe_* functions) so builds that don't opt into the feature won’t pull in
whisper-rs.
In `@src/openhuman/local_ai/service/speech.rs`:
- Around line 29-43: There is a TOCTOU race between
whisper_engine::is_loaded(&self.whisper) and calling
self.transcribe_in_process(audio_path) which can let the engine unload between
the two calls; to fix this either perform the "is loaded and transcribe"
atomically by acquiring whatever internal lock or guard the engine provides (or
hold the same self.whisper lock) while calling transcribe_in_process, or
implement a try_transcribe method on the whisper engine that attempts
transcription and returns a specific NotLoaded error so transcribe_in_process
can detect it without a separate is_loaded check; update the branch that
currently checks whisper_engine::is_loaded, calling the atomic/try_transcribe
approach and keep the existing fallback behavior on real transcription errors
while avoiding the separate is_loaded call.
In `@src/openhuman/voice/postprocess.rs`:
- Around line 83-101: Add a test that asserts cleanup_transcription returns the
raw input when the config disables cleanup: create a Config with
voice_llm_cleanup_enabled set to false (keeping other defaults), call
cleanup_transcription(&config, "some non-empty text", None) inside a tokio
runtime (similar to the existing tests), and assert the result equals the
original string; reference the Config struct and the cleanup_transcription
function to locate where to add the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1ca30e83-412c-42ab-8252-9e36690ffe20
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (20)
Cargo.tomlapp/src/components/settings/panels/LocalModelPanel.tsxapp/src/pages/Conversations.tsxapp/src/utils/tauriCommands.tsapp/test/e2e/specs/voice-mode.spec.tssrc/core/all.rssrc/openhuman/config/schema/local_ai.rssrc/openhuman/local_ai/mod.rssrc/openhuman/local_ai/service/bootstrap.rssrc/openhuman/local_ai/service/mod.rssrc/openhuman/local_ai/service/public_infer.rssrc/openhuman/local_ai/service/speech.rssrc/openhuman/local_ai/service/whisper_engine.rssrc/openhuman/mod.rssrc/openhuman/voice/mod.rssrc/openhuman/voice/ops.rssrc/openhuman/voice/postprocess.rssrc/openhuman/voice/schemas.rssrc/openhuman/voice/types.rstests/json_rpc_e2e.rs
| // --- Verify the voice recording button is visible --- | ||
| // In voice mode, we should see "Start Talking" (or the unavailability message) | ||
| const hasVoiceButton = await waitForAnyText( | ||
| ['Start Talking', 'Transcribing', 'Stop & Send'], | ||
| 10_000 | ||
| ); | ||
| // The button should be present even if STT is unavailable (it will be disabled) | ||
| if (!hasVoiceButton) { | ||
| console.log('[VoiceModeE2E] Voice button not found, checking for status messages...'); | ||
| // If whisper is not available, the button may still be present but the status message dominates | ||
| const hasStatus = await textExists('Speech-to-text unavailable'); | ||
| if (hasStatus) { | ||
| console.log('[VoiceModeE2E] STT unavailable message displayed — voice mode active.'); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Fail the test when voice mode renders neither the CTA nor the unavailable state.
This branch only logs when the voice control is missing, so the spec still goes green even if voice mode shows neither a record button nor the expected unavailable UI. Please turn one of those outcomes into an assertion.
Based on learnings "Assert both UI outcomes and backend/mock effects in E2E specs when relevant to ensure full-stack correctness".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/test/e2e/specs/voice-mode.spec.ts` around lines 162 - 177, The test
currently only logs when neither the voice CTA nor the unavailable message
appears; update the block that checks waitForAnyText(...) (variable
hasVoiceButton) and textExists('Speech-to-text unavailable') (variable
hasStatus) so that if both are false the spec fails the test—e.g., replace the
console logs with an assertion or throw an Error indicating "Voice mode did not
render CTA or unavailable state" inside the same conditional branch; keep the
existing checks for Start Talking/Transcribing/Stop & Send and Speech-to-text
unavailable and only assert when both checks fail.
| let mut _sample_rate: u32 = 0; | ||
| let mut bits_per_sample: u16 = 0; | ||
|
|
||
| while pos + 8 <= data.len() { | ||
| let chunk_id = &data[pos..pos + 4]; | ||
| let chunk_size = | ||
| u32::from_le_bytes([data[pos + 4], data[pos + 5], data[pos + 6], data[pos + 7]]) | ||
| as usize; | ||
|
|
||
| if chunk_id == b"fmt " { | ||
| if chunk_size < 16 || pos + 8 + chunk_size > data.len() { | ||
| return Err("malformed fmt chunk".to_string()); | ||
| } | ||
| let fmt = &data[pos + 8..]; | ||
| audio_format = u16::from_le_bytes([fmt[0], fmt[1]]); | ||
| num_channels = u16::from_le_bytes([fmt[2], fmt[3]]); | ||
| _sample_rate = u32::from_le_bytes([fmt[4], fmt[5], fmt[6], fmt[7]]); | ||
| bits_per_sample = u16::from_le_bytes([fmt[14], fmt[15]]); | ||
| fmt_found = true; | ||
| } | ||
|
|
||
| if chunk_id == b"data" && fmt_found { | ||
| let pcm_data = &data[pos + 8..pos + 8 + chunk_size.min(data.len() - pos - 8)]; | ||
| return convert_pcm_to_f32(pcm_data, audio_format, num_channels, bits_per_sample); |
There was a problem hiding this comment.
Reject unsupported WAV sample rates and channel counts.
The decoder reads sample_rate and num_channels but never enforces them. A 44.1 kHz WAV or a num_channels > 2 file will be accepted and fed to Whisper as if it were valid 16 kHz mono input, which corrupts transcription instead of failing fast.
Suggested guard
- let mut _sample_rate: u32 = 0;
+ let mut sample_rate: u32 = 0;
...
- _sample_rate = u32::from_le_bytes([fmt[4], fmt[5], fmt[6], fmt[7]]);
+ sample_rate = u32::from_le_bytes([fmt[4], fmt[5], fmt[6], fmt[7]]);
...
- return convert_pcm_to_f32(pcm_data, audio_format, num_channels, bits_per_sample);
+ if sample_rate != 16_000 {
+ return Err(format!("unsupported WAV sample rate: {sample_rate}"));
+ }
+ if num_channels != 1 && num_channels != 2 {
+ return Err(format!("unsupported WAV channel count: {num_channels}"));
+ }
+ return convert_pcm_to_f32(pcm_data, audio_format, num_channels, bits_per_sample);Also applies to: 228-268
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/openhuman/local_ai/service/whisper_engine.rs` around lines 192 - 215, The
WAV parser currently reads sample rate (_sample_rate) and num_channels but never
enforces them, so add a validation after parsing the "fmt " chunk and before
calling convert_pcm_to_f32: check that _sample_rate == 16000 and that
num_channels == 1 (or explicitly reject num_channels > 2 if you want to allow
stereo only) and return Err(...) with a clear message like "unsupported sample
rate" or "unsupported channel count" if the checks fail; apply the same
validation in the other parsing branch handling fmt/data (the other block around
the 228-268 region) so files with 44.1kHz or >2 channels fail fast instead of
being fed to Whisper.
| async fn voice_status_detects_stub_binaries() { | ||
| let tmp = tempfile::tempdir().expect("tempdir"); | ||
|
|
||
| let whisper_stub = tmp.path().join("whisper-cli"); | ||
| std::fs::write(&whisper_stub, b"#!/bin/sh\n").expect("write stub"); | ||
| #[cfg(unix)] | ||
| { | ||
| use std::os::unix::fs::PermissionsExt; | ||
| std::fs::set_permissions(&whisper_stub, std::fs::Permissions::from_mode(0o755)) | ||
| .expect("chmod"); | ||
| } | ||
|
|
||
| std::env::set_var("WHISPER_BIN", whisper_stub.display().to_string()); | ||
|
|
||
| let mut config = Config::default(); | ||
| config.workspace_dir = tmp.path().join("workspace"); | ||
| config.config_path = tmp.path().join("config.toml"); | ||
|
|
||
| let result = voice_status(&config).await.unwrap(); | ||
| assert!(result.value.whisper_binary.is_some()); | ||
|
|
||
| std::env::remove_var("WHISPER_BIN"); | ||
| } |
There was a problem hiding this comment.
Serialize and restore env state in this test.
WHISPER_BIN is process-global, and this test overwrites it and then unconditionally removes it. Parallel Rust test execution can race with any other reader/writer, and even serial runs will destroy a pre-existing value if the suite or developer environment already had one set. Please guard this with shared test synchronization and restore the old value in a drop-safe way.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/openhuman/voice/ops.rs` around lines 265 - 287, Test
voice_status_detects_stub_binaries mutates the global WHISPER_BIN env var
unsafely; change it to save the prior value with
std::env::var_os("WHISPER_BIN"), set the stub with std::env::set_var, and
restore the original value in a drop-safe guard (RAII) so restore happens even
on panic; additionally protect access with a test-wide synchronization primitive
(e.g., a static Mutex or use serial_test) to prevent concurrent tests from
racing, and keep references to the test helpers (voice_status, tempfile stub
creation) intact while replacing the unconditional std::env::remove_var call
with the guarded restore.
| "voice_status" => ControllerSchema { | ||
| namespace: "voice", | ||
| function: "status", | ||
| description: "Check availability of STT/TTS binaries and models.", | ||
| inputs: vec![], | ||
| outputs: vec![json_output("status", "Voice availability status.")], | ||
| }, | ||
| "voice_transcribe" => ControllerSchema { | ||
| namespace: "voice", | ||
| function: "transcribe", | ||
| description: | ||
| "Transcribe audio from a file path using whisper.cpp, with optional LLM cleanup.", | ||
| inputs: vec![ | ||
| required_string("audio_path", "Path to the audio file."), | ||
| optional_string("context", "Conversation context for LLM post-processing."), | ||
| optional_bool( | ||
| "skip_cleanup", | ||
| "Skip LLM cleanup, return raw whisper output.", | ||
| ), | ||
| ], | ||
| outputs: vec![json_output( | ||
| "speech", | ||
| "Transcription result with text and raw_text.", | ||
| )], | ||
| }, | ||
| "voice_transcribe_bytes" => ControllerSchema { | ||
| namespace: "voice", | ||
| function: "transcribe_bytes", | ||
| description: | ||
| "Transcribe audio from raw bytes using whisper.cpp, with optional LLM cleanup.", | ||
| inputs: vec![ | ||
| FieldSchema { | ||
| name: "audio_bytes", | ||
| ty: TypeSchema::Bytes, | ||
| comment: "Raw audio bytes.", | ||
| required: true, | ||
| }, | ||
| optional_string("extension", "Audio file extension (default: webm)."), | ||
| optional_string("context", "Conversation context for LLM post-processing."), | ||
| optional_bool( | ||
| "skip_cleanup", | ||
| "Skip LLM cleanup, return raw whisper output.", | ||
| ), | ||
| ], | ||
| outputs: vec![json_output( | ||
| "speech", | ||
| "Transcription result with text and raw_text.", | ||
| )], | ||
| }, | ||
| "voice_tts" => ControllerSchema { | ||
| namespace: "voice", | ||
| function: "tts", | ||
| description: "Synthesize speech from text using piper.", | ||
| inputs: vec![ | ||
| required_string("text", "Text to synthesize."), | ||
| optional_string("output_path", "Optional output file path."), | ||
| ], | ||
| outputs: vec![json_output("tts", "TTS result with output path.")], |
There was a problem hiding this comment.
Align the declared outputs with the JSON you actually return.
These schemas advertise wrapped responses (status, speech, tts), but to_json serializes the inner DTO directly. Anything consuming ControllerSchema.outputs will infer the wrong top-level contract unless the handler wraps the payload under those keys or the schema lists the DTO’s real fields instead.
Also applies to: 215-218
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/openhuman/voice/schemas.rs` around lines 83 - 140, The ControllerSchema
outputs for "voice_status", "voice_transcribe", "voice_transcribe_bytes", and
"voice_tts" claim wrapped top-level keys ("status", "speech", "tts") but your
handlers' to_json (or the DTO serialization) emits the inner DTO directly;
update either the schemas or the serialization so they match: either change the
ControllerSchema.outputs entries to list the actual DTO fields returned by the
handlers, or modify the handlers / to_json logic to wrap the returned payload
under the declared keys (e.g., return {"status": <dto>} / {"speech": <dto>} /
{"tts": <dto>}); ensure the change is applied consistently for the same pattern
referenced around lines 215-218 as well.
- whisper_engine: validate WAV sample rate (must be 16kHz) and channel count (1 or 2) before feeding audio to whisper - speech: offload load_engine and transcribe_in_process to tokio::task::spawn_blocking to avoid blocking the Tokio runtime - ops: use RAII guard for WHISPER_BIN env var in test to prevent races and ensure restore on panic; log temp file cleanup failures instead of silently ignoring; sanitize paths in debug logs to basenames only - postprocess: add test for disabled cleanup config returning raw text - voice-mode.spec: assert failure when neither voice CTA nor unavailable message appears; make reply mode test runnable in isolation with auth/nav setup Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
app/test/e2e/specs/voice-mode.spec.ts (1)
1-1: Consider removing@ts-nocheckor documenting why it's necessary.Blanket disabling of TypeScript checks hides legitimate type errors. If specific imports or WebdriverIO globals cause issues, use targeted
//@ts-expect-error`` comments with explanations instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/test/e2e/specs/voice-mode.spec.ts` at line 1, Remove the blanket "// `@ts-nocheck`" from the top of voice-mode.spec.ts and either fix the underlying type errors or replace with targeted comments: use specific "// `@ts-expect-error`" annotations directly above the offending imports or WebdriverIO global uses (e.g., any problematic import lines or references to browser, $, $$, or other test globals) and add a short comment explaining why the ignore is necessary; then run the typechecker to ensure only the intended lines are suppressed.src/openhuman/local_ai/service/whisper_engine.rs (1)
179-236: Add debug logs on decode rejection/error branches.
decode_wav_to_f32returns clear errors, but critical rejection branches currently have no debug/trace breadcrumbs, which makes field diagnosis harder.As per coding guidelines, "Use log/tracing at debug or trace level in Rust for critical checkpoints (entry/exit points, branch decisions, external calls, retries/timeouts, state transitions, error handling paths)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/local_ai/service/whisper_engine.rs` around lines 179 - 236, Add debug/trace logging to decode_wav_to_f32's rejection/error branches so failures are visible; instrument the function (and the branch that calls convert_pcm_to_f32) to emit logs using the project's tracing/log macro (e.g., trace! or debug!) with contextual info like chunk_id, chunk_size, pos, audio_format, num_channels, sample_rate, bits_per_sample and the error string before each early return or Err(...) (for cases: too small file, invalid RIFF/WAVE header, malformed fmt chunk, unsupported sample_rate/channel count, and missing data chunk) so callers can correlate failures to the input bytes and parsing state.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@app/test/e2e/specs/voice-mode.spec.ts`:
- Around line 182-205: The Reply-toggle test is using global textExists checks
so it can be fooled by the Input toggle; change the assertions to scope to the
Reply toggle container and assert both options exist. Locate the "shows reply
mode toggle with text and voice options" test (and helpers used there:
waitForAnyText, textExists) and: 1) find the Reply toggle element/container
(e.g., locate the element that contains the 'Reply' label or use the
waitForAnyText result as the anchor), 2) query only within that container for
the 'Text' and 'Voice' option labels (use an existing scoped helper if available
or get the Reply element then search its children), and 3) replace the global
expect(hasText).toBe(true) with two scoped assertions expecting both 'Text' and
'Voice' to be present inside the Reply toggle.
In `@src/openhuman/local_ai/service/whisper_engine.rs`:
- Around line 170-173: The code reads the entire WAV into memory unbounded (the
std::fs::read call) which can OOM; add a file-size guard before allocating:
define a MAX_WAV_BYTES constant and, in the function containing the lines using
std::fs::read, check the file size via std::fs::metadata or open the file and
use metadata.len(), returning an error if it exceeds the limit, then read the
file (or read up to the limit using a capped reader) and pass the resulting
bytes to decode_wav_to_f32 and transcribe_pcm_f32; update error messages to
indicate the file was too large and reference the existing functions
decode_wav_to_f32 and transcribe_pcm_f32 in the same flow.
- Around line 244-277: The code currently uses pcm.chunks_exact(...) in the
(audio_format, bits_per_sample) match arms which silently drops trailing partial
frames; instead, before chunking check the expected frame byte size (for PCM16:
frame_bytes = 2 * num_channels; for IEEE float32: frame_bytes = 4 *
num_channels) and if pcm.len() % frame_bytes != 0 return an Err indicating
malformed/truncated PCM rather than proceeding, so replace the silent truncation
around the pcm.chunks_exact(...) usage with an explicit length-check and early
error return in the same match branches.
---
Nitpick comments:
In `@app/test/e2e/specs/voice-mode.spec.ts`:
- Line 1: Remove the blanket "// `@ts-nocheck`" from the top of voice-mode.spec.ts
and either fix the underlying type errors or replace with targeted comments: use
specific "// `@ts-expect-error`" annotations directly above the offending imports
or WebdriverIO global uses (e.g., any problematic import lines or references to
browser, $, $$, or other test globals) and add a short comment explaining why
the ignore is necessary; then run the typechecker to ensure only the intended
lines are suppressed.
In `@src/openhuman/local_ai/service/whisper_engine.rs`:
- Around line 179-236: Add debug/trace logging to decode_wav_to_f32's
rejection/error branches so failures are visible; instrument the function (and
the branch that calls convert_pcm_to_f32) to emit logs using the project's
tracing/log macro (e.g., trace! or debug!) with contextual info like chunk_id,
chunk_size, pos, audio_format, num_channels, sample_rate, bits_per_sample and
the error string before each early return or Err(...) (for cases: too small
file, invalid RIFF/WAVE header, malformed fmt chunk, unsupported
sample_rate/channel count, and missing data chunk) so callers can correlate
failures to the input bytes and parsing state.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ee10696c-593f-4d97-8a76-03c8d3cee230
📒 Files selected for processing (6)
app/test/e2e/specs/voice-mode.spec.tssrc/openhuman/local_ai/service/bootstrap.rssrc/openhuman/local_ai/service/speech.rssrc/openhuman/local_ai/service/whisper_engine.rssrc/openhuman/voice/ops.rssrc/openhuman/voice/postprocess.rs
✅ Files skipped from review due to trivial changes (1)
- src/openhuman/voice/ops.rs
🚧 Files skipped from review as they are similar to previous changes (3)
- src/openhuman/voice/postprocess.rs
- src/openhuman/local_ai/service/speech.rs
- src/openhuman/local_ai/service/bootstrap.rs
| it('shows reply mode toggle with text and voice options', async () => { | ||
| // Ensure conversations page is loaded (re-authenticate if state was lost). | ||
| const onConversations = await waitForAnyText( | ||
| ['Message OpenHuman', 'Type a message', 'Reply'], | ||
| 5_000 | ||
| ); | ||
| if (!onConversations) { | ||
| await triggerAuthDeepLink('e2e-voice-token'); | ||
| await waitForWindowVisible(25_000); | ||
| await waitForWebView(15_000); | ||
| await waitForAppReady(15_000); | ||
| await completeOnboardingIfVisible(); | ||
| await waitForHome(20_000); | ||
| } | ||
|
|
||
| // The Reply toggle should be visible on the conversations page | ||
| const hasReplyLabel = await textExists('Reply'); | ||
| expect(hasReplyLabel).toBe(true); | ||
|
|
||
| // Verify both reply mode options exist | ||
| // (There are multiple "Text" and "Voice" buttons — Input + Reply groups) | ||
| const hasText = await textExists('Text'); | ||
| expect(hasText).toBe(true); | ||
| }); |
There was a problem hiding this comment.
Second test still uses global assertions rather than scoping to the Reply toggle group.
While the re-authentication fallback (lines 188-195) improves isolation, the assertions at lines 198-204 check for 'Reply' and 'Text' anywhere on the page. This can pass even if the Reply toggle is missing (since "Text" appears in the Input toggle as well). Additionally, the test doesn't verify the "Voice" option exists in the Reply toggle.
Scope assertions to the Reply toggle container and add the missing "Voice" check:
Proposed fix to scope assertions and add Voice check
// The Reply toggle should be visible on the conversations page
const hasReplyLabel = await textExists('Reply');
expect(hasReplyLabel).toBe(true);
- // Verify both reply mode options exist
- // (There are multiple "Text" and "Voice" buttons — Input + Reply groups)
- const hasText = await textExists('Text');
- expect(hasText).toBe(true);
+ // Verify both reply mode options exist.
+ // Note: There are multiple "Text" and "Voice" buttons (Input + Reply groups),
+ // so this assertion confirms at least one of each exists on the page.
+ // Ideally, scope to the Reply toggle container for precision.
+ const hasTextOption = await textExists('Text');
+ expect(hasTextOption).toBe(true);
+
+ const hasVoiceOption = await textExists('Voice');
+ expect(hasVoiceOption).toBe(true);
});
});Based on learnings: "Ensure each E2E spec is runnable in isolation without dependencies on other specs".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| it('shows reply mode toggle with text and voice options', async () => { | |
| // Ensure conversations page is loaded (re-authenticate if state was lost). | |
| const onConversations = await waitForAnyText( | |
| ['Message OpenHuman', 'Type a message', 'Reply'], | |
| 5_000 | |
| ); | |
| if (!onConversations) { | |
| await triggerAuthDeepLink('e2e-voice-token'); | |
| await waitForWindowVisible(25_000); | |
| await waitForWebView(15_000); | |
| await waitForAppReady(15_000); | |
| await completeOnboardingIfVisible(); | |
| await waitForHome(20_000); | |
| } | |
| // The Reply toggle should be visible on the conversations page | |
| const hasReplyLabel = await textExists('Reply'); | |
| expect(hasReplyLabel).toBe(true); | |
| // Verify both reply mode options exist | |
| // (There are multiple "Text" and "Voice" buttons — Input + Reply groups) | |
| const hasText = await textExists('Text'); | |
| expect(hasText).toBe(true); | |
| }); | |
| it('shows reply mode toggle with text and voice options', async () => { | |
| // Ensure conversations page is loaded (re-authenticate if state was lost). | |
| const onConversations = await waitForAnyText( | |
| ['Message OpenHuman', 'Type a message', 'Reply'], | |
| 5_000 | |
| ); | |
| if (!onConversations) { | |
| await triggerAuthDeepLink('e2e-voice-token'); | |
| await waitForWindowVisible(25_000); | |
| await waitForWebView(15_000); | |
| await waitForAppReady(15_000); | |
| await completeOnboardingIfVisible(); | |
| await waitForHome(20_000); | |
| } | |
| // The Reply toggle should be visible on the conversations page | |
| const hasReplyLabel = await textExists('Reply'); | |
| expect(hasReplyLabel).toBe(true); | |
| // Verify both reply mode options exist. | |
| // Note: There are multiple "Text" and "Voice" buttons (Input + Reply groups), | |
| // so this assertion confirms at least one of each exists on the page. | |
| // Ideally, scope to the Reply toggle container for precision. | |
| const hasTextOption = await textExists('Text'); | |
| expect(hasTextOption).toBe(true); | |
| const hasVoiceOption = await textExists('Voice'); | |
| expect(hasVoiceOption).toBe(true); | |
| }); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/test/e2e/specs/voice-mode.spec.ts` around lines 182 - 205, The
Reply-toggle test is using global textExists checks so it can be fooled by the
Input toggle; change the assertions to scope to the Reply toggle container and
assert both options exist. Locate the "shows reply mode toggle with text and
voice options" test (and helpers used there: waitForAnyText, textExists) and: 1)
find the Reply toggle element/container (e.g., locate the element that contains
the 'Reply' label or use the waitForAnyText result as the anchor), 2) query only
within that container for the 'Text' and 'Voice' option labels (use an existing
scoped helper if available or get the Reply element then search its children),
and 3) replace the global expect(hasText).toBe(true) with two scoped assertions
expecting both 'Text' and 'Voice' to be present inside the Reply toggle.
| let raw_bytes = std::fs::read(wav_path).map_err(|e| format!("failed to read WAV file: {e}"))?; | ||
|
|
||
| let audio_f32 = decode_wav_to_f32(&raw_bytes)?; | ||
| transcribe_pcm_f32(handle, &audio_f32, language) |
There was a problem hiding this comment.
Bound WAV input size before reading into memory.
Line 170 reads the full file with no size cap. A very large WAV can cause high memory pressure/OOM in the RPC path.
💡 Suggested fix
pub fn transcribe_wav_file(
handle: &WhisperEngineHandle,
wav_path: &Path,
language: Option<&str>,
) -> Result<String, String> {
debug!("{LOG_PREFIX} reading WAV file: {}", wav_path.display());
- let raw_bytes = std::fs::read(wav_path).map_err(|e| format!("failed to read WAV file: {e}"))?;
+ const MAX_WAV_BYTES: u64 = 25 * 1024 * 1024; // 25 MiB safety cap
+ let meta = std::fs::metadata(wav_path)
+ .map_err(|e| format!("failed to stat WAV file: {e}"))?;
+ if meta.len() > MAX_WAV_BYTES {
+ return Err(format!(
+ "WAV file too large: {} bytes (max {})",
+ meta.len(),
+ MAX_WAV_BYTES
+ ));
+ }
+ let raw_bytes =
+ std::fs::read(wav_path).map_err(|e| format!("failed to read WAV file: {e}"))?;
let audio_f32 = decode_wav_to_f32(&raw_bytes)?;
transcribe_pcm_f32(handle, &audio_f32, language)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let raw_bytes = std::fs::read(wav_path).map_err(|e| format!("failed to read WAV file: {e}"))?; | |
| let audio_f32 = decode_wav_to_f32(&raw_bytes)?; | |
| transcribe_pcm_f32(handle, &audio_f32, language) | |
| debug!("{LOG_PREFIX} reading WAV file: {}", wav_path.display()); | |
| const MAX_WAV_BYTES: u64 = 25 * 1024 * 1024; // 25 MiB safety cap | |
| let meta = std::fs::metadata(wav_path) | |
| .map_err(|e| format!("failed to stat WAV file: {e}"))?; | |
| if meta.len() > MAX_WAV_BYTES { | |
| return Err(format!( | |
| "WAV file too large: {} bytes (max {})", | |
| meta.len(), | |
| MAX_WAV_BYTES | |
| )); | |
| } | |
| let raw_bytes = | |
| std::fs::read(wav_path).map_err(|e| format!("failed to read WAV file: {e}"))?; | |
| let audio_f32 = decode_wav_to_f32(&raw_bytes)?; | |
| transcribe_pcm_f32(handle, &audio_f32, language) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/openhuman/local_ai/service/whisper_engine.rs` around lines 170 - 173, The
code reads the entire WAV into memory unbounded (the std::fs::read call) which
can OOM; add a file-size guard before allocating: define a MAX_WAV_BYTES
constant and, in the function containing the lines using std::fs::read, check
the file size via std::fs::metadata or open the file and use metadata.len(),
returning an error if it exceeds the limit, then read the file (or read up to
the limit using a capped reader) and pass the resulting bytes to
decode_wav_to_f32 and transcribe_pcm_f32; update error messages to indicate the
file was too large and reference the existing functions decode_wav_to_f32 and
transcribe_pcm_f32 in the same flow.
| match (audio_format, bits_per_sample) { | ||
| // PCM 16-bit | ||
| (1, 16) => { | ||
| let samples: Vec<i16> = pcm | ||
| .chunks_exact(2) | ||
| .map(|c| i16::from_le_bytes([c[0], c[1]])) | ||
| .collect(); | ||
|
|
||
| let mono = if num_channels == 2 { | ||
| samples | ||
| .chunks_exact(2) | ||
| .map(|pair| ((pair[0] as i32 + pair[1] as i32) / 2) as i16) | ||
| .collect::<Vec<_>>() | ||
| } else { | ||
| samples | ||
| }; | ||
|
|
||
| Ok(mono.iter().map(|&s| s as f32 / 32768.0).collect()) | ||
| } | ||
| // IEEE float 32-bit | ||
| (3, 32) => { | ||
| let samples: Vec<f32> = pcm | ||
| .chunks_exact(4) | ||
| .map(|c| f32::from_le_bytes([c[0], c[1], c[2], c[3]])) | ||
| .collect(); | ||
|
|
||
| if num_channels == 2 { | ||
| Ok(samples | ||
| .chunks_exact(2) | ||
| .map(|pair| (pair[0] + pair[1]) / 2.0) | ||
| .collect()) | ||
| } else { | ||
| Ok(samples) | ||
| } |
There was a problem hiding this comment.
Reject malformed PCM frame alignment instead of silently truncating.
Both PCM branches can drop trailing partial frame data (chunks_exact behavior), which accepts malformed input and can skew transcripts.
💡 Suggested fix
fn convert_pcm_to_f32(
pcm: &[u8],
audio_format: u16,
num_channels: u16,
bits_per_sample: u16,
) -> Result<Vec<f32>, String> {
match (audio_format, bits_per_sample) {
// PCM 16-bit
(1, 16) => {
+ let frame_bytes = 2usize
+ .checked_mul(num_channels as usize)
+ .ok_or_else(|| "invalid channel count".to_string())?;
+ if frame_bytes == 0 || pcm.len() % frame_bytes != 0 {
+ return Err("malformed PCM16 WAV data (partial frame)".to_string());
+ }
let samples: Vec<i16> = pcm
.chunks_exact(2)
.map(|c| i16::from_le_bytes([c[0], c[1]]))
.collect();
@@
// IEEE float 32-bit
(3, 32) => {
+ let frame_bytes = 4usize
+ .checked_mul(num_channels as usize)
+ .ok_or_else(|| "invalid channel count".to_string())?;
+ if frame_bytes == 0 || pcm.len() % frame_bytes != 0 {
+ return Err("malformed PCM32 WAV data (partial frame)".to_string());
+ }
let samples: Vec<f32> = pcm
.chunks_exact(4)
.map(|c| f32::from_le_bytes([c[0], c[1], c[2], c[3]]))
.collect();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/openhuman/local_ai/service/whisper_engine.rs` around lines 244 - 277, The
code currently uses pcm.chunks_exact(...) in the (audio_format, bits_per_sample)
match arms which silently drops trailing partial frames; instead, before
chunking check the expected frame byte size (for PCM16: frame_bytes = 2 *
num_channels; for IEEE float32: frame_bytes = 4 * num_channels) and if pcm.len()
% frame_bytes != 0 return an Err indicating malformed/truncated PCM rather than
proceeding, so replace the silent truncation around the pcm.chunks_exact(...)
usage with an explicit length-check and early error return in the same match
branches.
Enhance billing and credits management in the application
Summary
src/openhuman/voice/domain module extracting speech-to-text (whisper.cpp) and text-to-speech (piper) from thelocal_aimodule into its own RPC namespace (openhuman.voice_*)voice_statusendpoint that checks binary/model availability so the UI can show clear errors when voice mode can't work instead of failing silentlyvoice_*endpoints and show status feedback when switching to voice input modeChanges
New files
src/openhuman/voice/mod.rs— Module declaration and re-exportssrc/openhuman/voice/types.rs— DTOs:VoiceSpeechResult,VoiceTtsResult,VoiceStatussrc/openhuman/voice/ops.rs— Business logic:voice_status,voice_transcribe,voice_transcribe_bytes,voice_ttssrc/openhuman/voice/schemas.rs— 4 RPC endpoint schemas + handlersapp/test/e2e/specs/voice-mode.spec.ts— E2E spec for voice mode UI integrationModified files
src/openhuman/local_ai/mod.rs— Madepathsandmodel_idspub(crate)for cross-module accesssrc/openhuman/mod.rs— Addedpub mod voicesrc/core/all.rs— Registered voice controllers/schemas + namespace descriptionapp/src/utils/tauriCommands.ts— Added voice RPC wrapper functions and typesapp/src/pages/Conversations.tsx— Switched tovoice_*endpoints + added status check on mode switchTest plan
voice_status_returns_availability) — passingbash app/scripts/e2e-run-spec.sh test/e2e/specs/voice-mode.spec.ts voice-mode🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Style
Tests
Chores