Your AI agent, always listening, always local, sounding like you.
Install · Quickstart · How it works · Config
Say the wake word. Ask anything. Your OpenClaw agent responds — in your own voice — while running 100% on-device. No cloud. No API keys. No latency.
This is what happens when your terminal AI gets ears and a mouth.
You already have a powerful AI agent that can read your codebase, run commands, search the web, and manage your machine. But you type to it. Through a text box. Like it's 2023.
openclaw-voice gives your agent a voice interface in 30 lines of config. Wake word detection, speech-to-text, LLM reasoning, and voice-cloned text-to-speech — all running locally on Apple Silicon. Zero ongoing cost.
You: "Hey Jarvis, what's using all my GPU memory right now?"
Agent: "Ollama has gemma4 loaded, using 9.2 GB of your 96 GB VRAM.
The fine-tuning job on PID 47182 is consuming another 31 GB.
Want me to kill it?"
You: "Yeah, kill it and pull the new model."
Agent: *kills process, pulls model*
Agent: "Done. Qwen3.5 35B is ready."
pip install openclaw-voicePrerequisites (macOS, Apple Silicon recommended):
- OpenClaw with a configured agent
- WhisperKit CLI —
brew install whisperkit - Microphone access (macOS will prompt on first run)
# 1. Record your voice sample (32 seconds, 4 sentences)
openclaw-voice record
# 2. Test all subsystems
openclaw-voice test
# 3. Start listening (foreground)
openclaw-voice startSay your wake word. Ask anything. That's it.
Always-on in background:
openclaw-voice install # LaunchAgent, auto-starts on login
openclaw-voice logs # tail the live log
openclaw-voice uninstall # stop + remove┌─────────┐ ┌──────────────┐ ┌───────────┐ ┌──────────┐ ┌─────────┐
│ Mic │───▶│ openWakeWord │───▶│ WhisperKit│───▶│ OpenClaw │───▶│Chatterbox│──▶ speakers
│ 16kHz │ │ wake detect │ │ STT │ │ Agent │ │ TTS │
└─────────┘ └──────────────┘ └───────────┘ └──────────┘ └─────────┘
<0.5ms/check Neural Engine Your models Voice cloned
CPU only Apple Silicon Local only from 30s sample
| Layer | What | Why |
|---|---|---|
| Wake word | openWakeWord | Trainable, CPU-only, zero cloud |
| STT | WhisperKit large-v3-turbo | Neural Engine accelerated on Apple Silicon |
| Agent | OpenClaw | Any model, any tool, full machine control |
| TTS (cloned) | Chatterbox MLX | 30s sample → your voice, MPS accelerated |
| TTS (fallback) | WhisperKit TTS | No sample needed, realtime streaming |
Total cost: $0/month. Everything runs on-device. The only network call is to your local OpenClaw gateway (or your configured LLM endpoint).
Config lives at ~/.openclaw/voice/config/voice.yaml. Created on first run with sensible defaults.
wake_word:
model: hey_jarvis # or hey_mycroft, alexa, or your custom model
threshold: 0.6 # lower = more sensitive, more false positives
cooldown_sec: 2.0 # minimum time between wake triggers
stt:
model: large-v3-turbo # WhisperKit model (fastest large-class)
language: en
max_record_sec: 12 # stop recording after this long
silence_trigger_sec: 1.5 # stop after this much silence
tts:
backend: chatterbox # primary: voice cloning
fallback_speaker: aiden # fallback if no voice sample
voice_sample: ~/.openclaw/voice/samples/voice.wav
agent:
id: main # your OpenClaw agent ID
max_reply_chars: 800 # truncate long replies for voice UXTrain your own "hey " model with ~50 voice samples. See docs/custom-wake-word.md for the full guide.
Quick path: use the openWakeWord training Colab, download the ONNX model, drop it in ~/.openclaw/voice/models/, and update your config.
- Too many false wake-ups? Raise
wake_word.thresholdto 0.7 - Cuts off mid-thought? Raise
stt.silence_trigger_secto 2.5 - Want shorter replies? Lower
agent.max_reply_charsto 400 - No voice sample yet? It falls back to WhisperKit TTS automatically
- Replies too slow? Configure your agent to use a faster model for voice mode
Why OpenClaw and not a raw LLM API? Because your voice assistant should be able to do things — run commands, read files, search code, manage your machine. OpenClaw gives your agent tools, memory, and multi-model routing. A raw API gives you text.
Why WhisperKit and not whisper.cpp? Neural Engine acceleration. On Apple Silicon, WhisperKit runs STT 2-3x faster than GPU-based whisper.cpp while using less power. For an always-listening daemon, that matters.
Why Chatterbox and not [other TTS]? Voice cloning from a 30-second sample with near-zero quality loss. No other OSS TTS does this well on Apple Silicon. The fallback to WhisperKit TTS means it works even without a sample.
Why openWakeWord? It's the only wake word engine that's truly local (no cloud), trainable (custom models), and runs on CPU with <5% of a single core. Perfect for always-on.
Contributions welcome. Areas of particular interest:
- Linux support — currently macOS only (WhisperKit + Apple Silicon). Would love PipeWire + Whisper.cpp support.
- More TTS backends — Coqui TTS, Bark, XTTS
- Streaming responses — start speaking before the full reply is generated
- Multi-language — Chatterbox supports multilingual, needs config exposure
- Custom wake word training tooling — make it a one-command experience
See CONTRIBUTING.md for guidelines.
MIT


