Skip to content

avasis-ai/openclaw-voice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

openclaw-voice

Your AI agent, always listening, always local, sounding like you.

Install · Quickstart · How it works · Config

demo


Say the wake word. Ask anything. Your OpenClaw agent responds — in your own voice — while running 100% on-device. No cloud. No API keys. No latency.

This is what happens when your terminal AI gets ears and a mouth.

Why

You already have a powerful AI agent that can read your codebase, run commands, search the web, and manage your machine. But you type to it. Through a text box. Like it's 2023.

openclaw-voice gives your agent a voice interface in 30 lines of config. Wake word detection, speech-to-text, LLM reasoning, and voice-cloned text-to-speech — all running locally on Apple Silicon. Zero ongoing cost.

You: "Hey Jarvis, what's using all my GPU memory right now?"
Agent: "Ollama has gemma4 loaded, using 9.2 GB of your 96 GB VRAM.
       The fine-tuning job on PID 47182 is consuming another 31 GB.
       Want me to kill it?"
You: "Yeah, kill it and pull the new model."
Agent: *kills process, pulls model*
Agent: "Done. Qwen3.5 35B is ready."

Install

pip install openclaw-voice

Prerequisites (macOS, Apple Silicon recommended):

  • OpenClaw with a configured agent
  • WhisperKit CLIbrew install whisperkit
  • Microphone access (macOS will prompt on first run)

Quickstart

# 1. Record your voice sample (32 seconds, 4 sentences)
openclaw-voice record

# 2. Test all subsystems
openclaw-voice test

# 3. Start listening (foreground)
openclaw-voice start

Say your wake word. Ask anything. That's it.

Always-on in background:

openclaw-voice install    # LaunchAgent, auto-starts on login
openclaw-voice logs       # tail the live log
openclaw-voice uninstall  # stop + remove

How it works

┌─────────┐    ┌──────────────┐    ┌───────────┐    ┌──────────┐    ┌─────────┐
│  Mic    │───▶│ openWakeWord │───▶│ WhisperKit│───▶│ OpenClaw │───▶│Chatterbox│──▶ speakers
│ 16kHz   │    │ wake detect  │    │    STT    │    │   Agent  │    │   TTS   │
└─────────┘    └──────────────┘    └───────────┘    └──────────┘    └─────────┘
               <0.5ms/check        Neural Engine     Your models     Voice cloned
               CPU only            Apple Silicon     Local only      from 30s sample
Layer What Why
Wake word openWakeWord Trainable, CPU-only, zero cloud
STT WhisperKit large-v3-turbo Neural Engine accelerated on Apple Silicon
Agent OpenClaw Any model, any tool, full machine control
TTS (cloned) Chatterbox MLX 30s sample → your voice, MPS accelerated
TTS (fallback) WhisperKit TTS No sample needed, realtime streaming

Total cost: $0/month. Everything runs on-device. The only network call is to your local OpenClaw gateway (or your configured LLM endpoint).

Configuration

Config lives at ~/.openclaw/voice/config/voice.yaml. Created on first run with sensible defaults.

wake_word:
  model: hey_jarvis       # or hey_mycroft, alexa, or your custom model
  threshold: 0.6          # lower = more sensitive, more false positives
  cooldown_sec: 2.0       # minimum time between wake triggers

stt:
  model: large-v3-turbo   # WhisperKit model (fastest large-class)
  language: en
  max_record_sec: 12      # stop recording after this long
  silence_trigger_sec: 1.5 # stop after this much silence

tts:
  backend: chatterbox     # primary: voice cloning
  fallback_speaker: aiden # fallback if no voice sample
  voice_sample: ~/.openclaw/voice/samples/voice.wav

agent:
  id: main                # your OpenClaw agent ID
  max_reply_chars: 800    # truncate long replies for voice UX

Custom wake word

Train your own "hey " model with ~50 voice samples. See docs/custom-wake-word.md for the full guide.

Quick path: use the openWakeWord training Colab, download the ONNX model, drop it in ~/.openclaw/voice/models/, and update your config.

Tips

  • Too many false wake-ups? Raise wake_word.threshold to 0.7
  • Cuts off mid-thought? Raise stt.silence_trigger_sec to 2.5
  • Want shorter replies? Lower agent.max_reply_chars to 400
  • No voice sample yet? It falls back to WhisperKit TTS automatically
  • Replies too slow? Configure your agent to use a faster model for voice mode

Architecture decisions

Why OpenClaw and not a raw LLM API? Because your voice assistant should be able to do things — run commands, read files, search code, manage your machine. OpenClaw gives your agent tools, memory, and multi-model routing. A raw API gives you text.

Why WhisperKit and not whisper.cpp? Neural Engine acceleration. On Apple Silicon, WhisperKit runs STT 2-3x faster than GPU-based whisper.cpp while using less power. For an always-listening daemon, that matters.

Why Chatterbox and not [other TTS]? Voice cloning from a 30-second sample with near-zero quality loss. No other OSS TTS does this well on Apple Silicon. The fallback to WhisperKit TTS means it works even without a sample.

Why openWakeWord? It's the only wake word engine that's truly local (no cloud), trainable (custom models), and runs on CPU with <5% of a single core. Perfect for always-on.

Contributing

Contributions welcome. Areas of particular interest:

  • Linux support — currently macOS only (WhisperKit + Apple Silicon). Would love PipeWire + Whisper.cpp support.
  • More TTS backends — Coqui TTS, Bark, XTTS
  • Streaming responses — start speaking before the full reply is generated
  • Multi-language — Chatterbox supports multilingual, needs config exposure
  • Custom wake word training tooling — make it a one-command experience

See CONTRIBUTING.md for guidelines.

License

MIT


Built by Avasis · Part of the OpenClaw ecosystem

openclaw-voice

About

Always-listening voice interface for OpenClaw. Wake word → STT → Agent → TTS. Fully local. Your voice, your machine.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages