Skip to content

Use HA streaming response API so TTS speaks sentence-by-sentence instead of waiting for full reply #30

@Darktex

Description

@Darktex

Problem

On HA 2026.5.2 with the OpenClaw conversation agent set on an Assist pipeline (Nabu Casa Cloud TTS), TTS playback only begins after the entire OpenClaw response has been generated. For multi-sentence answers this means several seconds of silence before any audio plays — completely defeats the purpose of token streaming.

Root cause

custom_components/openclaw/conversation.py does stream tokens internally:

async for chunk in client.async_stream_message(...):
    full_response += chunk

but then assembles the full string and hands it to the pipeline as a single blob:

intent_response = intent.IntentResponse(language=user_input.language)
intent_response.async_set_speech(full_response)

So the Assist pipeline never sees a stream and can't chunk into sentences for TTS.

Fix

HA exposes a streaming response API for conversation agents (available since 2025.x, fully landed by 2026.x). Two complementary surfaces:

  • intent_response.async_set_speech_async_iterator(async_iter_of_text_deltas) — pipeline consumes deltas, chunks into sentences, streams them into the TTS engine.
  • For richer flows, chat_log.async_add_delta_content_stream() to feed deltas into the chat log so other consumers (history, frontend) see the stream too.

Cloud TTS, Wyoming engines, and the Nabu Casa Cloud TTS engine all consume the streamed sentence chunks today, so the user-visible win is immediate: TTS starts as soon as the first sentence completes.

Suggested change in _get_response_streaming (or wherever the streaming branch lives): yield the chunks directly instead of accumulating, and have the caller pass the iterator into async_set_speech_async_iterator. Keep the non-streaming fallback path for older HA cores or non-streaming providers.

Why it matters

The whole point of a personality-driven assistant with a real LLM behind it is fast, conversational back-and-forth. The latency before voice starts is currently 3–8s on longer answers, which makes the assistant feel broken even when generation is fast. Sentence streaming closes the perceived latency to <1s.

Happy to test a patch / open a PR if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions