An experiment in streaming text-to-speech (TTS), interfacing with Coqui's XTTSv2 pipeline.
This package implements a streaming server + client for TTS inference. Features / options include:
- playback of generated audio on the host (using PyAudio), with sub-1-second delay once model has been warmed up
- download of the generated audio as a .wav file
- support for long texts through smart decomposition into a series of inference calls
- lip-syncing an animated robot face (using PyLips)
- support for cuda-based GPU and Apple's MPS (though with mps delays appear to be a bit higher)
The API also supports many of the original xtts knobs, e.g., speaker, temperature, etc.
- Coqui's seminal development https://docs.coqui.ai/en/latest/models/xtts.html
- IDIAP for maintaining a fork)
- The creators of Pylips https://github.com/interaction-lab
The lip-syncing feature additionally requires:
- PyLips (forked from original development by students at USC Interaction Lab)
For NVIDIA GPUs, deepspeed (python package) is strongly encouraged for reducing the time to first chunk during inference.
On the server side (where inference is perfomed and audio optionally plays):
uv sync
uv run streaming-xttsOn the client side:
uv run streaming-tts-client -p "The rain in spain falls mainly on the plane."You can also download the complete generated wave file:
uv run streaming-tts-client -pd "Check your project directory for a wave file with the current timestamp."For both the server and client script above, use --help to see the available optons.
If you have an NVIDIA GPU, install the optional dependency:
uv sync --extra deepspeedThen run the server with flag:
uv run streaming-xtts --deepspeedAdd the optional dependency:
uv sync --extra pylipsWarning: If you want the deepspeed extra as well, include both in the uv sync command (as otherwise you will only get one or the other).
Use the --pylips argument when starting the server:
uv run streaming-xtts --pylipsTo view the animated robot face, navigate to: http://localhost:8008/face.
Note: since this was optimized for a mobile device, zoom out in your computer's web browser window for a better viewing experience.
First, serve PyLips:
git clone https://github.com/witwicki/PyLips.git
cd Pylips
# in your favorite virtual environment
pip install .
python -m pylips.face.start --port <port,e.g.,8008>Next, start the TTS server with appropriate flags:
uv run streaming-xtts --pylips --pylipsserver <server_IP_or_hostname>:<port>To view the animated robot face, navigate to: http://<server_IP_or_hostname>:/face.
- Support for Apple Silicon
- FastAPI for cleaner interface
- Emotional cues from text on animated face
- Streaming audio over HTTP (by, e.g., DASH) for accessibility on low-compute devices
- Streaming support for newer TTS models, e.g., F5