streaming_xtts

An experiment in streaming text-to-speech (TTS), interfacing with Coqui's XTTSv2 pipeline.

This package implements a streaming server + client for TTS inference. Features / options include:

playback of generated audio on the host (using PyAudio), with sub-1-second delay once model has been warmed up
download of the generated audio as a .wav file
support for long texts through smart decomposition into a series of inference calls
lip-syncing an animated robot face (using PyLips)
support for cuda-based GPU and Apple's MPS (though with mps delays appear to be a bit higher)

The API also supports many of the original xtts knobs, e.g., speaker, temperature, etc.

Acknowledgements

Coqui's seminal development https://docs.coqui.ai/en/latest/models/xtts.html
IDIAP for maintaining a fork)
The creators of Pylips https://github.com/interaction-lab

Requirements

portaudio (available on linux, macos, etc.)

The lip-syncing feature additionally requires:

PyLips (forked from original development by students at USC Interaction Lab)

For NVIDIA GPUs, deepspeed (python package) is strongly encouraged for reducing the time to first chunk during inference.

Quick start

On the server side (where inference is perfomed and audio optionally plays):

uv sync
uv run streaming-xtts

On the client side:

uv run streaming-tts-client -p "The rain in spain falls mainly on the plane."

You can also download the complete generated wave file:

uv run streaming-tts-client -pd "Check your project directory for a wave file with the current timestamp."

For both the server and client script above, use --help to see the available optons.

Deepspeed (faster inference for cuda)

If you have an NVIDIA GPU, install the optional dependency:

uv sync --extra deepspeed

Then run the server with flag:

uv run streaming-xtts --deepspeed

Actuating the robot face

Prerequisite

Add the optional dependency:

uv sync --extra pylips

Warning: If you want the deepspeed extra as well, include both in the uv sync command (as otherwise you will only get one or the other).

Locally

Use the --pylips argument when starting the server:

uv run streaming-xtts --pylips

To view the animated robot face, navigate to: http://localhost:8008/face.

Note: since this was optimized for a mobile device, zoom out in your computer's web browser window for a better viewing experience.

On a different machine

First, serve PyLips:

git clone https://github.com/witwicki/PyLips.git
cd Pylips
# in your favorite virtual environment
pip install .
python -m pylips.face.start --port <port,e.g.,8008>

Next, start the TTS server with appropriate flags:

uv run streaming-xtts --pylips --pylipsserver <server_IP_or_hostname>:<port>

To view the animated robot face, navigate to: http://<server_IP_or_hostname>:/face.

Planned improvements

Support for Apple Silicon
FastAPI for cleaner interface
Emotional cues from text on animated face
Streaming audio over HTTP (by, e.g., DASH) for accessibility on low-compute devices
Streaming support for newer TTS models, e.g., F5

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
src		src
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

streaming_xtts

Acknowledgements

Requirements

Quick start

Deepspeed (faster inference for cuda)

Actuating the robot face

Prerequisite

Locally

On a different machine

Planned improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

streaming_xtts

Acknowledgements

Requirements

Quick start

Deepspeed (faster inference for cuda)

Actuating the robot face

Prerequisite

Locally

On a different machine

Planned improvements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages