Skip to content

witwicki/streaming_xtts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

streaming_xtts

An experiment in streaming text-to-speech (TTS), interfacing with Coqui's XTTSv2 pipeline.

This package implements a streaming server + client for TTS inference. Features / options include:

  • playback of generated audio on the host (using PyAudio), with sub-1-second delay once model has been warmed up
  • download of the generated audio as a .wav file
  • support for long texts through smart decomposition into a series of inference calls
  • lip-syncing an animated robot face (using PyLips)
  • support for cuda-based GPU and Apple's MPS (though with mps delays appear to be a bit higher)

The API also supports many of the original xtts knobs, e.g., speaker, temperature, etc.

Acknowledgements

Requirements

The lip-syncing feature additionally requires:

  • PyLips (forked from original development by students at USC Interaction Lab)

For NVIDIA GPUs, deepspeed (python package) is strongly encouraged for reducing the time to first chunk during inference.

Quick start

On the server side (where inference is perfomed and audio optionally plays):

uv sync
uv run streaming-xtts

On the client side:

uv run streaming-tts-client -p "The rain in spain falls mainly on the plane."

You can also download the complete generated wave file:

uv run streaming-tts-client -pd "Check your project directory for a wave file with the current timestamp."

For both the server and client script above, use --help to see the available optons.

Deepspeed (faster inference for cuda)

If you have an NVIDIA GPU, install the optional dependency:

uv sync --extra deepspeed

Then run the server with flag:

uv run streaming-xtts --deepspeed

Actuating the robot face

Prerequisite

Add the optional dependency:

uv sync --extra pylips

Warning: If you want the deepspeed extra as well, include both in the uv sync command (as otherwise you will only get one or the other).

Locally

Use the --pylips argument when starting the server:

uv run streaming-xtts --pylips

To view the animated robot face, navigate to: http://localhost:8008/face.

Note: since this was optimized for a mobile device, zoom out in your computer's web browser window for a better viewing experience.

On a different machine

First, serve PyLips:

git clone https://github.com/witwicki/PyLips.git
cd Pylips
# in your favorite virtual environment
pip install .
python -m pylips.face.start --port <port,e.g.,8008>

Next, start the TTS server with appropriate flags:

uv run streaming-xtts --pylips --pylipsserver <server_IP_or_hostname>:<port>

To view the animated robot face, navigate to: http://<server_IP_or_hostname>:/face.

Planned improvements

  • Support for Apple Silicon
  • FastAPI for cleaner interface
  • Emotional cues from text on animated face
  • Streaming audio over HTTP (by, e.g., DASH) for accessibility on low-compute devices
  • Streaming support for newer TTS models, e.g., F5

About

Experiments in streaming text-to-speech, building on Coqui's XTTSv2 pipeline.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages