Skip to content
View disco-speech's full-sized avatar

Block or report disco-speech

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
DisCo-Speech/README.md

DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

version version version version python

πŸ“ Abstract

Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control.


DisCo-Speech Codec Figure 1: The structure and two-stage training of DisCodec. DisCo-Speech Architecture Figure 2: The overview of DisCo-Speech.

🎡 Audio Samples

We provide extensive samples demonstrating:

  • DisCo-Speech: Zero-shot Controllable Speech Generation
    • Voice Cloning Comparison With Other Mthods
    • Zero-Shot Controllable generation Comparison With Other Methods
  • DisCodec: Disentangled Speech Codec
    • DisCodec Reconstruction Performance
    • Zero-Shot Voice conversion of DisCodec
    • Disentanglement Visual Analysis
  • Additional DisCo-Speech Zero-Shot Demos
    • Voice Cloning
    • Cross lingual Voice Cloning
    • Zero-Shot Controllable Generation Demos

πŸ“’ News

  • [2025-12-16] The project page is now live! Check out the samples here.
  • [2025-12-16] Our paper is available on arXiv.

🚧 Code Release Status:

We are actively preparing the source code for release. Currently, the code is undergoing internal review and refactoring to ensuring it is easy to run and reproduce.

Please Star ⭐ this repository to get the latest updates!


πŸ—ΊοΈ Roadmap

  • Launch Project Page (Demo) 🌐
  • Release Paper on arXiv πŸ“„
  • Release Inference Code πŸ’»
  • Release Pretrained Models (Checkpoints) πŸ“¦
  • Release Training Scripts βš™οΈ

πŸ”— Citation

If you find DisCo-Speech useful for your research, please consider citing our paper:

@article{li2025disco, title={DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec}, author={Li, Tao and Ge, Wenshuo and Wang, Zhichao and Cui, Zihao and Ma, Yong and Gao, Yingying and Deng, Chao and Zhang, Shilei and Feng, Junlan}, journal={arXiv preprint arXiv:2512.13251}, year={2025} }

Popular repositories Loading

  1. DisCo-Speech DisCo-Speech Public

    85 7

  2. DisCo-demo DisCo-demo Public

    HTML

  3. parler-tts parler-tts Public

    Forked from huggingface/parler-tts

    Inference and training library for high-quality TTS models.

    Python

  4. moshi moshi Public

    Forked from kyutai-labs/moshi

    Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.

    Python