disco-speech

DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

📝 Abstract

Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control.

Figure 1: The structure and two-stage training of DisCodec.

Figure 2: The overview of DisCo-Speech.

🎵 Audio Samples

👉 Click here to visit our Demo Page 👈

We provide extensive samples demonstrating:

DisCo-Speech: Zero-shot Controllable Speech Generation
- Voice Cloning Comparison With Other Mthods
- Zero-Shot Controllable generation Comparison With Other Methods
DisCodec: Disentangled Speech Codec
- DisCodec Reconstruction Performance
- Zero-Shot Voice conversion of DisCodec
- Disentanglement Visual Analysis
Additional DisCo-Speech Zero-Shot Demos
- Voice Cloning
- Cross lingual Voice Cloning
- Zero-Shot Controllable Generation Demos

📢 News

[2025-12-16] The project page is now live! Check out the samples here.
[2025-12-16] Our paper is available on arXiv.

🚧 Code Release Status:

We are actively preparing the source code for release. Currently, the code is undergoing internal review and refactoring to ensuring it is easy to run and reproduce.

Please Star ⭐ this repository to get the latest updates!

🗺️ Roadmap

Launch Project Page (Demo) 🌐
Release Paper on arXiv 📄
Release Inference Code 💻
Release Pretrained Models (Checkpoints) 📦
Release Training Scripts ⚙️

🔗 Citation

If you find DisCo-Speech useful for your research, please consider citing our paper:

@article{li2025disco, title={DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec}, author={Li, Tao and Ge, Wenshuo and Wang, Zhichao and Cui, Zihao and Ma, Yong and Gao, Yingying and Deng, Chao and Zhang, Shilei and Feng, Junlan}, journal={arXiv preprint arXiv:2512.13251}, year={2025} }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disco-speech

Achievements