Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control.
π Click here to visit our Demo Page π
We provide extensive samples demonstrating:
- DisCo-Speech: Zero-shot Controllable Speech Generation
- Voice Cloning Comparison With Other Mthods
- Zero-Shot Controllable generation Comparison With Other Methods
- DisCodec: Disentangled Speech Codec
- DisCodec Reconstruction Performance
- Zero-Shot Voice conversion of DisCodec
- Disentanglement Visual Analysis
- Additional DisCo-Speech Zero-Shot Demos
- Voice Cloning
- Cross lingual Voice Cloning
- Zero-Shot Controllable Generation Demos
- [2025-12-16] The project page is now live! Check out the samples here.
- [2025-12-16] Our paper is available on arXiv.
π§ Code Release Status:
We are actively preparing the source code for release. Currently, the code is undergoing internal review and refactoring to ensuring it is easy to run and reproduce.
Please Star β this repository to get the latest updates!
- Launch Project Page (Demo) π
- Release Paper on arXiv π
- Release Inference Code π»
- Release Pretrained Models (Checkpoints) π¦
- Release Training Scripts βοΈ
If you find DisCo-Speech useful for your research, please consider citing our paper:
@article{li2025disco, title={DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec}, author={Li, Tao and Ge, Wenshuo and Wang, Zhichao and Cui, Zihao and Ma, Yong and Gao, Yingying and Deng, Chao and Zhang, Shilei and Feng, Junlan}, journal={arXiv preprint arXiv:2512.13251}, year={2025} }


