Note: This fork's
mainbranch is a patched version of the original yl4579/StyleTTS2 repository. It is maintained specifically as a git submodule for the kokoro-deutsch training recipe.
The Kokoro-82M TTS model is based on the StyleTTS 2 architecture. However, to fine-tune it using its published HuggingFace weights, several modifications to the upstream training code are required:
- PyTorch API Migration (Critical): Migrated
torch.nn.utils.weight_normandspectral_normto the moderntorch.nn.utils.parametrizationsAPI. This is mandatory for compatibility with Kokoro's inference pipeline (KModel). - Kokoro Symbols: Integrated Kokoro's specific 178-token IPA vocabulary (
kokoro_symbols.py). - Bug Fixes:
- Fixed an
unsqueezeshape mismatch crash at epoch boundaries involving F0 tensors. - Fixed checkpoint saving order to prevent data loss if TensorBoard audio generation fails.
- Fixed missing
.train()mode re-initializations after checkpoint loading in Stage 2. - Removed hardcoded
ipdbbreakpoints that caused silent hangs. - Added a monkey-patch for
torch.loadweights_only=Falsefor PyTorch 2.6+ compatibility. - Filtered long phoneme sequences (> 510 tokens) to prevent PLBERT position embedding overflows.
- Fixed an
This repository is not meant to be used standalone.
Please see the kokoro-deutsch repository for the full end-to-end training guide, dataset preparation scripts, and voicepack extraction tools.
For the original StyleTTS2 project and documentation, please visit the upstream repository.