Skip to content

Commit 80b553c

Browse files
authored
Document G2P phoneme mismatch limitation in Kokoro (#414)
## Summary - Documents the grapheme-to-phoneme (G2P) conversion limitation affecting Kokoro and KittenTTS pronunciation quality - Adds Known Issues section explaining the espeak vs graphemes_to_phonemes_en_us mismatch - References PR #409 where pronunciation issues were discovered during KittenTTS testing ## Context During KittenTTS integration testing (PR #409), @Josscii discovered pronunciation issues with words like "hello" and "day". @Alex-Wengg identified the root cause: the original Kokoro and KittenTTS models were trained using espeak for phoneme generation, but FluidAudio uses `graphemes_to_phonemes_en_us` from HuggingFace (PeterReid). This mismatch causes some words to be pronounced incorrectly because the phoneme outputs don't match what the models expect. ## The Limitation - **Current G2P**: `graphemes_to_phonemes_en_us` (HuggingFace: PeterReid/graphemes_to_phonemes_en_us) - **Models trained with**: espeak phonemes - **Why we can't use espeak**: Licensing constraints - **Impact**: Affects all TTS models using the shared Kokoro G2P pipeline - **What's needed**: An espeak-compatible alternative with a permissive license ## Test plan - [x] Documentation builds correctly - [x] Links to PR #409 comment thread work - [x] Known Issues section is clear and actionable 📝 Generated with [Claude Code](https://claude.com/claude-code) <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/414" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->
1 parent b80d364 commit 80b553c

File tree

1 file changed

+2
-0
lines changed

1 file changed

+2
-0
lines changed

Documentation/TTS/Kokoro.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,8 @@ The 6 LSTM ops (duration predictor) remain on CPU — CoreML does not schedule r
112112

113113
- **Sibilance in high-pitched voices**: Some female `af_*` voices (e.g. `af_heart`, `af_bella`) produce harsh sibilant sounds (s, sh, z). This is baked into the model output and cannot be fixed with post-processing EQ. Lower-pitched voices (male `am_*` variants and some female voices) are unaffected. See [mobius#23](https://github.com/FluidInference/mobius/issues/23).
114114

115+
- **G2P phoneme mismatch limitation**: FluidAudio currently uses `graphemes_to_phonemes_en_us` (from HuggingFace: [PeterReid/graphemes_to_phonemes_en_us](https://huggingface.co/PeterReid/graphemes_to_phonemes_en_us)) for grapheme-to-phoneme conversion. The original Kokoro and KittenTTS models were trained using espeak for phoneme generation. This G2P mismatch can cause pronunciation issues in some words (e.g., "hello" and "day" in KittenTTS). We cannot use espeak directly due to licensing constraints. **Need**: An espeak-compatible alternative with a permissive license that produces matching phoneme outputs. This affects any TTS model in FluidAudio that relies on the shared Kokoro G2P pipeline. See [PR #409](https://github.com/FluidInference/FluidAudio/pull/409#issuecomment-2632827330) for examples.
116+
115117
## Enable TTS in Your Project
116118

117119
Kokoro TTS is included in the `FluidAudio` product — no separate product needed.

0 commit comments

Comments
 (0)