Document G2P phoneme mismatch limitation in Kokoro (#414)

Alex-Wengg · web-flow · commit 80b553cbb743 · 2026-03-22T23:04:33.000-04:00
## Summary - Documents the grapheme-to-phoneme (G2P) conversion limitation affecting Kokoro and KittenTTS pronunciation quality - Adds Known Issues section explaining the espeak vs graphemes_to_phonemes_en_us mismatch - References PR #409 where pronunciation issues were discovered during KittenTTS testing ## Context During KittenTTS integration testing (PR #409), @Josscii discovered pronunciation issues with words like "hello" and "day". @Alex-Wengg identified the root cause: the original Kokoro and KittenTTS models were trained using espeak for phoneme generation, but FluidAudio uses `graphemes_to_phonemes_en_us` from HuggingFace (PeterReid). This mismatch causes some words to be pronounced incorrectly because the phoneme outputs don't match what the models expect. ## The Limitation - **Current G2P**: `graphemes_to_phonemes_en_us` (HuggingFace: PeterReid/graphemes_to_phonemes_en_us) - **Models trained with**: espeak phonemes - **Why we can't use espeak**: Licensing constraints - **Impact**: Affects all TTS models using the shared Kokoro G2P pipeline - **What's needed**: An espeak-compatible alternative with a permissive license ## Test plan - [x] Documentation builds correctly - [x] Links to PR #409 comment thread work - [x] Known Issues section is clear and actionable 📝 Generated with [Claude Code](https://claude.com/claude-code)  --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/414" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>
diff --git a/Documentation/TTS/Kokoro.md b/Documentation/TTS/Kokoro.md
@@ -112,6 +112,8 @@ The 6 LSTM ops (duration predictor) remain on CPU — CoreML does not schedule r
 
 - **Sibilance in high-pitched voices**: Some female `af_*` voices (e.g. `af_heart`, `af_bella`) produce harsh sibilant sounds (s, sh, z). This is baked into the model output and cannot be fixed with post-processing EQ. Lower-pitched voices (male `am_*` variants and some female voices) are unaffected. See [mobius#23](https://github.com/FluidInference/mobius/issues/23).
 
+- **G2P phoneme mismatch limitation**: FluidAudio currently uses `graphemes_to_phonemes_en_us` (from HuggingFace: [PeterReid/graphemes_to_phonemes_en_us](https://huggingface.co/PeterReid/graphemes_to_phonemes_en_us)) for grapheme-to-phoneme conversion. The original Kokoro and KittenTTS models were trained using espeak for phoneme generation. This G2P mismatch can cause pronunciation issues in some words (e.g., "hello" and "day" in KittenTTS). We cannot use espeak directly due to licensing constraints. **Need**: An espeak-compatible alternative with a permissive license that produces matching phoneme outputs. This affects any TTS model in FluidAudio that relies on the shared Kokoro G2P pipeline. See [PR #409](https://github.com/FluidInference/FluidAudio/pull/409#issuecomment-2632827330) for examples.
+
 ## Enable TTS in Your Project
 
 Kokoro TTS is included in the `FluidAudio` product — no separate product needed.