Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
9600bb8
feat: Qwen3-TTS CoreML conversion pipeline
Alex-Wengg Jan 28, 2026
b0942e5
feat: add best-of-n sampling for improved prosody
Alex-Wengg Jan 28, 2026
f784c7d
fix: use sampling for code_predictor to prevent muffled audio
Alex-Wengg Jan 28, 2026
9f69b33
fix: include RMS in audio scoring to select louder, clearer audio
Alex-Wengg Jan 28, 2026
2108b85
fix: eliminate double code_predictor calls to reduce background noise
Alex-Wengg Jan 28, 2026
e3ee51c
feat: PocketTTS pure CoreML pipeline — zero PyTorch dependency
Alex-Wengg Jan 28, 2026
5ca5880
chore: remove redundant debug, test, and old conversion scripts
Alex-Wengg Jan 28, 2026
b44c071
docs: add trial log and conversion guide for PocketTTS CoreML
Alex-Wengg Jan 28, 2026
7f63055
feat: add PocketTTS Python package, conversion scripts, and docs
Alex-Wengg Jan 29, 2026
7d7f370
refactor: use spectral similarity instead of Resemblyzer
Alex-Wengg Feb 4, 2026
87fe41d
feat: add Qwen3-TTS v9/v10 conversion, bilingual testing, and RAM mea…
Alex-Wengg Feb 5, 2026
3d87e9a
docs: add issue documentation for Qwen3-TTS conversion and integration
Alex-Wengg Feb 5, 2026
4fd954e
chore: remove debug, intermediate, and measurement scripts
Alex-Wengg Feb 5, 2026
d8760e1
refactor: organize scripts into convert/, explore/, test/ subfolders
Alex-Wengg Feb 5, 2026
a5c8e04
docs: add Swift integration and performance issue documentation
Alex-Wengg Feb 5, 2026
c6cfa40
Merge main into feature/qwen3-tts-coreml (accept main for pocket_tts …
Alex-Wengg Feb 5, 2026
b47105f
chore: remove unrelated pocket_tts files from PR
Alex-Wengg Feb 5, 2026
4c69dd2
docs: add debugging methodology guide for CoreML model conversion
Alex-Wengg Feb 5, 2026
2a5a173
fix: correct EOS token ID and add uv.lock for Qwen3 TTS
Alex-Wengg Mar 21, 2026
1e77d03
revert root .gitignore to match main
Alex-Wengg Mar 21, 2026
a4765ff
Merge remote-tracking branch 'origin/main' into feature/qwen3-tts-coreml
Alex-Wengg Mar 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
feat: add PocketTTS Python package, conversion scripts, and docs
Organize conversion scripts into clean folder structure:
- convert_models/convert/ — CoreML model conversion scripts (4 models)
- convert_models/traceable/ — PyTorch wrappers for tracing (4 models)
- convert_assets/ — constant export script

Also includes: pocket_tts Python package, project config (pyproject.toml,
uv.lock), documentation, tests, and KV cache fix (200→512) in
generate_coreml_v4.py.
  • Loading branch information
Alex-Wengg committed Jan 29, 2026
commit 7f63055142dd3078ebeed7d04013c1afc3331af4
1 change: 1 addition & 0 deletions models/tts/pocket_tts/.python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.10
23 changes: 23 additions & 0 deletions models/tts/pocket_tts/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
Permission is hereby granted, free of charge, to any
person obtaining a copy of this software and associated
documentation files (the "Software"), to deal in the
Software without restriction, including without
limitation the rights to use, copy, modify, merge,
publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software
is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice
shall be included in all copies or substantial portions
of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF
ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED
TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR
IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.
161 changes: 161 additions & 0 deletions models/tts/pocket_tts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# Pocket TTS

<img width="1446" height="622" alt="pocket-tts-logo-v2-transparent" src="https://github.com/user-attachments/assets/637b5ed6-831f-4023-9b4c-741be21ab238" />

A lightweight text-to-speech (TTS) application designed to run efficiently on CPUs.
Forget about the hassle of using GPUs and web APIs serving TTS models. With Kyutai's Pocket TTS, generating audio is just a pip install and a function call away.

Supports Python 3.10, 3.11, 3.12, 3.13 and 3.14. Requires PyTorch 2.5+. Does not require the gpu version of PyTorch.

[🔊 Demo](https://kyutai.org/pocket-tts) |
[🐱‍💻GitHub Repository](https://github.com/kyutai-labs/pocket-tts) |
[🤗 Hugging Face Model Card](https://huggingface.co/kyutai/pocket-tts) |
[⚙️ Tech report](https://kyutai.org/blog/2026-01-13-pocket-tts) |
[📄 Paper](https://arxiv.org/abs/2509.06926) |
[📚 Documentation](https://github.com/kyutai-labs/pocket-tts/tree/main/docs)


## Main takeaways
* Runs on CPU
* Small model size, 100M parameters
* Audio streaming
* Low latency, ~200ms to get the first audio chunk
* Faster than real-time, ~6x real-time on a CPU of MacBook Air M4
* Uses only 2 CPU cores
* Python API and CLI
* Voice cloning
* English only at the moment
* Can handle infinitely long text inputs
* [Can run on client-side in the browser](#in-browser-implementations)

## Trying it from the website, without installing anything

Navigate to the [Kyutai website](https://kyutai.org/pocket-tts) to try it out directly in your browser. You can input text, select different voices, and generate speech without any installation.

## Trying it with the CLI

### The `generate` command
You can use pocket-tts directly from the command line. We recommend using
`uv` as it installs any dependencies on the fly in an isolated environment (uv installation instructions [here](https://docs.astral.sh/uv/getting-started/installation/#standalone-installer)).
You can also use `pip install pocket-tts` to install it manually.

This will generate a wav file `./tts_output.wav` saying the default text with the default voice, and display some speed statistics.
```bash
uvx pocket-tts generate
# or if you installed it manually with pip:
pocket-tts generate
```
Modify the voice with `--voice` and the text with `--text`. We provide a small catalog of voices.

You can take a look at [this page](https://huggingface.co/kyutai/tts-voices) which details the licenses
for each voice.

* [alba](https://huggingface.co/kyutai/tts-voices/blob/main/alba-mackenna/casual.wav)
* [marius](https://huggingface.co/kyutai/tts-voices/blob/main/voice-donations/Selfie.wav)
* [javert](https://huggingface.co/kyutai/tts-voices/blob/main/voice-donations/Butter.wav)
* [jean](https://huggingface.co/kyutai/tts-voices/blob/main/ears/p010/freeform_speech_01.wav)
* [fantine](https://huggingface.co/kyutai/tts-voices/blob/main/vctk/p244_023.wav)
* [cosette](https://huggingface.co/kyutai/tts-voices/blob/main/expresso/ex04-ex02_confused_001_channel1_499s.wav)
* [eponine](https://huggingface.co/kyutai/tts-voices/blob/main/vctk/p262_023.wav)
* [azelma](https://huggingface.co/kyutai/tts-voices/blob/main/vctk/p303_023.wav)

The `--voice` argument can also take a plain wav file as input for voice cloning.
You can use your own or check out our [voice repository](https://huggingface.co/kyutai/tts-voices).
We recommend [cleaning the sample](https://podcast.adobe.com/en/enhance) before using it with Pocket TTS, because the audio quality of the sample is also reproduced.

Feel free to check out the [generate documentation](https://github.com/kyutai-labs/pocket-tts/tree/main/docs/generate.md) for more details and examples.
For trying multiple voices and prompts quickly, prefer using the `serve` command.

### The `serve` command

You can also run a local server to generate audio via HTTP requests.
```bash
uvx pocket-tts serve
# or if you installed it manually with pip:
pocket-tts serve
```
Navigate to `http://localhost:8000` to try the web interface, it's faster than the command line as the model is kept in memory between requests.

You can check out the [serve documentation](https://github.com/kyutai-labs/pocket-tts/tree/main/docs/serve.md) for more details and examples.

### The `export-voice` command

Processing an audio file (e.g., a .wav or .mp3) for voice cloning is relatively slow, but loading a safetensors file -- a voice embedding converted from an audio file -- is very fast. You can use the `export-voice` command to do this conversion. See the [export-voice documentation](https://github.com/kyutai-labs/pocket-tts/tree/main/docs/export_voice.md) for more details and examples.


## Using it as a Python library

You can try out the Python library on Colab [here](https://colab.research.google.com/github/kyutai-labs/pocket-tts/blob/main/docs/pocket-tts-example.ipynb).

Install the package with
```bash
pip install pocket-tts
# or
uv add pocket-tts
```

You can use this package as a simple Python library to generate audio from text.
```python
from pocket_tts import TTSModel
import scipy.io.wavfile

tts_model = TTSModel.load_model()
voice_state = tts_model.get_state_for_audio_prompt(
"alba" # One of the pre-made voices, see above
# You can also use any voice file you have locally or from Hugging Face:
# "./some_audio.wav"
# or "hf://kyutai/tts-voices/expresso/ex01-ex02_default_001_channel2_198s.wav"
)
audio = tts_model.generate_audio(voice_state, "Hello world, this is a test.")
# Audio is a 1D torch tensor containing PCM data.
scipy.io.wavfile.write("output.wav", tts_model.sample_rate, audio.numpy())
```

You can have multiple voice states around if
you have multiple voices you want to use. `load_model()`
and `get_state_for_audio_prompt()` are relatively slow operations,
so we recommend to keep the model and voice states in memory if you can.

You can check out the [Python API documentation](https://github.com/kyutai-labs/pocket-tts/tree/main/docs/python-api.md) for more details and examples.

## Unsupported features

At the moment, we do not support (but would love pull requests adding):
- [Running the TTS inside a web browser (WebAssembly)](https://github.com/kyutai-labs/pocket-tts/issues/1)
- [A compiled version with for example `torch.compile()` or `candle`.](https://github.com/kyutai-labs/pocket-tts/issues/2)
- [Adding silence in the text input to generate pauses.](https://github.com/kyutai-labs/pocket-tts/issues/6)
- [Quantization to run the computation in int8.](https://github.com/kyutai-labs/pocket-tts/issues/7)

We tried running this TTS model on the GPU but did not observe a speedup compared to CPU execution,
notably because we use a batch size of 1 and a very small model.

## Development and local setup

We accept contributions! Feel free to open issues or pull requests on GitHub.

You can find development instructions in the [CONTRIBUTING.md](https://github.com/kyutai-labs/pocket-tts/tree/main/CONTRIBUTING.md) file. You'll also find there how to have an editable install of the package for local development.

## In-browser implementations

Pocket TTS is small enough to run directly in your browser in WebAssembly/JavaScript.
We don't have official support for this yet, but you can try out one of these community implementations:

- [babybirdprd/pocket-tts](https://github.com/babybirdprd/pocket-tts): Candle version (Rust) with WebAssembly and PyO3 bindings, meaning it can run on the web too.
- [ekzhang/jax-js](https://github.com/ekzhang/jax-js/tree/main/website/src/routes/tts): Using jax-js, a ML library for the web. Demo [here](https://jax-js.com/tts)
- [KevinAHM/pocket-tts-onnx-export](https://github.com/KevinAHM/pocket-tts-onnx-export): Model exported to .onnx and run using [ONNX Runtime Web](https://onnxruntime.ai/docs/tutorials/web/). Demo [here](https://huggingface.co/spaces/KevinAHM/pocket-tts-web)

## Projects using Pocket TTS

- [lukasmwerner/pocket-reader](https://github.com/lukasmwerner/pocket-reader) - Browser screen reader
- [ikidd/pocket-tts-wyoming](https://github.com/ikidd/pocket-tts-wyoming) - Docker container for pocket-tts using Wyoming protocol, ready for Home Assistant Voice use.

## Prohibited use

Use of our model must comply with all applicable laws and regulations and must not result in, involve, or facilitate any illegal, harmful, deceptive, fraudulent, or unauthorized activity. Prohibited uses include, without limitation, voice impersonation or cloning without explicit and lawful consent; misinformation, disinformation, or deception (including fake news, fraudulent calls, or presenting generated content as genuine recordings of real people or events); and the generation of unlawful, harmful, libelous, abusive, harassing, discriminatory, hateful, or privacy-invasive content. We disclaim all liability for any non-compliant use.


## Authors

Manu Orsini*, Simon Rouard*, Gabriel De Marmiesse*, Václav Volhejn, Neil Zeghidour, Alexandre Défossez

*equal contribution
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,13 @@
import sys
import os

sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
_SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
_COREML_DIR = os.path.dirname(_SCRIPT_DIR)
_PROJECT_DIR = os.path.dirname(_COREML_DIR)
sys.path.insert(0, _PROJECT_DIR) # for: from pocket_tts import ...
sys.path.insert(0, os.path.join(_COREML_DIR, "convert_models", "traceable")) # for: from traceable_* import ...

OUTPUT_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "constants")
OUTPUT_DIR = os.path.join(_COREML_DIR, "constants")


def export():
Expand Down Expand Up @@ -54,14 +57,18 @@ def export():
print(f"text_embed_table: {embed_table.shape}")
np.save(os.path.join(OUTPUT_DIR, "text_embed_table.npy"), embed_table)

# Also export the Mimi decoder init state shapes (for reference)
from traceable_decoder import TraceableMimiDecoder
decoder = TraceableMimiDecoder.from_mimi(model.mimi)
mimi_state = decoder.init_state(batch_size=1)
# Also export the Mimi decoder init state
from pocket_tts.modules.stateful_module import init_states

state = init_states(model.mimi.decoder, batch_size=1, sequence_length=256)
state.update(init_states(model.mimi.decoder_transformer, batch_size=1, sequence_length=256))
if hasattr(model.mimi, "upsample"):
state.update(init_states(model.mimi.upsample, batch_size=1, sequence_length=256))

mimi_state_np = {}
for k, v in mimi_state.items():
arr = v.numpy().astype(np.float32)
mimi_state_np[k] = arr
for mod_name, mod_state in state.items():
for key, tensor in mod_state.items():
mimi_state_np[key] = tensor.numpy().astype(np.float32)
np.savez(os.path.join(OUTPUT_DIR, "mimi_init_state.npz"), **mimi_state_np)
print(f"mimi_init_state: {len(mimi_state_np)} tensors")

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,12 @@
import sys
import os

sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
_SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
_CONVERT_MODELS_DIR = os.path.dirname(_SCRIPT_DIR)
_COREML_DIR = os.path.dirname(_CONVERT_MODELS_DIR)
_PROJECT_DIR = os.path.dirname(_COREML_DIR)
sys.path.insert(0, _PROJECT_DIR) # for: from pocket_tts import ...
sys.path.insert(0, os.path.join(_CONVERT_MODELS_DIR, "traceable")) # for: from traceable_* import ...

from traceable_cond_step import TraceableCondStep

Expand All @@ -17,12 +21,12 @@ def convert():
model = TTSModel.load_model(lsd_decode_steps=8)
model.eval()

cond_step = TraceableCondStep.from_flowlm(model.flow_lm, max_seq_len=200)
cond_step = TraceableCondStep.from_flowlm(model.flow_lm, max_seq_len=512)
cond_step.eval()

# Example inputs
conditioning = torch.randn(1, 1, 1024)
cache = torch.full((2, 1, 200, 16, 64), float('nan'))
cache = torch.full((2, 1, 512, 16, 64), float('nan'))
pos = torch.zeros(1)

example_inputs = (
Expand All @@ -39,7 +43,7 @@ def convert():
print("Converting to CoreML...")
inputs = [ct.TensorType(name="conditioning", shape=(1, 1, 1024))]
for i in range(6):
inputs.append(ct.TensorType(name=f"cache{i}", shape=(2, 1, 200, 16, 64)))
inputs.append(ct.TensorType(name=f"cache{i}", shape=(2, 1, 512, 16, 64)))
inputs.append(ct.TensorType(name=f"position{i}", shape=(1,)))

mlmodel = ct.convert(
Expand Down Expand Up @@ -67,7 +71,7 @@ def convert():
'conditioning': np.random.randn(1, 1, 1024).astype(np.float32),
}
for i in range(6):
test_inputs[f'cache{i}'] = np.zeros((2, 1, 200, 16, 64), dtype=np.float32)
test_inputs[f'cache{i}'] = np.zeros((2, 1, 512, 16, 64), dtype=np.float32)
test_inputs[f'position{i}'] = np.array([0.0], dtype=np.float32)
out = coreml_model.predict(test_inputs)
print(f"Output keys: {len(out)}")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,12 @@
import sys
import os

sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
_SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
_CONVERT_MODELS_DIR = os.path.dirname(_SCRIPT_DIR)
_COREML_DIR = os.path.dirname(_CONVERT_MODELS_DIR)
_PROJECT_DIR = os.path.dirname(_COREML_DIR)
sys.path.insert(0, _PROJECT_DIR) # for: from pocket_tts import ...
sys.path.insert(0, os.path.join(_CONVERT_MODELS_DIR, "traceable")) # for: from traceable_* import ...

from traceable_flow_decoder import TraceableFlowDecoder

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,12 @@
import sys
import os

sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
_SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
_CONVERT_MODELS_DIR = os.path.dirname(_SCRIPT_DIR)
_COREML_DIR = os.path.dirname(_CONVERT_MODELS_DIR)
_PROJECT_DIR = os.path.dirname(_COREML_DIR)
sys.path.insert(0, _PROJECT_DIR) # for: from pocket_tts import ...
sys.path.insert(0, os.path.join(_CONVERT_MODELS_DIR, "traceable")) # for: from traceable_* import ...

from traceable_flowlm_step import TraceableFlowLMStep

Expand All @@ -18,7 +22,7 @@ def convert_flowlm_step():
model.eval()

print("Creating traceable step model...")
max_seq_len = 200
max_seq_len = 512
step_model = TraceableFlowLMStep.from_flowlm(model.flow_lm, max_seq_len=max_seq_len)
step_model.eval()

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
"""Convert Mimi streaming decoder to CoreML.

NOTE: The Mimi decoder uses in-place state mutations (state[:] = ...) in its
streaming convolution layers (StreamingConv1d, StreamingConvTranspose1d).
coremltools cannot convert these in-place operations directly.

The existing mimi_decoder_v2.mlpackage was converted using a custom traceable
wrapper that rewrites all streaming ops as functional (returning new tensors
instead of mutating in place). This requires rewriting the forward pass of:
- StreamingConv1d.forward() (conv.py)
- StreamingConvTranspose1d.forward() (conv.py)
- MimiTransformerLayer attention cache updates (mimi_transformer.py)

The model has 26 streaming state tensors (see traceable_mimi_decoder.py for
the full list) and produces 1920 audio samples per frame at 24kHz.

To regenerate mimi_decoder_v2.mlpackage:
1. Create a functional TraceableMimiDecoder that avoids all in-place ops
2. Trace with sequence_length=256 for attention caches
3. Convert with compute_precision=FLOAT32, target=macOS15

Input: latent [1, 512, 1] + 26 state tensors
Output: audio [1, 1, 1920] + 26 updated state tensors
"""
import sys
import os

_SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
_CONVERT_MODELS_DIR = os.path.dirname(_SCRIPT_DIR)
_COREML_DIR = os.path.dirname(_CONVERT_MODELS_DIR)
_PROJECT_DIR = os.path.dirname(_COREML_DIR)
sys.path.insert(0, _PROJECT_DIR) # for: from pocket_tts import ...
sys.path.insert(0, os.path.join(_CONVERT_MODELS_DIR, "traceable")) # for: from traceable_* import ...


def convert():
"""Reference conversion — requires functional Mimi wrapper (see docstring)."""
import torch
import numpy as np
import coremltools as ct
from pocket_tts import TTSModel
from pocket_tts.modules.stateful_module import init_states

print("Loading model...")
model = TTSModel.load_model(lsd_decode_steps=8)
model.eval()

# Show the state structure for reference
print("\nMimi decoder streaming state:")
state = init_states(model.mimi.decoder, batch_size=1, sequence_length=256)
state.update(
init_states(model.mimi.decoder_transformer, batch_size=1, sequence_length=256)
)
if hasattr(model.mimi, "upsample"):
state.update(
init_states(model.mimi.upsample, batch_size=1, sequence_length=256)
)

total_params = 0
for mod_name, mod_state in state.items():
for key, tensor in mod_state.items():
total_params += tensor.numel()
print(f" {mod_name}.{key}: {list(tensor.shape)}")

print(f"\nTotal state elements: {total_params:,}")
print(f"State tensors: {sum(len(s) for s in state.values())}")

print(
"\nERROR: Direct conversion not supported due to in-place state mutations."
)
print("The existing mimi_decoder_v2.mlpackage uses a custom functional wrapper.")
print("See this file's docstring for details on how to regenerate it.")
return None


if __name__ == "__main__":
convert()
Loading