2024 MacBook Pro, 48GB Ram, M4 Pro, Tahoe 26.0
https://huggingface.co/FluidInference/parakeet-tdt-0.6b-v3-coreml
swift run fluidaudio fleurs-benchmark --languages all --samples all[17:19:31.944] [INFO] [FluidAudio.FLEURSBenchmark] ----------------------------------------
[17:19:31.944] [INFO] [FluidAudio.FLEURSBenchmark] ================================================================================
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Results saved to fleurs_benchmark_results.json
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] ================================================================================
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] FLEURS BENCHMARK SUMMARY
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] ================================================================================
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark]
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Language | WER% | CER% | RTFx | Duration | Processed | Skipped
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] -----------------------------------------------------------------------------------------
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Bulgarian (Bulgaria) | 12.9 | 4.1 | 187.5 | 3468.0s | 350 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Croatian (Croatia) | 14.2 | 4.3 | 197.1 | 3647.0s | 350 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Czech (Czechia) | 12.5 | 4.1 | 205.7 | 4247.4s | 350 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Danish (Denmark) | 20.6 | 7.7 | 206.1 | 10579.1s | 930 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Dutch (Netherlands) | 7.9 | 2.7 | 184.9 | 3337.7s | 350 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] English (US) | 5.7 | 2.8 | 200.8 | 3442.9s | 350 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Estonian (Estonia) | 20.5 | 4.4 | 215.4 | 10825.4s | 893 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Finnish (Finland) | 15.5 | 3.5 | 211.9 | 11894.4s | 918 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] French (France) | 6.3 | 2.6 | 192.0 | 3667.3s | 350 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] German (Germany) | 7.1 | 2.8 | 206.6 | 4684.6s | 350 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Greek (Greece) | 37.1 | 13.8 | 175.2 | 6862.0s | 650 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Hungarian (Hungary) | 18.1 | 5.4 | 203.9 | 11050.9s | 905 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Italian (Italy) | 4.8 | 1.9 | 222.6 | 5098.7s | 350 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Latvian (Latvia) | 27.9 | 7.8 | 208.6 | 10218.6s | 851 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Lithuanian (Lithuania) | 25.3 | 7.0 | 193.7 | 10686.5s | 986 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Maltese (Malta) | 25.6 | 9.7 | 206.8 | 12770.6s | 926 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Polish (Poland) | 8.7 | 2.9 | 183.8 | 3409.6s | 350 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Romanian (Romania) | 14.7 | 4.8 | 192.3 | 9099.4s | 883 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Russian (Russia) | 7.4 | 2.4 | 199.6 | 3974.6s | 350 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Slovak (Slovakia) | 13.0 | 4.5 | 217.8 | 4169.6s | 350 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Slovenian (Slovenia) | 27.6 | 9.4 | 189.1 | 8173.1s | 834 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Spanish (Spain) | 6.5 | 3.9 | 214.2 | 4258.9s | 350 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Swedish (Sweden) | 17.0 | 5.2 | 211.7 | 8399.2s | 759 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] Ukrainian (Ukraine) | 7.4 | 2.5 | 195.4 | 3853.7s | 350 | -
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] -----------------------------------------------------------------------------------------
[17:19:31.945] [INFO] [FluidAudio.FLEURSBenchmark] AVERAGE | 15.2 | 5.0 | 200.9 | 161819.2 | 14085 | -
[17:19:31.954] [INFO] [FluidAudio.Main] Peak memory usage (process-wide): 0.487 GB
[02:01:49.655] [INFO] [Benchmark] 2620 files per dataset • Test runtime: 3m 2s • 09/25/2025, 2:01 AM EDT
[02:01:49.655] [INFO] [Benchmark] --- Benchmark Results ---
[02:01:49.655] [INFO] [Benchmark] Dataset: librispeech test-clean
[02:01:49.655] [INFO] [Benchmark] Files processed: 2620
[02:01:49.655] [INFO] [Benchmark] Average WER: 2.6%
[02:01:49.655] [INFO] [Benchmark] Median WER: 0.0%
[02:01:49.655] [INFO] [Benchmark] Average CER: 1.1%
[02:01:49.655] [INFO] [Benchmark] Median RTFx: 137.8x
[02:01:49.655] [INFO] [Benchmark] Overall RTFx: 153.4x (19452.5s / 126.8s)
[02:01:49.655] [INFO] [Benchmark] Results saved to: asr_benchmark_results.json
[02:01:49.655] [INFO] [Benchmark] ASR benchmark completed successfully
swift run fluidaudio asr-benchmark --max-files all --model-version v2
Use v2 if you only need English, it is a bit more accurate
ansient day, like music in the air. Ah
[01:35:16.880] [INFO] [Benchmark] File: 908-157963-0010.flac (WER: 15.4%) (Duration: 6.28s)
[01:35:16.880] [INFO] [Benchmark] ------------------------------------------------------------
[01:35:16.894] [INFO] [Benchmark] Normalized Reference: she ceasd and smild in tears then sat down in her silver shrine
[01:35:16.894] [INFO] [Benchmark] Normalized Hypothesis: she ceased and smiled in tears then sat down in her silver shrine
[01:35:16.894] [INFO] [Benchmark] Original Hypothesis: She ceased and smiled in tears, Then sat down in her silver shrine,
[01:35:16.894] [INFO] [Benchmark] 2620 files per dataset • Test runtime: 3m 25s • 09/26/2025, 1:35 AM EDT
[01:35:16.894] [INFO] [Benchmark] --- Benchmark Results ---
[01:35:16.894] [INFO] [Benchmark] Dataset: librispeech test-clean
[01:35:16.894] [INFO] [Benchmark] Files processed: 2620
[01:35:16.894] [INFO] [Benchmark] Average WER: 2.2%
[01:35:16.894] [INFO] [Benchmark] Median WER: 0.0%
[01:35:16.894] [INFO] [Benchmark] Average CER: 0.7%
[01:35:16.894] [INFO] [Benchmark] Median RTFx: 125.6x
[01:35:16.894] [INFO] [Benchmark] Overall RTFx: 141.2x (19452.5s / 137.7s)
[01:35:16.894] [INFO] [Benchmark] Results saved to: asr_benchmark_results.json
[01:35:16.894] [INFO] [Benchmark] ASR benchmark completed successfully
Core ML first-load compile times captured on iPhone 16 Pro Max and iPhone 13 running the parakeet-tdt-0.6b-v3-coreml bundle. Cold-start compilation happens the first time each Core ML model is loaded; subsequent loads hit the cached binaries. Warm compile metrics were collected only on the iPhone 16 Pro Max run, and only for models that were reloaded during the session.
| Model | iPhone 16 Pro Max cold (ms) | iPhone 16 Pro Max warm (ms) | iPhone 13 cold (ms) | Compute units |
|---|---|---|---|---|
| Preprocessor | 9.15 | - | 632.63 | MLComputeUnits(rawValue: 2) |
| Encoder | 3361.23 | 162.05 | 4396.00 | MLComputeUnits(rawValue: 1) |
| Decoder | 88.49 | 8.11 | 146.01 | MLComputeUnits(rawValue: 1) |
| JointDecision | 48.46 | 7.97 | 71.85 | MLComputeUnits(rawValue: 1) |
We generated the same strings with to gerneate audio between 1s to ~300s in order to test the speed across a range of varying inputs on Pytorch CPU, MPS, and MLX pipeline, and compared it against the native Swift version with Core ML models.
Each pipeline warmed up the models by running through it once with pesudo inputs, and then comparing the raw inference time with the model already loaded. You can see that for the Core ML model, we traded lower memory and very slightly faster inference for longer initial warm-up.
Note that the Pytorch kokoro model in Pytorch has a memory leak issue: hexgrad/kokoro#152
The following tests were ran on M4 Pro, 48GB RAM, Macbook Pro. If you have another device, please do try replicating it as well!
KPipeline benchmark for voice af_heart (warm-up took 0.175s) using hexgrad/kokoro
Test Chars Output (s) Inf(s) RTFx Peak GB
1 42 2.750 0.187 14.737x 1.44
2 129 8.625 0.530 16.264x 1.85
3 254 15.525 0.923 16.814x 2.65
4 93 6.125 0.349 17.566x 2.66
5 104 7.200 0.410 17.567x 2.70
6 130 9.300 0.504 18.443x 2.72
7 197 12.850 0.726 17.711x 2.83
8 6 1.350 0.098 13.823x 2.83
9 1228 76.200 4.342 17.551x 3.19
10 567 35.200 2.069 17.014x 4.85
11 4615 286.525 17.041 16.814x 4.78
Total - 461.650 27.177 16.987x 4.85 I wasn't able to run the MPS model for longer durations, even with PYTORCH_ENABLE_MPS_FALLBACK=1 enabled, it kept crashing for the longer strings.
KPipeline benchmark for voice af_heart (warm-up took 0.568s) using pip package
Test Chars Output (s) Inf(s) RTFx Peak GB
1 42 2.750 0.414 6.649x 1.41
2 129 8.625 0.729 11.839x 1.54
Total - 11.375 1.142 9.960x 1.54 TTS benchmark for voice af_heart (warm-up took an extra 2.155s) using model prince-canuma/Kokoro-82M
Test Chars Output (s) Inf(s) RTFx Peak GB
1 42 2.750 0.347 7.932x 1.12
2 129 8.650 0.597 14.497x 2.47
3 254 15.525 0.825 18.829x 2.65
4 93 6.125 0.306 20.039x 2.65
5 104 7.200 0.343 21.001x 2.65
6 130 9.300 0.560 16.611x 2.65
7 197 12.850 0.596 21.573x 2.65
8 6 1.350 0.364 3.706x 2.65
9 1228 76.200 2.979 25.583x 3.29
10 567 35.200 1.374 25.615x 3.37
11 4615 286.500 11.112 25.783x 3.37
Total - 461.650 19.401 23.796x 3.37Note that it does take ~15s to compile the model on the first run, subsequent runs are shorter, we expect ~2s to load.
> swift run fluidaudio tts --benchmark
...
FluidAudio TTS benchmark for voice af_heart (warm-up took an extra 2.348s)
Test Chars Ouput (s) Inf(s) RTFx
1 42 2.825 0.440 6.424x
2 129 7.725 0.594 13.014x
3 254 13.400 0.776 17.278x
4 93 5.875 0.587 10.005x
5 104 6.675 0.613 10.889x
6 130 8.075 0.621 13.008x
7 197 10.650 0.627 16.983x
8 6 0.825 0.360 2.290x
9 1228 67.625 2.362 28.625x
10 567 33.025 1.341 24.619x
11 4269 247.600 9.087 27.248x
Total - 404.300 17.408 23.225
Peak memory usage (process-wide): 1.503 GBModel is nearly identical to the base model in terms of quality, perforamnce wise we see an up to ~3.5x improvement compared to the silero Pytorch VAD model with the 256ms batch model (8 chunks of 32ms)
Dataset: https://github.com/Lab41/VOiCES-subset
swift run fluidaudio vad-benchmark --dataset voices-subset --all-files --threshold 0.85
...
Timing Statistics:
[18:56:31.208] [INFO] [VAD] Total processing time: 0.29s
[18:56:31.208] [INFO] [VAD] Total audio duration: 351.05s
[18:56:31.208] [INFO] [VAD] RTFx: 1230.6x faster than real-time
[18:56:31.208] [INFO] [VAD] Audio loading time: 0.00s (0.6%)
[18:56:31.208] [INFO] [VAD] VAD inference time: 0.28s (98.7%)
[18:56:31.208] [INFO] [VAD] Average per file: 0.011s
[18:56:31.208] [INFO] [VAD] Min per file: 0.001s
[18:56:31.208] [INFO] [VAD] Max per file: 0.020s
[18:56:31.208] [INFO] [VAD]
VAD Benchmark Results:
[18:56:31.208] [INFO] [VAD] Accuracy: 96.0%
[18:56:31.208] [INFO] [VAD] Precision: 100.0%
[18:56:31.208] [INFO] [VAD] Recall: 95.8%
[18:56:31.208] [INFO] [VAD] F1-Score: 97.9%
[18:56:31.208] [INFO] [VAD] Total Time: 0.29s
[18:56:31.208] [INFO] [VAD] RTFx: 1230.6x faster than real-time
[18:56:31.208] [INFO] [VAD] Files Processed: 25
[18:56:31.208] [INFO] [VAD] Avg Time per File: 0.011s
swift run fluidaudio vad-benchmark --dataset musan-full --num-files all --threshold 0.8
...
[23:02:35.539] [INFO] [VAD] Total processing time: 322.31s
[23:02:35.539] [INFO] [VAD] Timing Statistics:
[23:02:35.539] [INFO] [VAD] RTFx: 1220.7x faster than real-time
[23:02:35.539] [INFO] [VAD] Audio loading time: 1.20s (0.4%)
[23:02:35.539] [INFO] [VAD] VAD inference time: 319.57s (99.1%)
[23:02:35.539] [INFO] [VAD] Average per file: 0.160s
[23:02:35.539] [INFO] [VAD] Total audio duration: 393442.58s
[23:02:35.539] [INFO] [VAD] Min per file: 0.000s
[23:02:35.539] [INFO] [VAD] Max per file: 0.873s
[23:02:35.711] [INFO] [VAD] VAD Benchmark Results:
[23:02:35.711] [INFO] [VAD] Accuracy: 94.2%
[23:02:35.711] [INFO] [VAD] Precision: 92.6%
[23:02:35.711] [INFO] [VAD] Recall: 78.9%
[23:02:35.711] [INFO] [VAD] F1-Score: 85.2%
[23:02:35.711] [INFO] [VAD] Total Time: 322.31s
[23:02:35.711] [INFO] [VAD] RTFx: 1220.7x faster than real-time
[23:02:35.711] [INFO] [VAD] Files Processed: 2016
[23:02:35.711] [INFO] [VAD] Avg Time per File: 0.160s
[23:02:35.744] [INFO] [VAD] Results saved to: vad_benchmark_results.json
The offline version uses the community-1 model, the online version uses the legacy speaker-diarization-3.1 model.
For slightly ~1.2% worse DER we default to a higher step ratio segmentation duration than the baseline community-1 pipeline. This allows us to get nearly ~2x the speed (as expected because we're processing 1/2 of the embeddings). For highly critical use cases, one may should use step ratio = 0.1 and minSegmentDurationSeconds = 0.0
Running on the full voxconverse benchmark:
StepRatio = 0.2, minSegmentDurationSeconds= 1.0
Average DER: 15.07% | Median DER: 10.70% | Average JER: 39.40% | Median JER: 40.95% (collar=0.25s, ignoreOverlap=True)
Average RTFx: 122.06 (from 232 clips)
Completed. New results: 232, Skipped existing: 0, Total attempted: 232
Step Ratio 2, min turation 1.0
StepRatio = 0.1, minSegmentDurationSeconds= 0
Average DER: 13.89% | Median DER: 10.49% | Average JER: 42.84% | Median JER: 43.30% (collar=0.25s, ignoreOverlap=True)
Average RTFx: 64.75 (from 232 clips)
Completed. New results: 232, Skipped existing: 0, Total attempted: 232
Step Ratio 1, min duration 0 (edited) Note that the baseline pytorch version is ~11% DER, we lost some precision dropping down to fp16 precision in order to run most of the emebdding model on neural engine. But as a result, we significantly out perform the baseline mps backend as well. the pyannote-community-1 on cpu is ~1.5-2 RTFx, on mps, it's ~20-25 RTFx.
This is more tricky and honestly a lot more fragile to clustering. Expect +10-15% worse DER for the streaming implementation. Only use this when you critically need realtime streaming speaker diarization. In most cases, offline is more than enough for most applications.

