2024 MacBook Pro, 48GB Ram, M4 Pro, Tahoe 26.0
https://huggingface.co/FluidInference/parakeet-tdt-0.6b-v3-coreml
swift run fluidaudio fleurs-benchmark --languages en_us,it_it,es_419,fr_fr,de_de,ru_ru,uk_ua --samples all================================================================================
FLEURS BENCHMARK SUMMARY
================================================================================
Language | WER% | CER% | RTFx | Duration | Processed | Skipped
-----------------------------------------------------------------------------------------
English (US) | 5.7 | 2.8 | 136.7 | 3442.9s | 350 | -
French (France) | 5.8 | 2.4 | 136.5 | 560.8s | 52 | 298
German (Germany) | 3.1 | 1.2 | 152.2 | 62.1s | 5 | -
Italian (Italy) | 4.3 | 2.0 | 153.7 | 743.3s | 50 | -
Russian (Russia) | 7.7 | 2.8 | 134.1 | 621.2s | 50 | -
Spanish (Spain) | 6.5 | 3.0 | 152.3 | 586.9s | 50 | -
Ukrainian (Ukraine) | 6.5 | 1.9 | 132.5 | 528.2s | 50 | -
-----------------------------------------------------------------------------------------
AVERAGE | 5.6 | 2.3 | 142.6 | 6545.5s | 607 | 298
2620 files per dataset • Test runtime: 4m 1s • 09/04/2025, 1:55 AM EDT
--- Benchmark Results ---
Dataset: librispeech test-clean
Files processed: 2620
Average WER: 2.7%
Median WER: 0.0%
Average CER: 1.1%
Median RTFx: 99.3x
Overall RTFx: 109.6x (19452.5s / 177.5s)
Model is nearly identical to the base model in terms of quality, perforamnce wise we see an up to ~3.5x improvement compared to the silero Pytorch VAD model with the 256ms batch model (8 chunks of 32ms)
Dataset: https://github.com/Lab41/VOiCES-subset
swift run fluidaudio vad-benchmark --dataset voices-subset --all-files --threshold 0.85
...
Timing Statistics:
[18:56:31.208] [INFO] [VAD] Total processing time: 0.29s
[18:56:31.208] [INFO] [VAD] Total audio duration: 351.05s
[18:56:31.208] [INFO] [VAD] RTFx: 1230.6x faster than real-time
[18:56:31.208] [INFO] [VAD] Audio loading time: 0.00s (0.6%)
[18:56:31.208] [INFO] [VAD] VAD inference time: 0.28s (98.7%)
[18:56:31.208] [INFO] [VAD] Average per file: 0.011s
[18:56:31.208] [INFO] [VAD] Min per file: 0.001s
[18:56:31.208] [INFO] [VAD] Max per file: 0.020s
[18:56:31.208] [INFO] [VAD]
VAD Benchmark Results:
[18:56:31.208] [INFO] [VAD] Accuracy: 96.0%
[18:56:31.208] [INFO] [VAD] Precision: 100.0%
[18:56:31.208] [INFO] [VAD] Recall: 95.8%
[18:56:31.208] [INFO] [VAD] F1-Score: 97.9%
[18:56:31.208] [INFO] [VAD] Total Time: 0.29s
[18:56:31.208] [INFO] [VAD] RTFx: 1230.6x faster than real-time
[18:56:31.208] [INFO] [VAD] Files Processed: 25
[18:56:31.208] [INFO] [VAD] Avg Time per File: 0.011s
swift run fluidaudio vad-benchmark --dataset musan-full --num-files all --threshold 0.8
...
[23:02:35.539] [INFO] [VAD] Total processing time: 322.31s
[23:02:35.539] [INFO] [VAD] Timing Statistics:
[23:02:35.539] [INFO] [VAD] RTFx: 1220.7x faster than real-time
[23:02:35.539] [INFO] [VAD] Audio loading time: 1.20s (0.4%)
[23:02:35.539] [INFO] [VAD] VAD inference time: 319.57s (99.1%)
[23:02:35.539] [INFO] [VAD] Average per file: 0.160s
[23:02:35.539] [INFO] [VAD] Total audio duration: 393442.58s
[23:02:35.539] [INFO] [VAD] Min per file: 0.000s
[23:02:35.539] [INFO] [VAD] Max per file: 0.873s
[23:02:35.711] [INFO] [VAD] VAD Benchmark Results:
[23:02:35.711] [INFO] [VAD] Accuracy: 94.2%
[23:02:35.711] [INFO] [VAD] Precision: 92.6%
[23:02:35.711] [INFO] [VAD] Recall: 78.9%
[23:02:35.711] [INFO] [VAD] F1-Score: 85.2%
[23:02:35.711] [INFO] [VAD] Total Time: 322.31s
[23:02:35.711] [INFO] [VAD] RTFx: 1220.7x faster than real-time
[23:02:35.711] [INFO] [VAD] Files Processed: 2016
[23:02:35.711] [INFO] [VAD] Avg Time per File: 0.160s
[23:02:35.744] [INFO] [VAD] Results saved to: vad_benchmark_results.json

