Skip to content

Latest commit

 

History

History
98 lines (84 loc) · 3.93 KB

File metadata and controls

98 lines (84 loc) · 3.93 KB

Benchmarks

2024 MacBook Pro, 48GB Ram, M4 Pro, Tahoe 26.0

Transcription

https://huggingface.co/FluidInference/parakeet-tdt-0.6b-v3-coreml

swift run fluidaudio fleurs-benchmark --languages en_us,it_it,es_419,fr_fr,de_de,ru_ru,uk_ua --samples all
================================================================================
FLEURS BENCHMARK SUMMARY
================================================================================

Language                  | WER%   | CER%   | RTFx    | Duration | Processed | Skipped
-----------------------------------------------------------------------------------------
English (US)              | 5.7    | 2.8    | 136.7   | 3442.9s  | 350       | -
French (France)           | 5.8    | 2.4    | 136.5   | 560.8s   | 52        | 298
German (Germany)          | 3.1    | 1.2    | 152.2   | 62.1s    | 5         | -
Italian (Italy)           | 4.3    | 2.0    | 153.7   | 743.3s   | 50        | -
Russian (Russia)          | 7.7    | 2.8    | 134.1   | 621.2s   | 50        | -
Spanish (Spain)           | 6.5    | 3.0    | 152.3   | 586.9s   | 50        | -
Ukrainian (Ukraine)       | 6.5    | 1.9    | 132.5   | 528.2s   | 50        | -
-----------------------------------------------------------------------------------------
AVERAGE                   | 5.6    | 2.3    | 142.6   | 6545.5s  | 607       | 298
2620 files per dataset • Test runtime: 4m 1s • 09/04/2025, 1:55 AM EDT
--- Benchmark Results ---
   Dataset: librispeech test-clean
   Files processed: 2620
   Average WER: 2.7%
   Median WER: 0.0%
   Average CER: 1.1%
   Median RTFx: 99.3x
   Overall RTFx: 109.6x (19452.5s / 177.5s)

Voice Activity Detection

assets/vad-graph.png

Dataset: https://github.com/Lab41/VOiCES-subset

swift run fluidaudio vad-benchmark --dataset voices-subset --all-files --threshold 0.85
...
Timing Statistics:
[18:56:31.208] [INFO] [VAD]    Total processing time: 0.29s
[18:56:31.208] [INFO] [VAD]    Total audio duration: 351.05s
[18:56:31.208] [INFO] [VAD]    RTFx: 1230.6x faster than real-time
[18:56:31.208] [INFO] [VAD]    Audio loading time: 0.00s (0.6%)
[18:56:31.208] [INFO] [VAD]    VAD inference time: 0.28s (98.7%)
[18:56:31.208] [INFO] [VAD]    Average per file: 0.011s
[18:56:31.208] [INFO] [VAD]    Min per file: 0.001s
[18:56:31.208] [INFO] [VAD]    Max per file: 0.020s
[18:56:31.208] [INFO] [VAD]
VAD Benchmark Results:
[18:56:31.208] [INFO] [VAD]    Accuracy: 96.0%
[18:56:31.208] [INFO] [VAD]    Precision: 100.0%
[18:56:31.208] [INFO] [VAD]    Recall: 95.8%
[18:56:31.208] [INFO] [VAD]    F1-Score: 97.9%
[18:56:31.208] [INFO] [VAD]    Total Time: 0.29s
[18:56:31.208] [INFO] [VAD]    RTFx: 1230.6x faster than real-time
[18:56:31.208] [INFO] [VAD]    Files Processed: 25
[18:56:31.208] [INFO] [VAD]    Avg Time per File: 0.011s
swift run fluidaudio vad-benchmark --dataset musan-full --num-files all --threshold 0.8
...
[23:02:35.539] [INFO] [VAD] Total processing time: 322.31s
[23:02:35.539] [INFO] [VAD] Timing Statistics:
[23:02:35.539] [INFO] [VAD] RTFx: 1220.7x faster than real-time
[23:02:35.539] [INFO] [VAD] Audio loading time: 1.20s (0.4%)
[23:02:35.539] [INFO] [VAD] VAD inference time: 319.57s (99.1%)
[23:02:35.539] [INFO] [VAD] Average per file: 0.160s
[23:02:35.539] [INFO] [VAD] Total audio duration: 393442.58s
[23:02:35.539] [INFO] [VAD] Min per file: 0.000s
[23:02:35.539] [INFO] [VAD] Max per file: 0.873s
[23:02:35.711] [INFO] [VAD] VAD Benchmark Results:
[23:02:35.711] [INFO] [VAD] Accuracy: 94.2%
[23:02:35.711] [INFO] [VAD] Precision: 92.6%
[23:02:35.711] [INFO] [VAD] Recall: 78.9%
[23:02:35.711] [INFO] [VAD] F1-Score: 85.2%
[23:02:35.711] [INFO] [VAD] Total Time: 322.31s
[23:02:35.711] [INFO] [VAD] RTFx: 1220.7x faster than real-time
[23:02:35.711] [INFO] [VAD] Files Processed: 2016
[23:02:35.711] [INFO] [VAD] Avg Time per File: 0.160s
[23:02:35.744] [INFO] [VAD] Results saved to: vad_benchmark_results.json

Speaker Diarization