- Model (multilingual):
FluidInference/parakeet-tdt-0.6b-v3-coreml - Model (English-only):
FluidInference/parakeet-tdt-0.6b-v2-coreml - Languages: v3 spans 25 European languages; v2 focuses on English accuracy
- Processing Mode: Batch transcription for complete audio files
- Real-time Factor: ~120x on M4 Pro (1 minute ≈ 0.5 seconds)
- Model:
FluidInference/parakeet-realtime-eou-120m-coreml - Chunk Sizes: 160ms (lowest latency), 320ms, 1600ms (highest throughput)
- End-of-Utterance Detection: Built-in silence detection with configurable debounce
- Prefer v2 when you only need English. It reuses the fused TDT decoder from v3 but ships with a tighter vocabulary, delivering better recall on long-form English audio.
- Use v3 for multilingual coverage (25 languages). English accuracy is still strong, but the broader vocab slightly trails v2 on rare words.
- Both versions share the same API surface—set
AsrModelVersionin code or pass--model-versionin the CLI.
// Download the English-only bundle when you only need English transcripts
let models = try await AsrModels.downloadAndLoad(version: .v2)import FluidAudio
// Batch transcription from an audio file
Task {
// 1) Initialize ASR manager and load models
let models = try await AsrModels.downloadAndLoad(version: .v3) // Switch to .v2 for English-only
let asrManager = AsrManager(config: .default)
try await asrManager.configure(models: models)
// 2) Prepare 16 kHz mono samples (see: Audio Conversion)
let samples = try await loadSamples16kMono(path: "path/to/audio.wav")
// 3) Transcribe the audio
let result = try await asrManager.transcribe(samples, source: .system)
print("Transcription: \(result.text)")
print("Confidence: \(result.confidence)")
}Important: Do not parse WAV/PCM bytes by hand (e.g., slicing headers or assuming 16-bit samples). Always convert with
AudioConverterso differing bit depths, channel layouts, metadata chunks, or compressed formats (MP3/M4A/FLAC) get normalized to the 16 kHz mono Float32 tensors that Parakeet expects. Manually decoded buffers frequently contain garbage values, which shows up as empty transcripts even though the models load successfully.
If you already have an audio file on disk you can skip manual sample loading—AsrManager.transcribe(_ url:source:)
handles format conversion internally via AudioConverter.
let models = try await AsrModels.downloadAndLoad(version: .v3)
let asrManager = AsrManager()
try await asrManager.loadModels(models)
let audioURL = URL(fileURLWithPath: "/path/to/audio.wav")
let result = try await asrManager.transcribe(audioURL, source: .system)
print(result.text)Working offline? Follow the Manual Model Loading guide to stage the CoreML bundles and call AsrModels.load without triggering HuggingFace downloads.
# Transcribe an audio file (batch)
swift run fluidaudiocli transcribe audio.wav
# English-only run (better recall)
swift run fluidaudiocli transcribe audio.wav --model-version v2
# Transcribe multiple files in parallel
swift run fluidaudiocli multi-stream audio1.wav audio2.wav
# Benchmark ASR on LibriSpeech
swift run fluidaudiocli asr-benchmark --subset test-clean --max-files 50
# Run the English-only benchmark
swift run fluidaudiocli asr-benchmark --subset test-clean --max-files 50 --model-version v2
# Multilingual ASR (FLEURS) benchmark
swift run fluidaudiocli fleurs-benchmark --languages en_us,fr_fr --samples 10
# Download LibriSpeech test sets
swift run fluidaudiocli download --dataset librispeech-test-clean
swift run fluidaudiocli download --dataset librispeech-test-other# Transcribe a file (--use-cache auto-downloads models)
swift run fluidaudiocli parakeet-eou --input audio.wav --use-cache
# Run benchmark on LibriSpeech test-clean
swift run fluidaudiocli parakeet-eou --benchmark --chunk-size 160 --max-files 100 --use-cacheOptions: --input <path>, --benchmark, --max-files <n>, --chunk-size <160|320|1600>, --eou-debounce <ms>, --use-cache, --models <path>, --output <path>, --verbose