Rust bindings for FluidAudio - a Swift library for ASR, VAD, Speaker Diarization, and TTS on Apple platforms.
- ASR (Automatic Speech Recognition)
- Parakeet TDT - High-quality speech-to-text for 25 European languages
- Regular ASR for maximum accuracy
- Streaming ASR for 99.5% less memory usage
- Real-time sample transcription
- Qwen3-ASR - Multilingual transcription with excellent Japanese, Chinese, Vietnamese support
- 30+ languages including East Asian, Southeast Asian, European, and more
- Both one-shot and streaming modes available
- Language hint support or automatic detection
- Parakeet TDT - High-quality speech-to-text for 25 European languages
- VAD (Voice Activity Detection) - Detect speech segments in audio
- Speaker Diarization - Identify and label different speakers in audio
- macOS 14+ or iOS 17+
- Apple Silicon (M1/M2/M3) recommended
- Rust 1.70+
- Swift 5.10+
Add to your Cargo.toml:
[dependencies]
fluidaudio-rs = "0.1"use fluidaudio_rs::FluidAudio;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let audio = FluidAudio::new()?;
// Check system info
let info = audio.system_info();
println!("Running on: {} ({})", info.chip_name, info.platform);
println!("Apple Silicon: {}", audio.is_apple_silicon());
// Initialize ASR (downloads models on first run)
audio.init_asr()?;
// Transcribe an audio file
let result = audio.transcribe_file("audio.wav")?;
println!("Text: {}", result.text);
println!("Confidence: {:.2}%", result.confidence * 100.0);
println!("Processing speed: {:.1}x realtime", result.rtfx);
Ok(())
}For real-time audio applications, you can transcribe raw audio samples directly without file I/O:
use fluidaudio_rs::FluidAudio;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let audio = FluidAudio::new()?;
audio.init_asr()?;
// Audio samples from microphone or streaming source
// (16kHz mono, normalized to -1.0 to 1.0)
let samples: Vec<f32> = capture_audio_from_mic();
// Transcribe samples directly
let result = audio.transcribe_samples(&samples)?;
println!("Text: {}", result.text);
Ok(())
}This is ideal for meeting transcription apps, voice assistants, and other real-time scenarios where writing to temporary files adds unnecessary overhead.
Qwen3-ASR provides excellent multilingual support, especially for East Asian and Southeast Asian languages. Perfect for applications that need Japanese, Chinese, Vietnamese, or Korean transcription.
East Asian: Japanese, Chinese (Mandarin), Cantonese, Korean Southeast Asian: Vietnamese, Indonesian, Malay, Thai, Filipino European: English, French, German, Spanish, Portuguese, Italian, Dutch, Swedish, Danish, Finnish, Polish, Czech, Greek, Hungarian, Romanian Other: Russian, Arabic, Hindi, Turkish, Persian, Macedonian
use fluidaudio_rs::FluidAudio;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let audio = FluidAudio::new()?;
// Initialize Qwen3-ASR
audio.init_qwen3_asr()?;
// Transcribe Japanese audio with language hint
let result = audio.qwen3_transcribe_file("japanese_meeting.wav", Some("ja"))?;
println!("Japanese: {}", result.text);
// Transcribe with automatic language detection
let result = audio.qwen3_transcribe_file("multilingual_audio.wav", None)?;
println!("Auto-detected: {}", result.text);
Ok(())
}use fluidaudio_rs::FluidAudio;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let audio = FluidAudio::new()?;
audio.init_qwen3_asr()?;
// Audio samples from microphone (Japanese speaker)
let samples: Vec<f32> = capture_japanese_audio();
// Transcribe with language hint
let result = audio.qwen3_transcribe_samples(&samples, Some("Japanese"))?;
println!("日本語: {}", result.text);
Ok(())
}For real-time meeting translation or live transcription:
use fluidaudio_rs::FluidAudio;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let audio = FluidAudio::new()?;
audio.init_qwen3_streaming()?;
// Start streaming session for Japanese
// Parameters: language, min_audio_secs, chunk_secs, max_audio_secs
audio.qwen3_streaming_start(Some("ja"), 1.0, 2.0, 30.0)?;
// Feed audio chunks as they arrive from microphone
loop {
let chunk: Vec<f32> = capture_audio_chunk(); // From mic
// Get partial transcription if ready
if let Some(partial_text) = audio.qwen3_streaming_feed(&chunk)? {
println!("Partial (JP): {}", partial_text);
// Translate partial_text to English here...
}
if meeting_ended {
break;
}
}
// Get final complete transcription
let final_text = audio.qwen3_streaming_finish()?;
println!("Final (JP): {}", final_text);
Ok(())
}Language Codes: Use ISO 639-1 codes ("ja", "zh", "vi", "ko") or English names ("Japanese", "Chinese", "Vietnamese", "Korean").
System Requirements: Qwen3-ASR requires macOS 15+ or iOS 18+ (uses Apple's latest CoreML stateful models).
Streaming ASR uses 99.5% less memory than regular ASR by processing audio in chunks rather than loading entire files. Perfect for long recordings or resource-constrained environments.
The easiest way to use streaming ASR - just like regular transcription but memory-efficient:
use fluidaudio_rs::FluidAudio;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let audio = FluidAudio::new()?;
audio.init_streaming_asr()?;
// Process large files with minimal memory
let result = audio.transcribe_file_streaming("long_meeting.wav")?;
println!("Text: {}", result.text);
println!("Speed: {:.1}x realtime", result.rtfx);
Ok(())
}For real-time streaming with full control:
use fluidaudio_rs::FluidAudio;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let audio = FluidAudio::new()?;
audio.init_streaming_asr()?;
// Start streaming session
audio.streaming_asr_start()?;
// Feed audio chunks as they arrive
loop {
let samples: Vec<f32> = capture_audio_chunk(); // From mic, network, etc.
audio.streaming_asr_feed(&samples)?;
if done {
break;
}
}
// Get final transcription
let text = audio.streaming_asr_finish()?;
println!("Transcription: {}", text);
Ok(())
}When to use Streaming ASR:
- Long audio files (> 5 minutes)
- Real-time transcription with live audio feed
- Memory-constrained environments
- Continuous streaming scenarios
When to use Regular ASR:
- Short audio clips (< 5 minutes)
- When you need maximum accuracy on complete audio
- Batch processing where memory isn't a concern
use fluidaudio_rs::FluidAudio;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let audio = FluidAudio::new()?;
// Initialize VAD with threshold (0.0-1.0)
audio.init_vad(0.85)?;
println!("VAD available: {}", audio.is_vad_available());
Ok(())
}use fluidaudio_rs::FluidAudio;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let audio = FluidAudio::new()?;
// Initialize diarization with clustering threshold (0.0-1.0)
// Lower = more speakers, higher = fewer speakers
audio.init_diarization(0.6)?;
// Diarize an audio file
let segments = audio.diarize_file("meeting.wav")?;
for seg in &segments {
println!(
"[{:.2}s - {:.2}s] {}",
seg.start_time, seg.end_time, seg.speaker_id
);
}
Ok(())
}First initialization downloads and compiles ML models (~500MB total). This can take 20-30 seconds as Apple's Neural Engine compiles the models. Subsequent loads use cached compilations (~1 second).
| Platform | Status |
|---|---|
| macOS (Apple Silicon) | Full support |
| macOS (Intel) | Limited (no ASR) |
| iOS | Full support |
| Linux/Windows | Not supported |
This crate uses a C FFI bridge to communicate between Rust and Swift:
- The Swift layer (
FluidAudioBridge) wraps the FluidAudio library - C-compatible functions are exported using
@_cdecl - Rust calls these functions through
extern "C"declarations - The build.rs script compiles the Swift package and links it
MIT