Real-time speaker diarization for iOS and macOS, answering "who spoke when" in audio streams.
Pick the diarizer based on the workflow:
LS-EEND handles noisy environments and overlapping speakers well and supports up to 10 speakers. It is also much more lightweight than Sortformer, able to run entirely on an M4 MAX CPU at a comparable speed to Sortformer on ANE. However, it is more prone to false alarms and usually less stable than Sortformer, unless many speakers are talking simultaneously. Streaming updates occur every 100ms with about 900ms of tentative predictions.
Sortformer handles noisy environments and overlapping speakers well, but is limited to 4 speakers. Speaker identities are extremely stable, but it struggles when many loud voices are present. It also misses quiet speech as it's trained to ignore background conversations. The most common source of error is missed speech. Streaming updates occur every 480ms, with 560ms of tentative predictions.
DiarizerManager Legacy online diarizer. Struggles with background noise, background conversations, overlapping speech with more than 2 speakers, short utterances, and similar-sounding speakers. The most common source of error is incorrect labeling. Performs poorly for low-latency streaming. Requires external timestamp alignment between chunks if operating on a sliding window. This is also the most computationally heavy online diarizer.
Offline VBx pipeline is the best offline-quality option when you want a full batch pipeline with segmentation, embedding extraction, and clustering over a complete file.
| Environment | Sortformer | DiarizerManager | LS-EEND |
|---|---|---|---|
| Clean/silent room | Best | Good | Best |
| Background noise | Best | Poor | Good |
| Speech from another room | Poor | Good | Good |
| Whispers | Poor | Good | Best |
| High overlap | Good | Poor | Best |
| Max speakers | 4 | No max | 10 |
| Benchmarks | Good | Poor | Best |
| Remembering speakers across meetings | Great | Best | Good |
import FluidAudio
// 1. Download models (one-time setup)
let models = try await DiarizerModels.downloadIfNeeded()
// 2. Initialize with default config
let diarizer = DiarizerManager()
diarizer.initialize(models: models)
// 3. Normalize any audio file to 16kHz mono Float32 using AudioConverter
let converter = AudioConverter()
let url = URL(fileURLWithPath: "path/to/audio.wav")
let audioSamples = try converter.resampleAudioFile(url)
// 4. Run diarization (accepts any RandomAccessCollection<Float>)
let result = try diarizer.performCompleteDiarization(audioSamples)
// Alternative: Use ArraySlice for zero-copy processing
let audioSlice = audioSamples[1000..<5000] // No memory copy!
let sliceResult = try diarizer.performCompleteDiarization(audioSlice)
// 5. Get results
for segment in result.segments {
print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
}The diarizer module mirrors the three-stage pipeline and its offline counterpart. Files live under Sources/FluidAudio/Diarizer/:
Core/
├── DiarizerManager.swift # Real-time orchestrator and chunk scheduler
├── DiarizerTypes.swift # Public models/configs shared across stages
└── DiarizerModels.swift # Core ML bundle management
Segmentation/
├── SegmentationProcessor.swift # VAD + powerset segmentation inference
├── SlidingWindow.swift # Frame windowing helpers
└── AudioValidation.swift # Streaming quality/embedding validation
Extraction/
└── EmbeddingExtractor.swift # WeSpeaker embedding inference
Clustering/
├── SpeakerManager.swift # Active speaker tracking and assignment
├── SpeakerTypes.swift # Speaker/raw embedding representations
└── SpeakerOperations.swift # Distance/scoring utilities
Offline/
├── Core/
│ ├── OfflineDiarizerManager.swift
│ ├── OfflineDiarizerTypes.swift
│ └── OfflineDiarizerModels.swift
├── Segmentation/
│ └── OfflineSegmentationProcessor.swift
├── Extraction/
│ ├── OfflineEmbeddingExtractor.swift
│ ├── PLDATransform.swift
│ └── WeightInterpolation.swift
├── Clustering/
│ ├── AHCClustering.swift
│ └── VBxClustering.swift
└── Utils/
├── OfflineReconstruction.swift
└── VDSPOperations.swift
Use this layout as the reference when adding new diarization capabilities so orchestration, segmentation, embedding extraction, and clustering stay isolated.
If you deploy in an offline environment, stage the Core ML bundles manually and skip the automatic HuggingFace downloader.
Download the two .mlmodelc folders from FluidInference/speaker-diarization-coreml:
pyannote_segmentation.mlmodelcwespeaker_v2.mlmodelc
Keep the folder names unchanged so the loader can find coremldata.bin inside each bundle.
Place the staged repo in persistent storage (replace /opt/models with your path):
/opt/models
└── speaker-diarization-coreml
├── pyannote_segmentation.mlmodelc
│ ├── coremldata.bin
│ └── ...
└── wespeaker_v2.mlmodelc
├── coremldata.bin
└── ...
You can clone with Git LFS, download the .tar archives from HuggingFace, or copy the directory from a machine that already ran DiarizerModels.downloadIfNeeded() (macOS cache: ~/Library/Application Support/FluidAudio/Models/speaker-diarization-coreml).
Point DiarizerModels.load at the staged bundles and initialize the manager with the returned models:
import FluidAudio
@main
struct OfflineDiarizer {
static func main() async {
do {
let basePath = "/opt/models/speaker-diarization-coreml"
let segmentation = URL(fileURLWithPath: basePath).appendingPathComponent("pyannote_segmentation.mlmodelc")
let embedding = URL(fileURLWithPath: basePath).appendingPathComponent("wespeaker_v2.mlmodelc")
let models = try await DiarizerModels.load(
localSegmentationModel: segmentation,
localEmbeddingModel: embedding
)
let diarizer = DiarizerManager()
diarizer.initialize(models: models)
// ... run diarization as usual
} catch {
print("Failed to load diarizer models: \(error)")
}
}
}Use FileManager to verify the two .mlmodelc folders exist before loading. When paths are correct, the loader never contacts the network and no auto-download occurs.
For fine-tuning, you can customize the configuration:
let config = DiarizerConfig(
clusteringThreshold: 0.7, // Speaker separation sensitivity (0.5-0.9)
minSpeechDuration: 1.0, // Minimum speech segment (seconds)
minSilenceGap: 0.5 // Minimum gap between speakers (seconds)
)
let diarizer = DiarizerManager(config: config)Requires macOS 14 / iOS 17 or later. The offline stack uses native C++ clustering and AsyncStream coordination that are unavailable on older OS releases.
When you need full parity with the pyannote/Core ML exporter (powerset segmentation + VBx clustering), use OfflineDiarizerManager. It orchestrates segmentation, soft mask interpolation, WeSpeaker embedding extraction, PLDA/VBx clustering, and timeline reconstruction in one place:
import FluidAudio
let config = OfflineDiarizerConfig()
let manager = OfflineDiarizerManager(config: config)
try await manager.prepareModels() // Downloads + compiles Core ML bundles when missing
let samples = try AudioConverter().resampleAudioFile(path: "meeting.wav")
let result = try await manager.process(audio: samples)
for segment in result.segments {
print("\(segment.speakerId) → \(segment.startTimeSeconds)s – \(segment.endTimeSeconds)s")
}For file-based processing, use the memory-mapped streaming API which automatically handles large audio files efficiently:
let url = URL(fileURLWithPath: "meeting.wav")
let result = try await manager.process(url)The file-based API internally uses memory-mapped streaming to avoid materializing the entire buffer in memory.
The offline controller mirrors the reference pipeline:
- Segmentation:
SegmentationRunnerfeeds 10 s/160 k sample chunks through the Core ML segmentation model. Each chunk yields 589 frame-level log probabilities over the 7 local powerset classes. - Binarization:
Binarization.logProbsToWeightsconverts log probabilities to soft VAD weights; binary masks are still available for diagnostics. - Weight interpolation:
WeightInterpolationapplies the same half-pixel mapping asscipy.ndimage.zoom, preserving the Core ML exporter’s alignment when resampling 589-frame masks to the embedding model’s pooling rate. - Embedding extraction:
EmbeddingRunnerbatches audio + resampled weights and returns L2-normalized 256-d embeddings. - VBx clustering:
VBxClustering(withAHCClusteringwarm start andPLDAScoring) runs the full VBx refinement loop using the JSON parameters exported with the model bundle. - Timeline reconstruction:
TimelineReconstructionnow derives frame duration from the actual segmentation output andOfflineDiarizerConfig.windowDuration, ensuring timestamps stay correct if you swap in models with different hop sizes.
OfflineDiarizerConfig groups knobs by pipeline stage:
segmentation: Window length (default 10 s), step ratio, min on/off durations, and sample rate. These must align with the exported Core ML segmentation model.embedding: Batch size and overlap handling. KeepexcludeOverlapenabled for community-1 style powerset outputs.clustering: The VBx warm-start threshold plus pyannote's Fa/Fb priors.vbx: Max iterations and convergence tolerance for the refinement loop.postProcessing: Minimum gap duration when stitching segments back together.export: OptionalembeddingsPathfor dumping per-speaker vectors to JSON.
prepareModels captures Core ML compilation timings (and download durations when a fresh fetch is needed), so DiarizationResult.timings reflects audio loading, segmentation, embedding, clustering, and post-processing costs in one place. Per-speaker embeddings are exposed in speakerDatabase for downstream analytics without toggling debug flags.
The CLI exposes the same controller via fluidaudio process and the diarization benchmark tooling:
swift run fluidaudiocli process meeting.wav --mode offline --threshold 0.6 --debug --chunk-seconds 10.0 --overlap-seconds 0.0
swift run fluidaudiocli diarization-benchmark --mode offline --dataset ami-sdm --threshold 0.6 --auto-downloadAdd --rttm path/to/meeting.rttm when you have ground-truth annotations to emit DER/JER directly on the console, or --export-embeddings embeddings.json to inspect cluster assignments. The GitHub Actions workflow offline-pipeline.yml executes the single-file AMI benchmark on every PR, keeping downloads, PLDA transforms, and VBx clustering guard-railed.
Both commands reuse the shared model cache (OfflineDiarizerModels.defaultModelsDirectory()) and emit JSON payloads compatible with the streaming pipeline.
For use cases requiring fine-grained control over memory management or audio loading, you can manually construct the audio source using StreamingAudioSourceFactory:
import FluidAudio
let config = OfflineDiarizerConfig()
let manager = OfflineDiarizerManager(config: config)
try await manager.prepareModels()
let factory = StreamingAudioSourceFactory()
let (source, loadDuration) = try factory.makeDiskBackedSource(
from: URL(fileURLWithPath: "meeting.wav"),
targetSampleRate: config.segmentation.sampleRate
)
defer { source.cleanup() }
let result = try await manager.process(
audioSource: source,
audioLoadingSeconds: loadDuration
)This approach is useful when you need to:
- Process the same file multiple times without reloading
- Measure audio loading time separately from diarization time
- Implement custom cleanup or caching logic
For most use cases, the simpler manager.process(url) API is recommended.
For online diarization, prefer LS-EEND or Sortformer first, and use WeSpeaker/Pyannote only when you specifically need its speaker-database behavior.
Use LSEENDDiarizer when you want live diarizer with low latency and up to 10 speakers. It is very eager to detect speech, and performs well even when many simultaneous speakers are present.
import FluidAudio
let diarizer = LSEENDDiarizer()
try await diarizer.initialize(variant: .dihard3)
try diarizer.addAudio(audioChunk, sourceSampleRate: 16_000)
if let update = try diarizer.process() {
for segment in update.finalizedSegments {
print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
}
}Notes:
- Supports streaming and complete-buffer inference with the same API.
- Returns
DiarizerTimelineUpdatevalues with finalized and tentative segments. - Use
enrollSpeaker(withSamples:named:overwritingAssignedSpeakerName:)to pre-enroll speakers before streaming. This keeps the warmed LS-EEND session state while resetting the visible timeline. - See LS-EEND.md for the full API and call flow.
Use SortformerDiarizer when you want robust low-latency streaming with stronger speaker identity stability, and the 4-speaker limit is acceptable. It attempts to ignore background speakers, so it may miss quiet speech. It also sometimes gets overwhelmed when too many speakers are talking simultaneously.
import FluidAudio
let diarizer = SortformerDiarizer()
let models = try await SortformerModels.loadFromHuggingFace(config: .default)
diarizer.initialize(models: models)
try diarizer.addAudio(audioChunk, sourceSampleRate: 16_000)
if let update = try diarizer.process() {
for segment in update.finalizedSegments {
print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
}
}Notes:
- Best when speaker ordering and identity stability matter more than maximum speaker count.
- Limited to 4 speakers and can miss quiet speech more often than LS-EEND.
- Use
enrollSpeaker(withAudio:named:overwritingAssignedSpeakerName:)to pre-enroll speakers before streaming. This preserves the speaker cache/FIFO state and clears the visible timeline. - Also returns
DiarizerTimelineUpdate/DiarizerTimelineso the integration shape matches LS-EEND. - See Sortformer.md for model loading, tuning, and behavior details.
Use DiarizerManager when you need the classic segmentation + embedding + speaker-database pipeline. This is the slowest streaming option and works best with larger chunks.
Process audio in chunks for real-time applications:
// Configure for streaming
let diarizer = DiarizerManager() // Default config works well
diarizer.initialize(models: models)
var stream = AudioStream(
chunkDuration: 5.0, // 10.0 for best accuracy. 3.0 for lowest
chunkSkip: 2.0, // Duration between successive chunk starts
streamStartTime: 0.0, // Audio stream start time
chunkingStrategy: .useMostRecent // Most recent
)
stream.bind { chunk, time in
let results = try diarizer.performCompleteDiarization(chunk, atTime: time)
for segment in results.segments {
handleSpeakerSegment(segment)
}
}
for audioSamples in audioStream {
try stream.write(from: audioSamples)
}Notes:
- Keep one
DiarizerManagerinstance per stream soSpeakerManagermaintains ID consistency. - Always rebase per-chunk timestamps by
(chunkStartSample / sampleRate). - Provide 16 kHz mono Float32 samples; pad final chunk to the model window.
- Tune
speakerThresholdandembeddingThresholdto trade off ID stability vs. sensitivity. - Prefer chunk sizes of at least 5 seconds; 10 seconds is usually better.
Speaker Enrollment: The Speaker class includes a name field for enrollment workflows. When users introduce themselves ("My name is Alice"), update the speaker's name from the default (e.g. "Speaker_1") to enable personalized identification.
The performCompleteDiarization function accepts audio of any length, but accuracy varies:
- < 3 seconds: May fail or produce unreliable results
- 3-5 seconds: Minimum viable chunk, reduced accuracy
- 10 seconds: Optimal balance of accuracy and latency (recommended)
- > 10 seconds: Good accuracy but higher latency
- Maximum: Limited only by memory
You can adjust chunk size based on your needs:
- Low latency: Use 3-5 second chunks (accept lower accuracy)
- High accuracy: Use 10+ second chunks (accept higher latency)
The diarizer doesn't automatically chunk audio - you need to:
- Accumulate incoming audio samples to your desired chunk size
- Process chunks with
performCompleteDiarization - Maintain speaker IDs across chunks using
SpeakerManager
import AVFoundation
class RealTimeDiarizer {
private let audioEngine = AVAudioEngine()
private let diarizer: DiarizerManager
private var audioStream: AudioStream
init() async throws {
let models = try await DiarizerModels.downloadIfNeeded()
diarizer = DiarizerManager() // Default config
diarizer.initialize(models: models)
audioStream = AudioStream(
chunkDuration: 5.0, // 5 second chunks work well
chunkSkip: 3.0, // 3.0 second delay between chunks works well
streamStartTime: 0.0,
chunkingStrategy: .useFixedSkip // ensure chunks are evenly spaced
)
audioStream.bind { chunk, _ in
Task {
do {
let result = try diarizer.performCompleteDiarization(chunk)
await handleResults(result)
} catch {
print("Diarization error: \(error)")
}
}
}
}
func startCapture() throws {
let inputNode = audioEngine.inputNode
let recordingFormat = inputNode.outputFormat(forBus: 0)
// Install tap to capture audio
inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { [weak self] buffer, _ in
guard let self = self else { return }
try? self.audioStream.write(from: buffer)
}
audioEngine.prepare()
try audioEngine.start()
}
@MainActor
private func handleResults(_ result: DiarizationResult) {
for segment in result.segments {
print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
}
}
}Main entry point for diarization pipeline:
let diarizer = DiarizerManager() // Default config (recommended)
diarizer.initialize(models: models)
// Normalize with AudioConverter
let samples = try AudioConverter().resampleAudioFile(URL(fileURLWithPath: "path/to/audio.wav"))
let result = try diarizer.performCompleteDiarization(samples)Tracks speaker identities across audio chunks:
let speakerManager = diarizer.speakerManager
// Get speaker information
print("Active speakers: \(speakerManager.speakerCount)")
for speakerId in speakerManager.speakerIds {
if let speaker = speakerManager.getSpeaker(for: speakerId) {
print("\(speaker.name): \(speaker.duration)s total")
}
}Configuration parameters:
let config = DiarizerConfig(
clusteringThreshold: 0.7, // Speaker separation threshold (0.0-1.0)
minSpeechDuration: 1.0, // Minimum speech duration in seconds
minSilenceGap: 0.5, // Minimum silence between speakers
minActiveFramesCount: 10.0, // Minimum active frames for valid segment
debugMode: false // Enable debug logging
)Pre-load known speaker profiles:
// Create embeddings for known speakers
let aliceAudio = loadAudioFile("alice_sample.wav")
let aliceEmbedding = try diarizer.extractEmbedding(aliceAudio)
// Initialize with known speakers
let alice = Speaker(id: "Alice", name: "Alice", currentEmbedding: aliceEmbedding)
let bob = Speaker(id: "Bob", name: "Bob", currentEmbedding: bobEmbedding)
speakerManager.initializeKnownSpeakers([alice, bob])
// Process - will use "Alice" instead of "Speaker_1" when matched
let result = try diarizer.performCompleteDiarization(audioSamples)import SwiftUI
import FluidAudio
struct DiarizationView: View {
@StateObject private var processor = DiarizationProcessor()
var body: some View {
VStack {
Text("Speakers: \(processor.speakerCount)")
List(processor.activeSpeakers) { speaker in
HStack {
Circle()
.fill(speaker.isSpeaking ? Color.green : Color.gray)
.frame(width: 10, height: 10)
Text(speaker.name)
Spacer()
Text("\(speaker.duration, specifier: "%.1f")s")
}
}
Button(processor.isProcessing ? "Stop" : "Start") {
processor.toggleProcessing()
}
}
}
}
@MainActor
class DiarizationProcessor: ObservableObject {
@Published var speakerCount = 0
@Published var activeSpeakers: [SpeakerDisplay] = []
@Published var isProcessing = false
private var diarizer: DiarizerManager?
func toggleProcessing() {
if isProcessing {
stopProcessing()
} else {
startProcessing()
}
}
private func startProcessing() {
Task {
let models = try await DiarizerModels.downloadIfNeeded()
diarizer = DiarizerManager() // Default config
diarizer?.initialize(models: models)
isProcessing = true
// Start audio capture and process chunks
AudioCapture.start { [weak self] chunk in
self?.processChunk(chunk)
}
}
}
private func processChunk(_ audio: [Float]) {
Task { @MainActor in
guard let diarizer = diarizer else { return }
let result = try diarizer.performCompleteDiarization(audio)
speakerCount = diarizer.speakerManager.speakerCount
// Update UI with current speakers
activeSpeakers = diarizer.speakerManager.speakerIds.compactMap { id in
guard let speaker = diarizer.speakerManager.getSpeaker(for: id) else {
return nil
}
return SpeakerDisplay(
id: id,
name: speaker.name,
duration: speaker.duration,
isSpeaking: result.segments.contains { $0.speakerId == id }
)
}
}
}
}let config = DiarizerConfig(
clusteringThreshold: 0.7,
minSpeechDuration: 1.0,
minSilenceGap: 0.5
)
// Lower latency for real-time
let config = DiarizerConfig(
clusteringThreshold: 0.7,
minSpeechDuration: 0.5, // Faster response
minSilenceGap: 0.3 // Quicker speaker switches
)// Reset between sessions to free memory
diarizer.speakerManager.reset()
// Or cleanup completely
diarizer.cleanup()Evaluate performance on your audio:
# Command-line benchmark
swift run fluidaudiocli diarization-benchmark --single-file ES2004a
# Results:
# DER: 17.7% (Miss: 10.3%, FA: 1.6%, Speaker Error: 5.8%)
# RTFx: 141.2x (real-time factor) M1 2022| Method | Description |
|---|---|
initialize(models:) |
Initialize with Core ML models |
performCompleteDiarization(_:sampleRate:) |
Process audio and return segments |
cleanup() |
Release resources |
| Method | Description |
|---|---|
assignSpeaker(_:speechDuration:) |
Assign embedding to speaker |
initializeKnownSpeakers(_:) |
Load known speaker profiles |
getSpeaker(for:) |
Get speaker details |
reset() |
Clear all speakers |
| Property | Type | Description |
|---|---|---|
segments |
[TimedSpeakerSegment] |
Speaker segments with timing |
speakerDatabase |
[String: [Float]]? |
Speaker embeddings keyed by speaker ID |
timings |
PipelineTimings? |
Processing timings for the diarization pass |
- iOS 17.0+ / macOS 14.0+
- Swift 5.9+
- ~100MB for Core ML models (downloaded on first use)
| Device | RTFx | Notes |
|---|---|---|
| M2 MacBook Air | 150x | Apple Neural Engine |
| M1 iPad Pro | 120x | Neural Engine |
| iPhone 14 Pro | 80x | Neural Engine |
| GitHub Actions | 140x | CPU only |
- Check if audio has overlapping speech (not yet supported)
- Ensure 16kHz sampling rate
- Verify audio isn't too noisy
- Call
reset()between sessions - Process shorter chunks for streaming
- Reduce
minActiveFramesCountif needed
- Check internet connection
- Verify ~100MB free space
- Models cached after first download
See main repository LICENSE file.