Speaker Diarization

Real-time speaker diarization for iOS and macOS, answering "who spoke when" in audio streams.

Model Choice

Pick the diarizer based on the workflow:

LS-EEND handles noisy environments and overlapping speakers well and supports up to 10 speakers. It is also much more lightweight than Sortformer, able to run entirely on an M4 MAX CPU at a comparable speed to Sortformer on ANE. However, it is more prone to false alarms and usually less stable than Sortformer, unless many speakers are talking simultaneously. Streaming updates occur every 100ms with about 900ms of tentative predictions.

Sortformer handles noisy environments and overlapping speakers well, but is limited to 4 speakers. Speaker identities are extremely stable, but it struggles when many loud voices are present. It also misses quiet speech as it's trained to ignore background conversations. The most common source of error is missed speech. Streaming updates occur every 480ms, with 560ms of tentative predictions.

DiarizerManager Legacy online diarizer. Struggles with background noise, background conversations, overlapping speech with more than 2 speakers, short utterances, and similar-sounding speakers. The most common source of error is incorrect labeling. Performs poorly for low-latency streaming. Requires external timestamp alignment between chunks if operating on a sliding window. This is also the most computationally heavy online diarizer.

Offline VBx pipeline is the best offline-quality option when you want a full batch pipeline with segmentation, embedding extraction, and clustering over a complete file.

Environment	Sortformer	DiarizerManager	LS-EEND
Clean/silent room	Best	Good	Best
Background noise	Best	Poor	Good
Speech from another room	Poor	Good	Good
Whispers	Poor	Good	Best
High overlap	Good	Poor	Best
Max speakers	4	No max	10
Benchmarks	Good	Poor	Best
Remembering speakers across meetings	Great	Best	Good

Quick Start

import FluidAudio

// 1. Download models (one-time setup)
let models = try await DiarizerModels.downloadIfNeeded()

// 2. Initialize with default config
let diarizer = DiarizerManager()
diarizer.initialize(models: models)

// 3. Normalize any audio file to 16kHz mono Float32 using AudioConverter
let converter = AudioConverter()
let url = URL(fileURLWithPath: "path/to/audio.wav")
let audioSamples = try converter.resampleAudioFile(url)

// 4. Run diarization (accepts any RandomAccessCollection<Float>)
let result = try diarizer.performCompleteDiarization(audioSamples)

// Alternative: Use ArraySlice for zero-copy processing
let audioSlice = audioSamples[1000..<5000]  // No memory copy!
let sliceResult = try diarizer.performCompleteDiarization(audioSlice)

// 5. Get results
for segment in result.segments {
    print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
}

Source Layout

The diarizer module mirrors the three-stage pipeline and its offline counterpart. Files live under Sources/FluidAudio/Diarizer/:

Core/
├── DiarizerManager.swift        # Real-time orchestrator and chunk scheduler
├── DiarizerTypes.swift          # Public models/configs shared across stages
└── DiarizerModels.swift         # Core ML bundle management

Segmentation/
├── SegmentationProcessor.swift  # VAD + powerset segmentation inference
├── SlidingWindow.swift          # Frame windowing helpers
└── AudioValidation.swift        # Streaming quality/embedding validation

Extraction/
└── EmbeddingExtractor.swift     # WeSpeaker embedding inference

Clustering/
├── SpeakerManager.swift         # Active speaker tracking and assignment
├── SpeakerTypes.swift           # Speaker/raw embedding representations
└── SpeakerOperations.swift      # Distance/scoring utilities

Offline/
├── Core/
│   ├── OfflineDiarizerManager.swift
│   ├── OfflineDiarizerTypes.swift
│   └── OfflineDiarizerModels.swift
├── Segmentation/
│   └── OfflineSegmentationProcessor.swift
├── Extraction/
│   ├── OfflineEmbeddingExtractor.swift
│   ├── PLDATransform.swift
│   └── WeightInterpolation.swift
├── Clustering/
│   ├── AHCClustering.swift
│   └── VBxClustering.swift
└── Utils/
    ├── OfflineReconstruction.swift
    └── VDSPOperations.swift

Use this layout as the reference when adding new diarization capabilities so orchestration, segmentation, embedding extraction, and clustering stay isolated.

Manual Model Loading

If you deploy in an offline environment, stage the Core ML bundles manually and skip the automatic HuggingFace downloader.

Required assets

Download the two .mlmodelc folders from FluidInference/speaker-diarization-coreml:

pyannote_segmentation.mlmodelc
wespeaker_v2.mlmodelc

Keep the folder names unchanged so the loader can find coremldata.bin inside each bundle.

Folder layout

Place the staged repo in persistent storage (replace /opt/models with your path):

/opt/models
└── speaker-diarization-coreml
    ├── pyannote_segmentation.mlmodelc
    │   ├── coremldata.bin
    │   └── ...
    └── wespeaker_v2.mlmodelc
        ├── coremldata.bin
        └── ...

You can clone with Git LFS, download the .tar archives from HuggingFace, or copy the directory from a machine that already ran DiarizerModels.downloadIfNeeded() (macOS cache: ~/Library/Application Support/FluidAudio/Models/speaker-diarization-coreml).

Loading without downloads

Point DiarizerModels.load at the staged bundles and initialize the manager with the returned models:

import FluidAudio

@main
struct OfflineDiarizer {
    static func main() async {
        do {
            let basePath = "/opt/models/speaker-diarization-coreml"
            let segmentation = URL(fileURLWithPath: basePath).appendingPathComponent("pyannote_segmentation.mlmodelc")
            let embedding = URL(fileURLWithPath: basePath).appendingPathComponent("wespeaker_v2.mlmodelc")

            let models = try await DiarizerModels.load(
                localSegmentationModel: segmentation,
                localEmbeddingModel: embedding
            )

            let diarizer = DiarizerManager()
            diarizer.initialize(models: models)

            // ... run diarization as usual
        } catch {
            print("Failed to load diarizer models: \(error)")
        }
    }
}

Use FileManager to verify the two .mlmodelc folders exist before loading. When paths are correct, the loader never contacts the network and no auto-download occurs.

Custom Configuration (Optional)

For fine-tuning, you can customize the configuration:

let config = DiarizerConfig(
    clusteringThreshold: 0.7,  // Speaker separation sensitivity (0.5-0.9)
    minSpeechDuration: 1.0,     // Minimum speech segment (seconds)
    minSilenceGap: 0.5          // Minimum gap between speakers (seconds)
)
let diarizer = DiarizerManager(config: config)

Offline VBx Pipeline (Batch Diarization)

Requires macOS 14 / iOS 17 or later. The offline stack uses native C++ clustering and AsyncStream coordination that are unavailable on older OS releases.

When you need full parity with the pyannote/Core ML exporter (powerset segmentation + VBx clustering), use OfflineDiarizerManager. It orchestrates segmentation, soft mask interpolation, WeSpeaker embedding extraction, PLDA/VBx clustering, and timeline reconstruction in one place:

import FluidAudio

let config = OfflineDiarizerConfig()
let manager = OfflineDiarizerManager(config: config)
try await manager.prepareModels()  // Downloads + compiles Core ML bundles when missing

let samples = try AudioConverter().resampleAudioFile(path: "meeting.wav")
let result = try await manager.process(audio: samples)

for segment in result.segments {
    print("\(segment.speakerId) → \(segment.startTimeSeconds)s – \(segment.endTimeSeconds)s")
}

For file-based processing, use the memory-mapped streaming API which automatically handles large audio files efficiently:

let url = URL(fileURLWithPath: "meeting.wav")
let result = try await manager.process(url)

The file-based API internally uses memory-mapped streaming to avoid materializing the entire buffer in memory.

The offline controller mirrors the reference pipeline:

Segmentation: SegmentationRunner feeds 10 s/160 k sample chunks through the Core ML segmentation model. Each chunk yields 589 frame-level log probabilities over the 7 local powerset classes.
Binarization: Binarization.logProbsToWeights converts log probabilities to soft VAD weights; binary masks are still available for diagnostics.
Weight interpolation: WeightInterpolation applies the same half-pixel mapping as scipy.ndimage.zoom, preserving the Core ML exporter’s alignment when resampling 589-frame masks to the embedding model’s pooling rate.
Embedding extraction: EmbeddingRunner batches audio + resampled weights and returns L2-normalized 256-d embeddings.
VBx clustering: VBxClustering (with AHCClustering warm start and PLDAScoring) runs the full VBx refinement loop using the JSON parameters exported with the model bundle.
Timeline reconstruction: TimelineReconstruction now derives frame duration from the actual segmentation output and OfflineDiarizerConfig.windowDuration, ensuring timestamps stay correct if you swap in models with different hop sizes.

OfflineDiarizerConfig groups knobs by pipeline stage:

segmentation: Window length (default 10 s), step ratio, min on/off durations, and sample rate. These must align with the exported Core ML segmentation model.
embedding: Batch size and overlap handling. Keep excludeOverlap enabled for community-1 style powerset outputs.
clustering: The VBx warm-start threshold plus pyannote's Fa/Fb priors.
vbx: Max iterations and convergence tolerance for the refinement loop.
postProcessing: Minimum gap duration when stitching segments back together.
export: Optional embeddingsPath for dumping per-speaker vectors to JSON.

prepareModels captures Core ML compilation timings (and download durations when a fresh fetch is needed), so DiarizationResult.timings reflects audio loading, segmentation, embedding, clustering, and post-processing costs in one place. Per-speaker embeddings are exposed in speakerDatabase for downstream analytics without toggling debug flags.

CLI shortcut

The CLI exposes the same controller via fluidaudio process and the diarization benchmark tooling:

swift run fluidaudiocli process meeting.wav --mode offline --threshold 0.6 --debug --chunk-seconds 10.0 --overlap-seconds 0.0
swift run fluidaudiocli diarization-benchmark --mode offline --dataset ami-sdm --threshold 0.6 --auto-download

Add --rttm path/to/meeting.rttm when you have ground-truth annotations to emit DER/JER directly on the console, or --export-embeddings embeddings.json to inspect cluster assignments. The GitHub Actions workflow offline-pipeline.yml executes the single-file AMI benchmark on every PR, keeping downloads, PLDA transforms, and VBx clustering guard-railed.

Both commands reuse the shared model cache (OfflineDiarizerModels.defaultModelsDirectory()) and emit JSON payloads compatible with the streaming pipeline.

Advanced: Manual Audio Source Control

For use cases requiring fine-grained control over memory management or audio loading, you can manually construct the audio source using StreamingAudioSourceFactory:

import FluidAudio

let config = OfflineDiarizerConfig()
let manager = OfflineDiarizerManager(config: config)
try await manager.prepareModels()

let factory = StreamingAudioSourceFactory()
let (source, loadDuration) = try factory.makeDiskBackedSource(
    from: URL(fileURLWithPath: "meeting.wav"),
    targetSampleRate: config.segmentation.sampleRate
)
defer { source.cleanup() }

let result = try await manager.process(
    audioSource: source,
    audioLoadingSeconds: loadDuration
)

This approach is useful when you need to:

Process the same file multiple times without reloading
Measure audio loading time separately from diarization time
Implement custom cleanup or caching logic

For most use cases, the simpler manager.process(url) API is recommended.

Streaming/Real-time Processing

For online diarization, prefer LS-EEND or Sortformer first, and use WeSpeaker/Pyannote only when you specifically need its speaker-database behavior.

LS-EEND Streaming

Use LSEENDDiarizer when you want live diarizer with low latency and up to 10 speakers. It is very eager to detect speech, and performs well even when many simultaneous speakers are present.

import FluidAudio

let diarizer = LSEENDDiarizer()
try await diarizer.initialize(variant: .dihard3)

try diarizer.addAudio(audioChunk, sourceSampleRate: 16_000)
if let update = try diarizer.process() {
    for segment in update.finalizedSegments {
        print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
    }
}

Notes:

Supports streaming and complete-buffer inference with the same API.
Returns DiarizerTimelineUpdate values with finalized and tentative segments.
Use enrollSpeaker(withSamples:named:overwritingAssignedSpeakerName:) to pre-enroll speakers before streaming. This keeps the warmed LS-EEND session state while resetting the visible timeline.
See LS-EEND.md for the full API and call flow.

Sortformer Streaming

Use SortformerDiarizer when you want robust low-latency streaming with stronger speaker identity stability, and the 4-speaker limit is acceptable. It attempts to ignore background speakers, so it may miss quiet speech. It also sometimes gets overwhelmed when too many speakers are talking simultaneously.

import FluidAudio

let diarizer = SortformerDiarizer()
let models = try await SortformerModels.loadFromHuggingFace(config: .default)
diarizer.initialize(models: models)

try diarizer.addAudio(audioChunk, sourceSampleRate: 16_000)
if let update = try diarizer.process() {
    for segment in update.finalizedSegments {
        print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
    }
}

Notes:

Best when speaker ordering and identity stability matter more than maximum speaker count.
Limited to 4 speakers and can miss quiet speech more often than LS-EEND.
Use enrollSpeaker(withAudio:named:overwritingAssignedSpeakerName:) to pre-enroll speakers before streaming. This preserves the speaker cache/FIFO state and clears the visible timeline.
Also returns DiarizerTimelineUpdate / DiarizerTimeline so the integration shape matches LS-EEND.
See Sortformer.md for model loading, tuning, and behavior details.

WeSpeaker/Pyannote Streaming

Use DiarizerManager when you need the classic segmentation + embedding + speaker-database pipeline. This is the slowest streaming option and works best with larger chunks.

Process audio in chunks for real-time applications:

// Configure for streaming
let diarizer = DiarizerManager()  // Default config works well
diarizer.initialize(models: models)

var stream = AudioStream(
    chunkDuration: 5.0, // 10.0 for best accuracy. 3.0 for lowest
    chunkSkip: 2.0, // Duration between successive chunk starts
    streamStartTime: 0.0, // Audio stream start time
    chunkingStrategy: .useMostRecent // Most recent
)

stream.bind { chunk, time in
    let results = try diarizer.performCompleteDiarization(chunk, atTime: time)
    for segment in results.segments {
        handleSpeakerSegment(segment)
    }
}

for audioSamples in audioStream {
    try stream.write(from: audioSamples)
}

Notes:

Keep one DiarizerManager instance per stream so SpeakerManager maintains ID consistency.
Always rebase per-chunk timestamps by (chunkStartSample / sampleRate).
Provide 16 kHz mono Float32 samples; pad final chunk to the model window.
Tune speakerThreshold and embeddingThreshold to trade off ID stability vs. sensitivity.
Prefer chunk sizes of at least 5 seconds; 10 seconds is usually better.

Speaker Enrollment: The Speaker class includes a name field for enrollment workflows. When users introduce themselves ("My name is Alice"), update the speaker's name from the default (e.g. "Speaker_1") to enable personalized identification.

Chunk Size Considerations

The performCompleteDiarization function accepts audio of any length, but accuracy varies:

< 3 seconds: May fail or produce unreliable results
3-5 seconds: Minimum viable chunk, reduced accuracy
10 seconds: Optimal balance of accuracy and latency (recommended)
> 10 seconds: Good accuracy but higher latency
Maximum: Limited only by memory

You can adjust chunk size based on your needs:

Low latency: Use 3-5 second chunks (accept lower accuracy)
High accuracy: Use 10+ second chunks (accept higher latency)

The diarizer doesn't automatically chunk audio - you need to:

Accumulate incoming audio samples to your desired chunk size
Process chunks with performCompleteDiarization
Maintain speaker IDs across chunks using SpeakerManager

Real-time Audio Capture Example

import AVFoundation

class RealTimeDiarizer {
    private let audioEngine = AVAudioEngine()
    private let diarizer: DiarizerManager
    private var audioStream: AudioStream
    
    init() async throws {
        let models = try await DiarizerModels.downloadIfNeeded()
        diarizer = DiarizerManager()  // Default config
        diarizer.initialize(models: models)
        audioStream = AudioStream(
            chunkDuration: 5.0, // 5 second chunks work well
            chunkSkip: 3.0, // 3.0 second delay between chunks works well
            streamStartTime: 0.0,
            chunkingStrategy: .useFixedSkip // ensure chunks are evenly spaced
        )
        audioStream.bind { chunk, _ in
            Task {
                do {
                    let result = try diarizer.performCompleteDiarization(chunk)
                    await handleResults(result)
                } catch {
                    print("Diarization error: \(error)")
                }
            }
        }
    }

    func startCapture() throws {
        let inputNode = audioEngine.inputNode
        let recordingFormat = inputNode.outputFormat(forBus: 0)

        // Install tap to capture audio
        inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { [weak self] buffer, _ in
            guard let self = self else { return }
            try? self.audioStream.write(from: buffer)
        }

        audioEngine.prepare()
        try audioEngine.start()
    }

    @MainActor
    private func handleResults(_ result: DiarizationResult) {
        for segment in result.segments {
            print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
        }
    }
}

Core Components

DiarizerManager

Main entry point for diarization pipeline:

let diarizer = DiarizerManager()  // Default config (recommended)
diarizer.initialize(models: models)

// Normalize with AudioConverter
let samples = try AudioConverter().resampleAudioFile(URL(fileURLWithPath: "path/to/audio.wav"))
let result = try diarizer.performCompleteDiarization(samples)

SpeakerManager

Tracks speaker identities across audio chunks:

let speakerManager = diarizer.speakerManager

// Get speaker information
print("Active speakers: \(speakerManager.speakerCount)")
for speakerId in speakerManager.speakerIds {
    if let speaker = speakerManager.getSpeaker(for: speakerId) {
        print("\(speaker.name): \(speaker.duration)s total")
    }
}

DiarizerConfig

Configuration parameters:

let config = DiarizerConfig(
    clusteringThreshold: 0.7,      // Speaker separation threshold (0.0-1.0)
    minSpeechDuration: 1.0,         // Minimum speech duration in seconds
    minSilenceGap: 0.5,             // Minimum silence between speakers
    minActiveFramesCount: 10.0,     // Minimum active frames for valid segment
    debugMode: false                // Enable debug logging
)

Known Speaker Recognition

Pre-load known speaker profiles:

// Create embeddings for known speakers
let aliceAudio = loadAudioFile("alice_sample.wav")
let aliceEmbedding = try diarizer.extractEmbedding(aliceAudio)

// Initialize with known speakers
let alice = Speaker(id: "Alice", name: "Alice", currentEmbedding: aliceEmbedding)
let bob = Speaker(id: "Bob", name: "Bob", currentEmbedding: bobEmbedding)
speakerManager.initializeKnownSpeakers([alice, bob])

// Process - will use "Alice" instead of "Speaker_1" when matched
let result = try diarizer.performCompleteDiarization(audioSamples)

SwiftUI Integration

import SwiftUI
import FluidAudio

struct DiarizationView: View {
    @StateObject private var processor = DiarizationProcessor()

    var body: some View {
        VStack {
            Text("Speakers: \(processor.speakerCount)")

            List(processor.activeSpeakers) { speaker in
                HStack {
                    Circle()
                        .fill(speaker.isSpeaking ? Color.green : Color.gray)
                        .frame(width: 10, height: 10)
                    Text(speaker.name)
                    Spacer()
                    Text("\(speaker.duration, specifier: "%.1f")s")
                }
            }

            Button(processor.isProcessing ? "Stop" : "Start") {
                processor.toggleProcessing()
            }
        }
    }
}

@MainActor
class DiarizationProcessor: ObservableObject {
    @Published var speakerCount = 0
    @Published var activeSpeakers: [SpeakerDisplay] = []
    @Published var isProcessing = false

    private var diarizer: DiarizerManager?

    func toggleProcessing() {
        if isProcessing {
            stopProcessing()
        } else {
            startProcessing()
        }
    }

    private func startProcessing() {
        Task {
            let models = try await DiarizerModels.downloadIfNeeded()
            diarizer = DiarizerManager()  // Default config
            diarizer?.initialize(models: models)
            isProcessing = true

            // Start audio capture and process chunks
            AudioCapture.start { [weak self] chunk in
                self?.processChunk(chunk)
            }
        }
    }

    private func processChunk(_ audio: [Float]) {
        Task { @MainActor in
            guard let diarizer = diarizer else { return }

            let result = try diarizer.performCompleteDiarization(audio)
            speakerCount = diarizer.speakerManager.speakerCount

            // Update UI with current speakers
            activeSpeakers = diarizer.speakerManager.speakerIds.compactMap { id in
                guard let speaker = diarizer.speakerManager.getSpeaker(for: id) else {
                    return nil
                }
                return SpeakerDisplay(
                    id: id,
                    name: speaker.name,
                    duration: speaker.duration,
                    isSpeaking: result.segments.contains { $0.speakerId == id }
                )
            }
        }
    }
}

Performance Optimization

let config = DiarizerConfig(
    clusteringThreshold: 0.7,
    minSpeechDuration: 1.0,
    minSilenceGap: 0.5
)

// Lower latency for real-time
let config = DiarizerConfig(
    clusteringThreshold: 0.7,
    minSpeechDuration: 0.5,    // Faster response
    minSilenceGap: 0.3         // Quicker speaker switches
)

Memory Management

// Reset between sessions to free memory
diarizer.speakerManager.reset()

// Or cleanup completely
diarizer.cleanup()

Benchmarking

Evaluate performance on your audio:

# Command-line benchmark
swift run fluidaudiocli diarization-benchmark --single-file ES2004a

# Results:
# DER: 17.7% (Miss: 10.3%, FA: 1.6%, Speaker Error: 5.8%)
# RTFx: 141.2x (real-time factor) M1 2022

API Reference

DiarizerManager

Method	Description
`initialize(models:)`	Initialize with Core ML models
`performCompleteDiarization(_:sampleRate:)`	Process audio and return segments
`cleanup()`	Release resources

SpeakerManager

Method	Description
`assignSpeaker(_:speechDuration:)`	Assign embedding to speaker
`initializeKnownSpeakers(_:)`	Load known speaker profiles
`getSpeaker(for:)`	Get speaker details
`reset()`	Clear all speakers

DiarizationResult

Property	Type	Description
`segments`	`[TimedSpeakerSegment]`	Speaker segments with timing
`speakerDatabase`	`[String: [Float]]?`	Speaker embeddings keyed by speaker ID
`timings`	`PipelineTimings?`	Processing timings for the diarization pass

Requirements

iOS 17.0+ / macOS 14.0+
Swift 5.9+
~100MB for Core ML models (downloaded on first use)

Performance

Device	RTFx	Notes
M2 MacBook Air	150x	Apple Neural Engine
M1 iPad Pro	120x	Neural Engine
iPhone 14 Pro	80x	Neural Engine
GitHub Actions	140x	CPU only

Troubleshooting

High DER on certain audio

Check if audio has overlapping speech (not yet supported)
Ensure 16kHz sampling rate
Verify audio isn't too noisy

Memory issues

Call reset() between sessions
Process shorter chunks for streaming
Reduce minActiveFramesCount if needed

Model download fails

Check internet connection
Verify ~100MB free space
Models cached after first download

License

See main repository LICENSE file.

FilesExpand file tree

GettingStarted.md

Latest commit

History

GettingStarted.md

File metadata and controls

Speaker Diarization

Model Choice

Quick Start

Source Layout

Manual Model Loading

Required assets

Folder layout

Loading without downloads

Custom Configuration (Optional)

Offline VBx Pipeline (Batch Diarization)

CLI shortcut

Advanced: Manual Audio Source Control

Streaming/Real-time Processing

LS-EEND Streaming

Sortformer Streaming

WeSpeaker/Pyannote Streaming

Chunk Size Considerations

Real-time Audio Capture Example

Core Components

DiarizerManager

SpeakerManager

DiarizerConfig

Known Speaker Recognition

SwiftUI Integration

Performance Optimization

Memory Management

Benchmarking

API Reference

DiarizerManager

SpeakerManager

DiarizationResult

Requirements

Performance

Troubleshooting

High DER on certain audio

Memory issues

Model download fails

License