LS-EEND Streaming Speaker Diarization

Overview

LS-EEND (Long-Form Streaming End-to-End Neural Diarization) answers "who spoke when" in real-time. A causal Conformer encoder with a retention mechanism feeds an online attractor decoder that tracks speaker identities frame by frame, without separate VAD, segmentation, or clustering.

Key specs:

4–10 simultaneous speakers depending on variant (see below)
~100ms frame resolution (10 Hz output)
Handles recordings up to one hour
8000 Hz input sample rate (automatic resampling via processComplete(audioFileURL:))
Frame-in-frame-out streaming with speculative preview frames
CoreML-optimized for Apple Silicon

Limitations:

8000 Hz sample rate — lower audio fidelity than 16 kHz models
Speaker identity is local to the recording; persistent speaker enrollment may be unreliable
Variants are domain-specialized: using the wrong variant for a domain hurts accuracy

Variant Selection

Each variant is a separate CoreML model trained on a specific corpus. Choose the one that best matches your audio.

`.ami` — In-person meetings

Multi-speaker conference room recordings with close-talk and distant microphones. Best for: boardroom meetings, panel discussions, speakers in a shared physical space.

DER (AMI test set): 20.76%
Max speakers: 4

`.callhome` — Phone calls

Telephone conversations with codec noise and narrow bandwidth. Best for: call center recordings, customer service calls, telephony audio.

DER (CALLHOME test set): 12.11%
Max speakers: 7

`.dihard2` — Difficult mixed conditions

Dinner parties, clinical interviews, conference rooms, multi-channel arrays, child speech. Best for: challenging acoustics, heavy overlap, non-standard recording setups.

DER (DIHARD II test set): 27.58%
Max speakers: 10

`.dihard3` — In-the-wild conversations (default)

Podcasts, audiobooks, broadcast media, YouTube, field recordings — deliberately broad. Best for: unknown or mixed recording conditions; the safest general-purpose choice.

DER (DIHARD III test set): 19.61%
Max speakers: 10

Call Flow

Complete Audio

LSEENDDiarizer.processComplete(_:sourceSampleRate:)
  |
  |-- normalizeSamplesLocked()            resample to 8 kHz when needed
  |-- processCompleteInternal()
  |     |
  |     |-- engine.createSession()        fresh streaming session
  |     |-- session.pushAudio()           run full buffer through LS-EEND
  |     |-- session.finalize()            flush remaining committed frames
  |     |-- DiarizerTimeline.addChunk()
  |     |-- DiarizerTimeline.finalize()
  |
  |-- return DiarizerTimeline

Under the hood, offline engine inference follows this path:

LSEENDInferenceHelper.infer(samples:sampleRate:)
  |
  |-- resampleIfNeeded()
  |-- offlineFeatureExtractor.extractFeatures(audio:)
  |     |
  |     |-- computeFlatTransposed()       STFT -> mel spectrogram
  |     |-- applyLogMelCumMeanNormalization()
  |     |-- spliceAndSubsample()
  |
  |-- createSession(inputSampleRate:)
  |-- session.ingestFeatures(features)
  |     |
  |     |-- FOR EACH MODEL FRAME:
  |     |     |
  |     |     |-- predictStep(frame:state:ingest:decode:)
  |     |     |     |
  |     |     |     |-- write frame into preallocated MLMultiArray
  |     |     |     |-- pass 6 recurrent state tensors:
  |     |     |     |     encRetKv, encRetScale, encConvCache,
  |     |     |     |     decRetKv, decRetScale, topBuffer
  |     |     |     |-- CoreML model.prediction()
  |     |     |     |-- read logits + next recurrent state tensors
  |     |     |
  |     |     |-- append committed full-output logits
  |
  |-- flushTail(from:pendingFrames:)      decode remaining tail frames
  |-- cropRealTracks()                    drop 2 boundary tracks
  |-- applyingSigmoid()                   logits -> probabilities
  |-- snapshot()
  |-- return LSEENDInferenceResult

Streaming

LSEENDDiarizer.addAudio(_:sourceSampleRate:)
  |
  |-- normalizeSamplesLocked()            resample to 8 kHz when needed
  |-- append to pendingAudio
  |
  v
LSEENDDiarizer.process()
  |
  |-- engine.createSession()              first call only
  |-- session.pushAudio(pendingAudio)
  |     |
  |     |-- featureExtractor.pushAudio()
  |     |     |
  |     |     |-- append raw audio to buffer
  |     |     |-- appendSTFTFrames()
  |     |     |-- applyLogMelCumMeanNormalization()
  |     |     |-- emitModelFrames()
  |     |
  |     |-- ingestFeatures()
  |     |     |
  |     |     |-- predictStep(... ingest=1, decode=0) for stable frames
  |     |     |-- update 6 recurrent state tensors each step
  |     |     |-- append committed full-output logits
  |     |
  |     |-- copy current state
  |     |-- flushTail(from:pendingFrames:) on copied state
  |     |     -> speculative preview logits
  |     |
  |     |-- cropRealTracks()              remove boundary tracks
  |     |-- applyingSigmoid()             logits -> probabilities
  |     |-- return committed + preview update
  |
  |-- DiarizerTimeline.addChunk()
  |-- return DiarizerTimelineUpdate

File Structure

Sources/FluidAudio/Diarizer/LS-EEND/
├── LSEENDDiarizer.swift           # High-level Diarizer protocol implementation
├── LSEENDInference.swift          # LSEENDInferenceHelper, LSEENDStreamingSession
├── LSEENDFeatureExtraction.swift  # Internal offline + streaming feature extraction
├── LSEENDSupport.swift            # Supporting data types, metadata, result structs, errors
└── LSEENDEvaluation.swift         # DER computation, RTTM parsing/writing, collar masking,
                                   #   optimal speaker assignment

LSEENDDiarizer

The primary entry point. Implements the Diarizer protocol — the same API as SortformerDiarizer.

Initialization

// Simple init (all parameters optional)
let diarizer = LSEENDDiarizer(
    computeUnits: .cpuOnly,       // Default: .cpuOnly (fastest for this model)
    onsetThreshold: 0.5,          // Probability to start a speech segment
    offsetThreshold: 0.5,         // Probability to end a speech segment
    onsetPadFrames: 0,            // Frames prepended to each segment
    offsetPadFrames: 0,           // Frames appended to each segment
    minFramesOn: 0,               // Discard segments shorter than this
    minFramesOff: 0,              // Close gaps shorter than this
    maxStoredFrames: nil          // Cap on retained finalized frames (nil = unlimited)
)

// Or pass a DiarizerTimelineConfig directly
let config = DiarizerTimelineConfig(onsetThreshold: 0.4, onsetPadFrames: 1)
let diarizer = LSEENDDiarizer(computeUnits: .cpuOnly, timelineConfig: config)

Loading Models

// Download from HuggingFace (cached after first call)
try await diarizer.initialize(variant: .dihard3)   // default variant

// From a pre-built descriptor
let descriptor = try await LSEENDModelDescriptor.loadFromHuggingFace(variant: .ami)
try diarizer.initialize(descriptor: descriptor)

// From a pre-loaded engine
let engine = try LSEENDInferenceHelper(descriptor: descriptor)
diarizer.initialize(engine: engine)

Offline Processing

// From a file URL (resamples to 8kHz automatically)
let timeline = try diarizer.processComplete(audioFileURL: audioURL)

// From raw samples (specify sample rate if it's not 8kHz already)
let timeline = try diarizer.processComplete(
    samples,
    sourceSampleRate: 8000, 
    finalizeOnCompletion: true,
    progressCallback: { processed, total, chunks in
        print("\(processed)/\(total) samples")
    }
)

Streaming

The sourceSampleRate argument is only needed if the audio samples are not already at 8kHz.

// Push audio in chunks
diarizer.addAudio(audioChunk, sourceSampleRate: 8000)    // [Float] or any Collection<Float>
if let update = try diarizer.process() {
    for segment in update.finalizedSegments { ... }
    for tentative in update.tentativeSegments { ... } 
}

// Convenience: add + process in one call
if let update = try diarizer.process(samples: audioChunk) { ... }

// Flush remaining frames at end of stream
try diarizer.finalizeSession()
let finalTimeline = diarizer.timeline

Speaker Enrollment

Use speaker enrollment to warm LS-EEND with a known speaker before the live stream starts. Enrollment keeps the active streaming session, resets the visible timeline back to frame 0, and preserves the speaker name inside the DiarizerTimeline.

let speaker = try diarizer.enrollSpeaker(
    withSamples: enrollmentAudio,
    sourceSampleRate: 16_000,
    named: "Alice",
    overwritingAssignedSpeakerName: false
)

// Later complete-buffer runs can keep the enrolled session state.
let timeline = try diarizer.processComplete(
    meetingAudio,
    sourceSampleRate: 16_000,
    keepingEnrolledSpeakers: true
)

Notes:

Enrollment is per diarizer instance. Recreate or reset() the diarizer to start a fresh session.
Enrollment can help with live identity continuity, but it is still less reliable than the WeSpeaker/Pyannote speaker database.
Speaker slots are still chronological. Use overwritingAssignedSpeakerName: false if you want enrollment to fail instead of replacing the name on an already-named slot.

Properties

Property	Type	Description
`timeline`	`DiarizerTimeline`	Accumulated finalized results
`isAvailable`	`Bool`	Whether the model is loaded
`numFramesProcessed`	`Int`	Total committed frames processed
`targetSampleRate`	`Int?`	Expected input sample rate (8000)
`modelFrameHz`	`Double?`	Output frame rate (~10.0 Hz)
`numSpeakers`	`Int?`	Real speaker track count (`realOutputDim`)
`streamingLatencySeconds`	`Double?`	Minimum latency before first frame
`decodeMaxSpeakers`	`Int?`	Total model output slots (including boundary tracks)
`computeUnits`	`MLComputeUnits`	CoreML compute units
`timelineConfig`	`DiarizerTimelineConfig`	Current post-processing config

Lifecycle

diarizer.reset()     // Reset streaming state for a new audio stream (keeps model loaded)
diarizer.cleanup()   // Release all resources including the loaded model

LSEENDInferenceHelper

Lower-level engine for direct CoreML inference. Use this when you need access to raw logits, want to manage sessions manually, or are building tooling around the model.

Creating an Engine

// Synchronous — model loading happens here
let descriptor = try await LSEENDModelDescriptor.loadFromHuggingFace(variant: .dihard3)
let engine = try LSEENDInferenceHelper(
    descriptor: descriptor,
    computeUnits: .cpuOnly   // default
)

Offline Inference

// From raw samples + sample rate (resamples internally if needed)
let result: LSEENDInferenceResult = try engine.infer(samples: audio, sampleRate: 16000)

// From a file (reads and resamples to targetSampleRate)
let result: LSEENDInferenceResult = try engine.infer(audioFileURL: url)

Streaming Inference

// Create a session (inputSampleRate must equal engine.targetSampleRate)
let session = try engine.createSession(inputSampleRate: engine.targetSampleRate)

// Or with a caller-owned mel spectrogram (for thread isolation)
let mel = NeMoMelSpectrogram(...)
let session = try engine.createSession(inputSampleRate: engine.targetSampleRate, melSpectrogram: mel)

Properties

Property	Type	Description
`descriptor`	`LSEENDModelDescriptor`	Model variant and file paths
`computeUnits`	`MLComputeUnits`	CoreML compute units
`metadata`	`LSEENDModelMetadata`	Decoded model configuration
`model`	`MLModel`	Loaded CoreML model
`targetSampleRate`	`Int`	Expected input sample rate
`modelFrameHz`	`Double`	Output frame rate
`streamingLatencySeconds`	`Double`	Minimum latency before first output
`decodeMaxSpeakers`	`Int`	Total output slots including boundary tracks

LSEENDStreamingSession

A stateful streaming session created by LSEENDInferenceHelper.createSession(inputSampleRate:). Maintains all six recurrent state tensors across calls.

Not thread-safe. All calls must be serialized.

let session = try engine.createSession(inputSampleRate: 8000)

// Feed audio incrementally
while let chunk = audioSource.next() {
    if let update = try session.pushAudio(chunk) {
        // update.probabilities — committed, final frames
        // update.previewProbabilities — speculative frames, will be refined
    }
}

// Flush remaining frames and close the session
if let final = try session.finalize() {
    // Process any remaining frames
}

// Get the complete assembled result at any point
let result: LSEENDInferenceResult = session.snapshot()

Methods

Method	Returns	Description
`pushAudio(_ chunk: [Float])`	`LSEENDStreamingUpdate?`	Feed audio; returns committed + preview frames, or `nil` if no frames ready
`finalize()`	`LSEENDStreamingUpdate?`	Flush remaining frames and seal the session
`snapshot()`	`LSEENDInferenceResult`	Assemble full result from all frames emitted so far

Property	Type	Description
`inputSampleRate`	`Int`	Sample rate this session was created with

Data Types

LSEENDMatrix

A row-major 2D Float matrix used throughout LS-EEND. Rows are time frames; columns are speakers or feature dimensions.

LSEENDInferenceResult

Output from LSEENDInferenceHelper.infer(...) or LSEENDStreamingSession.snapshot().

Property	Type	Description
`logits`	`LSEENDMatrix`	Speaker logits, boundary tracks removed. Shape: `[frames, realOutputDim]`
`probabilities`	`LSEENDMatrix`	Sigmoid of `logits`. Shape: `[frames, realOutputDim]`
`fullLogits`	`LSEENDMatrix`	Raw logits including boundary tracks. Shape: `[frames, fullOutputDim]`
`fullProbabilities`	`LSEENDMatrix`	Sigmoid of `fullLogits`
`frameHz`	`Double`	Output frame rate in Hz
`durationSeconds`	`Double`	Duration of input audio processed

LSEENDStreamingUpdate

Returned by LSEENDStreamingSession.pushAudio(_:) and finalize(). Contains two regions:

Property	Type	Description
`startFrame`	`Int`	Frame index where committed region begins
`logits`	`LSEENDMatrix`	Committed speaker logits (boundary tracks removed)
`probabilities`	`LSEENDMatrix`	Committed speaker probabilities
`previewStartFrame`	`Int`	Frame index where preview region begins
`previewLogits`	`LSEENDMatrix`	Speculative logits for buffered-but-unconfirmed frames
`previewProbabilities`	`LSEENDMatrix`	Speculative probabilities (will be refined by future audio)
`frameHz`	`Double`	Output frame rate
`durationSeconds`	`Double`	Cumulative audio duration fed so far
`totalEmittedFrames`	`Int`	Running total of committed frames across all updates

Committed vs preview: Committed frames have passed through the full causal encoder and are final. Preview frames are decoded by zero-padding the pending encoder state — they are a speculative "look ahead" that will be updated by the next pushAudio call.

Feature Extraction

Feature extraction is internal. LSEENDDiarizer and LSEENDInferenceHelper handle it automatically.

Model Loading

LSEENDVariant

public typealias LSEENDVariant = ModelNames.LSEEND.Variant

// Cases
LSEENDVariant.ami        // rawValue: "AMI"
LSEENDVariant.callhome   // rawValue: "CALLHOME"
LSEENDVariant.dihard2    // rawValue: "DIHARD II"
LSEENDVariant.dihard3    // rawValue: "DIHARD III"

Property	Type	Description
`rawValue`	`String`	Dataset name string (e.g. `"DIHARD III"`)
`description`	`String`	Same as `rawValue` (`CustomStringConvertible`)
`id`	`String`	Same as `rawValue` (`Identifiable`)
`name`	`String`	Internal checkpoint name (e.g. `"ls_eend_dih3_step"`)
`stem`	`String`	`"<rawValue>/<name>"` — path prefix within the repo
`modelFile`	`String`	Relative path to the `.mlmodelc` file
`configFile`	`String`	Relative path to the `.json` metadata file
`fileNames`	`[String]`	`[modelFile, configFile]`

LSEENDModelDescriptor

Locates the CoreML model and metadata JSON for a variant.

// Download from HuggingFace (cached after first call)
let descriptor = try await LSEENDModelDescriptor.loadFromHuggingFace(
    variant: .dihard3,               // default
    cacheDirectory: customURL,       // optional; defaults to ~/Library/Application Support/FluidAudio/Models
    computeUnits: .cpuOnly,          // optional
    progressHandler: { progress in } // optional
)

// From explicit local paths
let descriptor = LSEENDModelDescriptor(
    variant: .dihard3,
    modelURL: URL(fileURLWithPath: "/path/to/model.mlmodelc"),
    metadataURL: URL(fileURLWithPath: "/path/to/metadata.json")
)

Property	Type	Description
`variant`	`LSEENDVariant`	Model variant
`modelURL`	`URL`	Path to `.mlmodelc` or `.mlpackage`
`metadataURL`	`URL`	Path to JSON metadata file

LSEENDModelMetadata

Decoded from the JSON file at descriptor.metadataURL. Read via engine.metadata.

Property	Type	Description
`realOutputDim`	`Int`	Usable speaker tracks (`fullOutputDim - 2`)
`frameHz`	`Double`	Output frame rate (frames per second)
`targetSampleRate`	`Int`	Required audio sample rate
`streamingLatencySeconds`	`Double`	Minimum startup latency before the first stable output

Evaluation API

LSEENDEvaluation provides RTTM parsing/writing and DER computation for benchmarks. If you need detailed scoring workflows, read the source or move that material into a separate evaluation-specific doc.

Error Handling

All LS-EEND errors conform to LocalizedError and are thrown as LSEENDError.

Case	Thrown when
`.invalidMetadata(String)`	Metadata JSON is malformed or contains invalid values
`.invalidMatrixShape(String)`	Matrix dimensions are mismatched or negative
`.unsupportedAudio(String)`	Wrong sample rate, empty buffer, or finalized session
`.modelPredictionFailed(String)`	CoreML forward pass failed, or model not initialized
`.missingFeature(String)`	Required output tensor absent from CoreML prediction
`.invalidPath(String)`	File path cannot be resolved
`.modelLoadFailed(String)`	CoreML model could not be loaded or compiled

do {
    let timeline = try diarizer.processComplete(audioFileURL: url)
} catch let error as LSEENDError {
    switch error {
    case .unsupportedAudio(let message): print("Audio problem: \(message)")
    case .modelLoadFailed(let message): print("Model problem: \(message)")
    default: print(error.localizedDescription)
    }
}

Usage Examples

Offline File Processing

let diarizer = LSEENDDiarizer()
try await diarizer.initialize(variant: .ami)

let timeline = try diarizer.processComplete(audioFileURL: URL(fileURLWithPath: "meeting.wav"))
for segment in timeline.allSegments {
    print("Speaker \(segment.speakerIndex): \(segment.startTime)s–\(segment.endTime)s")
}

Streaming from Microphone

let diarizer = LSEENDDiarizer(computeUnits: .cpuOnly)
try await diarizer.initialize(variant: .dihard3)

// Feed 8kHz mono chunks from AVAudioEngine
audioEngine.installTap(onBus: 0, bufferSize: 1600, format: format) { buffer, _ in
    let samples = Array(UnsafeBufferPointer(
        start: buffer.floatChannelData![0], count: Int(buffer.frameLength)))
    diarizer.addAudio(samples)
    if let update = try? diarizer.process() {
        DispatchQueue.main.async { updateUI(diarizer.timeline) }
    }
}

Low-Level Engine + Session

let descriptor = try await LSEENDModelDescriptor.loadFromHuggingFace(variant: .callhome)
let engine = try LSEENDInferenceHelper(descriptor: descriptor)
let session = try engine.createSession(inputSampleRate: engine.targetSampleRate)

for chunk in chunkedAudio(samples, chunkSize: 800) {
    guard let update = try session.pushAudio(chunk) else { continue }
    // Committed frames: update.probabilities [newFrames × speakers]
    // Preview frames:   update.previewProbabilities [previewFrames × speakers]
}

let final = try session.finalize()
let result = session.snapshot()   // LSEENDInferenceResult

CLI

# Diarize a single file (default variant: dihard3)
swift run fluidaudiocli lseend audio.wav
swift run fluidaudiocli lseend audio.wav --variant callhome
swift run fluidaudiocli lseend audio.wav --threshold 0.4 --median-width 5 --output result.json

# Benchmark on AMI (downloads dataset automatically)
swift run fluidaudiocli lseend-benchmark --auto-download --variant ami
swift run fluidaudiocli lseend-benchmark --variant callhome --threshold 0.35 --collar 0.25
swift run fluidaudiocli lseend-benchmark --variant dihard3 --output results.json --max-files 10

CLI Flags

Flag	Default	Description
`--variant`	`dihard3`	`ami` \| `callhome` \| `dihard2` \| `dihard3`
`--threshold`	`0.5`	Speaker activity binarization threshold
`--median-width`	`1`	Median filter width in frames (1 = disabled)
`--collar`	`0.0`	Collar in seconds around transitions (benchmark only)
`--onset`	—	Override onset threshold separately from `--threshold`
`--offset`	—	Override offset threshold separately from `--threshold`
`--pad-onset`	`0`	Frames prepended to each segment
`--pad-offset`	`0`	Frames appended to each segment
`--min-duration-on`	`0.0`	Minimum segment duration in seconds
`--min-duration-off`	`0.0`	Minimum gap duration in seconds
`--output`	—	Path to save JSON results
`--auto-download`	—	Auto-download AMI dataset if missing (benchmark only)
`--max-files`	—	Limit number of files processed (benchmark only)
`--verbose`	—	Print per-meeting debug output (benchmark only)

Model Files on HuggingFace

Hosted at FluidInference/lseend-coreml. Downloaded automatically on first use and cached at ~/Library/Application Support/FluidAudio/Models/.

Variant	Model file	Config file
`.ami`	`AMI/ls_eend_ami_step.mlmodelc`	`AMI/ls_eend_ami_step.json`
`.callhome`	`CALLHOME/ls_eend_callhome_step.mlmodelc`	`CALLHOME/ls_eend_callhome_step.json`
`.dihard2`	`DIHARD II/ls_eend_dih2_step.mlmodelc`	`DIHARD II/ls_eend_dih2_step.json`
`.dihard3`	`DIHARD III/ls_eend_dih3_step.mlmodelc`	`DIHARD III/ls_eend_dih3_step.json`

Pre-fetch before running:

swift run fluidaudiocli download --repo lseend

References

LS-EEND Paper (arXiv 2410.06670) — Di Liang, Xiaofei Li. LS-EEND: Long-Form Streaming End-to-End Neural Diarization with Online Attractor Extraction. IEEE TASLP.
LS-EEND GitHub Repository
HuggingFace Models
AMI Corpus
CALLHOME Corpus
DIHARD Challenge

FilesExpand file tree

LS-EEND.md

Latest commit

History

LS-EEND.md

File metadata and controls

LS-EEND Streaming Speaker Diarization

Overview

Variant Selection

.ami — In-person meetings

.callhome — Phone calls

.dihard2 — Difficult mixed conditions

.dihard3 — In-the-wild conversations (default)

Call Flow

Complete Audio

Streaming

File Structure

LSEENDDiarizer

Initialization

Loading Models

Offline Processing

Streaming

Speaker Enrollment

Properties

Lifecycle

LSEENDInferenceHelper

Creating an Engine

Offline Inference

Streaming Inference

Properties

LSEENDStreamingSession

Methods

Data Types

LSEENDMatrix

LSEENDInferenceResult

LSEENDStreamingUpdate

Feature Extraction

Model Loading

LSEENDVariant

LSEENDModelDescriptor

LSEENDModelMetadata

Evaluation API

Error Handling

Usage Examples

Offline File Processing

Streaming from Microphone

Low-Level Engine + Session

CLI

CLI Flags

Model Files on HuggingFace

References

`.ami` — In-person meetings

`.callhome` — Phone calls

`.dihard2` — Difficult mixed conditions

`.dihard3` — In-the-wild conversations (default)