LS-EEND (Long-Form Streaming End-to-End Neural Diarization) answers "who spoke when" in real-time. A causal Conformer encoder with a retention mechanism feeds an online attractor decoder that tracks speaker identities frame by frame, without separate VAD, segmentation, or clustering.
Key specs:
- 4–10 simultaneous speakers depending on variant (see below)
- ~100ms frame resolution (10 Hz output)
- Handles recordings up to one hour
- 8000 Hz input sample rate (automatic resampling via
processComplete(audioFileURL:)) - Frame-in-frame-out streaming with speculative preview frames
- CoreML-optimized for Apple Silicon
Limitations:
- 8000 Hz sample rate — lower audio fidelity than 16 kHz models
- Speaker identity is local to the recording; persistent speaker enrollment may be unreliable
- Variants are domain-specialized: using the wrong variant for a domain hurts accuracy
Each variant is a separate CoreML model trained on a specific corpus. Choose the one that best matches your audio.
Multi-speaker conference room recordings with close-talk and distant microphones. Best for: boardroom meetings, panel discussions, speakers in a shared physical space.
- DER (AMI test set): 20.76%
- Max speakers: 4
Telephone conversations with codec noise and narrow bandwidth. Best for: call center recordings, customer service calls, telephony audio.
- DER (CALLHOME test set): 12.11%
- Max speakers: 7
Dinner parties, clinical interviews, conference rooms, multi-channel arrays, child speech. Best for: challenging acoustics, heavy overlap, non-standard recording setups.
- DER (DIHARD II test set): 27.58%
- Max speakers: 10
Podcasts, audiobooks, broadcast media, YouTube, field recordings — deliberately broad. Best for: unknown or mixed recording conditions; the safest general-purpose choice.
- DER (DIHARD III test set): 19.61%
- Max speakers: 10
LSEENDDiarizer.processComplete(_:sourceSampleRate:)
|
|-- normalizeSamplesLocked() resample to 8 kHz when needed
|-- processCompleteInternal()
| |
| |-- engine.createSession() fresh streaming session
| |-- session.pushAudio() run full buffer through LS-EEND
| |-- session.finalize() flush remaining committed frames
| |-- DiarizerTimeline.addChunk()
| |-- DiarizerTimeline.finalize()
|
|-- return DiarizerTimeline
Under the hood, offline engine inference follows this path:
LSEENDInferenceHelper.infer(samples:sampleRate:)
|
|-- resampleIfNeeded()
|-- offlineFeatureExtractor.extractFeatures(audio:)
| |
| |-- computeFlatTransposed() STFT -> mel spectrogram
| |-- applyLogMelCumMeanNormalization()
| |-- spliceAndSubsample()
|
|-- createSession(inputSampleRate:)
|-- session.ingestFeatures(features)
| |
| |-- FOR EACH MODEL FRAME:
| | |
| | |-- predictStep(frame:state:ingest:decode:)
| | | |
| | | |-- write frame into preallocated MLMultiArray
| | | |-- pass 6 recurrent state tensors:
| | | | encRetKv, encRetScale, encConvCache,
| | | | decRetKv, decRetScale, topBuffer
| | | |-- CoreML model.prediction()
| | | |-- read logits + next recurrent state tensors
| | |
| | |-- append committed full-output logits
|
|-- flushTail(from:pendingFrames:) decode remaining tail frames
|-- cropRealTracks() drop 2 boundary tracks
|-- applyingSigmoid() logits -> probabilities
|-- snapshot()
|-- return LSEENDInferenceResult
LSEENDDiarizer.addAudio(_:sourceSampleRate:)
|
|-- normalizeSamplesLocked() resample to 8 kHz when needed
|-- append to pendingAudio
|
v
LSEENDDiarizer.process()
|
|-- engine.createSession() first call only
|-- session.pushAudio(pendingAudio)
| |
| |-- featureExtractor.pushAudio()
| | |
| | |-- append raw audio to buffer
| | |-- appendSTFTFrames()
| | |-- applyLogMelCumMeanNormalization()
| | |-- emitModelFrames()
| |
| |-- ingestFeatures()
| | |
| | |-- predictStep(... ingest=1, decode=0) for stable frames
| | |-- update 6 recurrent state tensors each step
| | |-- append committed full-output logits
| |
| |-- copy current state
| |-- flushTail(from:pendingFrames:) on copied state
| | -> speculative preview logits
| |
| |-- cropRealTracks() remove boundary tracks
| |-- applyingSigmoid() logits -> probabilities
| |-- return committed + preview update
|
|-- DiarizerTimeline.addChunk()
|-- return DiarizerTimelineUpdate
Sources/FluidAudio/Diarizer/LS-EEND/
├── LSEENDDiarizer.swift # High-level Diarizer protocol implementation
├── LSEENDInference.swift # LSEENDInferenceHelper, LSEENDStreamingSession
├── LSEENDFeatureExtraction.swift # Internal offline + streaming feature extraction
├── LSEENDSupport.swift # Supporting data types, metadata, result structs, errors
└── LSEENDEvaluation.swift # DER computation, RTTM parsing/writing, collar masking,
# optimal speaker assignment
The primary entry point. Implements the Diarizer protocol — the same API as SortformerDiarizer.
// Simple init (all parameters optional)
let diarizer = LSEENDDiarizer(
computeUnits: .cpuOnly, // Default: .cpuOnly (fastest for this model)
onsetThreshold: 0.5, // Probability to start a speech segment
offsetThreshold: 0.5, // Probability to end a speech segment
onsetPadFrames: 0, // Frames prepended to each segment
offsetPadFrames: 0, // Frames appended to each segment
minFramesOn: 0, // Discard segments shorter than this
minFramesOff: 0, // Close gaps shorter than this
maxStoredFrames: nil // Cap on retained finalized frames (nil = unlimited)
)
// Or pass a DiarizerTimelineConfig directly
let config = DiarizerTimelineConfig(onsetThreshold: 0.4, onsetPadFrames: 1)
let diarizer = LSEENDDiarizer(computeUnits: .cpuOnly, timelineConfig: config)// Download from HuggingFace (cached after first call)
try await diarizer.initialize(variant: .dihard3) // default variant
// From a pre-built descriptor
let descriptor = try await LSEENDModelDescriptor.loadFromHuggingFace(variant: .ami)
try diarizer.initialize(descriptor: descriptor)
// From a pre-loaded engine
let engine = try LSEENDInferenceHelper(descriptor: descriptor)
diarizer.initialize(engine: engine)// From a file URL (resamples to 8kHz automatically)
let timeline = try diarizer.processComplete(audioFileURL: audioURL)
// From raw samples (specify sample rate if it's not 8kHz already)
let timeline = try diarizer.processComplete(
samples,
sourceSampleRate: 8000,
finalizeOnCompletion: true,
progressCallback: { processed, total, chunks in
print("\(processed)/\(total) samples")
}
)The sourceSampleRate argument is only needed if the audio samples are not already at 8kHz.
// Push audio in chunks
diarizer.addAudio(audioChunk, sourceSampleRate: 8000) // [Float] or any Collection<Float>
if let update = try diarizer.process() {
for segment in update.finalizedSegments { ... }
for tentative in update.tentativeSegments { ... }
}
// Convenience: add + process in one call
if let update = try diarizer.process(samples: audioChunk) { ... }
// Flush remaining frames at end of stream
try diarizer.finalizeSession()
let finalTimeline = diarizer.timelineUse speaker enrollment to warm LS-EEND with a known speaker before the live stream starts. Enrollment keeps the active streaming session, resets the visible timeline back to frame 0, and preserves the speaker name inside the DiarizerTimeline.
let speaker = try diarizer.enrollSpeaker(
withSamples: enrollmentAudio,
sourceSampleRate: 16_000,
named: "Alice",
overwritingAssignedSpeakerName: false
)
// Later complete-buffer runs can keep the enrolled session state.
let timeline = try diarizer.processComplete(
meetingAudio,
sourceSampleRate: 16_000,
keepingEnrolledSpeakers: true
)Notes:
- Enrollment is per diarizer instance. Recreate or
reset()the diarizer to start a fresh session. - Enrollment can help with live identity continuity, but it is still less reliable than the WeSpeaker/Pyannote speaker database.
- Speaker slots are still chronological. Use
overwritingAssignedSpeakerName: falseif you want enrollment to fail instead of replacing the name on an already-named slot.
| Property | Type | Description |
|---|---|---|
timeline |
DiarizerTimeline |
Accumulated finalized results |
isAvailable |
Bool |
Whether the model is loaded |
numFramesProcessed |
Int |
Total committed frames processed |
targetSampleRate |
Int? |
Expected input sample rate (8000) |
modelFrameHz |
Double? |
Output frame rate (~10.0 Hz) |
numSpeakers |
Int? |
Real speaker track count (realOutputDim) |
streamingLatencySeconds |
Double? |
Minimum latency before first frame |
decodeMaxSpeakers |
Int? |
Total model output slots (including boundary tracks) |
computeUnits |
MLComputeUnits |
CoreML compute units |
timelineConfig |
DiarizerTimelineConfig |
Current post-processing config |
diarizer.reset() // Reset streaming state for a new audio stream (keeps model loaded)
diarizer.cleanup() // Release all resources including the loaded modelLower-level engine for direct CoreML inference. Use this when you need access to raw logits, want to manage sessions manually, or are building tooling around the model.
// Synchronous — model loading happens here
let descriptor = try await LSEENDModelDescriptor.loadFromHuggingFace(variant: .dihard3)
let engine = try LSEENDInferenceHelper(
descriptor: descriptor,
computeUnits: .cpuOnly // default
)// From raw samples + sample rate (resamples internally if needed)
let result: LSEENDInferenceResult = try engine.infer(samples: audio, sampleRate: 16000)
// From a file (reads and resamples to targetSampleRate)
let result: LSEENDInferenceResult = try engine.infer(audioFileURL: url)// Create a session (inputSampleRate must equal engine.targetSampleRate)
let session = try engine.createSession(inputSampleRate: engine.targetSampleRate)
// Or with a caller-owned mel spectrogram (for thread isolation)
let mel = NeMoMelSpectrogram(...)
let session = try engine.createSession(inputSampleRate: engine.targetSampleRate, melSpectrogram: mel)| Property | Type | Description |
|---|---|---|
descriptor |
LSEENDModelDescriptor |
Model variant and file paths |
computeUnits |
MLComputeUnits |
CoreML compute units |
metadata |
LSEENDModelMetadata |
Decoded model configuration |
model |
MLModel |
Loaded CoreML model |
targetSampleRate |
Int |
Expected input sample rate |
modelFrameHz |
Double |
Output frame rate |
streamingLatencySeconds |
Double |
Minimum latency before first output |
decodeMaxSpeakers |
Int |
Total output slots including boundary tracks |
A stateful streaming session created by LSEENDInferenceHelper.createSession(inputSampleRate:). Maintains all six recurrent state tensors across calls.
Not thread-safe. All calls must be serialized.
let session = try engine.createSession(inputSampleRate: 8000)
// Feed audio incrementally
while let chunk = audioSource.next() {
if let update = try session.pushAudio(chunk) {
// update.probabilities — committed, final frames
// update.previewProbabilities — speculative frames, will be refined
}
}
// Flush remaining frames and close the session
if let final = try session.finalize() {
// Process any remaining frames
}
// Get the complete assembled result at any point
let result: LSEENDInferenceResult = session.snapshot()| Method | Returns | Description |
|---|---|---|
pushAudio(_ chunk: [Float]) |
LSEENDStreamingUpdate? |
Feed audio; returns committed + preview frames, or nil if no frames ready |
finalize() |
LSEENDStreamingUpdate? |
Flush remaining frames and seal the session |
snapshot() |
LSEENDInferenceResult |
Assemble full result from all frames emitted so far |
| Property | Type | Description |
|---|---|---|
inputSampleRate |
Int |
Sample rate this session was created with |
A row-major 2D Float matrix used throughout LS-EEND. Rows are time frames; columns are speakers or feature dimensions.
Output from LSEENDInferenceHelper.infer(...) or LSEENDStreamingSession.snapshot().
| Property | Type | Description |
|---|---|---|
logits |
LSEENDMatrix |
Speaker logits, boundary tracks removed. Shape: [frames, realOutputDim] |
probabilities |
LSEENDMatrix |
Sigmoid of logits. Shape: [frames, realOutputDim] |
fullLogits |
LSEENDMatrix |
Raw logits including boundary tracks. Shape: [frames, fullOutputDim] |
fullProbabilities |
LSEENDMatrix |
Sigmoid of fullLogits |
frameHz |
Double |
Output frame rate in Hz |
durationSeconds |
Double |
Duration of input audio processed |
Returned by LSEENDStreamingSession.pushAudio(_:) and finalize(). Contains two regions:
| Property | Type | Description |
|---|---|---|
startFrame |
Int |
Frame index where committed region begins |
logits |
LSEENDMatrix |
Committed speaker logits (boundary tracks removed) |
probabilities |
LSEENDMatrix |
Committed speaker probabilities |
previewStartFrame |
Int |
Frame index where preview region begins |
previewLogits |
LSEENDMatrix |
Speculative logits for buffered-but-unconfirmed frames |
previewProbabilities |
LSEENDMatrix |
Speculative probabilities (will be refined by future audio) |
frameHz |
Double |
Output frame rate |
durationSeconds |
Double |
Cumulative audio duration fed so far |
totalEmittedFrames |
Int |
Running total of committed frames across all updates |
Committed vs preview: Committed frames have passed through the full causal encoder and are final. Preview frames are decoded by zero-padding the pending encoder state — they are a speculative "look ahead" that will be updated by the next pushAudio call.
Feature extraction is internal. LSEENDDiarizer and LSEENDInferenceHelper handle it automatically.
public typealias LSEENDVariant = ModelNames.LSEEND.Variant
// Cases
LSEENDVariant.ami // rawValue: "AMI"
LSEENDVariant.callhome // rawValue: "CALLHOME"
LSEENDVariant.dihard2 // rawValue: "DIHARD II"
LSEENDVariant.dihard3 // rawValue: "DIHARD III"| Property | Type | Description |
|---|---|---|
rawValue |
String |
Dataset name string (e.g. "DIHARD III") |
description |
String |
Same as rawValue (CustomStringConvertible) |
id |
String |
Same as rawValue (Identifiable) |
name |
String |
Internal checkpoint name (e.g. "ls_eend_dih3_step") |
stem |
String |
"<rawValue>/<name>" — path prefix within the repo |
modelFile |
String |
Relative path to the .mlmodelc file |
configFile |
String |
Relative path to the .json metadata file |
fileNames |
[String] |
[modelFile, configFile] |
Locates the CoreML model and metadata JSON for a variant.
// Download from HuggingFace (cached after first call)
let descriptor = try await LSEENDModelDescriptor.loadFromHuggingFace(
variant: .dihard3, // default
cacheDirectory: customURL, // optional; defaults to ~/Library/Application Support/FluidAudio/Models
computeUnits: .cpuOnly, // optional
progressHandler: { progress in } // optional
)
// From explicit local paths
let descriptor = LSEENDModelDescriptor(
variant: .dihard3,
modelURL: URL(fileURLWithPath: "/path/to/model.mlmodelc"),
metadataURL: URL(fileURLWithPath: "/path/to/metadata.json")
)| Property | Type | Description |
|---|---|---|
variant |
LSEENDVariant |
Model variant |
modelURL |
URL |
Path to .mlmodelc or .mlpackage |
metadataURL |
URL |
Path to JSON metadata file |
Decoded from the JSON file at descriptor.metadataURL. Read via engine.metadata.
| Property | Type | Description |
|---|---|---|
realOutputDim |
Int |
Usable speaker tracks (fullOutputDim - 2) |
frameHz |
Double |
Output frame rate (frames per second) |
targetSampleRate |
Int |
Required audio sample rate |
streamingLatencySeconds |
Double |
Minimum startup latency before the first stable output |
LSEENDEvaluation provides RTTM parsing/writing and DER computation for benchmarks. If you need detailed scoring workflows, read the source or move that material into a separate evaluation-specific doc.
All LS-EEND errors conform to LocalizedError and are thrown as LSEENDError.
| Case | Thrown when |
|---|---|
.invalidMetadata(String) |
Metadata JSON is malformed or contains invalid values |
.invalidMatrixShape(String) |
Matrix dimensions are mismatched or negative |
.unsupportedAudio(String) |
Wrong sample rate, empty buffer, or finalized session |
.modelPredictionFailed(String) |
CoreML forward pass failed, or model not initialized |
.missingFeature(String) |
Required output tensor absent from CoreML prediction |
.invalidPath(String) |
File path cannot be resolved |
.modelLoadFailed(String) |
CoreML model could not be loaded or compiled |
do {
let timeline = try diarizer.processComplete(audioFileURL: url)
} catch let error as LSEENDError {
switch error {
case .unsupportedAudio(let message): print("Audio problem: \(message)")
case .modelLoadFailed(let message): print("Model problem: \(message)")
default: print(error.localizedDescription)
}
}let diarizer = LSEENDDiarizer()
try await diarizer.initialize(variant: .ami)
let timeline = try diarizer.processComplete(audioFileURL: URL(fileURLWithPath: "meeting.wav"))
for segment in timeline.allSegments {
print("Speaker \(segment.speakerIndex): \(segment.startTime)s–\(segment.endTime)s")
}let diarizer = LSEENDDiarizer(computeUnits: .cpuOnly)
try await diarizer.initialize(variant: .dihard3)
// Feed 8kHz mono chunks from AVAudioEngine
audioEngine.installTap(onBus: 0, bufferSize: 1600, format: format) { buffer, _ in
let samples = Array(UnsafeBufferPointer(
start: buffer.floatChannelData![0], count: Int(buffer.frameLength)))
diarizer.addAudio(samples)
if let update = try? diarizer.process() {
DispatchQueue.main.async { updateUI(diarizer.timeline) }
}
}let descriptor = try await LSEENDModelDescriptor.loadFromHuggingFace(variant: .callhome)
let engine = try LSEENDInferenceHelper(descriptor: descriptor)
let session = try engine.createSession(inputSampleRate: engine.targetSampleRate)
for chunk in chunkedAudio(samples, chunkSize: 800) {
guard let update = try session.pushAudio(chunk) else { continue }
// Committed frames: update.probabilities [newFrames × speakers]
// Preview frames: update.previewProbabilities [previewFrames × speakers]
}
let final = try session.finalize()
let result = session.snapshot() // LSEENDInferenceResult# Diarize a single file (default variant: dihard3)
swift run fluidaudiocli lseend audio.wav
swift run fluidaudiocli lseend audio.wav --variant callhome
swift run fluidaudiocli lseend audio.wav --threshold 0.4 --median-width 5 --output result.json
# Benchmark on AMI (downloads dataset automatically)
swift run fluidaudiocli lseend-benchmark --auto-download --variant ami
swift run fluidaudiocli lseend-benchmark --variant callhome --threshold 0.35 --collar 0.25
swift run fluidaudiocli lseend-benchmark --variant dihard3 --output results.json --max-files 10| Flag | Default | Description |
|---|---|---|
--variant |
dihard3 |
ami | callhome | dihard2 | dihard3 |
--threshold |
0.5 |
Speaker activity binarization threshold |
--median-width |
1 |
Median filter width in frames (1 = disabled) |
--collar |
0.0 |
Collar in seconds around transitions (benchmark only) |
--onset |
— | Override onset threshold separately from --threshold |
--offset |
— | Override offset threshold separately from --threshold |
--pad-onset |
0 |
Frames prepended to each segment |
--pad-offset |
0 |
Frames appended to each segment |
--min-duration-on |
0.0 |
Minimum segment duration in seconds |
--min-duration-off |
0.0 |
Minimum gap duration in seconds |
--output |
— | Path to save JSON results |
--auto-download |
— | Auto-download AMI dataset if missing (benchmark only) |
--max-files |
— | Limit number of files processed (benchmark only) |
--verbose |
— | Print per-meeting debug output (benchmark only) |
Hosted at FluidInference/lseend-coreml. Downloaded automatically on first use and cached at ~/Library/Application Support/FluidAudio/Models/.
| Variant | Model file | Config file |
|---|---|---|
.ami |
AMI/ls_eend_ami_step.mlmodelc |
AMI/ls_eend_ami_step.json |
.callhome |
CALLHOME/ls_eend_callhome_step.mlmodelc |
CALLHOME/ls_eend_callhome_step.json |
.dihard2 |
DIHARD II/ls_eend_dih2_step.mlmodelc |
DIHARD II/ls_eend_dih2_step.json |
.dihard3 |
DIHARD III/ls_eend_dih3_step.mlmodelc |
DIHARD III/ls_eend_dih3_step.json |
Pre-fetch before running:
swift run fluidaudiocli download --repo lseend- LS-EEND Paper (arXiv 2410.06670) — Di Liang, Xiaofei Li. LS-EEND: Long-Form Streaming End-to-End Neural Diarization with Online Attractor Extraction. IEEE TASLP.
- LS-EEND GitHub Repository
- HuggingFace Models
- AMI Corpus
- CALLHOME Corpus
- DIHARD Challenge