Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions Documentation/Diarization/GettingStarted.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,19 @@ Pick the diarizer based on the workflow:
| Max speakers | 4 | No max | 10 |
| Benchmarks | Good | Poor | Best |
| Remembering speakers across meetings | Great | Best | Good |
| Pre-enrolled speaker mapping | Best | Good | Weak |

### Speaker Enrollment: Sortformer vs LS-EEND

For workflows that pre-enroll known speakers before live audio, Sortformer is the stronger choice:

- **Sortformer** auto-maps all speakers with high confidence, even when two voices are similar. It benefits from training on a large volume of real-world data and uses past context effectively through its speaker cache.
- **LS-EEND** can fail enrollment when voices are too similar ("too close to existing speaker" collision). Its scores are bounded to roughly 0.2–0.8 due to the internal sigmoid-over-cosine architecture. An external score-extraction + global assignment fallback avoids hard rejection but produces weaker mappings.
- **LS-EEND** is an end-to-end model, which makes it difficult to force speaker registration into a specific slot. There is no API for per-slot similarity outputs or explicit slot-lock assignment.

LS-EEND was trained primarily on simulated data (fine-tuned on real data), while Sortformer was trained on predominantly real-world data. This training data difference is the main reason for the enrollment accuracy gap.

See [LS-EEND.md](LS-EEND.md#enrollment-limitations-integration-feedback) and [Sortformer.md](Sortformer.md#enrollment-strengths-integration-feedback) for details.

## Quick Start

Expand Down
14 changes: 14 additions & 0 deletions Documentation/Diarization/LS-EEND.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,20 @@ Notes:
- Enrollment can help with live identity continuity, but it is still less reliable than the WeSpeaker/Pyannote speaker database.
- Speaker slots are still chronological. Use `overwritingAssignedSpeakerName: false` if you want enrollment to fail instead of replacing the name on an already-named slot.

### Enrollment Limitations (Integration Feedback)

Real-world integration testing with 4-speaker audio reveals specific enrollment weaknesses compared to Sortformer:

**Score range:** LS-EEND scores are bounded between `sigmoid(-1)` and `sigmoid(1)`, roughly **0.2 to 0.8**. Internally the model applies sigmoid to cosine similarity scores, so raw outputs will never reach the 0.9+ confidence levels that external post-processing might suggest.

**Close-voice slot collision:** When enrolling speakers one at a time (strict enrollment path), LS-EEND's internal collision logic can reject a speaker whose voice is too similar to an already-enrolled slot. In a 4-speaker test, 3 speakers enrolled with strong mapping (~0.9 post-normalized confidence), but the 4th failed due to "too close to existing speaker." Sortformer auto-mapped all 4 with high confidence in the same test.

**Score-extraction fallback is weaker:** An alternative integration strategy — extracting per-slot scores over a sample, building a full score matrix, then running global assignment (e.g. Hungarian algorithm + threshold) — avoids hard enrollment rejection but produces weaker results. Non-dominant speakers can drop to ~0.2 confidence and one speaker can dominate multiple slot assignments.

**Root cause:** LS-EEND is an end-to-end model, making it difficult to force speaker registration into a specific slot. There is currently no API for per-slot similarity outputs or explicit slot-lock assignment. Suppressing existing attractors may be a path forward, but this has not been validated.

**Training data gap:** Sortformer was trained on a large volume of real-world data, giving it stronger generalization for speaker identity. LS-EEND was trained primarily on simulated data and then fine-tuned on real data — the base model without fine-tuning performs poorly.

### Properties

| Property | Type | Description |
Expand Down
8 changes: 8 additions & 0 deletions Documentation/Diarization/Sortformer.md
Original file line number Diff line number Diff line change
Expand Up @@ -491,6 +491,14 @@ Notes:
- Sortformer still uses chronological speaker slots, and it is still limited to four unique speakers.
- Use `overwritingAssignedSpeakerName: false` if you want enrollment to fail instead of replacing the name on an already-named slot.

### Enrollment Strengths (Integration Feedback)

In real-world 4-speaker integration testing, Sortformer's auto-mapping is consistently strong: all 4 speakers — including two with very similar voices — map with high confidence. This is the key advantage over LS-EEND for pre-enrolled speaker workflows.

**Why Sortformer wins here:** Sortformer was trained on a large volume of real-world data, which gives it better generalization for speaker disambiguation. It can utilize past context extremely well through the speaker cache and FIFO mechanism.

**LS-EEND comparison:** LS-EEND enrollment can fail when two speakers are too similar, rejecting the 4th speaker due to slot collision. Sortformer does not have this problem because its slot assignment mechanism is more tolerant of similar voices. See [LS-EEND Enrollment Limitations](LS-EEND.md#enrollment-limitations-integration-feedback) for details.

## References

- [NVIDIA Sortformer Paper](https://arxiv.org/abs/2409.06656)
Expand Down
Loading