introduce CoreML diarizer model on the SDK by Alex-Wengg · Pull Request #2 · FluidInference/FluidAudio

Alex-Wengg · 2025-06-25T04:26:27Z

SeamlessAudioSwift -> FluidAudioSwift
Removed SherpaOnnx dependencies and components
CoreML library introduction
Main code file of notice
https://github.com/FluidInference/FluidAudioSwift/blob/beta/Sources/FluidAudioSwift/DiarizerManager.swift
Github action to verify build compilation
Basic benchmarking https://github.com/FluidInference/FluidAudioSwift/blob/beta/Tests/FluidAudioSwiftTests/BenchmarkTests.swift

Sources/SeamlessAudioSwift/DiarizerManager.swift

Sources/SeamlessAudioSwift/SherpaOnnxDiarizerManager.swift

BREAKING CHANGE: Removed SherpaOnnx backend support - Remove SherpaOnnxWrapper and SherpaOnnxWrapperC targets - Remove SherpaOnnx.swift and SherpaOnnxDiarizerManager.swift - Remove SherpaOnnx integration tests - Update Package.swift to remove SherpaOnnx linker settings and targets - Update DiarizerManager.swift to use CoreML as default backend only - Clean up README.md to focus on CoreML backend - Fix iOS build issues with conditional UIKit imports - Update tests to only test CoreML backend (15 tests passing ✅) - Simplify API surface area and reduce binary size The package now provides a streamlined CoreML-only implementation for speaker diarization on Apple platforms.

- Updated package name and target names in Package.swift - Renamed source directory from Sources/SeamlessAudioSwift to Sources/FluidAudioSwift - Renamed test directory from Tests/SeamlessAudioSwiftTests to Tests/FluidAudioSwiftTests - Updated main Swift file from SeamlessAudioSwift.swift to FluidAudioSwift.swift - Updated import statements throughout codebase - Updated README.md with new package name and branding - Updated Xcode project references to use FluidAudioSwift - All tests passing (15/15) - Build successful

Package.swift

Sources/FluidAudioSwift/iOSCoreMLDiarizerManager.swift

Sources/FluidAudioSwift/CoreMLDiarizerManager.swift

- Add CITests.swift for fast CI testing without external dependencies - Add BenchmarkTests.swift for research-standard AMI corpus evaluation - Add GitHub Actions workflow with multi-job testing strategy - Add AMI download script for official corpus data - Add comprehensive benchmark documentation - Update BasicInitializationTests to work with real CoreML models Features: ✅ CI tests run in <1 second, no external deps ✅ Benchmark tests work with real AMI Meeting Corpus audio ✅ GitHub Actions tests on macOS with Xcode 15 ✅ Research-standard DER/JER metrics calculation ✅ Official AMI corpus integration (ES2002a, ES2003a) ✅ Performance testing and validation

…iftly-install

Sources/FluidAudioSwift/DiarizerManager.swift

Bharat0091 · 2025-06-26T08:54:57Z

Sources/FluidAudioSwift/DiarizerManager.swift

+            return nil
+        }
+
+        let embedding = embeddings[0]


getEmbedding model returns 3 embeddings, one for each speaker. why are we picking only first one ?

just to confirm we return 3 embeddings for the case of over lapping speakers ?

Scripts/download_ami_benchmark_data.sh

Bharat0091 · 2025-06-26T09:03:26Z

Sources/FluidAudioSwift/DiarizerManager.swift

+    }
+
+    private func getAnnotation(
+        annotation: inout [Segment: Int],


I was passing labelMapping so that even if the segmentation model thinks there are two speakers, but when we assignSpeaker from DB based on cosine distance, if both maps to same person, we want to consider them as one continous chunk. This is not an issue, if in UI we automatically merge continous speaker segments.

this seem like a minor issue but we could note of for later releases

Bharat0091 · 2025-06-26T09:07:39Z

Tests/FluidAudioSwiftTests/BenchmarkTests.swift

+            print("   Processing AMI IHM file \(index + 1)/\(amiData.samples.count): \(sample.id)")
+
+            do {
+                let predictedSegments = try await manager.performSegmentation(sample.audioSamples, sampleRate: sampleRate)


This will not return speaker_id right ?

it would be returning the raw index (0, 1, or 2) from the segmentation model output, not the actual speaker identities.

Tests/FluidAudioSwiftTests/BenchmarkTests.swift

Bharat0091

Overall looks good. Few minor comments.

Sources/FluidAudioSwift/DiarizerManager.swift

README.md

Sources/FluidAudioSwift/DiarizerManager.swift

BrandonWeng

### Why is this change needed?  Taking inspiration from the silero https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py Updating our segmentation implementation and supporitng streaming VAD ```bash %swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming Building for debugging... [1/1] Write swift-version--58304C5D6DBC2206.txt Build of product 'fluidaudio' complete! (0.07s) [00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed [00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc [00:08:02.812] [INFO] [VadManager] VAD model loaded successfully [00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s [00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation... [00:08:02.820] [INFO] [VadAnalyze] • Speech Start at 1.200s [00:08:02.821] [INFO] [VadAnalyze] • Speech End at 2.700s [00:08:02.822] [INFO] [VadAnalyze] • Speech Start at 4.300s [00:08:02.825] [INFO] [VadAnalyze] • Speech End at 7.800s [00:08:02.828] [INFO] [VadAnalyze] • Speech Start at 13.700s [00:08:02.830] [INFO] [VadAnalyze] • Speech End at 16.200s [00:08:02.830] [INFO] [VadAnalyze] • Speech Start at 17.300s [00:08:02.832] [INFO] [VadAnalyze] • Speech End at 19.000s [00:08:02.839] [INFO] [VadAnalyze] • Speech Start at 29.600s [00:08:02.840] [INFO] [VadAnalyze] • Speech End at 30.600s [00:08:02.849] [INFO] [VadAnalyze] • Speech Start at 45.000s [00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments... [00:08:02.850] [INFO] [VadAnalyze] • Speech End at 45.500s [00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events % swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds Building for debugging... [1/1] Write swift-version--58304C5D6DBC2206.txt Build of product 'fluidaudio' complete! (0.07s) [00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed [00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc [00:08:08.309] [INFO] [VadManager] VAD model loaded successfully [00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s [00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation... [00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s [00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s) [00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s) [00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s) [00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s) [00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s) [00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s) [00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s) % ffmpeg -i voiceink-issue-279.wav -af silencedetect=noise=-30dB:d=0.5 -f null - ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers built with Apple clang version 17.0.0 (clang-1700.0.13.3) ... libavutil 60. 8.100 / 60. 8.100 libavcodec 62. 11.100 / 62. 11.100 libavformat 62. 3.100 / 62. 3.100 libavdevice 62. 1.100 / 62. 1.100 libavfilter 11. 4.100 / 11. 4.100 libswscale 9. 1.100 / 9. 1.100 libswresample 6. 1.100 / 6. 1.100 [aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono Input #0, wav, from 'voiceink-issue-279.wav': Duration: 00:00:45.66, bitrate: 256 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s Stream mapping: Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, null, to 'pipe:': Metadata: encoder : Lavf62.3.100 Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s Metadata: encoder : Lavc62.11.100 pcm_s16le [silencedetect @ 0xb22c6c420] silence_start: 0 [silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364 [silencedetect @ 0xb22c6c420] silence_start: 2.305687 [silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125 [silencedetect @ 0xb22c6c420] silence_start: 7.579813 [silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125 [silencedetect @ 0xb22c6c420] silence_start: 15.845063 [silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687 [silencedetect @ 0xb22c6c420] silence_start: 18.692625 [silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813 [silencedetect @ 0xb22c6c420] silence_start: 30.367563 [silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445 [silencedetect @ 0xb22c6c420] silence_start: 41.454687 [silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125 [out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00 ```

introduce CoreML diarizer model on the SDK

### Why is this change needed?  Taking inspiration from the silero https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py Updating our segmentation implementation and supporitng streaming VAD ```bash %swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming Building for debugging... [1/1] Write swift-version--58304C5D6DBC2206.txt Build of product 'fluidaudio' complete! (0.07s) [00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed [00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc [00:08:02.812] [INFO] [VadManager] VAD model loaded successfully [00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s [00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation... [00:08:02.820] [INFO] [VadAnalyze] • Speech Start at 1.200s [00:08:02.821] [INFO] [VadAnalyze] • Speech End at 2.700s [00:08:02.822] [INFO] [VadAnalyze] • Speech Start at 4.300s [00:08:02.825] [INFO] [VadAnalyze] • Speech End at 7.800s [00:08:02.828] [INFO] [VadAnalyze] • Speech Start at 13.700s [00:08:02.830] [INFO] [VadAnalyze] • Speech End at 16.200s [00:08:02.830] [INFO] [VadAnalyze] • Speech Start at 17.300s [00:08:02.832] [INFO] [VadAnalyze] • Speech End at 19.000s [00:08:02.839] [INFO] [VadAnalyze] • Speech Start at 29.600s [00:08:02.840] [INFO] [VadAnalyze] • Speech End at 30.600s [00:08:02.849] [INFO] [VadAnalyze] • Speech Start at 45.000s [00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments... [00:08:02.850] [INFO] [VadAnalyze] • Speech End at 45.500s [00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events % swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds Building for debugging... [1/1] Write swift-version--58304C5D6DBC2206.txt Build of product 'fluidaudio' complete! (0.07s) [00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed [00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc [00:08:08.309] [INFO] [VadManager] VAD model loaded successfully [00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s [00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation... [00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s [00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s) [00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s) [00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s) [00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s) [00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s) [00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s) [00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s) % ffmpeg -i voiceink-issue-279.wav -af silencedetect=noise=-30dB:d=0.5 -f null - ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers built with Apple clang version 17.0.0 (clang-1700.0.13.3) ... libavutil 60. 8.100 / 60. 8.100 libavcodec 62. 11.100 / 62. 11.100 libavformat 62. 3.100 / 62. 3.100 libavdevice 62. 1.100 / 62. 1.100 libavfilter 11. 4.100 / 11. 4.100 libswscale 9. 1.100 / 9. 1.100 libswresample 6. 1.100 / 6. 1.100 [aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono Input #0, wav, from 'voiceink-issue-279.wav': Duration: 00:00:45.66, bitrate: 256 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s Stream mapping: Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, null, to 'pipe:': Metadata: encoder : Lavf62.3.100 Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s Metadata: encoder : Lavc62.11.100 pcm_s16le [silencedetect @ 0xb22c6c420] silence_start: 0 [silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364 [silencedetect @ 0xb22c6c420] silence_start: 2.305687 [silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125 [silencedetect @ 0xb22c6c420] silence_start: 7.579813 [silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125 [silencedetect @ 0xb22c6c420] silence_start: 15.845063 [silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687 [silencedetect @ 0xb22c6c420] silence_start: 18.692625 [silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813 [silencedetect @ 0xb22c6c420] silence_start: 30.367563 [silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445 [silencedetect @ 0xb22c6c420] silence_start: 41.454687 [silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125 [out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00 ```

## Summary This PR adds **experimental** Mandarin Chinese ASR support via the CTC zh-CN model and includes critical Swift 6 concurrency fixes for `SlidingWindowAsrManager`. > **⚠️ Experimental Feature**: CTC zh-CN Mandarin ASR is an early preview. The API and performance characteristics may change in future releases. ## Swift 6 Concurrency Fixes ### Fixed Issues - **Removed premature state mutations** in `processWindow()` that violated Swift 6 actor isolation - State updates (`accumulatedTokens`, `lastProcessedFrame`, `segmentIndex`, `processedChunks`) now occur **after** all async calls complete successfully - Prevents data races when async calls fail mid-execution ### Changes - `SlidingWindowAsrManager.processWindow()`: Moved state mutation to after async guard statements - Ensures atomic state updates only when processing succeeds ## CTC zh-CN Mandarin ASR Integration (Experimental) ### New Features #### Models - **CtcZhCnManager**: High-level API for Mandarin Chinese ASR using CTC decoder - **CtcZhCnModels**: Model management with int8/fp32 encoder variants - Int8: 571 MB (default) - FP32: 1.1 GB - Auto-downloads from HuggingFace: `FluidInference/parakeet-ctc-0.6b-zh-cn-coreml` #### CLI Commands ```bash # Transcribe Mandarin audio swift run fluidaudiocli ctc-zh-cn-transcribe audio.wav # Benchmark on THCHS-30 dataset (full 2,495 samples) swift run fluidaudiocli ctc-zh-cn-benchmark --auto-download # Benchmark subset (100 samples for faster testing) swift run fluidaudiocli ctc-zh-cn-benchmark --auto-download --samples 100 ``` #### Benchmark Results (THCHS-30 Full Test Set) **Full dataset** (2,495 samples): - **Mean CER**: 8.23% - **Median CER**: 6.45% - **CER = 0% (perfect)**: 435 samples (17.4%) - **Distribution**: 67.1% of samples <10% CER, 93.2% <20% CER - **Mean Latency**: 614 ms - **Mean RTFx**: 14.83x ### Dataset **THCHS-30** - Mandarin Chinese speech corpus from Tsinghua University - 30 hours of clean speech - 50 speakers - 2,495 test utterances (10 speakers, 250 unique sentences) - Content domain: News (not classical literature) - Source: http://www.openslr.org/18/ - HuggingFace: `FluidInference/THCHS-30-tests` ### Text Normalization CER calculation includes: - Chinese punctuation removal (，。！？、；：\u{201C}\u{201D}\u{2018}\u{2019}) - English punctuation removal (,.!?;:()[]{}\\<>"'-) - Arabic digit → Chinese character conversion (0→零, 1→一, etc.) - Whitespace normalization - Levenshtein distance calculation ## Devin Review Fixes ✅ Addressed all issues from [Devin code review](https://app.devin.ai/review/fluidinference/fluidaudio/pull/476): ### Review #1 (4 issues) 1. **✅ Fixed digit-to-Chinese conversion** - Added missing normalization (0→零, 1→一, etc.) that was inflating CER by ~1.66% 2. **✅ Added unit tests** - Created 13 comprehensive test cases for text normalization, CER calculation, and Levenshtein distance 3. **✅ Fixed CI dataset cache path** - Not applicable after CI workflow removal 4. **✅ Fixed CI model cache path** - Not applicable after CI workflow removal ### Review #2 (2 issues) 5. **✅ Fixed CER threshold mismatch** - Not applicable after CI workflow removal 6. **✅ Fixed saveResults NaN crash** - Added guard for empty results array to prevent division by zero ### Review #3 (2 issues) 7. **✅ Fixed FP32 encoder download** - Include both int8 and fp32 encoders in `requiredModels` set 8. **✅ Fixed AsrManager CTC-only handling** - Throw explicit error instead of routing to incompatible TDT decoder ### Additional Fixes - **✅ Fixed Unicode curly quotes** - Used escape sequences (`\u{201C}` etc.) in both source and tests - Added missing English punctuation removal - Added missing Chinese quotation mark handling ## Files Changed ### Swift 6 Concurrency - `Sources/FluidAudio/ASR/Parakeet/SlidingWindow/SlidingWindowAsrManager.swift` - `Sources/FluidAudio/ASR/Parakeet/AsrManager.swift` (added .ctcZhCn case + error handling) ### CTC zh-CN Integration - `Sources/FluidAudio/ASR/Parakeet/CtcZhCnManager.swift` (new) - `Sources/FluidAudio/ASR/Parakeet/CtcZhCnModels.swift` (new) - `Sources/FluidAudioCLI/Commands/ASR/CtcZhCnTranscribeCommand.swift` (new) - `Sources/FluidAudioCLI/Commands/ASR/CtcZhCnBenchmark.swift` (new) - `Sources/FluidAudio/ModelNames.swift` (updated - both encoder variants) - `Documentation/Benchmarks.md` (updated - marked experimental) ### Tests - `Tests/FluidAudioTests/ASR/Parakeet/CtcZhCnTests.swift` (new - 13 test cases) ## Testing - [x] Swift 6 concurrency fixes pass existing tests - [x] CTC zh-CN transcription tested manually - [x] THCHS-30 full benchmark: 8.23% mean CER (2,495 samples) - [x] Unit tests: 13 test cases for normalization and CER (100% passing) - [x] Text normalization matches baseline exactly - [x] FP32 encoder download verified ## Notes - This PR is a clean rebase of #475 off main - Skipped conflicting decoder refactoring commit (superseded by #474) - **Experimental feature**: CTC zh-CN API may change in future releases - **No CI workflow**: Benchmarks are run manually for experimental features

Alex-Wengg added 2 commits June 24, 2025 23:22

introduce CoreML diarizer model on the SDK

637d99d

Comment changes and test revert

f1924dd

Alex-Wengg marked this pull request as draft June 25, 2025 04:34