Skip to content

introduce CoreML diarizer model on the SDK#2

Merged
Alex-Wengg merged 13 commits intomainfrom
beta
Jun 27, 2025
Merged

introduce CoreML diarizer model on the SDK#2
Alex-Wengg merged 13 commits intomainfrom
beta

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Jun 25, 2025

@Alex-Wengg Alex-Wengg marked this pull request as draft June 25, 2025 04:34
BREAKING CHANGE: Removed SherpaOnnx backend support
- Remove SherpaOnnxWrapper and SherpaOnnxWrapperC targets
- Remove SherpaOnnx.swift and SherpaOnnxDiarizerManager.swift
- Remove SherpaOnnx integration tests
- Update Package.swift to remove SherpaOnnx linker settings and targets
- Update DiarizerManager.swift to use CoreML as default backend only
- Clean up README.md to focus on CoreML backend
- Fix iOS build issues with conditional UIKit imports
- Update tests to only test CoreML backend (15 tests passing ✅)
- Simplify API surface area and reduce binary size

The package now provides a streamlined CoreML-only implementation
for speaker diarization on Apple platforms.
- Updated package name and target names in Package.swift
- Renamed source directory from Sources/SeamlessAudioSwift to Sources/FluidAudioSwift
- Renamed test directory from Tests/SeamlessAudioSwiftTests to Tests/FluidAudioSwiftTests
- Updated main Swift file from SeamlessAudioSwift.swift to FluidAudioSwift.swift
- Updated import statements throughout codebase
- Updated README.md with new package name and branding
- Updated Xcode project references to use FluidAudioSwift
- All tests passing (15/15)
- Build successful
@Alex-Wengg Alex-Wengg marked this pull request as ready for review June 25, 2025 18:14
@Alex-Wengg Alex-Wengg requested a review from BrandonWeng June 25, 2025 18:14
- Add CITests.swift for fast CI testing without external dependencies
- Add BenchmarkTests.swift for research-standard AMI corpus evaluation
- Add GitHub Actions workflow with multi-job testing strategy
- Add AMI download script for official corpus data
- Add comprehensive benchmark documentation
- Update BasicInitializationTests to work with real CoreML models

Features:
✅ CI tests run in <1 second, no external deps
✅ Benchmark tests work with real AMI Meeting Corpus audio
✅ GitHub Actions tests on macOS with Xcode 15
✅ Research-standard DER/JER metrics calculation
✅ Official AMI corpus integration (ES2002a, ES2003a)
✅ Performance testing and validation
@Alex-Wengg Alex-Wengg force-pushed the beta branch 3 times, most recently from 545cc70 to 976a987 Compare June 25, 2025 23:34
return nil
}

let embedding = embeddings[0]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getEmbedding model returns 3 embeddings, one for each speaker. why are we picking only first one ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to confirm we return 3 embeddings for the case of over lapping speakers ?

}

private func getAnnotation(
annotation: inout [Segment: Int],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was passing labelMapping so that even if the segmentation model thinks there are two speakers, but when we assignSpeaker from DB based on cosine distance, if both maps to same person, we want to consider them as one continous chunk. This is not an issue, if in UI we automatically merge continous speaker segments.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seem like a minor issue but we could note of for later releases

print(" Processing AMI IHM file \(index + 1)/\(amiData.samples.count): \(sample.id)")

do {
let predictedSegments = try await manager.performSegmentation(sample.audioSamples, sampleRate: sampleRate)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not return speaker_id right ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be returning the raw index (0, 1, or 2) from the segmentation model output, not the actual speaker identities.

@Alex-Wengg Alex-Wengg requested a review from Bharat0091 June 26, 2025 14:50
Copy link
Copy Markdown

@Bharat0091 Bharat0091 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Few minor comments.

Copy link
Copy Markdown
Member

@BrandonWeng BrandonWeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alex-Wengg Alex-Wengg merged commit ed0ece8 into main Jun 27, 2025
1 check passed
BrandonWeng added a commit that referenced this pull request Sep 17, 2025
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
Alex-Wengg added a commit that referenced this pull request Jan 1, 2026
introduce CoreML diarizer model on the SDK
Alex-Wengg pushed a commit that referenced this pull request Jan 1, 2026
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
SGD2718 pushed a commit that referenced this pull request Jan 4, 2026
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
Alex-Wengg pushed a commit that referenced this pull request Jan 5, 2026
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
Alex-Wengg added a commit that referenced this pull request Apr 3, 2026
## Summary

This PR adds **experimental** Mandarin Chinese ASR support via the CTC
zh-CN model and includes critical Swift 6 concurrency fixes for
`SlidingWindowAsrManager`.

> **⚠️ Experimental Feature**: CTC zh-CN Mandarin ASR is an early
preview. The API and performance characteristics may change in future
releases.

## Swift 6 Concurrency Fixes

### Fixed Issues
- **Removed premature state mutations** in `processWindow()` that
violated Swift 6 actor isolation
- State updates (`accumulatedTokens`, `lastProcessedFrame`,
`segmentIndex`, `processedChunks`) now occur **after** all async calls
complete successfully
- Prevents data races when async calls fail mid-execution

### Changes
- `SlidingWindowAsrManager.processWindow()`: Moved state mutation to
after async guard statements
- Ensures atomic state updates only when processing succeeds

## CTC zh-CN Mandarin ASR Integration (Experimental)

### New Features

#### Models
- **CtcZhCnManager**: High-level API for Mandarin Chinese ASR using CTC
decoder
- **CtcZhCnModels**: Model management with int8/fp32 encoder variants
  - Int8: 571 MB (default)
  - FP32: 1.1 GB
- Auto-downloads from HuggingFace:
`FluidInference/parakeet-ctc-0.6b-zh-cn-coreml`

#### CLI Commands
```bash
# Transcribe Mandarin audio
swift run fluidaudiocli ctc-zh-cn-transcribe audio.wav

# Benchmark on THCHS-30 dataset (full 2,495 samples)
swift run fluidaudiocli ctc-zh-cn-benchmark --auto-download

# Benchmark subset (100 samples for faster testing)
swift run fluidaudiocli ctc-zh-cn-benchmark --auto-download --samples 100
```

#### Benchmark Results (THCHS-30 Full Test Set)

**Full dataset** (2,495 samples):
- **Mean CER**: 8.23%
- **Median CER**: 6.45%
- **CER = 0% (perfect)**: 435 samples (17.4%)
- **Distribution**: 67.1% of samples <10% CER, 93.2% <20% CER
- **Mean Latency**: 614 ms
- **Mean RTFx**: 14.83x

### Dataset

**THCHS-30** - Mandarin Chinese speech corpus from Tsinghua University
- 30 hours of clean speech
- 50 speakers
- 2,495 test utterances (10 speakers, 250 unique sentences)
- Content domain: News (not classical literature)
- Source: http://www.openslr.org/18/
- HuggingFace: `FluidInference/THCHS-30-tests`

### Text Normalization

CER calculation includes:
- Chinese punctuation removal (,。!?、;:\u{201C}\u{201D}\u{2018}\u{2019})
- English punctuation removal (,.!?;:()[]{}\\<>"'-)
- Arabic digit → Chinese character conversion (0→零, 1→一, etc.)
- Whitespace normalization
- Levenshtein distance calculation

## Devin Review Fixes ✅

Addressed all issues from [Devin code
review](https://app.devin.ai/review/fluidinference/fluidaudio/pull/476):

### Review #1 (4 issues)
1. **✅ Fixed digit-to-Chinese conversion** - Added missing normalization
(0→零, 1→一, etc.) that was inflating CER by ~1.66%
2. **✅ Added unit tests** - Created 13 comprehensive test cases for text
normalization, CER calculation, and Levenshtein distance
3. **✅ Fixed CI dataset cache path** - Not applicable after CI workflow
removal
4. **✅ Fixed CI model cache path** - Not applicable after CI workflow
removal

### Review #2 (2 issues)
5. **✅ Fixed CER threshold mismatch** - Not applicable after CI workflow
removal
6. **✅ Fixed saveResults NaN crash** - Added guard for empty results
array to prevent division by zero

### Review #3 (2 issues)
7. **✅ Fixed FP32 encoder download** - Include both int8 and fp32
encoders in `requiredModels` set
8. **✅ Fixed AsrManager CTC-only handling** - Throw explicit error
instead of routing to incompatible TDT decoder

### Additional Fixes
- **✅ Fixed Unicode curly quotes** - Used escape sequences (`\u{201C}`
etc.) in both source and tests
- Added missing English punctuation removal
- Added missing Chinese quotation mark handling

## Files Changed

### Swift 6 Concurrency
-
`Sources/FluidAudio/ASR/Parakeet/SlidingWindow/SlidingWindowAsrManager.swift`
- `Sources/FluidAudio/ASR/Parakeet/AsrManager.swift` (added .ctcZhCn
case + error handling)

### CTC zh-CN Integration
- `Sources/FluidAudio/ASR/Parakeet/CtcZhCnManager.swift` (new)
- `Sources/FluidAudio/ASR/Parakeet/CtcZhCnModels.swift` (new)
- `Sources/FluidAudioCLI/Commands/ASR/CtcZhCnTranscribeCommand.swift`
(new)
- `Sources/FluidAudioCLI/Commands/ASR/CtcZhCnBenchmark.swift` (new)
- `Sources/FluidAudio/ModelNames.swift` (updated - both encoder
variants)
- `Documentation/Benchmarks.md` (updated - marked experimental)

### Tests
- `Tests/FluidAudioTests/ASR/Parakeet/CtcZhCnTests.swift` (new - 13 test
cases)

## Testing

- [x] Swift 6 concurrency fixes pass existing tests
- [x] CTC zh-CN transcription tested manually
- [x] THCHS-30 full benchmark: 8.23% mean CER (2,495 samples)
- [x] Unit tests: 13 test cases for normalization and CER (100% passing)
- [x] Text normalization matches baseline exactly
- [x] FP32 encoder download verified

## Notes

- This PR is a clean rebase of #475 off main
- Skipped conflicting decoder refactoring commit (superseded by #474)
- **Experimental feature**: CTC zh-CN API may change in future releases
- **No CI workflow**: Benchmarks are run manually for experimental
features
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants