Skip to content

Streaming VAD and Speech Segmentation#110

Merged
BrandonWeng merged 13 commits intomainfrom
speech-semgentation
Sep 17, 2025
Merged

Streaming VAD and Speech Segmentation#110
BrandonWeng merged 13 commits intomainfrom
speech-semgentation

Conversation

@BrandonWeng
Copy link
Copy Markdown
Member

@BrandonWeng BrandonWeng commented Sep 16, 2025

Why is this change needed?

Taking inspiration from the silero https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00

@BrandonWeng BrandonWeng added enhancement New feature or request voice-activity-detection Issues related to VAD labels Sep 16, 2025
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Sep 16, 2025

Claude finished @BrandonWeng's task —— View job


Code Review: Streaming VAD and Speech Segmentation ✅

Todo List:

  • Start pending GitHub review
  • Get PR diff information
  • Analyze core VAD streaming implementation
  • Review new CLI command structure
  • Check test coverage and quality
  • Review documentation updates
  • Provide inline feedback on specific issues
  • Submit complete review

🔍 Review Complete

I've completed a comprehensive code review of PR #110. The implementation successfully adds streaming VAD capabilities inspired by Silero's reference implementation with solid architecture and good test coverage.

Strengths

  1. Excellent Architecture: The streaming implementation follows Silero's reference design with proper state machine handling
  2. Comprehensive API Design: Clear separation between offline segmentation and streaming with flexible configuration
  3. Good Test Coverage: New VadStreamingTests.swift provides solid unit test coverage for the streaming functionality
  4. Thorough Documentation: Updated CLI and VAD documentation with clear examples and parameter explanations
  5. Consistent Code Style: Follows the project's Swift formatting and naming conventions

⚠️ Key Issues Identified

Streaming Implementation

  • Timing Calculation Edge Cases: Start/end sample calculations may have inconsistencies with negative values and chunkSampleCount adjustments
  • Parameter Validation: Missing bounds checking for timeResolution parameter

Code Organization

  • Complex State Machine: The segmentation logic is quite dense and could benefit from refactoring
  • Test Relaxation: Several test tolerances were widened, suggesting algorithm changes

🎯 Recommendation

LGTM with minor fixes - The core functionality is solid and well-tested. The timing edge cases should be addressed but aren't blockers for the streaming VAD implementation.

The submitted GitHub review contains detailed feedback on specific lines and suggestions for improvement.


Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR successfully implements streaming VAD and speech segmentation inspired by Silero's reference implementation. The overall architecture is solid and the API design is well-thought-out.

Key Strengths:

  • ✅ Clean separation between offline/streaming APIs
  • ✅ Comprehensive test coverage for new functionality
  • ✅ Good documentation updates with clear examples
  • ✅ Follows project coding standards

Areas for Improvement:

Timing Calculations (Priority: Medium)

  • Stream event timing has potential edge cases with negative values
  • Inconsistent chunkSampleCount adjustments between start/end events
  • Missing bounds validation for timeResolution parameter

Code Organization (Priority: Low)

  • Complex segmentation state machine could benefit from refactoring
  • Some test tolerances were relaxed, suggesting algorithm changes
  • Consider making test-specific APIs internal

The implementation appears functionally correct and the test relaxations likely reflect improved accuracy from the new Silero-style implementation. The timing edge cases should be addressed but aren't blockers for the core functionality.

Recommendation: LGTM with minor fixes suggested for timing calculations and parameter validation.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 16, 2025

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 24.39x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 8.020 18.6 Fetching diarization models
Model Compile 3.437 8.0 CoreML compilation
Audio Load 0.089 0.2 Loading audio file
Segmentation 12.902 30.0 Detecting speech regions
Embedding 21.503 50.0 Extracting speaker voices
Clustering 8.601 20.0 Grouping same speakers
Total 43.025 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 43.0s diarization time • Test runtime: 1m 15s • 09/16/2025, 04:31 PM EST

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 16, 2025

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 444.9x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 454.3x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review: Here are some suggestions.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

Comment on lines 57 to 60
guard
let melspectrogramOutput = try melspectrogramModel?.prediction(
let melspectrogramOutput = try await melspectrogramModel?.prediction(
from: melspectrogramInput,
options: predictionOptions
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] Remove await from synchronous melspectrogram prediction

The newly added await before melspectrogramModel?.prediction(...) will not compile because MLModel.prediction is a synchronous API—there is no async overload like there is for the surrounding helper call. Every other model prediction in this file remains synchronous, so this line now produces a build error ('await' cannot be applied to a non-async function). Drop the await or switch to an asynchronous wrapper so the ASR build succeeds.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

@BrandonWeng BrandonWeng Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to support for newer versions of swift + os

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 16, 2025

ASR Benchmark Results ✅

Status: All benchmarks passed

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 2.09x
test-other 1.19% 0.00% 1.63x

Streaming Infrastructure Test

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.32x Streaming real-time factor
Avg Chunk Time 2.787s Average time to process each chunk
Max Chunk Time 4.041s Maximum chunk processing time
First Token 3.331s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming test uses 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 4m50s • 09/16/2025, 04:35 PM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

let samples = try AudioConverter().resampleAudioFile(audioURL)

var segmentation = VadSegmentationConfig.default
segmentation.minSpeechDuration = 0.25
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be concern if these minSpeechDuraiton and maxSpeechduration ever somehow conflict

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lool yeah we can add a check


Configuration for turning raw VAD probabilities into stable speech segments.

This struct applies rules for minimum durations, thresholds, and hysteresis to avoid jittery cuts and to produce clean, ASR-ready segments.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standard fields that VAD segmentation exposes, users may expect to be able to tune these

@BrandonWeng
Copy link
Copy Markdown
Member Author

swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --export-wav segments-vi.wav

with VAD:

 Transcribing file: segments-vi.wav -- file:///Users/brandonweng/code/FluidAudio/ ...
[16:25:22.416] [INFO] [Transcribe]==================================================
[16:25:22.416] [INFO] [Transcribe] BATCH TRANSCRIPTION RESULTS
[16:25:22.416] [INFO] [Transcribe] ==================================================
[16:25:22.416] [INFO] [Transcribe] My gut feeling, regardless, is that these particular features are extremely seldom used. So if anything, they need a big revamp
[16:25:22.416] [INFO] [Transcribe]

without VAD:

[16:25:13.012] [INFO] [Transcribe] Transcribing file: voiceink-issue-279.wav -- file:///Users/brandonweng/code/FluidAudio/ ...
[16:25:13.337] [INFO] [Transcribe]==================================================
[16:25:13.337] [INFO] [Transcribe] BATCH TRANSCRIPTION RESULTS
[16:25:13.337] [INFO] [Transcribe] ==================================================
[16:25:13.337] [INFO] [Transcribe]
Final transcription:
[16:25:13.337] [INFO] [Transcribe] My gut feeling, regardless, is that these particular features. So if anything yeah, I think it's a good idea.,
[16:25:13.337] [INFO] [Transcribe]

@BrandonWeng BrandonWeng enabled auto-merge (squash) September 16, 2025 20:33

/// Process raw 16kHz mono samples.
/// Processes audio in 4096-sample chunks (256ms at 16kHz).
/// ```swift
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the purpose of these code comments

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err mostly to show how to use it. You dont think its useful?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be better in readme, idk i guess our code is getting too complicated now

@BrandonWeng BrandonWeng merged commit 1416b2f into main Sep 17, 2025
13 checks passed
@BrandonWeng BrandonWeng deleted the speech-semgentation branch September 17, 2025 06:37
Alex-Wengg pushed a commit that referenced this pull request Jan 1, 2026
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
SGD2718 pushed a commit that referenced this pull request Jan 4, 2026
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
Alex-Wengg pushed a commit that referenced this pull request Jan 5, 2026
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request voice-activity-detection Issues related to VAD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants