Fluid Audio ships the Silero VAD converted for Core ML together with Silero-style timestamp extraction and streaming hysteresis. If you need help tuning the parameters for your use case, reach out on Discord.
For comparison of Silero-VAD compared to other models, see this. We are running v6
https://github.com/snakers4/silero-vad/wiki/Quality-Metrics
Need chunk-level probabilities or state for custom pipelines? Call process(_:)
to inspect every 256 ms hop:
let results = try await manager.process(samples)
for (index, chunk) in results.enumerated() {
print(
String(
format: "Chunk %02d: prob=%.3f, inference=%.4fs",
index,
chunk.probability,
chunk.processingTime
)
)
}Stage the Core ML bundle yourself when the runtime cannot reach HuggingFace.
silero-vad-unified-256ms-v6.0.0.mlmodelc
The bundle lives in the FluidInference/silero-vad-coreml repo. Keep the folder name intact so coremldata.bin remains discoverable.
/opt/models
└── silero-vad-coreml
└── silero-vad-unified-256ms-v6.0.0.mlmodelc
├── coremldata.bin
└── ...
Clone with Git LFS, download the archive from the HuggingFace UI, or copy from a machine that already initialized VadManager() (cache path: ~/Library/Application Support/FluidAudio/Models/silero-vad-coreml).
Supply the staged bundle to the VadManager(config:vadModel:) initializer:
import FluidAudio
import CoreML
Task {
do {
let modelURL = URL(fileURLWithPath: "/opt/models/silero-vad-coreml/silero-vad-unified-256ms-v6.0.0.mlmodelc", isDirectory: true)
var configuration = MLModelConfiguration()
configuration.computeUnits = .cpuOnly
let vadModel = try MLModel(contentsOf: modelURL, configuration: configuration)
let manager = VadManager(config: .default, vadModel: vadModel)
// Ready for segmenting or streaming without network downloads
} catch {
print("Failed to load VAD model: \(error)")
}
}Use FileManager to confirm the .mlmodelc directory exists before constructing the manager. When the bundle is present, no fallback download attempts occur.
VadManager can now emit ready-to-use speech intervals directly from PCM
samples. The segmentation logic mirrors the Silero reference implementation,
including minimum speech duration, silence padding, and max-duration splitting.
import FluidAudio
Task {
let manager = try await VadManager(
config: VadConfig(defaultThreshold: 0.75)
)
// Convert any supported file to 16 kHz mono Float32
let audioURL = URL(fileURLWithPath: "path/to/audio.wav")
let samples = try AudioConverter().resampleAudioFile(audioURL)
// Tune segmentation behavior with VadSegmentationConfig
var segmentation = VadSegmentationConfig.default
segmentation.minSpeechDuration = 0.25
segmentation.minSilenceDuration = 0.4
segmentation.speechPadding = 0.12
let segments = try await manager.segmentSpeech(samples, config: segmentation)
for (index, segment) in segments.enumerated() {
print(String(
format: "Segment %02d: %.2f–%.2fs",
index + 1,
segment.startTime,
segment.endTime
))
}
// Need audio chunks instead of timestamps?
let clips = try await manager.segmentSpeechAudio(samples, config: segmentation)
print("Extracted \(clips.count) buffered segments ready for ASR")
}Need chunk-level probabilities for each 256 ms hop? Use process(_:) and inspect
VadResult directly:
let results = try await manager.process(samples)
for (index, chunk) in results.enumerated() {
print(
String(
format: "Chunk %02d: prob=%.3f, inference=%.4fs",
index,
chunk.probability,
chunk.processingTime
)
)
}Key knobs in VadSegmentationConfig:
minSpeechDuration: discard very short bursts.minSilenceDuration: silence length required to close a segment.maxSpeechDuration: automatically split long spans using the last detected silence (default 14 s).speechPadding: context added on both sides of each returned segment.negativeThreshold/negativeThresholdOffset: control hysteresis the same way as Silero'sthreshold/neg_threshold.
If you prefer to keep the per-chunk VadResult output, you can measure the
real-time factor (RTFx) of non-streaming runs by comparing total inference time
with the audio duration:
let results = try await manager.process(samples)
let totalInference = results.reduce(0.0) { $0 + $1.processingTime }
let audioSeconds = Double(samples.count) / Double(VadManager.sampleRate)
let rtf = audioSeconds / totalInference
print(String(format: "VAD RTFx: %.1f", rtf))VadResult.processingTime is reported per 4096-sample chunk, so summing across
the array yields the full pass latency.
For streaming workloads you control the chunk size and maintain a
VadStreamState. Each call emits at most one VadStreamEvent describing a
speech start or end boundary, along with the raw probability for the chunk.
import FluidAudio
Task {
let manager = try await VadManager()
var state = await manager.makeStreamState()
for chunk in microphoneChunks { // chunk length ~256 ms at 16 kHz
let result = try await manager.processStreamingChunk(
chunk,
state: state,
config: .default,
returnSeconds: true,
timeResolution: 2
)
state = result.state
// Access the raw VAD probability (0.0-1.0) for this chunk
print(String(format: "Probability: %.3f", result.probability))
if let event = result.event {
switch event.kind {
case .speechStart:
print("Speech began at \(event.time ?? 0) s")
case .speechEnd:
print("Speech ended at \(event.time ?? 0) s")
}
}
}
}The VadStreamResult contains:
state: Updated state to pass to the next chunkevent: Optional speech start/end event (only emitted at boundaries)probability: Raw VAD probability (0.0-1.0) for the current chunk
Notes:
- Stream chunks do not need to be exactly 4096 samples; choose what matches your input cadence.
- Call
makeStreamState()whenever you reset your audio stream (equivalent to Silero'sreset_states). - When requesting seconds (
returnSeconds: true), timestamps are rounded usingtimeResolutiondecimal places. - Use
probabilityfor custom thresholding logic or confidence tracking alongside the built-in hysteresis.
Start with the general-purpose process command, which runs the diarization
pipeline (and therefore VAD) end-to-end on a single file:
swift run fluidaudiocli process path/to/audio.wavOnce you need to experiment with the VAD-specific heuristics directly, use the CLI commands below:
# Inspect offline segments (default mode is offline only)
swift run fluidaudiocli vad-analyze path/to/audio.wav
# Streaming only, 128 ms chunks, tighter silence rules (timestamps are emitted in seconds)
swift run fluidaudiocli vad-analyze path/to/audio.wav --streaming --min-silence-ms 300
# Run both offline + streaming in one pass
swift run fluidaudiocli vad-analyze path/to/audio.wav --mode both
# Classic benchmark tooling remains available
swift run fluidaudiocli vad-benchmark --num-files 50 --threshold 0.3swift run fluidaudiocli vad-analyze --help prints the full list of tuning
options, including negative-threshold overrides and max-duration splitting.
Offline runs emit an RTFx summary calculated from per-chunk inference time. Use
--mode both if you also want to see streaming start/end events in the same run.
Datasets for benchmarking can be fetched with swift run fluidaudiocli download --dataset vad.