Fix Kokoro v2 source_noise dtype and distribution#447
Conversation
|
Claude finished @Alex-Wengg's task —— View job PR Review: Fix Kokoro v2 source_noise dtype and distributionTodo List:
✅ Review CompleteI've thoroughly reviewed this PR and submitted a comprehensive code review. The fixes address two critical bugs:
Key findings:
Recommendation: This is a high-quality fix that should be merged. The changes improve correctness without introducing regressions. |
There was a problem hiding this comment.
PR Review: Fix Kokoro v2 source_noise dtype and distribution
This PR successfully addresses two critical bugs in Kokoro v2's source_noise generation. The fixes are technically sound and well-implemented. Here's my detailed analysis:
✅ Bug Fixes - Excellent Work
1. dtype mismatch (.float16 → .float32)
- Fix: Correctly addresses the Core ML model expectation mismatch
- Performance: Eliminates unnecessary vImage conversion overhead
- Consistency: Both synthesis and warm-up paths now use
.float32consistently
2. Wrong distribution (uniform → Gaussian)
- Fix: Properly implements standard normal distribution matching PyTorch's
torch.randn() - Accuracy: Box-Muller transform correctly generates N(0,1) noise as expected by
SourceModuleHnNSF
🔧 Code Quality Improvements
Shared RandomUtils.swift
- Good consolidation of Box-Muller implementation from PocketTTS
- Clear documentation linking to PyTorch equivalence
- Proper parameterization with RNG for testability
Code Cleanup
- Removal of complex vImage conversion logic simplifies maintenance
- Eliminated unused Accelerate import from TtsModels.swift
- Consistent implementation across KokoroSynthesizer.swift and TtsModels.swift
💡 Minor Suggestions
RandomUtils.swift:12 - Nit: Consider Float.ulpOfOne instead of Float.leastNonzeroMagnitude for the lower bound. Box-Muller requires u1 > 0, but leastNonzeroMagnitude (~1e-45) could cause numerical issues with logf(). Float.ulpOfOne (~1e-7) provides better stability.
Performance consideration: The current implementation generates noise sample-by-sample in a loop. For large arrays, vectorized generation might be more efficient, but given this is likely not a performance bottleneck, the current readable implementation is fine.
🎯 Architecture & Maintainability
- Good: Centralized gaussian random generation in shared utilities
- Good: Consistent application across both synthesis pipeline and model warm-up
- Good: Clear comments explaining the mathematical requirements
- Good: Maintains existing API contracts while fixing internal implementation
⚡ Performance Impact
- Positive: Eliminates Float32→Float16 conversion overhead
- Positive: Removes unnecessary temporary buffer allocation
- Neutral: Sample-by-sample generation vs. previous bulk generation (likely negligible impact)
🔍 Security & Safety
- No security concerns identified
- Proper memory management with existing multiArrayPool patterns
- Safe pointer operations with correct capacity bounds
📋 Overall Assessment
This is a high-quality fix that addresses real bugs with proper technical solutions. The implementation follows Swift best practices and maintains consistency with the existing codebase architecture. The fixes should resolve the identified dtype and distribution issues in Kokoro v2 models.
Recommendation: ✅ Approve - These are objective bug fixes that improve correctness without introducing regressions.
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 286.7s processing • Test runtime: 5m 1s • 03/27/2026, 08:57 PM EST |
PocketTTS Smoke Test ✅
Runtime: 0m33s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality may differ from Apple Silicon. |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 2m 27s • 2026-03-28T00:45:58.740Z |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 35.7s diarization time • Test runtime: 3m 23s • 03/27/2026, 08:53 PM EST |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 1m0s • 03/27/2026, 08:58 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Qwen3-ASR int8 Smoke Test ✅
Runtime: 3m13s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 7m8s • 03/27/2026, 08:43 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
b9ab93d to
c13d259
Compare
Switches to v1 models on all platforms to avoid source_noise issues in v2. Fixes audio endpoint trimming by computing length from pred_dur output. Changes: - ModelNames.swift: Use v1 models (.mlmodelc) on all platforms instead of v2 (_v2.mlmodelc) - KokoroSynthesizer.swift: Compute audio length from pred_dur (frames * 600) instead of broken audio_length_samples Results: - "Hello world" → 1.5s (was 5s) - "This is a test of kokoro" → 2.35s (was 5s) - Proper trimming without cutting off speech Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
c13d259 to
9368acc
Compare
| var totalFrames: Float = 0.0 | ||
| let predDurPtr = predDurArray.dataPointer.bindMemory(to: Float.self, capacity: predDurArray.count) | ||
| for i in 0..<predDurArray.count { | ||
| totalFrames += predDurPtr[i] | ||
| } |
There was a problem hiding this comment.
🔴 pred_dur dataPointer bound as Float.self without checking actual MLMultiArray data type
The new pred_dur reading code at KokoroSynthesizer.swift:420 assumes the array is float32 by using bindMemory(to: Float.self), but never checks predDurArray.dataType. If the CoreML model outputs pred_dur in float16 format (2 bytes/element), binding as Float.self (4 bytes/element) causes an out-of-bounds memory read (reading count * 4 bytes from a count * 2 byte buffer) and produces garbage frame counts, leading to incorrect audio trimming. This is inconsistent with the audio output handling just below at KokoroSynthesizer.swift:448, which correctly checks audioArrayUnwrapped.dataType == .float32 before pointer-binding and has a safe fallback using [i].floatValue.
Safe pattern used for audio (line 448) but not for pred_dur
// Audio output: correctly checks dataType
if audioArrayUnwrapped.dataType == .float32 {
let sourcePointer = audioArrayUnwrapped.dataPointer.bindMemory(...)
} else {
// safe fallback via NSNumber
}
// pred_dur: no dataType check
let predDurPtr = predDurArray.dataPointer.bindMemory(to: Float.self, ...)| var totalFrames: Float = 0.0 | |
| let predDurPtr = predDurArray.dataPointer.bindMemory(to: Float.self, capacity: predDurArray.count) | |
| for i in 0..<predDurArray.count { | |
| totalFrames += predDurPtr[i] | |
| } | |
| var totalFrames: Float = 0.0 | |
| if predDurArray.dataType == .float32 { | |
| let predDurPtr = predDurArray.dataPointer.bindMemory(to: Float.self, capacity: predDurArray.count) | |
| for i in 0..<predDurArray.count { | |
| totalFrames += predDurPtr[i] | |
| } | |
| } else { | |
| for i in 0..<predDurArray.count { | |
| totalFrames += predDurArray[i].floatValue | |
| } | |
| } |
Was this helpful? React with 👍 or 👎 to provide feedback.
Fixes audio trimming issues in Kokoro TTS by switching to v1 models and computing audio length from
pred_duroutput.Changes
1. Switch to v1 models on all platforms
audio_length_samplesoutput (always returns 0)2. Fix audio trimming using pred_dur
audio_length_samplesoutput is broken (returns 0)pred_duroutput:sum(pred_dur) * 600 samples/frameTechnical Details
v1 models don't have the
source_noiseinput (it's internalized), avoiding the dtype and distribution issues entirely. Thepred_duroutput provides accurate frame counts that can be reliably converted to sample counts.Fixes #445