refactor(parakeet): Improve consistency across ASR managers#494
refactor(parakeet): Improve consistency across ASR managers#494Alex-Wengg merged 5 commits intomainfrom
Conversation
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 50.4s diarization time • Test runtime: 2m 21s • 04/07/2026, 04:03 PM EST |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 1m57s • 04/07/2026, 04:14 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Consolidates ~700 lines of duplicated boilerplate across three language-specific model files into a generic implementation. Changes: - Add ParakeetLanguageModels<Config> generic struct (337 lines) - Refactor CtcJaModels.swift: 229 → 22 lines (config + typealias) - Refactor CtcZhCnModels.swift: 265 → 22 lines (config + typealias) - Refactor TdtJaModels.swift: 237 → 22 lines (config + typealias) - Make Repo enum Sendable for concurrency safety - Add joint model validation in TdtJaManager Pattern: Protocol-based configuration with generic implementation. The ParakeetLanguageModelConfig protocol defines language-specific settings (blankId, repository, model files, int8 support). Type aliases maintain backward compatibility. Reduces codebase by 328 lines (~45% reduction) while maintaining identical functionality. All CI tests pass. Resolves #457
Replace force unwrap with guard let statement and proper error handling. This follows project guidelines which prohibit force unwrapping. Changes: - Replace models.joint! with guard let jointModel = models.joint - Throw ASRError.processingFailed if joint model is missing - Remove precondition from init (guard let provides better error handling) Addresses review feedback on PR #492.
This PR addresses three high-priority consistency improvements in the Parakeet ASR folder (issue #457): ## 1. Standardize Lifecycle Method Names Unified naming conventions across all ASR managers to eliminate confusion: - `AsrManager`: `loadModels(_:)` → `configure(models:)` (clarifies it accepts pre-loaded models) - `SlidingWindowAsrSession`: `initialize()` → `loadModels()` (consistent with download methods) - `SlidingWindowAsrManager`: `start()` → `startStreaming()` (clearer intent) - `StreamingEouAsrManager`: `loadModelsFromHuggingFace()` → `loadModels()` - Added `loadModels(from:)` overloads for consistency **Files updated:** 5 managers + 8 CLI commands (13 total) ## 2. Consolidate Token Deduplication Logic Extracted ~230 lines of duplicate matching algorithms into reusable utilities: **New files:** - `SequenceMatch.swift`: Data structure for sequence matches - `SequenceMatcher.swift`: Generic matching algorithms (5 methods) - `findSuffixPrefixMatch()`: O(n) greedy boundary detection - `findBoundedSubstringMatch()`: Windowed search with offset - `findLongestCommonSubsequence()`: O(n²) dynamic programming LCS - `findContiguousMatches()`: Longest consecutive match run - `consolidateMatches()`: Merge adjacent matches - `TokenDeduplicationRegressionTests.swift`: 12 comprehensive tests **Refactored:** - `AsrManager+TokenProcessing.swift`: Reduced from ~65 to ~40 lines (-38%) - `ChunkProcessor.swift`: Removed ~77 lines of duplicate code **Verified:** WER 0.4%, RTFx 43.3x (no regression) ## 3. Extract Shared EOU/Nemotron Streaming Code Created reusable utilities for common streaming patterns: **New utilities:** - `EncoderCacheManager`: Cache initialization and extraction - `createInitialCaches()`: Zero-initialized cache arrays - `extractCachesFromOutput()`: Parse encoder outputs - `createZeroArray()`: Helper for array creation - `StreamingAsrUtils`: Common operations - `appendAudio()`: Buffer audio with resampling - `resetSharedState()`: Clear audio/tokens/counters - `processRemainingAudio()`: Final chunk padding - `decodeTokens()`: Token-to-text conversion **Refactored:** - `StreamingNemotronAsrManager`: Cache management, state reset - `StreamingEouAsrManager`: Cache management, state reset ## Impact - **Code reduction:** ~230 duplicate lines eliminated - **Reusable utilities:** 430 lines of generic, type-safe code - **Test coverage:** +12 comprehensive regression tests - **API consistency:** Unified lifecycle naming across all managers - **Performance:** No regression (verified via benchmarks) - **Tests:** 25/25 passing ✅ Closes #457 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Removed the consolidation step before mapping LCS matches to index pairs. The mergeUsingMatches function requires one pair per matched element to work correctly. When consecutive LCS matches are merged, tokens between anchors get lost or misaligned, causing the final matched token to be lost as an anchor and potentially duplicating trailing content. Fixes #494
PocketTTS Smoke Test ✅
Runtime: 0m54s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality may differ from Apple Silicon. |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 292.4s processing • Test runtime: 4m 54s • 04/07/2026, 04:12 PM EST |
Kokoro TTS Smoke Test ✅
Runtime: 0m53s Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon. |
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 3m 12s • 2026-04-07T20:12:54.963Z |
Qwen3-ASR int8 Smoke Test ✅
Performance Metrics
Runtime: 4m54s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 7m34s • 04/07/2026, 04:07 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
9ab252f to
4a63d1d
Compare
| let maxSearchLength = min(15, previous.count) | ||
|
|
||
| if let match = SequenceMatcher.findBoundedSubstringMatch( | ||
| previous: previous, | ||
| current: workingCurrent, | ||
| maxSearchLength: maxSearchLength, | ||
| boundarySearchFrames: boundarySearchFrames, | ||
| matcher: exactMatcher | ||
| ) { | ||
| logger.debug( | ||
| "Found duplicate sequence length=\(match.length) at currStart=\(match.rightStartIndex) (boundarySearch=\(boundarySearchFrames))" | ||
| ) | ||
| let finalRemoved = removedCount + match.rightStartIndex + match.length | ||
| return (Array(workingCurrent.dropFirst(match.rightStartIndex + match.length)), finalRemoved) |
There was a problem hiding this comment.
🟡 Stage 3 bounded substring search drops maxOverlap constraint, allowing longer matches than before
The refactoring of removeDuplicateTokenSequence Stage 3 loses the maxOverlap (default 12) constraint on match length. In the old code, both Stage 2 and Stage 3 shared maxMatchLength = min(maxOverlap, workingCurrent.count), which capped the overlap search loop at min(maxSearchLength=15, maxMatchLength=12, ...) — effectively min(12, previous.count, current.count). The new code passes maxSearchLength = min(15, previous.count) to SequenceMatcher.findBoundedSubstringMatch (SequenceMatcher.swift:85), which loops (2...min(maxSearchLength, current.count)) — effectively min(15, previous.count, current.count). The maxOverlap bound is not propagated. When both previous.count > 12 and workingCurrent.count > 12, Stage 3 now searches for overlaps up to 15 tokens instead of 12, potentially removing more tokens than the old code intended.
Prompt for agents
The `maxOverlap` constraint from `removeDuplicateTokenSequence` is not passed through to `SequenceMatcher.findBoundedSubstringMatch` for Stage 3. In the old code, the overlap loop was bounded by `min(maxSearchLength, maxMatchLength)` where `maxMatchLength = min(maxOverlap, workingCurrent.count)`. The new code only passes `maxSearchLength` to `findBoundedSubstringMatch`, which internally loops `(2...min(maxSearchLength, current.count))` — losing the `maxOverlap` cap on match length.
To fix: Either add a `maxMatchLength` parameter to `findBoundedSubstringMatch` in SequenceMatcher.swift that limits the overlap search range (similar to how `findSuffixPrefixMatch` uses `maxOverlap`), or pass `min(maxOverlap, workingCurrent.count)` as the `maxSearchLength` parameter instead of `min(15, previous.count)`. The cleanest fix would be adding an optional `maxOverlapLength` parameter to `findBoundedSubstringMatch` that defaults to `Int.max` and is applied as `min(maxSearchLength, maxOverlapLength, current.count)` in the loop range.
Was this helpful? React with 👍 or 👎 to provide feedback.
This PR addresses three high-priority consistency improvements in the Parakeet ASR folder from issue #457.
Summary
Changes
1. Lifecycle Method Standardization
Unified naming conventions to eliminate confusion:
AsrManagerloadModels(_:)configure(models:)SlidingWindowAsrSessioninitialize()loadModels()SlidingWindowAsrManagerstart()startStreaming()StreamingEouAsrManagerloadModelsFromHuggingFace()loadModels()Files updated: 5 managers + 8 CLI commands
2. Token Deduplication Consolidation
Extracted duplicate matching algorithms into generic, type-safe utilities:
New Files:
SequenceMatch.swift- Data structure for sequence matchesSequenceMatcher.swift- 5 reusable matching algorithms:findSuffixPrefixMatch()- O(n) greedy boundary detectionfindBoundedSubstringMatch()- Windowed searchfindLongestCommonSubsequence()- O(n²) LCS via DPfindContiguousMatches()- Longest consecutive runconsolidateMatches()- Merge adjacent matchesTokenDeduplicationRegressionTests.swift- 12 comprehensive testsRefactored:
AsrManager+TokenProcessing.swift- Reduced from ~65 to ~40 lines (-38%)ChunkProcessor.swift- Removed ~77 lines of duplicate code3. Streaming Code Extraction
Created utilities for common patterns in both
StreamingEouAsrManagerandStreamingNemotronAsrManager:New Utilities:
EncoderCacheManager- Cache initialization and extractionStreamingAsrUtils- Audio buffering, state reset, token decodingImpact
Testing
Breaking Changes
Closes #457