Implementation of NVIDIA RNNT streaming patterns for last chunk handling in FluidAudio's Token-and-Duration Transducer (TDT) decoder.
Based on speech_to_text_streaming_infer_rnnt.py from NeMo Parakeet TDT documentation:
- Last Chunk Detection
# NVIDIA approach
is_last_chunk_batch = chunk_length >= rest_audio_lengths
is_last_chunk = right_sample >= audio_batch.shape[1]- Variable Chunk Length Handling
chunk_lengths_batch = torch.where(
is_last_chunk_batch,
rest_audio_lengths, # Use remaining audio length
torch.full_like(rest_audio_lengths, fill_value=chunk_length),
)- Buffer Processing with Last Chunk Flag
buffer.add_audio_batch_(
audio_batch[:, left_sample:right_sample],
audio_lengths=chunk_lengths_batch,
is_last_chunk=is_last_chunk,
is_last_chunk_batch=is_last_chunk_batch,
)- State Management Across Chunks
if current_batched_hyps is None:
current_batched_hyps = chunk_batched_hyps
else:
current_batched_hyps.merge_(chunk_batched_hyps)// ChunkProcessor.swift
let isLastChunk = (centerStart + centerSamples) >= audioSamples.count// Method signatures updated throughout
func executeMLInferenceWithTimings(
_ paddedAudio: [Float],
isLastChunk: Bool = false
) async throws -> (tokens: [Int], timestamps: [Int], encoderSequenceLength: Int)
func decodeWithTimings(
encoderOutput: MLMultiArray,
isLastChunk: Bool = false
) async throws -> (tokens: [Int], timestamps: [Int])// TdtDecoder.swift - Based on NVIDIA consecutive blank pattern
if isLastChunk {
var additionalSteps = 0
var consecutiveBlanks = 0
let maxConsecutiveBlanks = 2 // Exit after 2 blanks in a row
while additionalSteps < maxSymbolsPerStep && consecutiveBlanks < maxConsecutiveBlanks {
// Use last valid encoder frame if beyond bounds
let frameIndex = min(timeIndices, encoderFrames.count - 1)
let encoderStep = encoderFrames[frameIndex]
// Continue processing beyond encoder length
// Exit naturally when decoder produces consecutive blanks
}
}// TdtDecoderState.swift
mutating func finalizeLastChunk() {
predictorOutput = nil // Clear cache
timeJump = nil // No more chunks
// Keep lastToken and LSTM states for context
}- 15-second hard limit: Models cannot process audio longer than 240,000 samples (15s at 16kHz)
- No single-pass mode: Audio > 15s must use ChunkProcessor
- Fixed model architecture: Cannot modify Core ML models at runtime
- ANE alignment required: All arrays must be ANE-aligned for optimal performance
- Zero-copy operations: Minimize memory allocations during streaming
- State persistence: LSTM states must be maintained across chunk boundaries
- No @unchecked Sendable: All code must be properly thread-safe
- Actor-based concurrency: Use Swift actors for thread safety
- Persistent decoder states: States maintained across async boundaries
// Current chunking parameters (frame-aligned)
centerSeconds: 11.2 // 140 encoder frames
leftContextSeconds: 1.6 // 20 encoder frames
rightContextSeconds: 1.6 // 20 encoder frames
// Total: 14.4s (within 15s limit)- timeIndices: Current position in encoder frames
- timeJump: Tracks processing beyond current chunk for streaming
- Bounds checking: Always clamp indices to prevent crashes
- lastToken: Maintains linguistic context between chunks
- predictorOutput: Cached LSTM output for performance
- hiddenState/cellState: LSTM memory across boundaries
- Only emit tokens beyond
startFrameOffsetto avoid duplicates - Update decoder state regardless of emission for context preservation
- Force advancement after
maxSymbolsPerStepto prevent infinite loops
Audio Input (>15s)
↓
ChunkProcessor
├─ Chunk 0: isLastChunk=false
└─ Chunk N: isLastChunk=true
↓
TdtDecoder.decodeWithTimings(isLastChunk: true)
├─ Main decoding loop
└─ Last chunk finalization (if isLastChunk=true)
├─ Continue beyond encoder frames
├─ Use consecutive blank detection
└─ Natural termination
Sources/FluidAudio/ASR/Parakeet/ChunkProcessor.swift: Chunk detection logicSources/FluidAudio/ASR/Parakeet/Decoder/TdtDecoderV3.swift: Finalization logicSources/FluidAudio/ASR/Parakeet/Decoder/TdtDecoderState.swift: State managementSources/FluidAudio/ASR/Parakeet/AsrManager+Transcription.swift: Pipeline routing