|
| 1 | +# Qwen3-ForcedAligner-0.6B → CoreML |
| 2 | + |
| 3 | +CoreML conversion of [Qwen/Qwen3-ForcedAligner-0.6B](https://huggingface.co/Qwen/Qwen3-ForcedAligner-0.6B) for on-device forced alignment on Apple platforms. |
| 4 | + |
| 5 | +## Model Overview |
| 6 | + |
| 7 | +Qwen3-ForcedAligner is a **non-autoregressive (NAR)** forced alignment model that takes audio + text and outputs per-word timestamps. It uses the same `Qwen3ASRForConditionalGeneration` architecture as Qwen3-ASR but runs inference differently — a single prefill pass instead of autoregressive decode. |
| 8 | + |
| 9 | +- **Parameters:** 0.6B |
| 10 | +- **Languages:** 11 (Chinese, English, Cantonese, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish) |
| 11 | +- **Max audio:** 5 minutes |
| 12 | +- **Timestamp resolution:** 80ms segments |
| 13 | +- **License:** Apache 2.0 |
| 14 | + |
| 15 | +## Architecture |
| 16 | + |
| 17 | +Two audio encoder approaches are available. The inference script auto-detects which |
| 18 | +to use based on which `.mlpackage` files are present. |
| 19 | + |
| 20 | +### Split Encoder (higher accuracy) |
| 21 | + |
| 22 | +``` |
| 23 | +Qwen3ASRForConditionalGeneration |
| 24 | + └── thinker |
| 25 | + ├── audio_tower (24-layer Transformer, 1024 dim) |
| 26 | + │ ├── conv frontend → forced_aligner_audio_conv.mlpackage |
| 27 | + │ └── transformer + projection → forced_aligner_audio_transformer.mlpackage |
| 28 | + ├── model (28-layer Qwen3 decoder, 1024 dim) → forced_aligner_decoder_prefill.mlpackage |
| 29 | + │ └── embed_tokens → forced_aligner_embedding.mlpackage |
| 30 | + └── lm_head (1024 → 5000) → forced_aligner_lm_head.mlpackage |
| 31 | +``` |
| 32 | + |
| 33 | +The audio encoder is split into two CoreML models to preserve cross-chunk attention. |
| 34 | +Conv runs per-chunk, then all conv outputs are concatenated and passed through the |
| 35 | +transformer in a single call with full bidirectional attention across all frames. |
| 36 | +This matches the native PyTorch behavior and gives the best accuracy (4.4ms AAS). |
| 37 | + |
| 38 | +### Monolithic Encoder (faster, simpler) |
| 39 | + |
| 40 | +``` |
| 41 | +Qwen3ASRForConditionalGeneration |
| 42 | + └── thinker |
| 43 | + ├── audio_tower (24-layer Transformer, 1024 dim) → forced_aligner_audio_encoder.mlpackage |
| 44 | + ├── model (28-layer Qwen3 decoder, 1024 dim) → forced_aligner_decoder_prefill.mlpackage |
| 45 | + │ └── embed_tokens → forced_aligner_embedding.mlpackage |
| 46 | + └── lm_head (1024 → 5000) → forced_aligner_lm_head.mlpackage |
| 47 | +``` |
| 48 | + |
| 49 | +The entire audio encoder (conv + transformer + projection) is a single CoreML model |
| 50 | +that processes each 100-frame mel chunk independently. This is faster (one model call |
| 51 | +per chunk, no concatenation step) but each chunk's 13 output frames only see their own |
| 52 | +context — no cross-chunk attention. This reduces accuracy (20.7ms AAS) but may be |
| 53 | +acceptable depending on the use case. |
| 54 | + |
| 55 | +### Which to use? |
| 56 | + |
| 57 | +| | Split Encoder | Monolithic Encoder | |
| 58 | +|---|---|---| |
| 59 | +| Models | `audio_conv` + `audio_transformer` | `audio_encoder` | |
| 60 | +| Size | ~1.1GB combined | ~604MB | |
| 61 | +| AAS (mean boundary error) | **4.4 ms** | 20.7 ms | |
| 62 | +| % within 20ms | **95.4%** | 90.7% | |
| 63 | +| Cross-chunk attention | Yes | No | |
| 64 | +| Model calls (audio) | N conv + 1 transformer | N encoder | |
| 65 | +| Best for | Accuracy-critical alignment | Latency-sensitive / real-time | |
| 66 | + |
| 67 | +The inference script (`run_coreml_inference.py`) checks for `audio_conv` + `audio_transformer` |
| 68 | +first. If found, it uses the split approach. Otherwise it falls back to the monolithic |
| 69 | +`audio_encoder` if present. |
| 70 | + |
| 71 | +### Key Differences from Qwen3-ASR-0.6B |
| 72 | + |
| 73 | +| | ASR-0.6B | ForcedAligner-0.6B | |
| 74 | +|---|---|---| |
| 75 | +| Audio encoder layers | 18 | **24** | |
| 76 | +| Audio encoder dim | 896 | **1024** | |
| 77 | +| Audio encoder heads | 14 | **16** | |
| 78 | +| Vocab size | 151,936 | **152,064** | |
| 79 | +| RoPE | standard | **interleaved mrope** | |
| 80 | +| Inference | autoregressive | **NAR (single prefill)** | |
| 81 | +| Output | text tokens | **ms timestamps** | |
| 82 | + |
| 83 | +## Input/Output Shapes |
| 84 | + |
| 85 | +### Split Encoder |
| 86 | + |
| 87 | +#### Audio Conv (per-chunk) |
| 88 | +``` |
| 89 | +Input: mel_input [1, 128, 100] float32 (128 mel bins, 100 frames = 1 window) |
| 90 | +Output: conv_features [1, 13, 1024] float32 (13 frames after 8x conv downsampling) |
| 91 | +``` |
| 92 | + |
| 93 | +#### Audio Transformer (all chunks concatenated) |
| 94 | +``` |
| 95 | +Input: features [1, 256, 1024] float32 (padded concatenated conv features) |
| 96 | +Output: audio_embeddings [1, 256, 1024] float32 (trim to actual frame count) |
| 97 | +``` |
| 98 | + |
| 99 | +### Monolithic Encoder |
| 100 | + |
| 101 | +#### Audio Encoder (per-chunk) |
| 102 | +``` |
| 103 | +Input: mel_input [1, 128, 100] float32 (128 mel bins, 100 frames = 1 window) |
| 104 | +Output: audio_embeddings [1, 13, 1024] float32 (13 frames, trim for short last chunk) |
| 105 | +``` |
| 106 | + |
| 107 | +### Token Embedding |
| 108 | +``` |
| 109 | +Input: input_ids [1, seq_len] int32 (seq_len ∈ [1, 1024]) |
| 110 | +Output: embeddings [1, seq_len, 1024] float32 |
| 111 | +``` |
| 112 | + |
| 113 | +### Decoder Prefill (NAR) |
| 114 | +``` |
| 115 | +Input: hidden_states [1, 1024, 1024] float32 (full sequence) |
| 116 | + position_cos [1, 1024, 128] float32 (RoPE cos) |
| 117 | + position_sin [1, 1024, 128] float32 (RoPE sin) |
| 118 | +Output: output_hidden [1, 1024, 1024] float32 |
| 119 | +``` |
| 120 | + |
| 121 | +### LM Head |
| 122 | +``` |
| 123 | +Input: hidden_states [1, seq_len, 1024] float32 (seq_len ∈ [1, 1024]) |
| 124 | +Output: logits [1, seq_len, 5000] float32 (raw timestamp values, NOT vocab tokens) |
| 125 | +``` |
| 126 | + |
| 127 | +> **Note:** The LM head output dim is 5000 (not vocab_size 152064). The ForcedAligner |
| 128 | +> predicts raw timestamp values via argmax, where each value × 80ms = absolute time. |
| 129 | +> 5000 × 80ms = 400s, covering up to ~6.7 minutes of audio. |
| 130 | +
|
| 131 | +## Inference Pipeline |
| 132 | + |
| 133 | +Steps 1-3 differ depending on encoder approach. Steps 4-11 are shared. |
| 134 | + |
| 135 | +### Split Encoder (steps 1-3) |
| 136 | +``` |
| 137 | +1. Audio → Whisper mel spectrogram → [1, 128, T] |
| 138 | +2. Chunk mel into 100-frame windows → Audio Conv (per-chunk) → conv features |
| 139 | +3. Concatenate all conv features → pad to 256 → Audio Transformer → audio embeddings |
| 140 | +``` |
| 141 | + |
| 142 | +### Monolithic Encoder (steps 1-3) |
| 143 | +``` |
| 144 | +1. Audio → Whisper mel spectrogram → [1, 128, T] |
| 145 | +2. Chunk mel into 100-frame windows → Audio Encoder (per-chunk) → embeddings |
| 146 | +3. Concatenate per-chunk embeddings (trim last chunk to actual frames) |
| 147 | +``` |
| 148 | + |
| 149 | +### Shared (steps 4-11) |
| 150 | +``` |
| 151 | +4. Tokenize text with <timestamp> delimiters between words |
| 152 | +5. Build input_ids: <audio_start> <audio_pad>... <audio_end> word1 <ts><ts> word2 <ts><ts> ... |
| 153 | +6. Embed: audio embeddings + text token embeddings → concatenated sequence |
| 154 | +7. Compute MRoPE cos/sin → Decoder prefill (single pass) → hidden states |
| 155 | +8. LM head → logits |
| 156 | +9. argmax at timestamp_token_id positions → raw ms values |
| 157 | +10. Fix monotonicity (LIS algorithm) → final timestamps |
| 158 | +11. Scale: ms = raw_value * 80 (timestamp_segment_time) |
| 159 | +``` |
| 160 | + |
| 161 | +## Conversion |
| 162 | + |
| 163 | +```bash |
| 164 | +# Install dependencies |
| 165 | +uv pip install torch coremltools transformers typer soundfile |
| 166 | + |
| 167 | +# Clone Qwen3-ASR source (required for model classes) |
| 168 | +git clone https://github.com/QwenLM/Qwen3-ASR.git /path/to/qwen3-asr |
| 169 | + |
| 170 | +# Convert split encoder (default — higher accuracy) |
| 171 | +uv run python convert-coreml.py |
| 172 | + |
| 173 | +# Convert monolithic encoder (faster) |
| 174 | +uv run python convert-coreml.py --components audio_encoder embedding decoder_prefill lm_head |
| 175 | + |
| 176 | +# Convert all components (both encoder approaches) |
| 177 | +uv run python convert-coreml.py --components audio_conv audio_transformer audio_encoder embedding decoder_prefill lm_head |
| 178 | +``` |
| 179 | + |
| 180 | +## Benchmarking |
| 181 | + |
| 182 | +```bash |
| 183 | +# Generate PyTorch reference timestamps from cached test-clean |
| 184 | +uv run python compare-models.py --num-files 10 --output results/pytorch_reference.json |
| 185 | + |
| 186 | +# Single file mode |
| 187 | +uv run python compare-models.py --audio-file audio.wav --text "hello world" --language English |
| 188 | +``` |
| 189 | + |
| 190 | +### Parity Metrics (3 LibriSpeech test-clean samples, 54 word boundaries) |
| 191 | + |
| 192 | +#### Split Encoder |
| 193 | + |
| 194 | +| Metric | Value | Notes | |
| 195 | +|--------|-------|-------| |
| 196 | +| AAS (mean boundary error) | 4.4 ms | lower is better | |
| 197 | +| Max boundary error | 160 ms | single position, 2 segments | |
| 198 | +| % within 20ms | 95.4% | | |
| 199 | +| % within 80ms (1 segment) | 99.1% | 80ms = 1 timestamp segment | |
| 200 | +| % within 160ms (2 segments) | 100.0% | | |
| 201 | +| PyTorch latency (avg) | ~4736 ms | CPU, includes first-run warmup | |
| 202 | +| CoreML latency (avg) | ~2781 ms | ALL compute units | |
| 203 | + |
| 204 | +Per-sample results: |
| 205 | +- Long (28 words): 1.4ms AAS, 98.2% within 20ms |
| 206 | +- Short (8 words): 10.0ms AAS, 87.5% within 20ms |
| 207 | +- Medium (18 words): 6.7ms AAS, 94.4% within 20ms |
| 208 | + |
| 209 | +#### Monolithic Encoder |
| 210 | + |
| 211 | +| Metric | Value | Notes | |
| 212 | +|--------|-------|-------| |
| 213 | +| AAS (mean boundary error) | 20.7 ms | ~5x worse than split | |
| 214 | +| % within 20ms | 90.7% | | |
| 215 | +| % within 80ms (1 segment) | 92.6% | | |
| 216 | +| % within 160ms (2 segments) | 96.3% | | |
| 217 | + |
| 218 | +The accuracy gap is caused by each chunk's 13 frames only attending to themselves |
| 219 | +in the transformer, missing cross-chunk context that the native PyTorch encoder provides. |
| 220 | + |
| 221 | +## Special Tokens |
| 222 | + |
| 223 | +| Token | ID | Purpose | |
| 224 | +|-------|-----|---------| |
| 225 | +| `<\|audio_start\|>` | 151669 | Start of audio embeddings | |
| 226 | +| `<\|audio_end\|>` | 151670 | End of audio embeddings | |
| 227 | +| `<\|audio_pad\|>` | 151676 | Audio embedding placeholder | |
| 228 | +| `<timestamp>` | 151705 | Timestamp prediction position | |
| 229 | + |
| 230 | +## LM Head Architecture |
| 231 | + |
| 232 | +The ForcedAligner's LM head is **not** the same as the ASR model's: |
| 233 | + |
| 234 | +| | ASR LM Head | ForcedAligner LM Head | |
| 235 | +|---|---|---| |
| 236 | +| Output dim | 151,936 (vocab tokens) | **5,000** (raw timestamp values) | |
| 237 | +| Purpose | Next-token prediction | Timestamp regression via argmax | |
| 238 | +| Decoding | argmax → token ID → text | argmax → raw_value × 80ms → time | |
| 239 | + |
| 240 | +The embedding table is still 152,064 tokens (shared architecture), but the LM head |
| 241 | +projects to 5,000 outputs — enough for timestamps up to 400 seconds at 80ms resolution. |
| 242 | + |
| 243 | +## Known Issues |
| 244 | + |
| 245 | +See [problems_encountered.md](./problems_encountered.md) for detailed conversion journal. |
| 246 | + |
| 247 | +## References |
| 248 | + |
| 249 | +- **Model:** [Qwen/Qwen3-ForcedAligner-0.6B](https://huggingface.co/Qwen/Qwen3-ForcedAligner-0.6B) |
| 250 | +- **Paper:** [arXiv:2601.21337](https://arxiv.org/abs/2601.21337) |
| 251 | +- **Source:** [QwenLM/Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR) |
| 252 | +- **Community request:** [FluidAudio#49](https://github.com/FluidInference/FluidAudio/issues/49) |
0 commit comments