Skip to content

Commit 00ce4c6

Browse files
authored
feat: add Qwen3-ForcedAligner-0.6B CoreML conversion and inference (#21)
* feat: add Qwen3-ForcedAligner-0.6B CoreML conversion and inference Add CoreML conversion pipeline for Qwen3-ForcedAligner-0.6B, a non-autoregressive forced alignment model that produces per-word timestamps from audio + text. The pipeline splits the model into 5 CoreML components: - Audio conv frontend (per-chunk mel → conv features) - Audio transformer (cross-chunk bidirectional attention + projection) - Token embedding (vocab → hidden states) - Decoder prefill (28-layer Qwen3 decoder, single NAR pass) - LM head (hidden states → 5000 timestamp bins) Key design decisions: - Audio encoder split into conv + transformer to preserve cross-chunk attention (monolithic per-chunk approach had 20.7ms AAS vs 4.4ms split) - MRoPE cos/sin computed outside the model for flexibility - Last mel chunk trimmed after conv to remove padding artifacts - Decoder and LM head use FLOAT32 precision to avoid FP16 overflow Parity vs PyTorch (3 LibriSpeech test-clean samples, 54 boundaries): - AAS: 4.4ms, within 20ms: 95.4%, within 80ms: 99.1% * docs: document both monolithic and split encoder approaches The inference script supports two audio encoder paths with auto-detection. Split encoder (audio_conv + audio_transformer) preserves cross-chunk attention for 4.4ms AAS. Monolithic encoder (audio_encoder) is faster but lacks cross-chunk attention (20.7ms AAS). Added comparison table and updated architecture, I/O shapes, inference pipeline, conversion, and parity sections. * docs: add Swift/CoreML integration bugs for ForcedAligner Document 5 bugs encountered during FluidAudio Swift integration: MLMultiArray stride issues, encoder 3D shape, Slaney vs HTK mel, STFT center padding, and MRoPE position clamping.
1 parent 4811d06 commit 00ce4c6

File tree

9 files changed

+4119
-0
lines changed

9 files changed

+4119
-0
lines changed
Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
# Qwen3-ForcedAligner-0.6B → CoreML
2+
3+
CoreML conversion of [Qwen/Qwen3-ForcedAligner-0.6B](https://huggingface.co/Qwen/Qwen3-ForcedAligner-0.6B) for on-device forced alignment on Apple platforms.
4+
5+
## Model Overview
6+
7+
Qwen3-ForcedAligner is a **non-autoregressive (NAR)** forced alignment model that takes audio + text and outputs per-word timestamps. It uses the same `Qwen3ASRForConditionalGeneration` architecture as Qwen3-ASR but runs inference differently — a single prefill pass instead of autoregressive decode.
8+
9+
- **Parameters:** 0.6B
10+
- **Languages:** 11 (Chinese, English, Cantonese, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish)
11+
- **Max audio:** 5 minutes
12+
- **Timestamp resolution:** 80ms segments
13+
- **License:** Apache 2.0
14+
15+
## Architecture
16+
17+
Two audio encoder approaches are available. The inference script auto-detects which
18+
to use based on which `.mlpackage` files are present.
19+
20+
### Split Encoder (higher accuracy)
21+
22+
```
23+
Qwen3ASRForConditionalGeneration
24+
└── thinker
25+
├── audio_tower (24-layer Transformer, 1024 dim)
26+
│ ├── conv frontend → forced_aligner_audio_conv.mlpackage
27+
│ └── transformer + projection → forced_aligner_audio_transformer.mlpackage
28+
├── model (28-layer Qwen3 decoder, 1024 dim) → forced_aligner_decoder_prefill.mlpackage
29+
│ └── embed_tokens → forced_aligner_embedding.mlpackage
30+
└── lm_head (1024 → 5000) → forced_aligner_lm_head.mlpackage
31+
```
32+
33+
The audio encoder is split into two CoreML models to preserve cross-chunk attention.
34+
Conv runs per-chunk, then all conv outputs are concatenated and passed through the
35+
transformer in a single call with full bidirectional attention across all frames.
36+
This matches the native PyTorch behavior and gives the best accuracy (4.4ms AAS).
37+
38+
### Monolithic Encoder (faster, simpler)
39+
40+
```
41+
Qwen3ASRForConditionalGeneration
42+
└── thinker
43+
├── audio_tower (24-layer Transformer, 1024 dim) → forced_aligner_audio_encoder.mlpackage
44+
├── model (28-layer Qwen3 decoder, 1024 dim) → forced_aligner_decoder_prefill.mlpackage
45+
│ └── embed_tokens → forced_aligner_embedding.mlpackage
46+
└── lm_head (1024 → 5000) → forced_aligner_lm_head.mlpackage
47+
```
48+
49+
The entire audio encoder (conv + transformer + projection) is a single CoreML model
50+
that processes each 100-frame mel chunk independently. This is faster (one model call
51+
per chunk, no concatenation step) but each chunk's 13 output frames only see their own
52+
context — no cross-chunk attention. This reduces accuracy (20.7ms AAS) but may be
53+
acceptable depending on the use case.
54+
55+
### Which to use?
56+
57+
| | Split Encoder | Monolithic Encoder |
58+
|---|---|---|
59+
| Models | `audio_conv` + `audio_transformer` | `audio_encoder` |
60+
| Size | ~1.1GB combined | ~604MB |
61+
| AAS (mean boundary error) | **4.4 ms** | 20.7 ms |
62+
| % within 20ms | **95.4%** | 90.7% |
63+
| Cross-chunk attention | Yes | No |
64+
| Model calls (audio) | N conv + 1 transformer | N encoder |
65+
| Best for | Accuracy-critical alignment | Latency-sensitive / real-time |
66+
67+
The inference script (`run_coreml_inference.py`) checks for `audio_conv` + `audio_transformer`
68+
first. If found, it uses the split approach. Otherwise it falls back to the monolithic
69+
`audio_encoder` if present.
70+
71+
### Key Differences from Qwen3-ASR-0.6B
72+
73+
| | ASR-0.6B | ForcedAligner-0.6B |
74+
|---|---|---|
75+
| Audio encoder layers | 18 | **24** |
76+
| Audio encoder dim | 896 | **1024** |
77+
| Audio encoder heads | 14 | **16** |
78+
| Vocab size | 151,936 | **152,064** |
79+
| RoPE | standard | **interleaved mrope** |
80+
| Inference | autoregressive | **NAR (single prefill)** |
81+
| Output | text tokens | **ms timestamps** |
82+
83+
## Input/Output Shapes
84+
85+
### Split Encoder
86+
87+
#### Audio Conv (per-chunk)
88+
```
89+
Input: mel_input [1, 128, 100] float32 (128 mel bins, 100 frames = 1 window)
90+
Output: conv_features [1, 13, 1024] float32 (13 frames after 8x conv downsampling)
91+
```
92+
93+
#### Audio Transformer (all chunks concatenated)
94+
```
95+
Input: features [1, 256, 1024] float32 (padded concatenated conv features)
96+
Output: audio_embeddings [1, 256, 1024] float32 (trim to actual frame count)
97+
```
98+
99+
### Monolithic Encoder
100+
101+
#### Audio Encoder (per-chunk)
102+
```
103+
Input: mel_input [1, 128, 100] float32 (128 mel bins, 100 frames = 1 window)
104+
Output: audio_embeddings [1, 13, 1024] float32 (13 frames, trim for short last chunk)
105+
```
106+
107+
### Token Embedding
108+
```
109+
Input: input_ids [1, seq_len] int32 (seq_len ∈ [1, 1024])
110+
Output: embeddings [1, seq_len, 1024] float32
111+
```
112+
113+
### Decoder Prefill (NAR)
114+
```
115+
Input: hidden_states [1, 1024, 1024] float32 (full sequence)
116+
position_cos [1, 1024, 128] float32 (RoPE cos)
117+
position_sin [1, 1024, 128] float32 (RoPE sin)
118+
Output: output_hidden [1, 1024, 1024] float32
119+
```
120+
121+
### LM Head
122+
```
123+
Input: hidden_states [1, seq_len, 1024] float32 (seq_len ∈ [1, 1024])
124+
Output: logits [1, seq_len, 5000] float32 (raw timestamp values, NOT vocab tokens)
125+
```
126+
127+
> **Note:** The LM head output dim is 5000 (not vocab_size 152064). The ForcedAligner
128+
> predicts raw timestamp values via argmax, where each value × 80ms = absolute time.
129+
> 5000 × 80ms = 400s, covering up to ~6.7 minutes of audio.
130+
131+
## Inference Pipeline
132+
133+
Steps 1-3 differ depending on encoder approach. Steps 4-11 are shared.
134+
135+
### Split Encoder (steps 1-3)
136+
```
137+
1. Audio → Whisper mel spectrogram → [1, 128, T]
138+
2. Chunk mel into 100-frame windows → Audio Conv (per-chunk) → conv features
139+
3. Concatenate all conv features → pad to 256 → Audio Transformer → audio embeddings
140+
```
141+
142+
### Monolithic Encoder (steps 1-3)
143+
```
144+
1. Audio → Whisper mel spectrogram → [1, 128, T]
145+
2. Chunk mel into 100-frame windows → Audio Encoder (per-chunk) → embeddings
146+
3. Concatenate per-chunk embeddings (trim last chunk to actual frames)
147+
```
148+
149+
### Shared (steps 4-11)
150+
```
151+
4. Tokenize text with <timestamp> delimiters between words
152+
5. Build input_ids: <audio_start> <audio_pad>... <audio_end> word1 <ts><ts> word2 <ts><ts> ...
153+
6. Embed: audio embeddings + text token embeddings → concatenated sequence
154+
7. Compute MRoPE cos/sin → Decoder prefill (single pass) → hidden states
155+
8. LM head → logits
156+
9. argmax at timestamp_token_id positions → raw ms values
157+
10. Fix monotonicity (LIS algorithm) → final timestamps
158+
11. Scale: ms = raw_value * 80 (timestamp_segment_time)
159+
```
160+
161+
## Conversion
162+
163+
```bash
164+
# Install dependencies
165+
uv pip install torch coremltools transformers typer soundfile
166+
167+
# Clone Qwen3-ASR source (required for model classes)
168+
git clone https://github.com/QwenLM/Qwen3-ASR.git /path/to/qwen3-asr
169+
170+
# Convert split encoder (default — higher accuracy)
171+
uv run python convert-coreml.py
172+
173+
# Convert monolithic encoder (faster)
174+
uv run python convert-coreml.py --components audio_encoder embedding decoder_prefill lm_head
175+
176+
# Convert all components (both encoder approaches)
177+
uv run python convert-coreml.py --components audio_conv audio_transformer audio_encoder embedding decoder_prefill lm_head
178+
```
179+
180+
## Benchmarking
181+
182+
```bash
183+
# Generate PyTorch reference timestamps from cached test-clean
184+
uv run python compare-models.py --num-files 10 --output results/pytorch_reference.json
185+
186+
# Single file mode
187+
uv run python compare-models.py --audio-file audio.wav --text "hello world" --language English
188+
```
189+
190+
### Parity Metrics (3 LibriSpeech test-clean samples, 54 word boundaries)
191+
192+
#### Split Encoder
193+
194+
| Metric | Value | Notes |
195+
|--------|-------|-------|
196+
| AAS (mean boundary error) | 4.4 ms | lower is better |
197+
| Max boundary error | 160 ms | single position, 2 segments |
198+
| % within 20ms | 95.4% | |
199+
| % within 80ms (1 segment) | 99.1% | 80ms = 1 timestamp segment |
200+
| % within 160ms (2 segments) | 100.0% | |
201+
| PyTorch latency (avg) | ~4736 ms | CPU, includes first-run warmup |
202+
| CoreML latency (avg) | ~2781 ms | ALL compute units |
203+
204+
Per-sample results:
205+
- Long (28 words): 1.4ms AAS, 98.2% within 20ms
206+
- Short (8 words): 10.0ms AAS, 87.5% within 20ms
207+
- Medium (18 words): 6.7ms AAS, 94.4% within 20ms
208+
209+
#### Monolithic Encoder
210+
211+
| Metric | Value | Notes |
212+
|--------|-------|-------|
213+
| AAS (mean boundary error) | 20.7 ms | ~5x worse than split |
214+
| % within 20ms | 90.7% | |
215+
| % within 80ms (1 segment) | 92.6% | |
216+
| % within 160ms (2 segments) | 96.3% | |
217+
218+
The accuracy gap is caused by each chunk's 13 frames only attending to themselves
219+
in the transformer, missing cross-chunk context that the native PyTorch encoder provides.
220+
221+
## Special Tokens
222+
223+
| Token | ID | Purpose |
224+
|-------|-----|---------|
225+
| `<\|audio_start\|>` | 151669 | Start of audio embeddings |
226+
| `<\|audio_end\|>` | 151670 | End of audio embeddings |
227+
| `<\|audio_pad\|>` | 151676 | Audio embedding placeholder |
228+
| `<timestamp>` | 151705 | Timestamp prediction position |
229+
230+
## LM Head Architecture
231+
232+
The ForcedAligner's LM head is **not** the same as the ASR model's:
233+
234+
| | ASR LM Head | ForcedAligner LM Head |
235+
|---|---|---|
236+
| Output dim | 151,936 (vocab tokens) | **5,000** (raw timestamp values) |
237+
| Purpose | Next-token prediction | Timestamp regression via argmax |
238+
| Decoding | argmax → token ID → text | argmax → raw_value × 80ms → time |
239+
240+
The embedding table is still 152,064 tokens (shared architecture), but the LM head
241+
projects to 5,000 outputs — enough for timestamps up to 400 seconds at 80ms resolution.
242+
243+
## Known Issues
244+
245+
See [problems_encountered.md](./problems_encountered.md) for detailed conversion journal.
246+
247+
## References
248+
249+
- **Model:** [Qwen/Qwen3-ForcedAligner-0.6B](https://huggingface.co/Qwen/Qwen3-ForcedAligner-0.6B)
250+
- **Paper:** [arXiv:2601.21337](https://arxiv.org/abs/2601.21337)
251+
- **Source:** [QwenLM/Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR)
252+
- **Community request:** [FluidAudio#49](https://github.com/FluidInference/FluidAudio/issues/49)

0 commit comments

Comments
 (0)