feat: add Qwen3-ForcedAligner-0.6B CoreML conversion and inference#21
feat: add Qwen3-ForcedAligner-0.6B CoreML conversion and inference#21Alex-Wengg merged 3 commits intomainfrom
Conversation
Add CoreML conversion pipeline for Qwen3-ForcedAligner-0.6B, a non-autoregressive forced alignment model that produces per-word timestamps from audio + text. The pipeline splits the model into 5 CoreML components: - Audio conv frontend (per-chunk mel → conv features) - Audio transformer (cross-chunk bidirectional attention + projection) - Token embedding (vocab → hidden states) - Decoder prefill (28-layer Qwen3 decoder, single NAR pass) - LM head (hidden states → 5000 timestamp bins) Key design decisions: - Audio encoder split into conv + transformer to preserve cross-chunk attention (monolithic per-chunk approach had 20.7ms AAS vs 4.4ms split) - MRoPE cos/sin computed outside the model for flexibility - Last mel chunk trimmed after conv to remove padding artifacts - Decoder and LM head use FLOAT32 precision to avoid FP16 overflow Parity vs PyTorch (3 LibriSpeech test-clean samples, 54 boundaries): - AAS: 4.4ms, within 20ms: 95.4%, within 80ms: 99.1%
The inference script supports two audio encoder paths with auto-detection. Split encoder (audio_conv + audio_transformer) preserves cross-chunk attention for 4.4ms AAS. Monolithic encoder (audio_encoder) is faster but lacks cross-chunk attention (20.7ms AAS). Added comparison table and updated architecture, I/O shapes, inference pipeline, conversion, and parity sections.
Document 5 bugs encountered during FluidAudio Swift integration: MLMultiArray stride issues, encoder 3D shape, Slaney vs HTK mel, STFT center padding, and MRoPE position clamping.
26e11bd to
f03d418
Compare
| flac_map[f.name] = str(f) | ||
|
|
||
| all_errors = [] | ||
| for sample in ref_data[:num_files]: |
There was a problem hiding this comment.
🔴 Dict slicing TypeError in compare command — ref_data is a dict, not a list
ref_data is loaded via json.load(f) which returns a dict (with keys model_id, language, num_samples, samples), but ref_data[:num_files] on line 739 attempts to slice it, which raises TypeError: unhashable type: 'slice'. The compare-models.py producer (compare-models.py:214-218) wraps samples under a "samples" key, so this should be ref_data["samples"][:num_files].
| for sample in ref_data[:num_files]: | |
| for sample in ref_data["samples"][:num_files]: |
Was this helpful? React with 👍 or 👎 to provide feedback.
| else: | ||
| typer.echo(f" WARNING: Cannot find {audio_path}, skipping") | ||
| continue | ||
| text = sample["text"] |
There was a problem hiding this comment.
🔴 KeyError on sample["text"] — producer writes "transcript" key
The reference JSON generated by compare-models.py:223 uses the key "transcript" for the sample's text, but run_coreml_inference.py:749 reads sample["text"], which raises KeyError: 'text'.
| text = sample["text"] | |
| text = sample["transcript"] |
Was this helpful? React with 👍 or 👎 to provide feedback.
| start_err = abs(ref["start_ms"] - hyp.start_ms) | ||
| end_err = abs(ref["end_ms"] - hyp.end_ms) |
There was a problem hiding this comment.
🔴 KeyError on ref["start_ms"]/ref["end_ms"] — producer writes "start_time_ms"/"end_time_ms"
The reference JSON alignment entries (produced by compare-models.py:228-229) use keys "start_time_ms" and "end_time_ms", but run_coreml_inference.py:766-767 and run_coreml_inference.py:786-788 read ref["start_ms"] and ref["end_ms"], which raises KeyError. All four occurrences need to use the correct key names.
Prompt for agents
Fix the key mismatch in models/stt/qwen3-forced-aligner-0.6b/coreml/run_coreml_inference.py. The reference JSON uses keys "start_time_ms" and "end_time_ms" (as written by compare-models.py lines 228-229), but the compare command reads "start_ms" and "end_ms". Fix all four occurrences:
Line 766: change ref["start_ms"] to ref["start_time_ms"]
Line 767: change ref["end_ms"] to ref["end_time_ms"]
Line 786: change ref['start_ms'] to ref['start_time_ms'] and ref['end_ms'] to ref['end_time_ms']
Line 788: change ref['start_ms'] to ref['start_time_ms'] and ref['end_ms'] to ref['end_time_ms']
Was this helpful? React with 👍 or 👎 to provide feedback.
| no_optimize: bool = typer.Option( | ||
| False, "--no-optimize", | ||
| help="Skip MIL optimization passes.", | ||
| ), |
There was a problem hiding this comment.
🟡 --no-optimize CLI flag is accepted but silently ignored
The no_optimize parameter is declared as a CLI option at convert-coreml.py:636 but is never passed to any _coreml_convert() call in the convert function body. The _coreml_convert function (individual_components.py:386) supports the parameter and uses it to skip MIL optimization passes. The sister qwen3-asr conversion script (convert-qwen3-asr.py:891) correctly threads no_optimize through, but this file does not. A user passing --no-optimize would expect optimization to be skipped, but it silently has no effect.
Prompt for agents
In models/stt/qwen3-forced-aligner-0.6b/coreml/convert-coreml.py, the no_optimize CLI parameter (lines 636-639) is declared but never passed to any _coreml_convert() call. Pass no_optimize=no_optimize to all _coreml_convert() calls in the convert functions: convert_audio_encoder (line 231), convert_audio_conv (line 284), convert_audio_transformer (line 344), convert_embedding (line 396), convert_lm_head (line 453), and convert_decoder_prefill (line 533). See the qwen3-asr convert-qwen3-asr.py:891 for the correct pattern.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Files
convert-coreml.pyindividual_components.pyrun_coreml_inference.pycompare-models.pypyproject.toml/uv.lockREADME.mdproblems_encountered.mdCoreML Components (5 models)
[1, 128, 100]mel[1, 13, 1024]conv features[1, 256, 1024]features[1, 256, 1024]embeddings[1, seq, int32][1, seq, 1024][1, 1024, 1024]+ RoPE[1, 1024, 1024][1, seq, 1024][1, seq, 5000]timestampsParity vs PyTorch (3 LibriSpeech test-clean samples, 54 word boundaries)
Test plan
cd models/stt/qwen3-forced-aligner-0.6b/coreml && uv syncuv run python convert-coreml.py— convert all 5 componentsuv run python compare-models.py --num-files 3— generate PyTorch referenceuv run python run_coreml_inference.py compare— verify CoreML vs PyTorch parity🤖 Generated with Claude Code