feat: add Qwen3-ForcedAligner-0.6B CoreML conversion and inference by Alex-Wengg · Pull Request #21 · FluidInference/mobius

Alex-Wengg · 2026-02-15T09:26:54Z

Summary

Add CoreML conversion pipeline and inference script for Qwen3-ForcedAligner-0.6B, a non-autoregressive forced alignment model that produces per-word timestamps from audio + text
Split audio encoder into conv frontend + transformer to preserve cross-chunk bidirectional attention (5x improvement in alignment accuracy vs monolithic per-chunk approach)
Includes conversion scripts, end-to-end inference pipeline, PyTorch comparison tooling, and Python environment (pyproject.toml + uv.lock)

Files

File	Purpose
`convert-coreml.py`	CLI to export all 5 CoreML components from HuggingFace weights
`individual_components.py`	PyTorch wrapper modules for tracing (conv, transformer, decoder, etc.)
`run_coreml_inference.py`	End-to-end CoreML inference + parity comparison against PyTorch
`compare-models.py`	Generate PyTorch reference timestamps from LibriSpeech
`pyproject.toml` / `uv.lock`	Python dependencies
`README.md`	Architecture docs, I/O shapes, inference pipeline
`problems_encountered.md`	Conversion journal with all issues and fixes

CoreML Components (5 models)

Component	Input	Output	Precision
Audio Conv	`[1, 128, 100]` mel	`[1, 13, 1024]` conv features	FP16
Audio Transformer	`[1, 256, 1024]` features	`[1, 256, 1024]` embeddings	FP32
Token Embedding	`[1, seq, int32]`	`[1, seq, 1024]`	FP16
Decoder Prefill	`[1, 1024, 1024]` + RoPE	`[1, 1024, 1024]`	FP32
LM Head	`[1, seq, 1024]`	`[1, seq, 5000]` timestamps	FP32

Parity vs PyTorch (3 LibriSpeech test-clean samples, 54 word boundaries)

Metric	Value
AAS (mean boundary error)	4.4 ms
Max boundary error	160 ms
% within 20ms	95.4%
% within 80ms (1 segment)	99.1%
% within 160ms (2 segments)	100.0%

Test plan

cd models/stt/qwen3-forced-aligner-0.6b/coreml && uv sync
uv run python convert-coreml.py — convert all 5 components
uv run python compare-models.py --num-files 3 — generate PyTorch reference
uv run python run_coreml_inference.py compare — verify CoreML vs PyTorch parity
Verify AAS < 5ms and >95% within 20ms

🤖 Generated with Claude Code

Add CoreML conversion pipeline for Qwen3-ForcedAligner-0.6B, a non-autoregressive forced alignment model that produces per-word timestamps from audio + text. The pipeline splits the model into 5 CoreML components: - Audio conv frontend (per-chunk mel → conv features) - Audio transformer (cross-chunk bidirectional attention + projection) - Token embedding (vocab → hidden states) - Decoder prefill (28-layer Qwen3 decoder, single NAR pass) - LM head (hidden states → 5000 timestamp bins) Key design decisions: - Audio encoder split into conv + transformer to preserve cross-chunk attention (monolithic per-chunk approach had 20.7ms AAS vs 4.4ms split) - MRoPE cos/sin computed outside the model for flexibility - Last mel chunk trimmed after conv to remove padding artifacts - Decoder and LM head use FLOAT32 precision to avoid FP16 overflow Parity vs PyTorch (3 LibriSpeech test-clean samples, 54 boundaries): - AAS: 4.4ms, within 20ms: 95.4%, within 80ms: 99.1%

The inference script supports two audio encoder paths with auto-detection. Split encoder (audio_conv + audio_transformer) preserves cross-chunk attention for 4.4ms AAS. Monolithic encoder (audio_encoder) is faster but lacks cross-chunk attention (20.7ms AAS). Added comparison table and updated architecture, I/O shapes, inference pipeline, conversion, and parity sections.

Document 5 bugs encountered during FluidAudio Swift integration: MLMultiArray stride issues, encoder 3D shape, Slaney vs HTK mel, STFT center padding, and MRoPE position clamping.

devin-ai-integration

Devin Review found 4 potential issues.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-03-21T00:46:48Z

models/stt/qwen3-forced-aligner-0.6b/coreml/run_coreml_inference.py

+            flac_map[f.name] = str(f)
+
+    all_errors = []
+    for sample in ref_data[:num_files]:


🔴 Dict slicing TypeError in compare command — ref_data is a dict, not a list

ref_data is loaded via json.load(f) which returns a dict (with keys model_id, language, num_samples, samples), but ref_data[:num_files] on line 739 attempts to slice it, which raises TypeError: unhashable type: 'slice'. The compare-models.py producer (compare-models.py:214-218) wraps samples under a "samples" key, so this should be ref_data["samples"][:num_files].

Suggested change

for sample in ref_data[:num_files]:

for sample in ref_data["samples"][:num_files]:

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-21T00:46:50Z

models/stt/qwen3-forced-aligner-0.6b/coreml/run_coreml_inference.py

+            else:
+                typer.echo(f"  WARNING: Cannot find {audio_path}, skipping")
+                continue
+        text = sample["text"]


🔴 KeyError on sample["text"] — producer writes "transcript" key

The reference JSON generated by compare-models.py:223 uses the key "transcript" for the sample's text, but run_coreml_inference.py:749 reads sample["text"], which raises KeyError: 'text'.

Suggested change

text = sample["text"]

text = sample["transcript"]

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-21T00:46:51Z

models/stt/qwen3-forced-aligner-0.6b/coreml/run_coreml_inference.py

+            start_err = abs(ref["start_ms"] - hyp.start_ms)
+            end_err = abs(ref["end_ms"] - hyp.end_ms)


🔴 KeyError on ref["start_ms"]/ref["end_ms"] — producer writes "start_time_ms"/"end_time_ms"

The reference JSON alignment entries (produced by compare-models.py:228-229) use keys "start_time_ms" and "end_time_ms", but run_coreml_inference.py:766-767 and run_coreml_inference.py:786-788 read ref["start_ms"] and ref["end_ms"], which raises KeyError. All four occurrences need to use the correct key names.

Prompt for agents

Fix the key mismatch in models/stt/qwen3-forced-aligner-0.6b/coreml/run_coreml_inference.py. The reference JSON uses keys "start_time_ms" and "end_time_ms" (as written by compare-models.py lines 228-229), but the compare command reads "start_ms" and "end_ms". Fix all four occurrences: Line 766: change ref["start_ms"] to ref["start_time_ms"] Line 767: change ref["end_ms"] to ref["end_time_ms"] Line 786: change ref['start_ms'] to ref['start_time_ms'] and ref['end_ms'] to ref['end_time_ms'] Line 788: change ref['start_ms'] to ref['start_time_ms'] and ref['end_ms'] to ref['end_time_ms']

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-21T00:46:52Z

models/stt/qwen3-forced-aligner-0.6b/coreml/convert-coreml.py

+    no_optimize: bool = typer.Option(
+        False, "--no-optimize",
+        help="Skip MIL optimization passes.",
+    ),


🟡 --no-optimize CLI flag is accepted but silently ignored

The no_optimize parameter is declared as a CLI option at convert-coreml.py:636 but is never passed to any _coreml_convert() call in the convert function body. The _coreml_convert function (individual_components.py:386) supports the parameter and uses it to skip MIL optimization passes. The sister qwen3-asr conversion script (convert-qwen3-asr.py:891) correctly threads no_optimize through, but this file does not. A user passing --no-optimize would expect optimization to be skipped, but it silently has no effect.

Prompt for agents

In models/stt/qwen3-forced-aligner-0.6b/coreml/convert-coreml.py, the no_optimize CLI parameter (lines 636-639) is declared but never passed to any _coreml_convert() call. Pass no_optimize=no_optimize to all _coreml_convert() calls in the convert functions: convert_audio_encoder (line 231), convert_audio_conv (line 284), convert_audio_transformer (line 344), convert_embedding (line 396), convert_lm_head (line 453), and convert_decoder_prefill (line 533). See the qwen3-asr convert-qwen3-asr.py:891 for the correct pattern.

Was this helpful? React with 👍 or 👎 to provide feedback.

Alex-Wengg marked this pull request as draft March 15, 2026 15:17

Alex-Wengg closed this Mar 16, 2026

Alex-Wengg reopened this Mar 21, 2026

Alex-Wengg marked this pull request as ready for review March 21, 2026 00:41

Alex-Wengg added 3 commits March 20, 2026 20:42

docs: add Swift/CoreML integration bugs for ForcedAligner

f03d418

Document 5 bugs encountered during FluidAudio Swift integration: MLMultiArray stride issues, encoder 3D shape, Slaney vs HTK mel, STFT center padding, and MRoPE position clamping.

Alex-Wengg force-pushed the feat/qwen3-forced-aligner-coreml branch from 26e11bd to f03d418 Compare March 21, 2026 00:42

Alex-Wengg merged commit 00ce4c6 into main Mar 21, 2026

Alex-Wengg deleted the feat/qwen3-forced-aligner-coreml branch March 21, 2026 00:43

devin-ai-integration bot reviewed Mar 21, 2026

View reviewed changes

Alex-Wengg mentioned this pull request Mar 21, 2026

fix: address Devin review issues in Qwen3-ForcedAligner scripts #32

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Qwen3-ForcedAligner-0.6B CoreML conversion and inference#21

feat: add Qwen3-ForcedAligner-0.6B CoreML conversion and inference#21
Alex-Wengg merged 3 commits intomainfrom
feat/qwen3-forced-aligner-coreml

Alex-Wengg commented Feb 15, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Mar 21, 2026

Uh oh!

devin-ai-integration bot Mar 21, 2026

Uh oh!

devin-ai-integration bot Mar 21, 2026

Uh oh!

devin-ai-integration bot Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	for sample in ref_data[:num_files]:
	for sample in ref_data["samples"][:num_files]:

		start_err = abs(ref["start_ms"] - hyp.start_ms)
		end_err = abs(ref["end_ms"] - hyp.end_ms)

Conversation

Alex-Wengg commented Feb 15, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

CoreML Components (5 models)

Parity vs PyTorch (3 LibriSpeech test-clean samples, 54 word boundaries)

Test plan

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Alex-Wengg commented Feb 15, 2026 •

edited by devin-ai-integration bot

Loading