Skip to content

feat: add Qwen3-ForcedAligner-0.6B CoreML conversion and inference#21

Merged
Alex-Wengg merged 3 commits intomainfrom
feat/qwen3-forced-aligner-coreml
Mar 21, 2026
Merged

feat: add Qwen3-ForcedAligner-0.6B CoreML conversion and inference#21
Alex-Wengg merged 3 commits intomainfrom
feat/qwen3-forced-aligner-coreml

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Feb 15, 2026

Summary

  • Add CoreML conversion pipeline and inference script for Qwen3-ForcedAligner-0.6B, a non-autoregressive forced alignment model that produces per-word timestamps from audio + text
  • Split audio encoder into conv frontend + transformer to preserve cross-chunk bidirectional attention (5x improvement in alignment accuracy vs monolithic per-chunk approach)
  • Includes conversion scripts, end-to-end inference pipeline, PyTorch comparison tooling, and Python environment (pyproject.toml + uv.lock)

Files

File Purpose
convert-coreml.py CLI to export all 5 CoreML components from HuggingFace weights
individual_components.py PyTorch wrapper modules for tracing (conv, transformer, decoder, etc.)
run_coreml_inference.py End-to-end CoreML inference + parity comparison against PyTorch
compare-models.py Generate PyTorch reference timestamps from LibriSpeech
pyproject.toml / uv.lock Python dependencies
README.md Architecture docs, I/O shapes, inference pipeline
problems_encountered.md Conversion journal with all issues and fixes

CoreML Components (5 models)

Component Input Output Precision
Audio Conv [1, 128, 100] mel [1, 13, 1024] conv features FP16
Audio Transformer [1, 256, 1024] features [1, 256, 1024] embeddings FP32
Token Embedding [1, seq, int32] [1, seq, 1024] FP16
Decoder Prefill [1, 1024, 1024] + RoPE [1, 1024, 1024] FP32
LM Head [1, seq, 1024] [1, seq, 5000] timestamps FP32

Parity vs PyTorch (3 LibriSpeech test-clean samples, 54 word boundaries)

Metric Value
AAS (mean boundary error) 4.4 ms
Max boundary error 160 ms
% within 20ms 95.4%
% within 80ms (1 segment) 99.1%
% within 160ms (2 segments) 100.0%

Test plan

  • cd models/stt/qwen3-forced-aligner-0.6b/coreml && uv sync
  • uv run python convert-coreml.py — convert all 5 components
  • uv run python compare-models.py --num-files 3 — generate PyTorch reference
  • uv run python run_coreml_inference.py compare — verify CoreML vs PyTorch parity
  • Verify AAS < 5ms and >95% within 20ms

🤖 Generated with Claude Code


Open with Devin

@Alex-Wengg Alex-Wengg marked this pull request as draft March 15, 2026 15:17
@Alex-Wengg Alex-Wengg closed this Mar 16, 2026
@Alex-Wengg Alex-Wengg reopened this Mar 21, 2026
@Alex-Wengg Alex-Wengg marked this pull request as ready for review March 21, 2026 00:41
Add CoreML conversion pipeline for Qwen3-ForcedAligner-0.6B, a non-autoregressive
forced alignment model that produces per-word timestamps from audio + text.

The pipeline splits the model into 5 CoreML components:
- Audio conv frontend (per-chunk mel → conv features)
- Audio transformer (cross-chunk bidirectional attention + projection)
- Token embedding (vocab → hidden states)
- Decoder prefill (28-layer Qwen3 decoder, single NAR pass)
- LM head (hidden states → 5000 timestamp bins)

Key design decisions:
- Audio encoder split into conv + transformer to preserve cross-chunk
  attention (monolithic per-chunk approach had 20.7ms AAS vs 4.4ms split)
- MRoPE cos/sin computed outside the model for flexibility
- Last mel chunk trimmed after conv to remove padding artifacts
- Decoder and LM head use FLOAT32 precision to avoid FP16 overflow

Parity vs PyTorch (3 LibriSpeech test-clean samples, 54 boundaries):
- AAS: 4.4ms, within 20ms: 95.4%, within 80ms: 99.1%
The inference script supports two audio encoder paths with auto-detection.
Split encoder (audio_conv + audio_transformer) preserves cross-chunk attention
for 4.4ms AAS. Monolithic encoder (audio_encoder) is faster but lacks
cross-chunk attention (20.7ms AAS). Added comparison table and updated
architecture, I/O shapes, inference pipeline, conversion, and parity sections.
Document 5 bugs encountered during FluidAudio Swift integration:
MLMultiArray stride issues, encoder 3D shape, Slaney vs HTK mel,
STFT center padding, and MRoPE position clamping.
@Alex-Wengg Alex-Wengg force-pushed the feat/qwen3-forced-aligner-coreml branch from 26e11bd to f03d418 Compare March 21, 2026 00:42
@Alex-Wengg Alex-Wengg merged commit 00ce4c6 into main Mar 21, 2026
@Alex-Wengg Alex-Wengg deleted the feat/qwen3-forced-aligner-coreml branch March 21, 2026 00:43
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 4 potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

flac_map[f.name] = str(f)

all_errors = []
for sample in ref_data[:num_files]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Dict slicing TypeError in compare command — ref_data is a dict, not a list

ref_data is loaded via json.load(f) which returns a dict (with keys model_id, language, num_samples, samples), but ref_data[:num_files] on line 739 attempts to slice it, which raises TypeError: unhashable type: 'slice'. The compare-models.py producer (compare-models.py:214-218) wraps samples under a "samples" key, so this should be ref_data["samples"][:num_files].

Suggested change
for sample in ref_data[:num_files]:
for sample in ref_data["samples"][:num_files]:
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

else:
typer.echo(f" WARNING: Cannot find {audio_path}, skipping")
continue
text = sample["text"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 KeyError on sample["text"] — producer writes "transcript" key

The reference JSON generated by compare-models.py:223 uses the key "transcript" for the sample's text, but run_coreml_inference.py:749 reads sample["text"], which raises KeyError: 'text'.

Suggested change
text = sample["text"]
text = sample["transcript"]
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +766 to +767
start_err = abs(ref["start_ms"] - hyp.start_ms)
end_err = abs(ref["end_ms"] - hyp.end_ms)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 KeyError on ref["start_ms"]/ref["end_ms"] — producer writes "start_time_ms"/"end_time_ms"

The reference JSON alignment entries (produced by compare-models.py:228-229) use keys "start_time_ms" and "end_time_ms", but run_coreml_inference.py:766-767 and run_coreml_inference.py:786-788 read ref["start_ms"] and ref["end_ms"], which raises KeyError. All four occurrences need to use the correct key names.

Prompt for agents
Fix the key mismatch in models/stt/qwen3-forced-aligner-0.6b/coreml/run_coreml_inference.py. The reference JSON uses keys "start_time_ms" and "end_time_ms" (as written by compare-models.py lines 228-229), but the compare command reads "start_ms" and "end_ms". Fix all four occurrences:

Line 766: change ref["start_ms"] to ref["start_time_ms"]
Line 767: change ref["end_ms"] to ref["end_time_ms"]
Line 786: change ref['start_ms'] to ref['start_time_ms'] and ref['end_ms'] to ref['end_time_ms']
Line 788: change ref['start_ms'] to ref['start_time_ms'] and ref['end_ms'] to ref['end_time_ms']
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +636 to +639
no_optimize: bool = typer.Option(
False, "--no-optimize",
help="Skip MIL optimization passes.",
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 --no-optimize CLI flag is accepted but silently ignored

The no_optimize parameter is declared as a CLI option at convert-coreml.py:636 but is never passed to any _coreml_convert() call in the convert function body. The _coreml_convert function (individual_components.py:386) supports the parameter and uses it to skip MIL optimization passes. The sister qwen3-asr conversion script (convert-qwen3-asr.py:891) correctly threads no_optimize through, but this file does not. A user passing --no-optimize would expect optimization to be skipped, but it silently has no effect.

Prompt for agents
In models/stt/qwen3-forced-aligner-0.6b/coreml/convert-coreml.py, the no_optimize CLI parameter (lines 636-639) is declared but never passed to any _coreml_convert() call. Pass no_optimize=no_optimize to all _coreml_convert() calls in the convert functions: convert_audio_encoder (line 231), convert_audio_conv (line 284), convert_audio_transformer (line 344), convert_embedding (line 396), convert_lm_head (line 453), and convert_decoder_prefill (line 533). See the qwen3-asr convert-qwen3-asr.py:891 for the correct pattern.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant