Add Qwen2.5-Omni-7B full Neuron speech pipeline (Thinker+Talker+Token2Wav, TP=4) by whn09 · Pull Request #122 · aws-neuron/neuronx-distributed-inference

whn09 · 2026-04-10T07:14:48Z

Description

Add NxDI model adapter for Qwen2.5-Omni-7B with full multimodal support and end-to-end Neuron speech synthesis: text generation, image understanding, audio understanding, and text-to-speech — all on Neuron.

All components run on Neuron at TP=4: Thinker (28 heads → 7/rank), Vision (16 → 4/rank), Audio (20 → 5/rank), Talker (12 heads → 3/rank), Token2Wav DiT (22 blocks, traced). This enables running on trn2.xlarge (4 NeuronCores) as well as larger instances.

Full Neuron Speech Pipeline

The complete text-to-speech pipeline runs entirely on Neuron hardware:

Component	Time	Notes
Thinker (7B, Neuron TP=4)	0.3s	24 text tokens
Talker (690M, Neuron TP=4)	2.1s	454 codec tokens, per-step thinker injection
Token2Wav (Neuron DiT + CPU BigVGAN)	9.9s	22 DiT blocks × 10 ODE steps
Total	~15s	9.1s audio, RTF ~1.7x

Compared to 267.9s on CPU (~18x overall speedup).

Key Technical Contributions

Talker per-step thinker state injection: At each autoregressive step, the corresponding thinker_reply_part[step] embedding is added to the codec token embedding via the vision_embeddings mechanism, matching HF's generation behavior.
Fused embedding: Collapsed embed_tokens (8448→3584) + thinker_to_talker_proj (3584→896) into a single 8448→896 lookup, eliminating the projection layer during inference.
Vision embeddings auto-padding: Compiled Neuron models require fixed bucket shapes. Vision embeddings are auto-padded to max_context_length in set_vision_embeddings() for bucket compatibility.
Token2Wav CPU fallback: Moved mel_len overflow check before input_embed (which doubles batch for classifier-free guidance) to prevent batch dimension mismatch on CPU fallback.
model_base extension: Added apply_vision_during_token_gen flag to allow vision_embeddings during token generation (not just context encoding), enabling the Talker's per-step injection pattern.

Model Information

Model Name: Qwen2.5-Omni-7B

Model Architecture: Multimodal encoder-decoder (Thinker + Vision + Audio + Talker + Token2Wav)

Purpose: Text generation, image-to-text, audio-to-text, text-to-speech (full omni-modal)

Architecture

Component	Runtime	TP	Key dims
Thinker (text)	Neuron	4	hidden=3584, heads=28, kv_heads=4, layers=28
Vision encoder	Neuron	4	embed=1280, heads=16, depth=32, SwiGLU
Audio encoder	CPU+Neuron	4	d_model=1280, heads=20, layers=32, chunked attn
Talker	Neuron	4	hidden=896, heads=12, kv_heads=4, layers=24, vocab=8448, fused embed
Token2Wav	CPU+Neuron (fp32)	N/A	DiT: dim=1024, 22 blocks (Neuron); BigVGAN: 6 upsample stages (CPU)

Checklist

Required Components

Accuracy Test — All 6 tests pass on trn2.48xlarge with real Qwen2.5-Omni-7B weights
README.md — Full multimodal documentation with architecture details, usage examples, vLLM serving instructions, and performance benchmarks
Source Code (src/)
- modeling_qwen25_omni.py — Text-only + multimodal orchestration, config handling, state dict conversion
- modeling_qwen25_omni_vision.py — Vision encoder (SwiGLU, RMSNorm, separate QKV, PatchMerger)
- modeling_qwen25_omni_audio.py — Audio encoder (hybrid CPU+Neuron, chunked attention)
- modeling_qwen25_omni_talker.py — Talker (Neuron, fused embed, per-step thinker injection, auto-padding)
- modeling_qwen25_omni_token2wav.py — Token2Wav (Neuron DiT + CPU BigVGAN, CPU fallback)

Optional Components

Unit Tests — Integration tests on Trn2 with real weights
End-to-end speech pipeline — Verified producing real human speech audio

Folder Structure

src/neuronx_distributed_inference/models/qwen25_omni/
  __init__.py
  modeling_qwen25_omni.py          # Text-only + multimodal orchestration
  modeling_qwen25_omni_vision.py   # Vision encoder (Neuron, TP=4)
  modeling_qwen25_omni_audio.py    # Audio encoder (CPU+Neuron hybrid, TP=4)
  modeling_qwen25_omni_talker.py   # Talker (Neuron, TP=4, per-step injection)
  modeling_qwen25_omni_token2wav.py # Token2Wav (Neuron DiT + CPU BigVGAN)

contrib/models/Qwen2.5-Omni-7B/
  README.md                        # Full multimodal documentation
  test/integration/test_model.py   # 6 integration tests

perf_test/
  3_bench_qwen25_omni_7b.sh       # vLLM benchmark (BS=1/4, TP=4)
  apply_vllm_neuron_patch_qwen25omni.py  # vllm-neuron patch script

Testing

How did you test this change?

All 6 tests pass on trn2.48xlarge with Qwen2.5-Omni-7B weights (Neuron SDK 2.23, PyTorch 2.9):

Imports: All 7 module groups import correctly
Config: TP=4 head divisibility verified (Thinker 7Q/1KV, Audio 5, Vision 4 per rank)
State dict: All 2448 keys convert correctly (text=339, audio=489, vision=518, talker=293, token2wav=809)
Audio CPU: Frontend + postprocessor latency 20-34ms (1-30s audio)
Talker Neuron: Compile + load + generate codec tokens on Neuron, TPOT 4.0ms
Text generation: Compile + load + generate on Neuron, TPOT ~11-13ms, correct outputs
Full speech pipeline: Thinker → Talker → Token2Wav all on Neuron, producing real speech audio

Compatibility

Tested with:

Neuron SDK Version(s): 2.23
Instance Type(s): Trn2 (trn2.48xlarge), should work on trn2.xlarge (4 cores)
PyTorch Version: 2.9
Python Version: 3.12

Related Issues

N/A

vLLM Integration

This model/feature is intended for use with vLLM (text-only)
vLLM-neuron patch included in perf_test/apply_vllm_neuron_patch_qwen25omni.py

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

The Thinker's text backbone is architecturally identical to Qwen2.5, so we reuse the Qwen2 NxDI implementation with state-dict remapping to handle the nested weight prefixes (thinker.model.* -> *). Non-text components (Talker, Token2Wav, audio/vision encoders) are filtered out during weight loading for text-only inference. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When loading Qwen2.5-Omni from a saved compiled model path, thinker_config is a plain dict (from JSON) rather than a SimpleNamespace. Convert it before attribute access. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PretrainedConfig defaults tie_word_embeddings=True, which overwrites the correct False value from thinker_config.text_config. This caused lm_head weights to be replaced by embed_tokens, producing only special tokens. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

With tie_word_embeddings correctly set to False from text_config, this method is never called, so the override is dead code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…token2wav) Extends the text-only Thinker support with all multimodal components: - Vision encoder (Neuron): SwiGLU MLP, RMSNorm, separate QKV projections, PatchMerger with spatial downsampling. Compiled and runs on Neuron cores. - Audio encoder (CPU): Whisper-style Conv1d frontend, 32 transformer layers with chunked attention (n_window=100), AvgPool1d downsampling. Runs on CPU due to 20 attention heads not being TP-divisible by 32. - Talker (CPU): Small Qwen2 decoder (24 layers, hidden=896, 12 heads, 4 kv_heads) that converts Thinker hidden states into codec tokens. Wraps HF's Qwen2_5OmniTalkerForConditionalGeneration for autoregressive generation. - Token2Wav (CPU, float32): DiT model (22 blocks) with ECAPA-TDNN speaker encoder for mel spectrogram generation, plus BigVGAN vocoder for waveform synthesis. Wraps HF's Qwen2_5OmniToken2WavModel. - Multimodal config: Extracts text/vision/audio/talker/token2wav configs from the nested HF config structure. - State dict conversion: Full 2448-key pipeline handling all component prefixes (thinker.model.*, thinker.visual.*, thinker.audio_tower.*, talker.*, token2wav.*). All components tested on Trn2 with real Qwen2.5-Omni-7B weights: Text: 339 keys, Vision: 518, Audio: 489, Talker: 293, Token2Wav: 809 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ructions Replace auto-generated README with comprehensive documentation covering all 5 components (Thinker, Vision, Audio, Talker, Token2Wav), vLLM serving setup, performance benchmarks, and architecture details. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…eneration - Migrate all Neuron components from TP=32 to TP=4 (Thinker 7Q/1KV per rank, Vision 4/rank, Audio 5/rank) - Refactor audio encoder to hybrid CPU+Neuron: Conv1d frontend + chunking on CPU, 32 transformer layers on Neuron with TP=4, AvgPool + projection on CPU - Add compile_audio_encoder() and load_audio_encoder() orchestration methods - Implement multimodal forward() combining vision + audio embeddings via scatter_by_index_put - Document Talker CPU rationale (non-standard head_dim, 3D mRoPE, per-step thinker injection, small model) - Update benchmark script and README for TP=4 configs - All 6 tests pass on trn2.48xlarge: imports, config, state_dict (2448 keys), audio CPU (20ms/1s), talker (293 keys), text generation (TPOT ~11-13ms) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rewrite test_model.py with 8 working tests: 1. Import validation (all 7 module groups) 2. Config TP=4 head divisibility check 3. State dict conversion (2448 keys, all components) 4. Audio encoder CPU frontend/postprocessor (1s-30s synthetic audio) 5. Talker CPU model (weight loading + codec tokens) 6. Text-only Thinker compile + load + generate (3 prompts, chat template) 7. Image understanding preprocessing (Qwen official demo.jpeg) 8. Audio understanding preprocessing (Qwen official cough.wav) Supports --test, --quick flags, and env var overrides for paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Supports 4 modes: text-only, image+text, audio+text, full multimodal. Uses Qwen official test assets (demo.jpeg, cough.wav). Usage: python3 examples/generate_qwen25_omni.py --mode text # text only python3 examples/generate_qwen25_omni.py --mode image # image understanding python3 examples/generate_qwen25_omni.py --mode audio # audio understanding python3 examples/generate_qwen25_omni.py --mode full # all modalities Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Pass model_path and compiled_path as function parameters instead of using global statement after referencing module-level constants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…-support

The upstream merge added tensor_capture_hook to hf_adapter's prepare_inputs_for_generation but didn't extract it from kwargs, and didn't add it to NeuronBaseForCausalLM.forward() signature. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Implement Neuron/Trainium support for the Talker (690M, 24 layers) and Token2Wav DiT (85M, 22 blocks) components, enabling full speech synthesis on Neuron hardware alongside the existing Thinker. Talker (modeling_qwen25_omni_talker.py): - NeuronQwen25OmniTalkerForCausalLM with NeuronBaseForCausalLM + ImageToTextModelWrapper - Explicit head_dim=128 (not hidden_size/num_heads=74.67), TP=4 recommended - Fused embedding: embed_tokens(8448,3584) + proj(3584,896) → (8448,896) - mRoPE support with mrope_section=[16,24,24] for 3D position_ids - ThinkerToTalkerProjection for CPU-side context encoding (3584→896) - State dict conversion with QKV fusion, codec_head→lm_head mapping Token2Wav (modeling_qwen25_omni_token2wav.py): - NeuronQwen25OmniToken2WavWithNeuronDiT compiles DiT via torch_neuronx.trace() - ODE solver loop + BigVGAN vocoder stay on CPU - Monkeypatches DiT forward to redirect to Neuron during inference Orchestration (modeling_qwen25_omni.py): - enable_talker/compile_talker/load_talker for CPU and Neuron modes - enable_token2wav/compile_token2wav_dit/load_token2wav_dit - Factory methods for Neuron talker, config, projection, and token2wav classes Tests: - test_talker_neuron.py: 9 unit tests with auto-mock import hook for CPU testing - test_e2e_qwen25_omni.py: 4 end-to-end tests (text/image/audio→text, text→speech) validated on trn2.48xlarge with Qwen2.5-Omni-7B Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Measured per-component CPU timing: Talker 103s (41%), Token2Wav 118s (47%), Thinker 31s (12%) for 14.1s audio output (17.9x RTF on trn2.48xlarge). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix TalkerRotaryEmbedding to expand 2D position_ids to 3D for mRoPE (NxDI traces with 2D, mRoPE needs 3D; same approach as Qwen3-VL) - Override _get_model_outputs to pass 24 positional args matching ImageToTextModelWrapper (vision_embeddings at index 22) - Add set_vision_embeddings() for thinker state injection before generate() - Fix Token2Wav DiT tracing: correct forward signature, _DiTTraceWrapper for keyword args, proper example input dimensions - Update README with Neuron vs CPU benchmark results: Thinker 64.7x, Talker 49.1x (TPOT=4.0ms), projected 1.9x total Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Split architecture: CPU preprocessing (ECAPA-TDNN, codec embed, input embed, rotary, block_diff) + Neuron transformer core (22 blocks + norm + proj). XLA tracing fixes: - DiTAttention in-place slice assignment → torch.cat reassembly - SDPA dispatch → explicit matmul attention - Boolean mask → float additive mask (0.0/-1e4) - DiTDecoderLayer inlined to pass pre-computed attention mask Benchmark (trn2.48xlarge, 300 tokens / 6s audio): - DiT core: 60ms Neuron vs 593ms CPU = 9.8x per forward - DiT 10 ODE steps: 3.8s vs 24.1s = 6.3x - Token2Wav e2e: 5.2s vs 13.7s = 2.7x (BigVGAN stays CPU) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…2Wav CPU fallback - Talker: Add per-step thinker state injection during token generation via vision_embeddings, matching HF's behavior of adding thinker_reply_part[step] to codec token embedding at each autoregressive step. Fused embedding collapses embed_tokens (8448→3584) + thinker_to_talker_proj (3584→896) into single 8448→896 lookup. Auto-pad vision_embeddings to max_context_length for compiled Neuron bucket compatibility. - Token2Wav: Move mel_len overflow check before input_embed (which doubles batch for classifier-free guidance) to prevent batch dimension mismatch on CPU fallback. - model_base: Allow vision_embeddings during token generation when model sets apply_vision_during_token_gen=True. - README: Add full Neuron pipeline performance results (RTF ~1.7x for 9.1s audio). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

During token generation, fused_embed(token) + proj(reply) was computing W@E + bias + W@reply + bias = W@(E+reply) + 2*bias, while HF computes proj(E + reply) = W@(E+reply) + bias. The extra bias caused audio noise. Fix: exclude projection bias from fused embedding weights. The bias is applied once via the projected thinker reply states during generation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

HF's Talker keeps repeating the last thinker_reply_part element when exhausted (typically the projected pad_embed). Our implementation was passing zeros, which removed the text-guidance signal and caused the model to lose track of when to stop generating. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

DiT's 22 transformer blocks have 3 distinct attention patterns (look_backward/look_ahead), but we were using block 0's mask for all blocks. Now _NeuronDiTCore accepts 3 masks (local, backward, ahead) and selects per-block via static index, matching HF's behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

One-command script that runs the full Neuron speech pipeline: Thinker -> Talker -> Token2Wav. Auto-compiles all components on first run (~30 min), subsequent runs load from cache (~15s). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

--compile mode compiles all 3 components (Thinker, Talker, DiT). Default mode loads compiled artifacts and runs the pipeline, erroring early if not compiled. No more mixed compile-and-run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Each Neuron component (Thinker, Talker, DiT) is loaded once and runs N iterations within the same process. Reports per-run times and averages. Pipeline RTF is computed from component averages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Each subprocess now reports model load time separately from inference avg. Summary clearly distinguishes one-time load cost vs per-inference latency. --num-runs N loads model once and runs N inferences. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

whn09 · 2026-04-22T03:33:06Z

Closing and superseding this PR with a clean zero-invasion rewrite.

The original submission modified files under src/neuronx_distributed_inference/ (model registrations in utils/constants.py and inference_demo.py, plus small hooks in models/model_base.py and utils/hf_adapter.py), which is not friendly for a contrib merge. The new PR refactors Qwen2.5-Omni-7B to a 100% zero-invasion contrib layout:

All model code lives under contrib/models/Qwen2.5-Omni-7B/{src,examples,test,perf_test}/
git diff upstream/main..HEAD -- src/ is empty
The Talker's previous need for an apply_vision_during_token_gen gate in model_base.py is handled by a get_model_output override on the contrib Talker subclass
One small upstream bug in hf_adapter.py (undefined tensor_capture_hook reference) is worked around by an in-contrib runtime shim, and will be fixed via a separate 1-line upstream PR

New PR: will link once opened.

whn09 and others added 5 commits April 10, 2026 16:38

refactor: remove unnecessary update_state_dict_for_tied_weights override

08a35ac

With tie_word_embeddings correctly set to False from text_config, this method is never called, so the override is dead code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

whn09 force-pushed the feature/qwen25-omni-support branch from 569f752 to 8a91d55 Compare April 10, 2026 08:42

whn09 changed the title ~~Add Qwen2.5-Omni-7B text-only (Thinker) inference support~~ Add Qwen2.5-Omni-7B full multimodal inference support Apr 10, 2026

whn09 and others added 2 commits April 10, 2026 16:59

whn09 changed the title ~~Add Qwen2.5-Omni-7B full multimodal inference support~~ Add Qwen2.5-Omni-7B multimodal support (TP=4, verified on Trn2) Apr 10, 2026

whn09 and others added 3 commits April 10, 2026 18:31

Fix global variable error in generate_qwen25_omni.py

314a3d7

Pass model_path and compiled_path as function parameters instead of using global statement after referencing module-level constants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This was referenced Apr 11, 2026

Fix Qwen2.5-Omni-7B contrib: add M-RoPE to attention #111

Closed

Fix text-only path to use M-RoPE instead of 1D RoPE whn09/neuronx-distributed-inference#3

Open

whn09 and others added 4 commits April 13, 2026 10:26

Merge remote-tracking branch 'upstream/main' into feature/qwen25-omni…

e23fd2a

…-support

Add e2e multimodal test results and speech pipeline profiling to README

a3703d1

Measured per-component CPU timing: Talker 103s (41%), Token2Wav 118s (47%), Thinker 31s (12%) for 14.1s audio output (17.9x RTF on trn2.48xlarge). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

whn09 marked this pull request as draft April 14, 2026 02:34

whn09 and others added 3 commits April 14, 2026 21:39

whn09 changed the title ~~Add Qwen2.5-Omni-7B multimodal support (TP=4, verified on Trn2)~~ Add Qwen2.5-Omni-7B full Neuron speech pipeline (Thinker+Talker+Token2Wav, TP=4) Apr 15, 2026

whn09 and others added 6 commits April 15, 2026 08:27

whn09 and others added 2 commits April 15, 2026 15:41

Merge branch 'aws-neuron:main' into feature/qwen25-omni-support

8c7aa82

whn09 closed this Apr 22, 2026

whn09 mentioned this pull request Apr 22, 2026

[contrib] Add Qwen2.5-Omni-7B with full Neuron speech pipeline (Thinker+Talker+Token2Wav, TP=4) #135

Open

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen2.5-Omni-7B full Neuron speech pipeline (Thinker+Talker+Token2Wav, TP=4)#122

Add Qwen2.5-Omni-7B full Neuron speech pipeline (Thinker+Talker+Token2Wav, TP=4)#122
whn09 wants to merge 25 commits intoaws-neuron:mainfrom
whn09:feature/qwen25-omni-support

whn09 commented Apr 10, 2026 •

edited

Loading

Uh oh!

whn09 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

whn09 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Full Neuron Speech Pipeline

Key Technical Contributions

Model Information

Architecture

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Related Issues

vLLM Integration

Uh oh!

whn09 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

whn09 commented Apr 10, 2026 •

edited

Loading