Skip to content

Contrib: Add Qwen3.6-27B (post-training update of Qwen3.5-27B)#140

Open
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/qwen3.6-27b
Open

Contrib: Add Qwen3.6-27B (post-training update of Qwen3.5-27B)#140
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/qwen3.6-27b

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Summary

  • Adds NxDI contrib implementation of Qwen3.6-27B, a 27B parameter dense model with hybrid DeltaNet + GQA attention architecture
  • Qwen3.6-27B is a post-training update of Qwen3.5-27B (PR Contrib: Add Qwen3.5-27B with hybrid DeltaNet + GQA architecture #128) with identical architecture (qwen3_5 model_type) -- improved agentic coding and thinking preservation, only weights differ
  • Same NxDI implementation as Qwen3.5-27B with updated documentation, Qwen3.6-27B benchmarks, quality validation, and cross-references between the two contribs

Relationship to PR #128 (Qwen3.5-27B)

This contrib uses the same Qwen35* classes and modeling_qwen35*.py filenames as the Qwen3.5-27B contrib (PR #128). The code is identical -- both models share the qwen3_5 model_type. Only the HuggingFace model ID and weights differ.

Config Compatibility

Qwen3.6-27B adds output_gate_type="swish" to text_config. Investigation confirmed this field is completely unused by HF transformers (zero references across v4.57.6, v5.6.0, and GitHub main) and by this NxDI code. No code changes required.

Test Results

Unit Tests (42/42 PASS, CPU only)

Module Tests
test_config.py 26/26
test_weight_conversion.py 16/16

Architecture-level tests -- identical results to Qwen3.5-27B.

Quality Validation (7/7 PASS, trn2.3xlarge, TP=4, SDK 2.29)

Test Result
Speed of light PASS
17 * 23 = 391 PASS
60mph * 2.5h = 150 miles PASS
is_prime function PASS
French translation PASS
Capital of Japan PASS
sqrt(144) = 12 PASS

Performance (trn2.3xlarge, TP=4, LNC=2, BF16, SDK 2.29)

Metric Qwen3.6-27B Qwen3.5-27B Delta
TPOT (P50) 54.2 ms 53 ms +2.3%
Throughput 18.5 tok/s 18.9 tok/s -2.1%
TTFT (P50) 306 ms 576 ms *

* TTFT difference due to compilation config (256-token vs 128-token bucket), not model. Architectural performance is equivalent.

Files (15 files, ~6600 lines)

contrib/models/Qwen3.6-27B/
├── README.md
├── src/
│   ├── __init__.py
│   ├── modeling_qwen35.py              (text decoder)
│   ├── modeling_qwen35_vision.py       (vision encoder)
│   ├── modeling_qwen35_vl.py           (VL pipeline)
│   └── nki_kernels/
│       ├── __init__.py
│       ├── nki_deltanet.py             (recurrent kernel)
│       ├── nki_deltanet_chunked.py     (per-chunk kernel)
│       └── nki_deltanet_fused.py       (fused chunked kernel)
└── test/
    ├── unit/
    │   ├── test_config.py              (26 tests)
    │   └── test_weight_conversion.py   (16 tests)
    └── integration/
        └── test_model.py               (8 tests)

Checklist

  • Contrib-only (no changes to NxDI src/)
  • Unit tests (42/42 pass)
  • Quality validation (7/7 pass on trn2.3xlarge, SDK 2.29)
  • Benchmarks (TPOT=54.2ms, 18.5 tok/s)
  • README with architecture details, benchmarks, cross-reference to Qwen3.5-27B, and config compatibility notes
  • Apache 2.0 license headers
  • SDK 2.29+ / NKI 0.3.0 required

Qwen3.6-27B shares identical architecture with Qwen3.5-27B (qwen3_5
model_type, hybrid DeltaNet + GQA). Same NxDI implementation as PR aws-neuron#128
with updated documentation, Qwen3.6-27B benchmarks, and cross-references.

Validated on trn2.3xlarge (TP=4, SDK 2.29): 7/7 quality tests passed,
TPOT=54.2ms, 18.5 tok/s, TTFT=306ms. Performance within 2% of Qwen3.5-27B.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant