sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product by aicss-genai · Pull Request #22156 · ggml-org/llama.cpp

aicss-genai · 2026-04-20T07:09:16Z

Overview

Authors

Small perf cleanup for Q6_K MMVQ on Intel GPUs.

The Q6_K MMVQ dot product previously called dpct::vectorized_binary<sycl::char4>(..., dpct::sub_sat()) to compute
a byte-wise saturated subtract in a packed int. On Intel targets this lowers to a scalarized sequence. Replaces it with a 4-byte SWAR trick (((a | 0x80808080u) - b) ^ 0x80808080u, safe under the known-range inputs) and a straight-line two-lane scalar dot (dp4a × 2) instead of the QR6_K-iterated loop.

Applied in all three Q6_K MMVQ sites: the standalone vec_dot_q6_K_q8_1_impl_mmvq, the reorder_vec_dot_q_sycl<Q6_K> method and operator, and the top-level vec_dot_q6_K_q8_1.

Bitwise equivalent on inputs produced by quantization; no accuracy change. Measurable tg improvement on BMG for Q6_K-heavy models.

Additional information

Split from #22066 per reviewer request for independent review.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes. This work was partially produced with an agentic engineering approach: agents surface issues and explore experiments while engineers identify and reject candidates using domain knowledge. Human feedback involved.

Signed-off-by: Chun Tao <chun.tao@intel.com>

ggml-gh-bot · 2026-04-20T07:13:20Z

Hi @aicss-genai, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 8 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

NeoZhangJianyu

Could you share the test result of this PR to show the value?

Thank you!

malsbat · 2026-05-08T17:41:23Z

Here is a subset of test data showing the range of model-dependent improvements. See bold lines for greatest improvement below.

Baseline commit is deab41e

Model	Task	Tokens	baseline (tok/s)	pr-7 (tok/s)	Speedup
Gemma-2-9B	pp	512	842.58 ±6.57	843.84 ±5.91	1.00x
Gemma-2-9B	pp	1024	871.54 ±0.40	869.58 ±0.31	1.00x
Gemma-2-9B	pp	2048	832.49 ±0.75	830.72 ±0.15	1.00x
Gemma-2-9B	pp	4096	711.62 ±0.18	713.11 ±0.36	1.00x
Gemma-2-9B	pp	8192	561.76 ±0.44	562.19 ±0.45	1.00x
Gemma-2-9B	tg	128	57.64 ±0.21	63.23 ±0.30	1.10x
Gemma-2-9B	tg	256	57.51 ±0.20	63.11 ±0.19	1.10x
Gemma-2-9B	tg	512	57.05 ±0.16	62.73 ±0.15	1.10x
Gemma-2-9B	tg	1024	54.91 ±0.10	60.07 ±0.16	1.09x
Llama-3.1-8B-Q8	pp	512	1017.45 ±4.55	1018.08 ±3.58	1.00x
Llama-3.1-8B-Q8	pp	1024	1078.25 ±0.61	1078.49 ±1.04	1.00x
Llama-3.1-8B-Q8	pp	2048	1023.51 ±0.38	1019.95 ±0.60	1.00x
Llama-3.1-8B-Q8	pp	4096	883.86 ±0.20	880.17 ±0.67	1.00x
Llama-3.1-8B-Q8	pp	8192	693.29 ±0.15	693.17 ±0.12	1.00x
Llama-3.1-8B-Q8	tg	128	55.16 ±0.02	55.18 ±0.02	1.00x
Llama-3.1-8B-Q8	tg	256	55.16 ±0.07	55.15 ±0.07	1.00x
Llama-3.1-8B-Q8	tg	512	55.11 ±0.04	55.10 ±0.04	1.00x
Llama-3.1-8B-Q8	tg	1024	54.46 ±0.06	54.47 ±0.05	1.00x
Qwen2.5-32B-Q6	pp	512	240.60 ±0.57	240.83 ±0.60	1.00x
Qwen2.5-32B-Q6	pp	1024	255.02 ±0.07	255.11 ±0.05	1.00x
Qwen2.5-32B-Q6	pp	2048	247.67 ±0.08	247.82 ±0.02	1.00x
Qwen2.5-32B-Q6	pp	4096	226.12 ±0.02	226.23 ±0.05	1.00x
Qwen2.5-32B-Q6	pp	8192	192.92 ±0.01	192.95 ±0.01	1.00x
Qwen2.5-32B-Q6	tg	128	12.20 ±0.02	18.43 ±0.02	1.51x
Qwen2.5-32B-Q6	tg	256	12.18 ±0.01	18.38 ±0.04	1.51x
Qwen2.5-32B-Q6	tg	512	12.17 ±0.01	18.29 ±0.03	1.50x
Qwen2.5-32B-Q6	tg	1024	12.06 ±0.01	18.06 ±0.03	1.50x
Qwen3.5-9B-Q4	pp	512	989.96 ±7.93	992.03 ±8.74	1.00x
Qwen3.5-9B-Q4	pp	1024	1030.30 ±0.49	1030.68 ±0.58	1.00x
Qwen3.5-9B-Q4	pp	2048	1049.97 ±1.12	1050.76 ±0.44	1.00x
Qwen3.5-9B-Q4	pp	4096	1009.27 ±0.37	1008.61 ±0.70	1.00x
Qwen3.5-9B-Q4	pp	8192	934.39 ±0.74	934.75 ±0.51	1.00x
Qwen3.5-9B-Q4	tg	128	57.87 ±0.20	62.08 ±0.13	1.07x
Qwen3.5-9B-Q4	tg	256	57.72 ±0.07	62.00 ±0.04	1.07x
Qwen3.5-9B-Q4	tg	512	57.54 ±0.07	61.94 ±0.05	1.08x
Qwen3.5-9B-Q4	tg	1024	57.45 ±0.02	61.87 ±0.05	1.08x

arthw

It's good job!

Thank you very much!

* master: (100 commits) Agent update hexagon: add support for TRI op (ggml-org#22822) ggml-hexagon: add PAD op HVX kernel (ggml-org#23078) docker : add OCI image labels for version and build date (ggml-org#21653) common : remove hf cache migration (ggml-org#23266) ui: Update KaTeX package and clean up logs from `sass` warnings (ggml-org#23275) feat: add scroll-to-bottom button to chat + prevent forced scroll down (ggml-org#23270) ui: Refactor models store, MCP service, and gate logs behind VITE_DEBUG (ggml-org#23236) ui: Centralize monospace font styles in app.css (ggml-org#23272) webui: fix Tailwind v4 utility classes missing when built via cmake (ggml-org#23253) llama: initialize pre-norm embedding mask flag (ggml-org#23256) add myself to conversion (ggml-org#23261) ci : added kleidiai-server to server-self-hosted workflow (ggml-org#22435) scripts : allow wc2wt with an existing branch (ggml-org#23189) sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (ggml-org#22156) sycl: route small f32 matmuls to oneMKL, bypass oneDNN (ggml-org#22150) sycl : fix error when use -mg 1 error (ggml-org#23140) update bid to match each layers MTP source (ggml-org#23237) cmake : do not check for bin install dir (ggml-org#23234) feat: Support d_conv=15 for ssm-conv.cu (ggml-org#23017) ...

) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>

) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> (cherry picked from commit 0caf2a1)

) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>

ctao456 and others added 2 commits April 19, 2026 23:39

sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product

4f611ae

Signed-off-by: Chun Tao <chun.tao@intel.com>

Merge branch 'ggml-org:master' into aicss-genai/sycl-bmg-upstream-pr-7

8a94a86

aicss-genai requested a review from a team as a code owner April 20, 2026 07:09

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 20, 2026

NeoZhangJianyu reviewed Apr 22, 2026

View reviewed changes

arthw approved these changes May 9, 2026

View reviewed changes

arthw added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 16, 2026

ggerganov merged commit 0caf2a1 into ggml-org:master May 18, 2026
82 of 87 checks passed

kgrama pushed a commit to kgrama/llama.cpp that referenced this pull request May 19, 2026

sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (ggml-org#22156

2ef2cdd

) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>

xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 19, 2026

sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (ggml-org#22156

b747367

) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 19, 2026

sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (ggml-org#22156

b57bca8

) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>

ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request May 19, 2026

sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (ggml-org#22156

16c30d7

) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>

fhnmor21 pushed a commit to fhnmor21/llama-cpp-turboquant that referenced this pull request May 19, 2026

sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (ggml-org#22156

33bee59

) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>

dbrain pushed a commit to dbrain/hbd-llama-cpp-turboquant that referenced this pull request May 21, 2026

sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (ggml-org#22156

3ab6c8d

) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>

baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026

sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (ggml-org#22156

e4165c9

) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>

srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026

sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (ggml-org#22156

3bdb7dc

) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>

fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026

sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (ggml-org#22156

b4c07c4

) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product#22156

sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product#22156
ggerganov merged 2 commits into
ggml-org:masterfrom
aicss-genai:aicss-genai/sycl-bmg-upstream-pr-7

aicss-genai commented Apr 20, 2026

Uh oh!

ggml-gh-bot Bot commented Apr 20, 2026

Uh oh!

NeoZhangJianyu left a comment

Uh oh!

malsbat commented May 8, 2026 •

edited

Loading

Uh oh!

arthw left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

aicss-genai commented Apr 20, 2026

Overview

Authors

Additional information

Requirements

Uh oh!

ggml-gh-bot Bot commented Apr 20, 2026

Uh oh!

NeoZhangJianyu left a comment

Choose a reason for hiding this comment

Uh oh!

malsbat commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

malsbat commented May 8, 2026 •

edited

Loading