Skip to content

sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product#22156

Merged
ggerganov merged 2 commits into
ggml-org:masterfrom
aicss-genai:aicss-genai/sycl-bmg-upstream-pr-7
May 18, 2026
Merged

sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product#22156
ggerganov merged 2 commits into
ggml-org:masterfrom
aicss-genai:aicss-genai/sycl-bmg-upstream-pr-7

Conversation

@aicss-genai
Copy link
Copy Markdown
Contributor

Overview

Authors

Small perf cleanup for Q6_K MMVQ on Intel GPUs.

The Q6_K MMVQ dot product previously called dpct::vectorized_binary<sycl::char4>(..., dpct::sub_sat()) to compute
a byte-wise saturated subtract in a packed int. On Intel targets this lowers to a scalarized sequence. Replaces it with a 4-byte SWAR trick (((a | 0x80808080u) - b) ^ 0x80808080u, safe under the known-range inputs) and a straight-line two-lane scalar dot (dp4a × 2) instead of the QR6_K-iterated loop.

Applied in all three Q6_K MMVQ sites: the standalone vec_dot_q6_K_q8_1_impl_mmvq, the reorder_vec_dot_q_sycl<Q6_K> method and operator, and the top-level vec_dot_q6_K_q8_1.

Bitwise equivalent on inputs produced by quantization; no accuracy change. Measurable tg improvement on BMG for Q6_K-heavy models.

Additional information

Split from #22066 per reviewer request for independent review.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes. This work was partially produced with an agentic engineering approach: agents surface issues and explore experiments while engineers identify and reject candidates using domain knowledge. Human feedback involved.

@aicss-genai aicss-genai requested a review from a team as a code owner April 20, 2026 07:09
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Apr 20, 2026

Hi @aicss-genai, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 8 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 20, 2026
Copy link
Copy Markdown
Contributor

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share the test result of this PR to show the value?

Thank you!

@malsbat
Copy link
Copy Markdown
Contributor

malsbat commented May 8, 2026

Here is a subset of test data showing the range of model-dependent improvements. See bold lines for greatest improvement below.

Baseline commit is deab41e

Model Task Tokens baseline (tok/s) pr-7 (tok/s) Speedup
Gemma-2-9B pp 512 842.58 ±6.57 843.84 ±5.91 1.00x
Gemma-2-9B pp 1024 871.54 ±0.40 869.58 ±0.31 1.00x
Gemma-2-9B pp 2048 832.49 ±0.75 830.72 ±0.15 1.00x
Gemma-2-9B pp 4096 711.62 ±0.18 713.11 ±0.36 1.00x
Gemma-2-9B pp 8192 561.76 ±0.44 562.19 ±0.45 1.00x
Gemma-2-9B tg 128 57.64 ±0.21 63.23 ±0.30 1.10x
Gemma-2-9B tg 256 57.51 ±0.20 63.11 ±0.19 1.10x
Gemma-2-9B tg 512 57.05 ±0.16 62.73 ±0.15 1.10x
Gemma-2-9B tg 1024 54.91 ±0.10 60.07 ±0.16 1.09x
Llama-3.1-8B-Q8 pp 512 1017.45 ±4.55 1018.08 ±3.58 1.00x
Llama-3.1-8B-Q8 pp 1024 1078.25 ±0.61 1078.49 ±1.04 1.00x
Llama-3.1-8B-Q8 pp 2048 1023.51 ±0.38 1019.95 ±0.60 1.00x
Llama-3.1-8B-Q8 pp 4096 883.86 ±0.20 880.17 ±0.67 1.00x
Llama-3.1-8B-Q8 pp 8192 693.29 ±0.15 693.17 ±0.12 1.00x
Llama-3.1-8B-Q8 tg 128 55.16 ±0.02 55.18 ±0.02 1.00x
Llama-3.1-8B-Q8 tg 256 55.16 ±0.07 55.15 ±0.07 1.00x
Llama-3.1-8B-Q8 tg 512 55.11 ±0.04 55.10 ±0.04 1.00x
Llama-3.1-8B-Q8 tg 1024 54.46 ±0.06 54.47 ±0.05 1.00x
Qwen2.5-32B-Q6 pp 512 240.60 ±0.57 240.83 ±0.60 1.00x
Qwen2.5-32B-Q6 pp 1024 255.02 ±0.07 255.11 ±0.05 1.00x
Qwen2.5-32B-Q6 pp 2048 247.67 ±0.08 247.82 ±0.02 1.00x
Qwen2.5-32B-Q6 pp 4096 226.12 ±0.02 226.23 ±0.05 1.00x
Qwen2.5-32B-Q6 pp 8192 192.92 ±0.01 192.95 ±0.01 1.00x
Qwen2.5-32B-Q6 tg 128 12.20 ±0.02 18.43 ±0.02 1.51x
Qwen2.5-32B-Q6 tg 256 12.18 ±0.01 18.38 ±0.04 1.51x
Qwen2.5-32B-Q6 tg 512 12.17 ±0.01 18.29 ±0.03 1.50x
Qwen2.5-32B-Q6 tg 1024 12.06 ±0.01 18.06 ±0.03 1.50x
Qwen3.5-9B-Q4 pp 512 989.96 ±7.93 992.03 ±8.74 1.00x
Qwen3.5-9B-Q4 pp 1024 1030.30 ±0.49 1030.68 ±0.58 1.00x
Qwen3.5-9B-Q4 pp 2048 1049.97 ±1.12 1050.76 ±0.44 1.00x
Qwen3.5-9B-Q4 pp 4096 1009.27 ±0.37 1008.61 ±0.70 1.00x
Qwen3.5-9B-Q4 pp 8192 934.39 ±0.74 934.75 ±0.51 1.00x
Qwen3.5-9B-Q4 tg 128 57.87 ±0.20 62.08 ±0.13 1.07x
Qwen3.5-9B-Q4 tg 256 57.72 ±0.07 62.00 ±0.04 1.07x
Qwen3.5-9B-Q4 tg 512 57.54 ±0.07 61.94 ±0.05 1.08x
Qwen3.5-9B-Q4 tg 1024 57.45 ±0.02 61.87 ±0.05 1.08x
image

Copy link
Copy Markdown
Contributor

@arthw arthw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good job!

Thank you very much!

@arthw arthw added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 16, 2026
@ggerganov ggerganov merged commit 0caf2a1 into ggml-org:master May 18, 2026
82 of 87 checks passed
Jcfunk added a commit to Jcfunk/llama.cpp that referenced this pull request May 19, 2026
* master: (100 commits)
  Agent update
  hexagon: add support for TRI op (ggml-org#22822)
  ggml-hexagon: add PAD op HVX kernel (ggml-org#23078)
  docker : add OCI image labels for version and build date (ggml-org#21653)
  common : remove hf cache migration (ggml-org#23266)
  ui: Update KaTeX package and clean up logs from `sass` warnings (ggml-org#23275)
  feat: add scroll-to-bottom button to chat + prevent forced scroll down (ggml-org#23270)
  ui: Refactor models store, MCP service, and gate logs behind VITE_DEBUG (ggml-org#23236)
  ui: Centralize monospace font styles in app.css (ggml-org#23272)
  webui: fix Tailwind v4 utility classes missing when built via cmake (ggml-org#23253)
  llama: initialize pre-norm embedding mask flag (ggml-org#23256)
  add myself to conversion (ggml-org#23261)
  ci : added kleidiai-server to server-self-hosted workflow (ggml-org#22435)
  scripts : allow wc2wt with an existing branch (ggml-org#23189)
  sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (ggml-org#22156)
  sycl: route small f32 matmuls to oneMKL, bypass oneDNN (ggml-org#22150)
  sycl : fix error when use -mg 1 error (ggml-org#23140)
  update bid to match each layers MTP source (ggml-org#23237)
  cmake : do not check for bin install dir (ggml-org#23234)
  feat: Support d_conv=15 for ssm-conv.cu (ggml-org#23017)
  ...
kgrama pushed a commit to kgrama/llama.cpp that referenced this pull request May 19, 2026
)

Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 19, 2026
)

Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 19, 2026
)

Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request May 19, 2026
)

Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
fhnmor21 pushed a commit to fhnmor21/llama-cpp-turboquant that referenced this pull request May 19, 2026
)

Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
dbrain pushed a commit to dbrain/hbd-llama-cpp-turboquant that referenced this pull request May 21, 2026
)

Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
)

Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026
)

Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
carlosfundora pushed a commit to carlosfundora/llama.cpp-1-bit-turbo that referenced this pull request May 24, 2026
)

Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
(cherry picked from commit 0caf2a1)
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
)

Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants