vulkan: Warptile tuning for Intel Xe2/Xe3 #18178

virajwad · 2025-12-18T17:07:10Z

Summary
In this PR we modify the m_warptile configurations to increase prompt processing performance for Intel Xe2+ GPUs that support coopmat extension

Changes

Increase WGSIZE 4x
Decrease num workgroups dispatched on MM shaders by 2x in x-dim and 2x in y-dim (4x total decrease)
Increase BM, BN each by 2x

Accuracy Check
Basic testing w/ llama-cli across models that show perf jump (+ trying different prompt sizes) - model output looks correct / reasonable.

Unit tests Check
Checked on system w/ Arc B580 + Intel Integrated Graphics. All unit tests pass.

Performance Results
Command ran is llama-bench.exe -m ..\<model> -p 512 -r 5 -n 128
The eval token gen results don't change and weren't expected to, only prompt processing :) Below numbers show prompt processing in tok/s for Arc B580 and Lunar Lake Series 2 IGPU.

PR Status
Ready for Review

jeffbolznv · 2025-12-18T18:24:14Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

+        if ((device->vendor_id == VK_VENDOR_ID_INTEL) && (device->driver_id == vk::DriverId::eIntelProprietaryWindows)) {
+            if (device->coopmat_support && device->architecture == INTEL_XE2) {
+                // Xe2/Xe3 with coopmat enabled - warptile performance tuning
+                m_warptile = { 512, 128, 128, 16, subgroup_size_8, 32, 2, tm_m, tn_m, tk_m, subgroup_size_8 };


I wonder if this should actually be the large tile size?

Also, a quick google search suggests Xe2 has 64KB of register file per core, which with 512 invocations is only 32 registers each which seems very low. But I've never worked on this hardware so I'm just speculating.

Hi @jeffbolznv Sure I can look at re-enabling large warptile size for Intel here and then moving the warptile config from m_ to l_. I'll also check perf again after the change.

Are you doing (64 * 1024) / 512 invocations is 128 bytes per invocation and the assumption is 4 byte width register? (to get 32 registers per invocation?)

Yes, that's the calculation I did.

Thanks Jeff, for Xe architecture each register in GRF was 32 Byte wide. But I need to look into the register situation a bit deeper

jeffbolznv · 2025-12-18T18:25:59Z

CC @mmerecki in case this makes sense to also enable for Linux.

netrunnereve · 2025-12-18T21:51:40Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

        m_align =  64;
        s_align =  32;

+        if ((device->vendor_id == VK_VENDOR_ID_INTEL) && (device->driver_id == vk::DriverId::eIntelProprietaryWindows)) {


Ah it's good to see more tunes show up 😃. Please move your tune to line 2845 so that they're all placed together.

Hi @netrunnereve thanks! I can do that but the wg_denoms change also needs to overwrite the default and be 'paired' with this tuning config to pass unit tests. Do you want me to make two separate 'if' statements with the same conditions?

I would just move this section above line 2841.

l_mmq_wg_denoms = l_wg_denoms = {128, 128, 1 }; m_mmq_wg_denoms = m_wg_denoms = { 64, 64, 1 }; s_mmq_wg_denoms = s_wg_denoms = { 32, 32, 1 }; l_align = 128; m_align = 64; s_align = 32;

Made the change

netrunnereve · 2025-12-18T22:40:31Z

Oh and please run a before and after ./bin/test-backend-ops perf -o MUL_MAT -p "n=512" to make sure that all quants actually run faster after tuning.

virajwad · 2025-12-18T23:00:06Z

Oh and please run a before and after ./bin/test-backend-ops perf -o MUL_MAT -p "n=512" to make sure that all quants actually run faster after tuning.

Thanks! I checked on Arc B580, everything saw good improvement (except for type_a=bf16 test which was same perf)

My IGPU had same perf on all quants as it doesn't support coopmat

virajwad · 2025-12-19T21:51:23Z

CC @mmerecki in case this makes sense to also enable for Linux.

Ok I tested Ubuntu 24.04 also with Lunar Lake, it does show perf improvement
Command ran is llama-bench.exe -m ..\<model> -p 512 -r 5 -n 128 , and below shows prompt processing in tok/s

Also checked test-backend-ops -o MUL_MAT,MUL_MAT_ID. All pass on Ubuntu as well

So I removed the Windows driver check. Should work for both Win/Linux now

virajwad · 2025-12-19T22:25:29Z

@jeffbolznv
I tried moving tuning config from m_warptile -> l_warptile and re-enabled l_warptile for Intel on line 4879. Tested on B580, the perf is the same across the models and different quants, except for bf16 quant which shows a lot worse performance

However, by re-enabling l_warptile for Intel, our integrated graphics ends up switching to the standard l_warptile config causing significant perf loss on all quants

Since this change is specific to Xe2+, I propose we keep at m_warptile for this PR and not re-enable l_warptile across all Intel GPUs. I would like to look at possibly re-enabling l_warptile separately

virajwad · 2025-12-19T22:26:55Z

@jeffbolznv @netrunnereve @0cc4m
PR is ready for second round of review :)

jeffbolznv · 2025-12-19T22:29:47Z

OK, since the large tile size is disabled on intel I guess it's fine to use medium as the large size. It's just a bit confusing.

0cc4m · 2025-12-20T13:47:24Z

Thank you, this looks quite interesting! I'll review it soon. Can you explain how you got to this tile configuration? Trial and error, or by using some architecture/driver knowledge?

0cc4m · 2025-12-21T09:49:50Z

@jeffbolznv I tried moving tuning config from m_warptile -> l_warptile and re-enabled l_warptile for Intel on line 4879. Tested on B580, the perf is the same across the models and different quants, except for bf16 quant which shows a lot worse performance

However, by re-enabling l_warptile for Intel, our integrated graphics ends up switching to the standard l_warptile config causing significant perf loss on all quants

Since this change is specific to Xe2+, I propose we keep at m_warptile for this PR and not re-enable l_warptile across all Intel GPUs. I would like to look at possibly re-enabling l_warptile separately

The large warptile is disabled for a reason on Intel, yes. But since your new configuration matches the denoms of it, it does make sense to use it. Why not just set the large tile size to your configuration for Intel, and only enable it on Intel if coopmat is available? That would make the change a little clearer.

virajwad · 2025-12-22T15:50:24Z

Thank you, this looks quite interesting! I'll review it soon. Can you explain how you got to this tile configuration? Trial and error, or by using some architecture/driver knowledge?

It was probably 70-80% trial and error :) There's not a lot of warptile configurations that will cause all unit tests to pass. I had to check what was indexing out of bound (or indexing too short) in the kernel when changing certain values

virajwad · 2025-12-22T15:57:50Z

@jeffbolznv I tried moving tuning config from m_warptile -> l_warptile and re-enabled l_warptile for Intel on line 4879. Tested on B580, the perf is the same across the models and different quants, except for bf16 quant which shows a lot worse performance
However, by re-enabling l_warptile for Intel, our integrated graphics ends up switching to the standard l_warptile config causing significant perf loss on all quants
Since this change is specific to Xe2+, I propose we keep at m_warptile for this PR and not re-enable l_warptile across all Intel GPUs. I would like to look at possibly re-enabling l_warptile separately

The large warptile is disabled for a reason on Intel, yes. But since your new configuration matches the denoms of it, it does make sense to use it. Why not just set the large tile size to your configuration for Intel, and only enable it on Intel if coopmat is available? That would make the change a little clearer.

So are you thinking something like this?

Re-enable l_warptile for Intel on line 4879 only if both XE2+ architecture and coopmat support is present
Change my tuning config from m_warptile --> l_warptile

There's another PR to enable coopmat support for Xe architecture, but also picks a different warptile tuning configuration, so I wanted to be more cautious and restrict to Xe2+ for now.

0cc4m · 2025-12-24T07:58:30Z

So are you thinking something like this?

1. Re-enable l_warptile for Intel on line 4879 only if both XE2+ architecture and coopmat support is present

2. Change my tuning config from m_warptile --> l_warptile

Yes, exactly.

There's another PR to enable coopmat support for Xe architecture, but also picks a different warptile tuning configuration, so I wanted to be more cautious and restrict to Xe2+ for now.

I hadn't seen that PR yet, but since it's still WIP, I wouldn't hold back on this one not to get in the way. In the end both have to work together anyways.

virajwad · 2025-12-29T18:06:50Z

Hi @0cc4m @jeffbolznv I made the change to re-enable l_warptile for Intel under certain conditions (coopmat=true and Xe2+ architecture) and changed my tuning to use m_warptile --> l_warptile instead. Spot checked the MUL_MAT, MUL_MAT_ID unit tests and they all pass, perf numbers still align with my data for both B580 (Xe2) and IGPU (non-Xe2).

My only wonder is I'm not sure why making this change causes a regression in MUL_MAT specifically for bf16 type in source a.
(left side shows my warptile tuning when using m_warptile, right side is moving it to l_warptile and re-enabling l_warptile)

All other MUL_MAT / MUL_MAT_ID quant types show perf improvement, is there something specific for bf16 mulmat that could cause a regression like this?

jeffbolznv · 2025-12-30T18:10:17Z

My guess is that coopmat_bf16_support is not supported on this device and this change enables the large tile size for bf16 (set around line 3576) that has not been tuned for this device.

… to l_warptile)

virajwad · 2025-12-30T20:30:41Z

Thanks @jeffbolznv! You were absolutely right, the device didn't support coopmat_bf16_support and I didn't notice the l/m/s warptile configs around line 3576 for tuning bf16 mulmats. I added a tuning for bf16 l_warptile which fixes the regression.

When trying test-backend-ops.exe perf -o MUL_MAT,MUL_MAT_ID -p "type_a=bf16", the regression is fixed and perf is improved a bit compared to original using m_warptile (before this PR).

When trying test-backend-ops perf -o MUL_MAT -p "n=512", now all quants including bf16 show perf improvement

All MUL_MAT, MUL_MAT_ID unit tests still functionally pass.

I spot checked perf on Llama3.2 1B BF16 model, it was same before and after PR.

And finally, I made sure the above BF16 model accuracy looks fine through llama-cli.

On a separate note, we may want to add more bf16 unit tests for checking accuracy. I think there aren't enough, when I tried a few different warptile tunings, though all the unit tests pass, the model accuracy through llama-cli at higher prompt sizes was broken. If anyone else modifies bf16 warptile tuning, it would be good to catch problems early.

@jeffbolznv @0cc4m Thanks for your help with this PR - I think it is ready for a 2nd look again :)

virajwad · 2025-12-30T20:54:35Z

PR ready - but a question @jeffbolznv:

The mul_mm.comp shader uses a separate 1D/2D mapping within the shader and in dispatch. WGSIZE is only ">1" value defined on x-dim and so it's a singular value (like 512). When dispatch_pipeline is called for the shader, it uses wg_denoms value (defined on x-dim and y-dim) to calculate number of workgroups to dispatch on x-dim and y-dim. I first found this interesting as vulkan shader examples online typically divide total_work_x / wg_size_x and total_work_y / wg_size_y. Here WGSIZE=512 (on x) is separately defined and mapped compared to using wg_size_x and wg_size_y (we instead use separate values wg_denoms = 128x128) to calculate dispatch, so the mapping and values aren't the same.

Then the mul_mm shader calculates NUM_WARPS = WGSIZE / WARP_SIZE. Say if WGSIZE=512 and WARP=32, then NUM_WARPS=16. Since BM*BN is the tile size that a workgroup operates on, and WM*WN is the tile size that a subgroup/warp operates on, then 2D-wise the number of warps also needs to satisfy (BM*BN)/(WM*WN) = NUM_WARPS (contained within a workgroup).

Due to these constraints, there aren't a lot of warptile configs that will allow the shader to correctly do matrix multiplication. For example, if I wanted to reduce WM (32->16) and WN (32->16), then NUM_WARPS will get calculated as WGSIZE=512 / WARP=32 so it is still 16 warps, but in 2D mapping (BM*BN)/(WM*WN) = (128*128)/(16*16) = 64 warps. So the warptile config is incorrect and causes unit tests / accuracy to fail. If the number of warps was mapped in a 2D way with workgroup size being 2D, could more configs be supported?

Was the shader designed intentionally this way? Any thoughts on this?

jeffbolznv · 2025-12-30T21:02:16Z

The change looks reasonable to me. I agree there aren't enough backend tests to hit the various tile sizes.

@0cc4m knows the mul_mm shader better than I, I've never fully understood all the tiling parameters.

virajwad added 4 commits December 14, 2025 19:50

modify warptile tuning for xe3

0f82d0a

intel vendor check w/ coopmat support

c908711

fix back formatting

3441283

fix formatting change 2

6b7f1e8

virajwad requested a review from 0cc4m as a code owner December 18, 2025 17:07

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 18, 2025

jeffbolznv reviewed Dec 18, 2025

View reviewed changes

loci-dev mentioned this pull request Dec 18, 2025

UPSTREAM PR #18178: vulkan: Warptile tuning for Intel Xe2/Xe3 auroralabs-loci/llama.cpp#616

Open

netrunnereve reviewed Dec 18, 2025

View reviewed changes

move intel check to chip specific tuning part

0cfa616

jeffbolznv approved these changes Dec 19, 2025

View reviewed changes

Change to support both windows and linux

c40396a

modify m_warptile to l_warptile for intel

dc14792

modify warptile tuning for bf16 matmuls to fix regression (m_warptile…

328116e

… to l_warptile)

vulkan: Warptile tuning for Intel Xe2/Xe3 #18178

Are you sure you want to change the base?

vulkan: Warptile tuning for Intel Xe2/Xe3 #18178

Conversation

virajwad commented Dec 18, 2025

Uh oh!

jeffbolznv Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

virajwad Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

virajwad Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

jeffbolznv commented Dec 18, 2025

Uh oh!

netrunnereve Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

virajwad Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

netrunnereve Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

virajwad Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

netrunnereve commented Dec 18, 2025

Uh oh!

virajwad commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

virajwad commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

virajwad commented Dec 19, 2025

Uh oh!

virajwad commented Dec 19, 2025

Uh oh!

jeffbolznv commented Dec 19, 2025

Uh oh!

0cc4m commented Dec 20, 2025

Uh oh!

0cc4m commented Dec 21, 2025

Uh oh!

virajwad commented Dec 22, 2025

Uh oh!

virajwad commented Dec 22, 2025

Uh oh!

0cc4m commented Dec 24, 2025

Uh oh!

virajwad commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Dec 30, 2025

Uh oh!

virajwad commented Dec 30, 2025

Uh oh!

virajwad commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

virajwad Dec 18, 2025 •

edited

Loading

virajwad commented Dec 18, 2025 •

edited

Loading

virajwad commented Dec 19, 2025 •

edited

Loading

virajwad commented Dec 29, 2025 •

edited

Loading

virajwad commented Dec 30, 2025 •

edited

Loading