Skip to content

Conversation

@virajwad
Copy link

Summary
In this PR we modify the m_warptile configurations to increase prompt processing performance for Intel Xe2+ GPUs that support coopmat extension

Changes

  1. Increase WGSIZE 4x
  2. Decrease num workgroups dispatched on MM shaders by 2x in x-dim and 2x in y-dim (4x total decrease)
  3. Increase BM, BN each by 2x

Accuracy Check
Basic testing w/ llama-cli across models that show perf jump (+ trying different prompt sizes) - model output looks correct / reasonable.

Unit tests Check
Checked on system w/ Arc B580 + Intel Integrated Graphics. All unit tests pass.

image

Performance Results
Command ran is llama-bench.exe -m ..\<model> -p 512 -r 5 -n 128
The eval token gen results don't change and weren't expected to, only prompt processing :) Below numbers show prompt processing in tok/s for Arc B580 and Lunar Lake Series 2 IGPU.

image image

PR Status
Ready for Review

@virajwad virajwad requested a review from 0cc4m as a code owner December 18, 2025 17:07
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 18, 2025
if ((device->vendor_id == VK_VENDOR_ID_INTEL) && (device->driver_id == vk::DriverId::eIntelProprietaryWindows)) {
if (device->coopmat_support && device->architecture == INTEL_XE2) {
// Xe2/Xe3 with coopmat enabled - warptile performance tuning
m_warptile = { 512, 128, 128, 16, subgroup_size_8, 32, 2, tm_m, tn_m, tk_m, subgroup_size_8 };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should actually be the large tile size?

Also, a quick google search suggests Xe2 has 64KB of register file per core, which with 512 invocations is only 32 registers each which seems very low. But I've never worked on this hardware so I'm just speculating.

Copy link
Author

@virajwad virajwad Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jeffbolznv Sure I can look at re-enabling large warptile size for Intel here and then moving the warptile config from m_ to l_. I'll also check perf again after the change.

Are you doing (64 * 1024) / 512 invocations is 128 bytes per invocation and the assumption is 4 byte width register? (to get 32 registers per invocation?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the calculation I did.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jeff, for Xe architecture each register in GRF was 32 Byte wide. But I need to look into the register situation a bit deeper

@jeffbolznv
Copy link
Collaborator

CC @mmerecki in case this makes sense to also enable for Linux.

m_align = 64;
s_align = 32;

if ((device->vendor_id == VK_VENDOR_ID_INTEL) && (device->driver_id == vk::DriverId::eIntelProprietaryWindows)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it's good to see more tunes show up 😃. Please move your tune to line 2845 so that they're all placed together.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @netrunnereve thanks! I can do that but the wg_denoms change also needs to overwrite the default and be 'paired' with this tuning config to pass unit tests. Do you want me to make two separate 'if' statements with the same conditions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just move this section above line 2841.

        l_mmq_wg_denoms = l_wg_denoms = {128, 128, 1 };
        m_mmq_wg_denoms = m_wg_denoms = { 64,  64, 1 };
        s_mmq_wg_denoms = s_wg_denoms = { 32,  32, 1 };
        l_align = 128;
        m_align =  64;
        s_align =  32;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change

@netrunnereve
Copy link
Collaborator

Oh and please run a before and after ./bin/test-backend-ops perf -o MUL_MAT -p "n=512" to make sure that all quants actually run faster after tuning.

@virajwad
Copy link
Author

virajwad commented Dec 18, 2025

Oh and please run a before and after ./bin/test-backend-ops perf -o MUL_MAT -p "n=512" to make sure that all quants actually run faster after tuning.

Thanks! I checked on Arc B580, everything saw good improvement (except for type_a=bf16 test which was same perf)

image

My IGPU had same perf on all quants as it doesn't support coopmat

@virajwad
Copy link
Author

virajwad commented Dec 19, 2025

CC @mmerecki in case this makes sense to also enable for Linux.

Ok I tested Ubuntu 24.04 also with Lunar Lake, it does show perf improvement
Command ran is llama-bench.exe -m ..\<model> -p 512 -r 5 -n 128 , and below shows prompt processing in tok/s

image

Also checked test-backend-ops -o MUL_MAT,MUL_MAT_ID. All pass on Ubuntu as well

image

So I removed the Windows driver check. Should work for both Win/Linux now

@virajwad
Copy link
Author

@jeffbolznv
I tried moving tuning config from m_warptile -> l_warptile and re-enabled l_warptile for Intel on line 4879. Tested on B580, the perf is the same across the models and different quants, except for bf16 quant which shows a lot worse performance

image

However, by re-enabling l_warptile for Intel, our integrated graphics ends up switching to the standard l_warptile config causing significant perf loss on all quants

image

Since this change is specific to Xe2+, I propose we keep at m_warptile for this PR and not re-enable l_warptile across all Intel GPUs. I would like to look at possibly re-enabling l_warptile separately

@virajwad
Copy link
Author

@jeffbolznv @netrunnereve @0cc4m
PR is ready for second round of review :)

@jeffbolznv
Copy link
Collaborator

OK, since the large tile size is disabled on intel I guess it's fine to use medium as the large size. It's just a bit confusing.

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 20, 2025

Thank you, this looks quite interesting! I'll review it soon. Can you explain how you got to this tile configuration? Trial and error, or by using some architecture/driver knowledge?

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 21, 2025

@jeffbolznv I tried moving tuning config from m_warptile -> l_warptile and re-enabled l_warptile for Intel on line 4879. Tested on B580, the perf is the same across the models and different quants, except for bf16 quant which shows a lot worse performance

However, by re-enabling l_warptile for Intel, our integrated graphics ends up switching to the standard l_warptile config causing significant perf loss on all quants

Since this change is specific to Xe2+, I propose we keep at m_warptile for this PR and not re-enable l_warptile across all Intel GPUs. I would like to look at possibly re-enabling l_warptile separately

The large warptile is disabled for a reason on Intel, yes. But since your new configuration matches the denoms of it, it does make sense to use it. Why not just set the large tile size to your configuration for Intel, and only enable it on Intel if coopmat is available? That would make the change a little clearer.

@virajwad
Copy link
Author

Thank you, this looks quite interesting! I'll review it soon. Can you explain how you got to this tile configuration? Trial and error, or by using some architecture/driver knowledge?

It was probably 70-80% trial and error :) There's not a lot of warptile configurations that will cause all unit tests to pass. I had to check what was indexing out of bound (or indexing too short) in the kernel when changing certain values

@virajwad
Copy link
Author

@jeffbolznv I tried moving tuning config from m_warptile -> l_warptile and re-enabled l_warptile for Intel on line 4879. Tested on B580, the perf is the same across the models and different quants, except for bf16 quant which shows a lot worse performance
However, by re-enabling l_warptile for Intel, our integrated graphics ends up switching to the standard l_warptile config causing significant perf loss on all quants
Since this change is specific to Xe2+, I propose we keep at m_warptile for this PR and not re-enable l_warptile across all Intel GPUs. I would like to look at possibly re-enabling l_warptile separately

The large warptile is disabled for a reason on Intel, yes. But since your new configuration matches the denoms of it, it does make sense to use it. Why not just set the large tile size to your configuration for Intel, and only enable it on Intel if coopmat is available? That would make the change a little clearer.

So are you thinking something like this?

  1. Re-enable l_warptile for Intel on line 4879 only if both XE2+ architecture and coopmat support is present
  2. Change my tuning config from m_warptile --> l_warptile

There's another PR to enable coopmat support for Xe architecture, but also picks a different warptile tuning configuration, so I wanted to be more cautious and restrict to Xe2+ for now.

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 24, 2025

So are you thinking something like this?

1. Re-enable l_warptile for Intel on line 4879 only if both XE2+ architecture and coopmat support is present

2. Change my tuning config from m_warptile --> l_warptile

Yes, exactly.

There's another PR to enable coopmat support for Xe architecture, but also picks a different warptile tuning configuration, so I wanted to be more cautious and restrict to Xe2+ for now.

I hadn't seen that PR yet, but since it's still WIP, I wouldn't hold back on this one not to get in the way. In the end both have to work together anyways.

@virajwad
Copy link
Author

virajwad commented Dec 29, 2025

Hi @0cc4m @jeffbolznv I made the change to re-enable l_warptile for Intel under certain conditions (coopmat=true and Xe2+ architecture) and changed my tuning to use m_warptile --> l_warptile instead. Spot checked the MUL_MAT, MUL_MAT_ID unit tests and they all pass, perf numbers still align with my data for both B580 (Xe2) and IGPU (non-Xe2).

My only wonder is I'm not sure why making this change causes a regression in MUL_MAT specifically for bf16 type in source a.
(left side shows my warptile tuning when using m_warptile, right side is moving it to l_warptile and re-enabling l_warptile)

image

All other MUL_MAT / MUL_MAT_ID quant types show perf improvement, is there something specific for bf16 mulmat that could cause a regression like this?

@jeffbolznv
Copy link
Collaborator

My guess is that coopmat_bf16_support is not supported on this device and this change enables the large tile size for bf16 (set around line 3576) that has not been tuned for this device.

@virajwad
Copy link
Author

Thanks @jeffbolznv! You were absolutely right, the device didn't support coopmat_bf16_support and I didn't notice the l/m/s warptile configs around line 3576 for tuning bf16 mulmats. I added a tuning for bf16 l_warptile which fixes the regression.

When trying test-backend-ops.exe perf -o MUL_MAT,MUL_MAT_ID -p "type_a=bf16", the regression is fixed and perf is improved a bit compared to original using m_warptile (before this PR).

image

When trying test-backend-ops perf -o MUL_MAT -p "n=512", now all quants including bf16 show perf improvement

image

All MUL_MAT, MUL_MAT_ID unit tests still functionally pass.

I spot checked perf on Llama3.2 1B BF16 model, it was same before and after PR.

And finally, I made sure the above BF16 model accuracy looks fine through llama-cli.

image

On a separate note, we may want to add more bf16 unit tests for checking accuracy. I think there aren't enough, when I tried a few different warptile tunings, though all the unit tests pass, the model accuracy through llama-cli at higher prompt sizes was broken. If anyone else modifies bf16 warptile tuning, it would be good to catch problems early.

@jeffbolznv @0cc4m Thanks for your help with this PR - I think it is ready for a 2nd look again :)

@virajwad
Copy link
Author

virajwad commented Dec 30, 2025

PR ready - but a question @jeffbolznv:

The mul_mm.comp shader uses a separate 1D/2D mapping within the shader and in dispatch. WGSIZE is only ">1" value defined on x-dim and so it's a singular value (like 512). When dispatch_pipeline is called for the shader, it uses wg_denoms value (defined on x-dim and y-dim) to calculate number of workgroups to dispatch on x-dim and y-dim. I first found this interesting as vulkan shader examples online typically divide total_work_x / wg_size_x and total_work_y / wg_size_y. Here WGSIZE=512 (on x) is separately defined and mapped compared to using wg_size_x and wg_size_y (we instead use separate values wg_denoms = 128x128) to calculate dispatch, so the mapping and values aren't the same.

Then the mul_mm shader calculates NUM_WARPS = WGSIZE / WARP_SIZE. Say if WGSIZE=512 and WARP=32, then NUM_WARPS=16. Since BM*BN is the tile size that a workgroup operates on, and WM*WN is the tile size that a subgroup/warp operates on, then 2D-wise the number of warps also needs to satisfy (BM*BN)/(WM*WN) = NUM_WARPS (contained within a workgroup).

Due to these constraints, there aren't a lot of warptile configs that will allow the shader to correctly do matrix multiplication. For example, if I wanted to reduce WM (32->16) and WN (32->16), then NUM_WARPS will get calculated as WGSIZE=512 / WARP=32 so it is still 16 warps, but in 2D mapping (BM*BN)/(WM*WN) = (128*128)/(16*16) = 64 warps. So the warptile config is incorrect and causes unit tests / accuracy to fail. If the number of warps was mapped in a 2D way with workgroup size being 2D, could more configs be supported?

Was the shader designed intentionally this way? Any thoughts on this?

@jeffbolznv
Copy link
Collaborator

The change looks reasonable to me. I agree there aren't enough backend tests to hit the various tile sizes.

@0cc4m knows the mul_mm shader better than I, I've never fully understood all the tiling parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants