-
Notifications
You must be signed in to change notification settings - Fork 14.4k
vulkan: Warptile tuning for Intel Xe2/Xe3 #18178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
ggml/src/ggml-vulkan/ggml-vulkan.cpp
Outdated
| if ((device->vendor_id == VK_VENDOR_ID_INTEL) && (device->driver_id == vk::DriverId::eIntelProprietaryWindows)) { | ||
| if (device->coopmat_support && device->architecture == INTEL_XE2) { | ||
| // Xe2/Xe3 with coopmat enabled - warptile performance tuning | ||
| m_warptile = { 512, 128, 128, 16, subgroup_size_8, 32, 2, tm_m, tn_m, tk_m, subgroup_size_8 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should actually be the large tile size?
Also, a quick google search suggests Xe2 has 64KB of register file per core, which with 512 invocations is only 32 registers each which seems very low. But I've never worked on this hardware so I'm just speculating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jeffbolznv Sure I can look at re-enabling large warptile size for Intel here and then moving the warptile config from m_ to l_. I'll also check perf again after the change.
Are you doing (64 * 1024) / 512 invocations is 128 bytes per invocation and the assumption is 4 byte width register? (to get 32 registers per invocation?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the calculation I did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jeff, for Xe architecture each register in GRF was 32 Byte wide. But I need to look into the register situation a bit deeper
|
CC @mmerecki in case this makes sense to also enable for Linux. |
ggml/src/ggml-vulkan/ggml-vulkan.cpp
Outdated
| m_align = 64; | ||
| s_align = 32; | ||
|
|
||
| if ((device->vendor_id == VK_VENDOR_ID_INTEL) && (device->driver_id == vk::DriverId::eIntelProprietaryWindows)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah it's good to see more tunes show up 😃. Please move your tune to line 2845 so that they're all placed together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @netrunnereve thanks! I can do that but the wg_denoms change also needs to overwrite the default and be 'paired' with this tuning config to pass unit tests. Do you want me to make two separate 'if' statements with the same conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just move this section above line 2841.
l_mmq_wg_denoms = l_wg_denoms = {128, 128, 1 };
m_mmq_wg_denoms = m_wg_denoms = { 64, 64, 1 };
s_mmq_wg_denoms = s_wg_denoms = { 32, 32, 1 };
l_align = 128;
m_align = 64;
s_align = 32;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change
|
Oh and please run a before and after |
Ok I tested Ubuntu 24.04 also with Lunar Lake, it does show perf improvement
Also checked test-backend-ops -o MUL_MAT,MUL_MAT_ID. All pass on Ubuntu as well
So I removed the Windows driver check. Should work for both Win/Linux now |
|
@jeffbolznv
However, by re-enabling l_warptile for Intel, our integrated graphics ends up switching to the standard l_warptile config causing significant perf loss on all quants
Since this change is specific to Xe2+, I propose we keep at m_warptile for this PR and not re-enable l_warptile across all Intel GPUs. I would like to look at possibly re-enabling l_warptile separately |
|
@jeffbolznv @netrunnereve @0cc4m |
|
OK, since the large tile size is disabled on intel I guess it's fine to use medium as the large size. It's just a bit confusing. |
|
Thank you, this looks quite interesting! I'll review it soon. Can you explain how you got to this tile configuration? Trial and error, or by using some architecture/driver knowledge? |
The large warptile is disabled for a reason on Intel, yes. But since your new configuration matches the denoms of it, it does make sense to use it. Why not just set the large tile size to your configuration for Intel, and only enable it on Intel if coopmat is available? That would make the change a little clearer. |
It was probably 70-80% trial and error :) There's not a lot of warptile configurations that will cause all unit tests to pass. I had to check what was indexing out of bound (or indexing too short) in the kernel when changing certain values |
So are you thinking something like this?
There's another PR to enable coopmat support for Xe architecture, but also picks a different warptile tuning configuration, so I wanted to be more cautious and restrict to Xe2+ for now. |
Yes, exactly.
I hadn't seen that PR yet, but since it's still WIP, I wouldn't hold back on this one not to get in the way. In the end both have to work together anyways. |
|
Hi @0cc4m @jeffbolznv I made the change to re-enable l_warptile for Intel under certain conditions (coopmat=true and Xe2+ architecture) and changed my tuning to use m_warptile --> l_warptile instead. Spot checked the MUL_MAT, MUL_MAT_ID unit tests and they all pass, perf numbers still align with my data for both B580 (Xe2) and IGPU (non-Xe2). My only wonder is I'm not sure why making this change causes a regression in MUL_MAT specifically for bf16 type in source a.
All other MUL_MAT / MUL_MAT_ID quant types show perf improvement, is there something specific for bf16 mulmat that could cause a regression like this? |
|
My guess is that coopmat_bf16_support is not supported on this device and this change enables the large tile size for bf16 (set around line 3576) that has not been tuned for this device. |
|
Thanks @jeffbolznv! You were absolutely right, the device didn't support When trying
When trying
All MUL_MAT, MUL_MAT_ID unit tests still functionally pass. I spot checked perf on Llama3.2 1B BF16 model, it was same before and after PR. And finally, I made sure the above BF16 model accuracy looks fine through llama-cli.
On a separate note, we may want to add more bf16 unit tests for checking accuracy. I think there aren't enough, when I tried a few different warptile tunings, though all the unit tests pass, the model accuracy through llama-cli at higher prompt sizes was broken. If anyone else modifies bf16 warptile tuning, it would be good to catch problems early. @jeffbolznv @0cc4m Thanks for your help with this PR - I think it is ready for a 2nd look again :) |
|
PR ready - but a question @jeffbolznv: The Then the mul_mm shader calculates Due to these constraints, there aren't a lot of warptile configs that will allow the shader to correctly do matrix multiplication. For example, if I wanted to reduce WM (32->16) and WN (32->16), then NUM_WARPS will get calculated as WGSIZE=512 / WARP=32 so it is still 16 warps, but in 2D mapping Was the shader designed intentionally this way? Any thoughts on this? |
|
The change looks reasonable to me. I agree there aren't enough backend tests to hit the various tile sizes. @0cc4m knows the mul_mm shader better than I, I've never fully understood all the tiling parameters. |









Summary
In this PR we modify the m_warptile configurations to increase prompt processing performance for Intel Xe2+ GPUs that support coopmat extension
Changes
Accuracy Check
Basic testing w/ llama-cli across models that show perf jump (+ trying different prompt sizes) - model output looks correct / reasonable.
Unit tests Check
Checked on system w/ Arc B580 + Intel Integrated Graphics. All unit tests pass.
Performance Results
Command ran is
llama-bench.exe -m ..\<model> -p 512 -r 5 -n 128The eval token gen results don't change and weren't expected to, only prompt processing :) Below numbers show prompt processing in tok/s for Arc B580 and Lunar Lake Series 2 IGPU.
PR Status
Ready for Review