ggml-cuda: Repost of 21896: Blackwell native NVFP4 support by michaelw9999 · Pull Request #22196 · ggml-org/llama.cpp

michaelw9999 · 2026-04-21T04:33:30Z

This is a restored clone of PR #21896 ggml-cuda: Blackwell native NVFP4 support .
Unfortunately it closed during a rebase error and it cannot be reopened
The exact commits are here as they were before. Sorry about this mixup!

…ead of block_nvfp4, removed UE4M3 max cap check, merged use_native_mxfp4/nvfp4 into use_native_fp4, merged quantize_mmq_nvfp4/mxfp4/cuda to quantize_mmq_fp4_Cuda, merged mma/mxfp4/nvfp4 into one templated mma_block_scaled_fp4

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

…kwell

Co-authored-by: Oliver Simons <osimons@nvidia.com>

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

anskumar01 · 2026-04-22T16:41:49Z

@am17an , @JohannesGaessler , original PR #21896 was approved before it got closed by mistake. Can this PR be merged now in its current form while design discussion for remaining gaps continue and be implemented separately.

am17an · 2026-04-23T05:49:31Z

I'm okay merging this however I respect @ORippler's (and I guess on the whole Nvidia side?) reservations. So we should come up with a plan to fix this before merging.

JohannesGaessler · 2026-04-25T08:42:37Z

@am17an sorry, which reservations are you talking about?

am17an · 2026-04-25T08:45:48Z

This one #21896 (comment)

michaelw9999 · 2026-04-25T09:43:03Z

I do agree we would benefit with some fixes here with regards to the tensor scale incorporation.
How and with what implementation exactly is still TBD @ORippler

Right now with this PR on Qwen3.5-4B :
Mean PPL(Q) : 11.658689 ± 0.371901.

Details

====== Perplexity statistics ======
Mean PPL(Q)                   :  11.658689 ±   0.371901
Mean PPL(base)                :  10.812440 ±   0.339678
Cor(ln(PPL(Q)), ln(PPL(base))):  98.55%
Mean ln(PPL(Q)/PPL(base))     :   0.075354 ±   0.005421
Mean PPL(Q)/PPL(base)         :   1.078266 ±   0.005845
Mean PPL(Q)-PPL(base)         :   0.846249 ±   0.068650

====== KL divergence statistics ======
Mean    KLD:   0.092041 ±   0.002901
Maximum KLD:  11.383351

With input scale linked up via build_lora, 11.599

When both weight/input scale are factored directly into ggml_mul_mat (the details of where and in what spot don't seem to matter) it's down to:
Mean PPL(Q) : 11.557739 ± 0.366391

Details

====== Perplexity statistics ======
Mean PPL(Q)                   :  11.557739 ±   0.366391
Mean PPL(base)                :  10.812440 ±   0.339678
Cor(ln(PPL(Q)), ln(PPL(base))):  98.51%
Mean ln(PPL(Q)/PPL(base))     :   0.066658 ±   0.005458
Mean PPL(Q)/PPL(base)         :   1.068930 ±   0.005834
Mean PPL(Q)-PPL(base)         :   0.745298 ±   0.066528

====== KL divergence statistics ======
Mean    KLD:   0.091930 ±   0.002889
Maximum KLD:  13.888078

So we have a 0.1 difference for this one particular model. It is not as large of a difference on the larger Nemotron 30B MoE. I just started playing around with Qwen3.6-27B dense to experiment back to back and see if there is any differences.
On the current generic NVFP4 x Q8's 11.40, changing the weight scale inclusion point will give 11.39, so that does not seem quite worth the effort for nvfp4_q8. Whether it changes real world model quality with the bigger 11.65 to 11.55 difference, I do not know, and I'm not sure the gap on other models, I imagine it will be more significant for smaller models.
On an experimental all NVFP4 x NVFP4 MMVQ kernel it also helped out.

ORippler · 2026-04-28T16:34:10Z

I'm okay merging this however I respect @ORippler's (and I guess on the whole Nvidia side?) reservations. So we should come up with a plan to fix this before merging.

Sorry for the radio silence. From our side, proceeding with the split-responsibility (ggml_cgraph constructer is responsible for multiplying per-tensor weight scales onto ops that consume NVFP4 tensors) is fine for the time being. The only part currently missing is the actual TC acceleration for BW GPUs, so it doesn't make sense to stop here. We will monitor NVFP4 and non-NVFP4 quants for a set of models we are interested in to ensure quality stays as expected within llama.cpp for the time being.

I do agree we would benefit with some fixes here with regards to the tensor scale incorporation.
How and with what implementation exactly is still TBD @ORippler

Regarding optimizations for quantizing incoming activations from F32->A4 (both perf and quality-wise), we feel these can be addressed in separate follow-up PRs.

I will do another round of quality/perf evaluations on DGX Spark and get back to you once I have data available.

am17an · 2026-04-28T17:24:00Z

@ORippler then let's merge this when tests are green

…22196)

ORippler · 2026-04-30T15:06:35Z

I will do another round of quality/perf evaluations on DGX Spark and get back to you once I have data available.

FWIW, here some numbers for Nemotron 3 Super 120B on Spark (NVFP4 ckpt from here, and Q4_K ckpt from here):

Quality:

Q4_K (W4Q8):
Final estimate: PPL = 4.6237 +/- 0.02802NVFP4 master (W4Q8)

NVFP4 (W4Q8 Int fallback path)
Final estimate: PPL = 4.6283 +/- 0.02814NVFP4 branch (W4Q4)

NVFP4 (W4A4 TC path)
Final estimate: PPL = 4.6577 +/- 0.02838

See no issues with PPL for the fallback phat, though quantizing activations to 4-bit undeniably hurts quality (this is in line with analysis of https://www.reddit.com/r/LocalLLaMA/comments/1svq8lm/qwen3635ba3b_klds_ints_and_nvfps/?show=original Qwen3.6).

Perf numbers (omitting Q4_K as a lot of the NVFP4 chkpt is in FP8 which we convert to F32 instead of failing the conversion in our hf converter script):

(base) osimons@spark-9c20:~/llama.cpp$ ./build_50494a/bin/llama-bench -m /gguf/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4_nvfp4.gguf -dio 1 -fa 1 -p 2048
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 122570 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 122570 MiB
| model                          |       size |     params | backend    | ngl | fa | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --: | --------------: | -------------------: |
| nemotron_h_moe 120B.A12B NVFP4 |  73.96 GiB |   120.67 B | CUDA       |  99 |  1 |   1 |          pp2048 |        506.21 ± 1.31 |
| nemotron_h_moe 120B.A12B NVFP4 |  73.96 GiB |   120.67 B | CUDA       |  99 |  1 |   1 |           tg128 |         12.01 ± 0.01 |

(base) osimons@spark-9c20:~/llama.cpp_nvfp4$ ./build_nvfp4/bin/llama-bench -m /gguf/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4_nvfp4.gguf -dio 1 -fa 1 -p 2048
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 122570 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 122570 MiB
| model                          |       size |     params | backend    | ngl | fa | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --: | --------------: | -------------------: |
| nemotron_h_moe 120B.A12B NVFP4 |  73.96 GiB |   120.67 B | CUDA       |  99 |  1 |   1 |          pp2048 |        611.30 ± 1.30 |
| nemotron_h_moe 120B.A12B NVFP4 |  73.96 GiB |   120.67 B | CUDA       |  99 |  1 |   1 |           tg128 |         11.91 ± 0.01

We will focus on quality and perf next, likely taking a look at the quantize kernel as that does take time (~8% in some nsys traces we took in the past)

michaelw9999 · 2026-04-30T17:21:46Z

I will do another round of quality/perf evaluations on DGX Spark and get back to you once I have data available.

FWIW, here some numbers for Nemotron 3 Super 120B on Spark (NVFP4 ckpt from here, and Q4_K ckpt from here):

Quality:

Q4_K (W4Q8):
Final estimate: PPL = 4.6237 +/- 0.02802NVFP4 master (W4Q8)

NVFP4 (W4Q8 Int fallback path)
Final estimate: PPL = 4.6283 +/- 0.02814NVFP4 branch (W4Q4)

NVFP4 (W4A4 TC path)
Final estimate: PPL = 4.6577 +/- 0.02838

See no issues with PPL for the fallback phat, though quantizing activations to 4-bit undeniably hurts quality (this is in line with analysis of https://www.reddit.com/r/LocalLLaMA/comments/1svq8lm/qwen3635ba3b_klds_ints_and_nvfps/?show=original Qwen3.6).

Perf numbers (omitting Q4_K as a lot of the NVFP4 chkpt is in FP8 which we convert to F32 instead of failing the conversion in our hf converter script):

(base) osimons@spark-9c20:~/llama.cpp$ ./build_50494a/bin/llama-bench -m /gguf/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4_nvfp4.gguf -dio 1 -fa 1 -p 2048
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 122570 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 122570 MiB
| model                          |       size |     params | backend    | ngl | fa | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --: | --------------: | -------------------: |
| nemotron_h_moe 120B.A12B NVFP4 |  73.96 GiB |   120.67 B | CUDA       |  99 |  1 |   1 |          pp2048 |        506.21 ± 1.31 |
| nemotron_h_moe 120B.A12B NVFP4 |  73.96 GiB |   120.67 B | CUDA       |  99 |  1 |   1 |           tg128 |         12.01 ± 0.01 |

(base) osimons@spark-9c20:~/llama.cpp_nvfp4$ ./build_nvfp4/bin/llama-bench -m /gguf/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4_nvfp4.gguf -dio 1 -fa 1 -p 2048
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 122570 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 122570 MiB
| model                          |       size |     params | backend    | ngl | fa | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --: | --------------: | -------------------: |
| nemotron_h_moe 120B.A12B NVFP4 |  73.96 GiB |   120.67 B | CUDA       |  99 |  1 |   1 |          pp2048 |        611.30 ± 1.30 |
| nemotron_h_moe 120B.A12B NVFP4 |  73.96 GiB |   120.67 B | CUDA       |  99 |  1 |   1 |           tg128 |         11.91 ± 0.01

We will focus on quality and perf next, likely taking a look at the quantize kernel as that does take time (~8% in some nsys traces we took in the past)

@ORippler I will quantize this model with my modified llama-quantizer which does more scale search and try to upload it to hf, if you want to compare. I have not tried to run models this large yet as I only have a 5090/32gb, so it may be difficult for me to run; on smaller models thus far, it has better ppl and kld than those converted with the hf script.

…22196)

Major upstream additions: - CUDA graph improvements: LRU eviction, node property tracking, uid-based reuse - Flash attention: stream-k fixup kernel, DKQ=320/DV=256 support, Pascal fix - SSM_CONV + ADD + SILU 3-node fusion (ggml-org#22478) - Blackwell native NVFP4 support (ggml-org#22196) - Q1_0 1-bit quantization (CPU, CUDA, Metal, Vulkan, WebGPU) - Backend-agnostic tensor parallelism (ggml-org#19378) - Speculative decoding: checkpointing, param refactoring, low-prob discard - libcommon renamed to libllama-common (ggml-org#21936) - Server: /api endpoints removed, checkpoint support, CVE-2026-21869 fix - Model refactors: build_qkv/create_tensor_qkv helpers, cmake glob for models - Recurrent state serialization fix for partial reads/writes (ggml-org#22362) - Fast mat-vec kernels for i-quants (ggml-org#22344, ggml-org#22504) Conflict resolution (22 files): - Turbo quant type IDs shifted +1 (42-46) to accommodate Q1_0 (41) - SSM_CONV tree kernels preserved alongside new fusion - DFlash spec decode coexists with upstream checkpointing - Server slot fields renamed: drafted→spec_draft, i_batch_dft→spec_i_batch - Qwen3.5/DeltaNet model registration uses new create_tensor_qkv helper - Gemma4 BF16 precision fix preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…22196)

michaelw9999 and others added 23 commits April 19, 2026 16:50

Blackwell NVFP4 MMQ Kernel

a081845

Removed whitespace

9fb7e84

Added FP8 Max definition and description

0bcf7b2

Fixed 'f' typo

4625a7c

Removed whitespace from comment

3ea6b59

Guard Blackwell NVFP4 quantizer for Blackwell only

db5957e

Merged vec_dot_fp4_fp4_mma together

83b412f

Updated block_fp4_mmq packing comment

78596bf

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

Added assert for QK_K == 8 * QK_MXFP4 in mul_mat_q

a68327c

Removed extra space typo

6e31a22

Changed NVFP4 quant assert and using get_int_b4

58e277e

Removed bool has_ids template from quantize

0e2c794

Updated block_fp4_mmq packing comment

72fc017

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

Added ue4m3 bounds check for testscale

7fcc8c0

Removed whitespace on line 52 of mmq.cuh

7c73198

Fixed MMQ_ITER_K_FP4 returning on non-FP4 models when running on Blac…

6b26a1c

…kwell

Change GGML_ASSERT to static_assert

e34b6ff

Co-authored-by: Oliver Simons <osimons@nvidia.com>

Whitespace fixes

02df263

Change amax_raw mul 1/6 to: / 6

9204590

Co-authored-by: Oliver Simons <osimons@nvidia.com>

Hoisted kbx0 and kbx out of the loop

667cc38

Update ggml/src/ggml-cuda/mmq.cuh

553c3a8

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

Add endif blackwell mma comment

0d9e045

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

michaelw9999 requested review from a team and ggerganov as code owners April 21, 2026 04:33

github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 21, 2026

michaelw9999 mentioned this pull request Apr 21, 2026

ggml-cuda: Blackwell native NVFP4 support #21896

Closed

michaelw9999 marked this pull request as draft April 27, 2026 03:48

am17an approved these changes Apr 28, 2026

View reviewed changes

JohannesGaessler approved these changes Apr 28, 2026

View reviewed changes

michaelw9999 marked this pull request as ready for review April 28, 2026 19:31

am17an merged commit fc2b005 into ggml-org:master Apr 28, 2026
46 of 52 checks passed

cnsiva pushed a commit to saas-home/llama.cpp that referenced this pull request Apr 29, 2026

ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (ggml-org#…

249b83a

…22196)

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026

ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (ggml-org#…

59b97d2

…22196)

samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026

ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (ggml-org#…

dbdb18c

…22196)

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026

ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (ggml-org#…

ef97d11

…22196)

meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026

ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (ggml-org#…

b81eac5

…22196)

baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026

ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (ggml-org#…

8ff7d05

…22196)

winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026

ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (ggml-org#…

3ca871b

…22196)

fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026

ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (ggml-org#…

9f7b391

…22196)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cuda: Repost of 21896: Blackwell native NVFP4 support#22196

ggml-cuda: Repost of 21896: Blackwell native NVFP4 support#22196
am17an merged 23 commits into
ggml-org:masterfrom
michaelw9999:nvfp4-blackwell

michaelw9999 commented Apr 21, 2026

Uh oh!

anskumar01 commented Apr 22, 2026

Uh oh!

am17an commented Apr 23, 2026

Uh oh!

JohannesGaessler commented Apr 25, 2026

Uh oh!

am17an commented Apr 25, 2026

Uh oh!

michaelw9999 commented Apr 25, 2026 •

edited

Loading

Uh oh!

ORippler commented Apr 28, 2026

Uh oh!

am17an commented Apr 28, 2026

Uh oh!

Uh oh!

ORippler commented Apr 30, 2026

Uh oh!

michaelw9999 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

michaelw9999 commented Apr 21, 2026

Uh oh!

anskumar01 commented Apr 22, 2026

Uh oh!

am17an commented Apr 23, 2026

Uh oh!

JohannesGaessler commented Apr 25, 2026

Uh oh!

am17an commented Apr 25, 2026

Uh oh!

michaelw9999 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ORippler commented Apr 28, 2026

Uh oh!

am17an commented Apr 28, 2026

Uh oh!

Uh oh!

ORippler commented Apr 30, 2026

Uh oh!

michaelw9999 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

michaelw9999 commented Apr 25, 2026 •

edited

Loading