ggml-cuda: Repost of 21896: Blackwell native NVFP4 support#22196
Conversation
…ead of block_nvfp4, removed UE4M3 max cap check, merged use_native_mxfp4/nvfp4 into use_native_fp4, merged quantize_mmq_nvfp4/mxfp4/cuda to quantize_mmq_fp4_Cuda, merged mma/mxfp4/nvfp4 into one templated mma_block_scaled_fp4
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
|
@am17an , @JohannesGaessler , original PR #21896 was approved before it got closed by mistake. Can this PR be merged now in its current form while design discussion for remaining gaps continue and be implemented separately. |
|
I'm okay merging this however I respect @ORippler's (and I guess on the whole Nvidia side?) reservations. So we should come up with a plan to fix this before merging. |
|
@am17an sorry, which reservations are you talking about? |
|
This one #21896 (comment) |
|
I do agree we would benefit with some fixes here with regards to the tensor scale incorporation. Right now with this PR on Qwen3.5-4B : DetailsWith input scale linked up via build_lora, When both weight/input scale are factored directly into DetailsSo we have a 0.1 difference for this one particular model. It is not as large of a difference on the larger Nemotron 30B MoE. I just started playing around with Qwen3.6-27B dense to experiment back to back and see if there is any differences. |
Sorry for the radio silence. From our side, proceeding with the split-responsibility (
Regarding optimizations for quantizing incoming activations from F32->A4 (both perf and quality-wise), we feel these can be addressed in separate follow-up PRs. I will do another round of quality/perf evaluations on DGX Spark and get back to you once I have data available. |
|
@ORippler then let's merge this when tests are green |
FWIW, here some numbers for Nemotron 3 Super 120B on Spark (NVFP4 ckpt from here, and Q4_K ckpt from here): Quality: See no issues with PPL for the fallback phat, though quantizing activations to 4-bit undeniably hurts quality (this is in line with analysis of https://www.reddit.com/r/LocalLLaMA/comments/1svq8lm/qwen3635ba3b_klds_ints_and_nvfps/?show=original Qwen3.6). Perf numbers (omitting Q4_K as a lot of the NVFP4 chkpt is in FP8 which we convert to F32 instead of failing the conversion in our hf converter script): We will focus on quality and perf next, likely taking a look at the quantize kernel as that does take time (~8% in some nsys traces we took in the past) |
@ORippler I will quantize this model with my modified llama-quantizer which does more scale search and try to upload it to hf, if you want to compare. I have not tried to run models this large yet as I only have a 5090/32gb, so it may be difficult for me to run; on smaller models thus far, it has better ppl and kld than those converted with the hf script. |
Major upstream additions: - CUDA graph improvements: LRU eviction, node property tracking, uid-based reuse - Flash attention: stream-k fixup kernel, DKQ=320/DV=256 support, Pascal fix - SSM_CONV + ADD + SILU 3-node fusion (ggml-org#22478) - Blackwell native NVFP4 support (ggml-org#22196) - Q1_0 1-bit quantization (CPU, CUDA, Metal, Vulkan, WebGPU) - Backend-agnostic tensor parallelism (ggml-org#19378) - Speculative decoding: checkpointing, param refactoring, low-prob discard - libcommon renamed to libllama-common (ggml-org#21936) - Server: /api endpoints removed, checkpoint support, CVE-2026-21869 fix - Model refactors: build_qkv/create_tensor_qkv helpers, cmake glob for models - Recurrent state serialization fix for partial reads/writes (ggml-org#22362) - Fast mat-vec kernels for i-quants (ggml-org#22344, ggml-org#22504) Conflict resolution (22 files): - Turbo quant type IDs shifted +1 (42-46) to accommodate Q1_0 (41) - SSM_CONV tree kernels preserved alongside new fusion - DFlash spec decode coexists with upstream checkpointing - Server slot fields renamed: drafted→spec_draft, i_batch_dft→spec_i_batch - Qwen3.5/DeltaNet model registration uses new create_tensor_qkv helper - Gemma4 BF16 precision fix preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This is a restored clone of PR #21896 ggml-cuda: Blackwell native NVFP4 support .
Unfortunately it closed during a rebase error and it cannot be reopened
The exact commits are here as they were before. Sorry about this mixup!