[FIX] Always support TP > 4 for FP4 Gemm#17300
Conversation
Summary of ChangesHello @danielafrimi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical issue preventing FP4 quantized models from initializing with high Tensor Parallelism (TP >= 4). It achieves this by implementing a robust padding strategy for both weights and activations. This ensures that the dimensions of the matrices conform to the strict alignment requirements of FlashInfer-CUTLASS FP4 GEMM kernels, thereby enhancing the compatibility and performance of FP4 quantization across various parallel configurations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces padding for FP4 quantized weights and activations to ensure they meet the alignment requirements of CUTLASS/FlashInfer kernels, specifically for tensor parallelism (TP) levels of 4 or greater. The changes include adding helper functions for padding and slicing tensors, and integrating this logic into the ModelOptFp4LinearMethod. The implementation correctly stores the original tensor dimensions, applies padding during weight processing and activation quantization, and slices the output to remove padding after the GEMM operation. My review focuses on the correctness and efficiency of the new padding logic. I've suggested a minor optimization to combine padding operations for better performance.
|
@danielafrimi Have you tried other fp4 gemm implementations other than cutlass? |
|
/tag-and-rerun-ci |
| layer.weights_padding_cols = 0 | ||
| return | ||
|
|
||
| # Pad weights for CUTLASS/FlashInfer kernel alignment (K and N divisible by 32) |
There was a problem hiding this comment.
QQ: if this comment is accurate, we also need to pad it for the trtllm backend? should we also do it under the block above
|
Hi @danielafrimi can you fix the merge conflicts |
Signed-off-by: root <dafrimi@nvidia.com>
Signed-off-by: root <dafrimi@nvidia.com>
919c7bc to
70059f1
Compare
Signed-off-by: root <dafrimi@nvidia.com>
|
@b8zhong @Fridge003
So to match the scale and weights N-dim we pad it to be mul of 128, and for the K-dim ((K/16) % 4 == 0) we pad accordingly the scales and weights, which forces us to pad the activation as well. BTW, for |
|
/tag-and-rerun-ci |
Summary
This PR enables FP4 (NVFP4) quantization to work with TP >= 4
Background
Previously, FP4 quantized models would fail to initialize with TP=4/8 due to kernel alignment requirements. The FlashInfer-CUTLASS FP4 GEMM kernels require the N/K-dimension to be divisible by 32.
Changes
This PR makes the following changes to support TP >= 4, by padding the weights accordingly. In addition, in cases where we pad the K dim of the weights, we need to pad the activation in the forward pass.