Active project: Ampere sm_86 INT4/INT8 packing, quantization, and BF16-output GEMM for NVIDIA Ampere Tensor Cores.
This checkout is being migrated from the original Blackwell QuTLASS MXFP/NVFP codebase into an Ampere-focused CUDA extension. New work should target explicit integer quantization contracts: packed INT4 data, INT8 data, scale and zero-point handling, and integer accumulation paths feeding BF16 results.
Primary active API surface:
pack_int4quantize_int8matmul_int4_bf16_tnmatmul_int8_bf16_tn
The intended active architecture target is sm_86 first, with sm_80 compatibility where practical. MXFP, NVFP, FP4, FP8, and Blackwell-specific CUTLASS paths are retained only as legacy/source material unless a task explicitly asks for them.
- Active Ampere Scope
- Getting Started
- Usage
- Validation
- Legacy/Source Material: MXFP, NVFP, and Blackwell
- Citation
The active implementation direction is an INT4/INT8 CUDA extension for Ampere:
- INT4 packing via
pack_int4 - INT8 quantization via
quantize_int8 - BF16-output Tensor Core GEMM via
matmul_int4_bf16_tn - BF16-output Tensor Core GEMM via
matmul_int8_bf16_tn - Explicit scale, zero-point, layout, and accumulation behavior for Ampere kernels
Ampere kernel work should avoid Blackwell-only assumptions such as sm_100, sm_120, MXFP block-scale layouts as the public contract, or FP8 scale dtypes as the active interface.
- NVIDIA Ampere GPU, with
sm_86as the primary target - CUDA 12.x toolchain compatible with the local PyTorch build
- PyTorch extension build environment
Install Python requirements:
pip install -r requirements.txtInstall the extension in editable mode for Ampere:
TORCH_CUDA_ARCH_LIST="8.0;8.6" pip install --no-build-isolation -e .The active public names are expected to be:
import qutlass
packed_w = qutlass.pack_int4(w_int)
acts_i8, act_scale, act_zero = qutlass.quantize_int8(acts)
out_i4 = qutlass.matmul_int4_bf16_tn(acts_bf16, packed_w, scales, zeros)
out_i8 = qutlass.matmul_int8_bf16_tn(a_int8, b_int8, a_scale, b_scale)Exact argument order and tensor layout requirements should be kept in sync with the bindings and tests as the migration lands.
Use focused INT tests and Ampere build flags while this migration is in progress:
TORCH_CUDA_ARCH_LIST="8.0;8.6" pip install --no-build-isolation -e .
python -m pytest tests -qCUDA-dependent tests should skip clearly when CUDA is unavailable. CPU-only import checks do not validate extension behavior.
The original QuTLASS README content below describes the Blackwell-focused MXFP/NVFP project that this checkout is being migrated away from. Treat this section as historical reference and source material for migration only. It is not the active project contract for new Ampere INT4/INT8 work.
QuTLASS was a CUTLASS-powered quantized BLAS library for low-bit deep learning on NVIDIA Blackwell GPUs.
It introduced narrow-precision microscaling routines tailored for quantized LLM inference and training on NVIDIA Blackwell GPUs.
The Blackwell architecture supports native matrix multiplication with microscaling, using scale factors in the form:
The scale factors are applied along the inner (gs=32). For an
- FlashInfer backend support for B200 GPUs
- Quantization-aware training via MXFP types
- Quartet clipping mask computation integrated in quantization routines
- Prototype backward kernels for MXFP4 (
sm_120) and MXFP8 (sm_100) - CUTLASS MXFP8 backward GEMM kernels in TN and NN layouts
- Transformers QAT integration
- Nanochat-QAT integration
- Support for
sm_100GPUs, including NVIDIA B200 - NVFP4 microscaling with W4A4 quantization support
- Online rotations with fused transform, quantization, and scale computation
- Runtime-loaded rotation matrices
- CUTLASS-backed NVFP4:NVFP4 matmul with block-scale reordering
- Abs-max quantization
- Multiple rotation sizes for MXFP4 and NVFP4
- vLLM integration
- MXFP4 microscaling support
- Weight and activation quantization (
W4A4) - Online transforms, quantization, and scale computation
- Microscaling group-size-compatible transformations
- Quartet and abs-max quantization schemes
- CUTLASS-backed MXFP4:MXFP4 matmul with block-scale reordering
- Small-batch prototype MXFP4 kernel without reordering
- Transformers integration
Legacy MXFP4 correctness tests were run with:
python tests/mxfp4_test.pyLegacy MXFP4 benchmarks were run with:
python benchmarks/bench_mxfp4.pyThe legacy fused quantization kernel was exposed as qutlass.fusedQuantizeMx(a, h, method), where method was Literal["quest", "abs_max"]. It returned FP4 (e2m1) quantized data and FP8 (e8m0) scaling factors.
The legacy matmul path used qutlass.matmul_mxf4_bf16_tn(aq, bq, a_sf, b_sf, alpha) with scale factors rearranged by qutlass.to_blocked into the cuBLAS block-scaled swizzle format. A custom prototype path, qutlass.matmul_ada_mxf4_bf16_tn(...), avoided that reordering for small batch sizes. NVFP4 was treated as functionally equivalent aside from naming.
The original benchmark figures and claims measured MXFP4/NVFP4 behavior on Blackwell and RTX 5090-class targets. Keep these assets as historical references while replacing benchmark labels and commands with INT4/INT8 equivalents.
In the original README, microbenchmarks showed MXFP4 performance across batch sizes, including ideal matrix multiplication and full-pipeline measurements with Hadamard rotation, data quantization, scale computation, and block-scale reordering.
Original end-to-end inference notes described MXFP4 speedups over PyTorch BF16 in Transformers for 8B and 14B models. Original training notes described MXFP4:MXFP8 QAT on Llama-3.1-8B.
For historical recipes related to MXFP and NVFP formats, the original project referenced FP-Quant, nanochat-qat, and related Transformers integrations.
@misc{qutlass2025,
title={QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning},
author={Roberto L. Castro, and Dan Alistarh},
year={2025},
publisher = {GitHub},
howpublished = {\url{https://github.com/IST-DASLab/qutlass}},
}