QEP is a meta-algorithm that improves any layer-wise quantization method by compensating for the error that propagates from previously quantized layers to subsequent ones.
!!! abstract "Reference" Yamato Arai and Yuma Ichikawa, "Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization," NeurIPS 2025. OpenReview | Original implementation
Standard layer-wise PTQ quantizes each layer independently using the original input activations. However, after quantizing layer (l), the input to layer (l+1) is no longer the original activation -- it is the output of the quantized layer (l), which contains quantization error. This accumulated error degrades quantization quality, especially at low bit-widths.
QEP addresses this by adjusting the weights of each layer before quantization to account for the activation error introduced by previously quantized layers.
For a layer with weight (W), original input activations (X), and quantized-model input activations (\hat{X}):
- Compute the activation difference: (\Delta = X - \hat{X})
- Compute the cross-term: (\Delta^T \hat{X})
- Solve for a weight correction (\Delta W) via the Hessian:
[ \Delta W = \alpha \cdot (\Delta^T \hat{X}) \cdot H^{-1} ]
where (H = \hat{X}^T \hat{X}) is the Hessian matrix and (\alpha) is the correction
strength (perccorr).
- Quantize the adjusted weight (W + \Delta W) using the base quantizer (e.g., GPTQ).
OneComp provides two QEP implementations, controlled by the QEPConfig.general parameter:
- Exploits the structure of transformer blocks (e.g., QKV layers sharing the same input)
- Groups layers that share input activations for efficient Hessian computation
- Processes one transformer block at a time to minimize GPU memory usage
- Recommended for Llama-like architectures
- Architecture-independent implementation
- Captures input activations for each layer individually
- Works with any model architecture
- Higher memory consumption and more forward passes
from onecomp import ModelConfig, Runner, GPTQ
model_config = ModelConfig(model_id="meta-llama/Llama-2-7b-hf", device="cuda:0")
gptq = GPTQ(wbits=3)
runner = Runner(
model_config=model_config,
quantizer=gptq,
qep=True,
)
runner.run()from onecomp import QEPConfig
qep_config = QEPConfig(
general=False, # Architecture-aware (default)
percdamp=0.01, # Hessian damping
perccorr=0.5, # Correction strength
device="cuda:0", # GPU for QEP computation
exclude_layer_keywords=["mlp.down_proj"],
)
runner = Runner(
model_config=model_config,
quantizer=gptq,
qep=True,
qep_config=qep_config,
)
runner.run()qep_config = QEPConfig(general=True)
runner = Runner(
model_config=model_config,
quantizer=gptq,
qep=True,
qep_config=qep_config,
)
runner.run()| Parameter | Type | Description | Default |
|---|---|---|---|
general |
bool |
Use generic (architecture-independent) QEP | False |
percdamp |
float |
Damping percentage for Hessian regularization | 0.01 |
perccorr |
float |
Correction strength (0 = no correction, 1 = full) | 0.5 |
device |
str |
GPU device for QEP computation | "cuda:0" |
exclude_layer_keywords |
list[str] |
Layer keywords excluded from error propagation | ["mlp.down_proj"] |
!!! note
The default exclude_layer_keywords is designed for Llama-like architectures. You may need
to adjust this for other model families.