QEP (Quantization Error Propagation)

QEP is a meta-algorithm that improves any layer-wise quantization method by compensating for the error that propagates from previously quantized layers to subsequent ones.

!!! abstract "Reference" Yamato Arai and Yuma Ichikawa, "Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization," NeurIPS 2025. OpenReview | Original implementation

Motivation

Standard layer-wise PTQ quantizes each layer independently using the original input activations. However, after quantizing layer (l), the input to layer (l+1) is no longer the original activation -- it is the output of the quantized layer (l), which contains quantization error. This accumulated error degrades quantization quality, especially at low bit-widths.

How QEP Works

QEP addresses this by adjusting the weights of each layer before quantization to account for the activation error introduced by previously quantized layers.

For a layer with weight (W), original input activations (X), and quantized-model input activations (\hat{X}):

Compute the activation difference: (\Delta = X - \hat{X})
Compute the cross-term: (\Delta^T \hat{X})
Solve for a weight correction (\Delta W) via the Hessian:

[ \Delta W = \alpha \cdot (\Delta^T \hat{X}) \cdot H^{-1} ]

where (H = \hat{X}^T \hat{X}) is the Hessian matrix and (\alpha) is the correction strength (perccorr).

Quantize the adjusted weight (W + \Delta W) using the base quantizer (e.g., GPTQ).

Two Implementations

OneComp provides two QEP implementations, controlled by the QEPConfig.general parameter:

Architecture-aware (default, `general=False`)

Exploits the structure of transformer blocks (e.g., QKV layers sharing the same input)
Groups layers that share input activations for efficient Hessian computation
Processes one transformer block at a time to minimize GPU memory usage
Recommended for Llama-like architectures

Generic (`general=True`)

Architecture-independent implementation
Captures input activations for each layer individually
Works with any model architecture
Higher memory consumption and more forward passes

Usage

Basic QEP

from onecomp import ModelConfig, Runner, GPTQ

model_config = ModelConfig(model_id="meta-llama/Llama-2-7b-hf", device="cuda:0")
gptq = GPTQ(wbits=3)

runner = Runner(
    model_config=model_config,
    quantizer=gptq,
    qep=True,
)
runner.run()

Custom QEP Configuration

from onecomp import QEPConfig

qep_config = QEPConfig(
    general=False,              # Architecture-aware (default)
    percdamp=0.01,              # Hessian damping
    perccorr=0.5,               # Correction strength
    device="cuda:0",            # GPU for QEP computation
    exclude_layer_keywords=["mlp.down_proj"],
)

runner = Runner(
    model_config=model_config,
    quantizer=gptq,
    qep=True,
    qep_config=qep_config,
)
runner.run()

Generic QEP (for non-Llama architectures)

qep_config = QEPConfig(general=True)

runner = Runner(
    model_config=model_config,
    quantizer=gptq,
    qep=True,
    qep_config=qep_config,
)
runner.run()

Parameters

Parameter	Type	Description	Default
`general`	`bool`	Use generic (architecture-independent) QEP	`False`
`percdamp`	`float`	Damping percentage for Hessian regularization	`0.01`
`perccorr`	`float`	Correction strength (0 = no correction, 1 = full)	`0.5`
`device`	`str`	GPU device for QEP computation	`"cuda:0"`
`exclude_layer_keywords`	`list[str]`	Layer keywords excluded from error propagation	`["mlp.down_proj"]`

!!! note The default exclude_layer_keywords is designed for Llama-like architectures. You may need to adjust this for other model families.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QEP (Quantization Error Propagation)

Motivation

How QEP Works

Two Implementations

Architecture-aware (default, `general=False`)

Generic (`general=True`)

Usage

Basic QEP

Custom QEP Configuration

Generic QEP (for non-Llama architectures)

Parameters

FilesExpand file tree

qep.md

Latest commit

History

qep.md

File metadata and controls

QEP (Quantization Error Propagation)

Motivation

How QEP Works

Two Implementations

Architecture-aware (default, general=False)

Generic (general=True)

Usage

Basic QEP

Custom QEP Configuration

Generic QEP (for non-Llama architectures)

Parameters

Architecture-aware (default, `general=False`)

Generic (`general=True`)