x86: runtime AVX512-VNNI tier for the int8 DNN GEMV (+ multi-accumulator cgemv) by czoli1976 · Pull Request #484 · xiph/opus

czoli1976 · 2026-06-16T19:59:39Z

Problem

The VNNI single-instruction int8 dot product (_mm256_dpbusds_epi32 → vpdpbusd) already exists in dnn/vec_avx.h, but it is compile-time gated behind #if defined(__AVXVNNI__) || defined(__AVX512VNNI__) with no runtime-dispatched tier. A normal distro build (-mavx2) running on a VNNI CPU therefore executes the 3-instruction AVX2 emulation (vpmaddubsw+vpmaddwd+vpaddd), never vpdpbusd. This is the x86 mirror of the Arm DOTPROD tier.

What this PR does

1. New RTCD tier above AVX2 (index 5) for the int8 GEMV (compute_linear with quantized weights):

CPU detection in celt/x86/x86cpu.c (CPUID.(7,0) EBX[16]=AVX512F, EBX[31]=AVX512VL, ECX[11]=AVX512_VNNI), plus an OSXSAVE/XGETBV check for OS AVX-512 state.
dnn/x86/nnet_avx512vnni.c (RTCD_ARCH avx512vnni), compute_linear_avx512vnni, MAY_HAVE_AVX512VNNI, dispatch in dnn/x86/dnn_x86.h + x86_dnn_map.c. Only compute_linear uses vpdpbusd; activation/conv2d reuse the AVX2 variants at index 5.
Index 5 filled in every x86 _IMPL table (celt/silk) so it is never NULL on a VNNI CPU.
Build wiring for all three systems (autotools, meson, cmake).

The target CPU here has AVX512-VNNI but not the VEX-encoded AVX-VNNI (a -mavxvnni kernel is an illegal instruction on it). Because opus's RTCD ladder is strictly sequential (and FUZZING randomly downgrades within it), a tier can only carry one encoding, so this tier targets AVX512-VNNI: EVEX-encoded vpdpbusds, kept 256-bit wide (-mprefer-vector-width=256 — the win is the instruction, not wider vectors).

2. Multi-accumulator cgemv8x4 / sparse_cgemv8x4 (the change that makes the tier worthwhile): the unroll-by-4 fed all four vpdpbusds into a single accumulator, serializing them on that register. Since EVEX vpdpbusds has ~5-cycle latency, that recurrence made the naive VNNI tier slower than the AVX2 emulation (which keeps its multiplies off the accumulator's critical path). Splitting across four independent accumulators fixes it and also speeds up the existing AVX2/SSE paths a few percent.

Bit-exactness

Verified byte-identical DRED encode+decode output between the VNNI tier, the AVX2 tier, and the pre-restructure code on real model weights. The AVX2/SSE emulation accumulates with wrapping 32-bit adds (exactly associative); on VNNI the regrouping is exact whenever the per-output sum stays within int32, which holds for the quantized weights the models use. meson test (with -Dcheck-asm=true -Dassertions=true) passes; the regroup only diverges in a synthetic full-int8-range stress test, where VNNI actually matches the exact C reference better than the saturating AVX2 emulation.

Performance — honest scope

Measured on an AVX512-VNNI Xeon (compute_linear int8 GEMV, best of 5):

shape	AVX2 (restructured)	VNNI	naive VNNI (1 acc)
512×1536 (DRED GRU)	18.0 µs	17.0 µs	24.8 µs
1024×1024	25.3 µs	23.1 µs	41.0 µs

On this µarch the VNNI tier is ~1.0–1.09× vs AVX2 — a small win, not the ~2–3× the Arm DOTPROD analogy suggested. Reasons: this batch-1 recurrent GEMV is partly load-bound, and on this µarch vpdpbusds throughput ≈ the AVX2 emulation's once latency is hidden. End-to-end the DNN is a fraction of a frame, so wall-clock is within noise on typical clips. It may show larger gains on Ice Lake / Sapphire Rapids (higher VNNI throughput, no AVX-512 downclock) or on VEX AVX-VNNI parts (Alder/Raptor Lake), which weren't available to test. Frequency downclock was measured and ruled out on this VM.

Scope: only helps DNN-enabled builds (DRED/OSCE/Deep-PLC, off by default) on VNNI CPUs. The multi-accumulator restructure benefits all x86 SIMD tiers regardless of VNNI.

Verification

opus_select_arch() → 5; DNN_COMPUTE_LINEAR_IMPL[5] → compute_linear_avx512vnni.
objdump: tier object emits EVEX vpdpbusds (no zmm); AVX2 object still has vpmaddubsw and no vpdpbusd.
Builds + DNN tests pass under meson, autotools, and cmake.

Related PRs (same contribution batch)

silk: Arm NEON for the float SILK analysis path (inner_product, warped autocorrelation, energy) #481 — silk/float: Arm Neon tier (inner_product, warped_autocorrelation, energy)
cmake: build the Arm DOTPROD DNN kernels (vdotq_s32) #482 — cmake: wire ARM DOTPROD DNN tier (build-only fix)
silk: Add Arm NEON silk_VAD_GetSA_Q8 #483 — silk: Arm Neon VAD_GetSA_Q8

Add an RTCD arch level above AVX2 (index 5) so CPUs with AVX512-VNNI run the EVEX-encoded 256-bit vpdpbusd int8 dot product in the DNN int8 GEMV (compute_linear with quantized weights) instead of the AVX2 vpmaddubsw+vpmaddwd+vpaddd emulation. Mirrors the Arm DOTPROD tier. This box has AVX512-VNNI but not the VEX-encoded AVX-VNNI, and opus's strictly-sequential RTCD ladder (plus the FUZZING downgrade) can only carry one encoding per tier, so the tier targets AVX512-VNNI: compiled -mavx512vnni -mavx512vl (kept 256-bit via -mprefer-vector-width=256, the win is the instruction not wider vectors) and gated on AVX512F+AVX512VL+AVX512_VNNI with an OSXSAVE/XGETBV check for OS AVX-512 state. Detection: CPUID.(7,0) EBX[16]/EBX[31]/ECX[11]. - celt/x86/x86cpu.{c,h}: detect AVX512-VNNI, MAY_HAVE_AVX512VNNI macro, new arch level in opus_select_arch. - dnn/x86: nnet_avx512vnni.c (RTCD_ARCH avx512vnni), declare/dispatch compute_linear_avx512vnni; only compute_linear uses vpdpbusd, so activation/conv2d reuse the AVX2 variants at index 5. - Fill index 5 in every x86 _IMPL table (celt/silk) so it is never NULL. - Wire detection + per-source flags in autotools, meson and cmake. Verified on this CPU: opus_select_arch()=5, DNN_COMPUTE_LINEAR_IMPL[5] dispatches to compute_linear_avx512vnni, object emits EVEX vpdpbusds (no zmm), all three build systems compile it, and DRED encode+decode output is byte-identical to the AVX2 tier (bit-exact). https://claude.ai/code/session_01H3ZadW9kpYqiMEGxhtpV2j

The unroll-by-4 in cgemv8x4 / sparse_cgemv8x4 fed all four vpdpbusds into a single accumulator, serializing them on that register. On the VNNI tier the EVEX vpdpbusds has ~5-cycle latency, so this single- accumulator recurrence made the VNNI kernel latency-bound and actually slower than the AVX2 vpmaddubsw+vpmaddwd+vpaddd emulation (which keeps its multiplies off the accumulator's critical path). Split the unrolled body across four independent accumulators and sum them once per 8-row block, keeping several int8 dot products in flight. Bit-exact: the AVX2/SSE emulation accumulates with wrapping 32-bit adds (exactly associative), and on VNNI the regrouping is exact whenever the per-output sum stays within int32 - which holds for the quantized weights used by the models. DRED encode+decode output is byte-identical before and after, and the check-asm + assertions test suite passes. Measured on an AVX512-VNNI CPU (compute_linear int8 GEMV, best of 5): the VNNI tier goes from ~0.8x (a regression) to ~1.0-1.09x vs AVX2, e.g. 512x1536 24.8->17.0 us and 1024x1024 41.0->22.9 us; the AVX2 tier also speeds up a few percent. End-to-end the DNN is a fraction of the frame, so the wall-clock change is within noise on typical clips. https://claude.ai/code/session_01H3ZadW9kpYqiMEGxhtpV2j

jmvalin · 2026-06-16T20:40:51Z

Looks like AI slop for admittedly no real performance gain

czoli1976 · 2026-06-16T20:46:40Z

Very small indeed Best Regards Ckristian Zoli

…

On Tue, 16 Jun 2026 at 21:41 Jean-Marc Valin ***@***.***> wrote: *jmvalin* left a comment (xiph/opus#484) <#484 (comment)> Looks like AI slop for admittedly no real performance gain — Reply to this email directly, view it on GitHub <#484?email_source=notifications&email_token=APL2Z6XPJ5BBDGGQMZEEO3D5AGWGVA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZSGM2DIMBXGIZ2M4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#issuecomment-4723440723>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APL2Z6RQDY53ZXWPIPVYM3D5AGWGVAVCNFSNUABEKJSXA33TNF2G64TZHMZDGMBZG4YTOMZ3JFZXG5LFHM2DMNZXGQ3TENZXG6QXMAQ> . Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS <https://github.com/notifications/mobile/ios/APL2Z6WSHT6IREUW652P5UT5AGWGVA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZSGM2DIMBXGIZ2M4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KUZTPN52GK4S7NFXXG> and Android <https://github.com/notifications/mobile/android/APL2Z6UNCDNHYY5MEPU543D5AGWGVA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZSGM2DIMBXGIZ2M4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2K4ZTPN52GK4S7MFXGI4TPNFSA>. Download it today! You are receiving this because you authored the thread.Message ID: ***@***.***>

czoli1976 · 2026-06-17T07:28:22Z

Fair, @jmvalin

A bit of context: this was tested on a container with Intel Cascade Lake(CL), which is old and notoriously suffers from downclock when using AVX. It is hard to get an AVX gain on such silicon as it is usually slower but my kernel is x1.0-x1.09, which is already a win on this generation. A later generation, e.g. Ice Lake-SP or a Sapphire Rapids would not have the downclock issue and the same goes for a modern Desktop CPU like AlderLake. Moreover newer Silicon has more lanes and bigger L1 to bank more.

However, given the kernel, especially on CL, is load-port-bound (vs compute-bound) I now have a version with tiling, It tiles several 8-row blocks together and broadcasts each activation only once.

This on the same Cascade Lake container.

Again, can expect more on newer silicon.

I understand if you think 2% E2E is small but given how SIMD optimised the code is already 2% sounds quite significant to me, especially considering that this micro optimisation is for ~5% of encode (~15% of the DNN).

Would you like to see and bench the improved version? Especially if you had an Alder Lake or newer Desktop.

PS: the Tiling benefits the AVX as well and for E2E translates to 2%.

czoli1976 · 2026-06-17T08:24:25Z

and here the results for a DRED Decode sweep over a 5% step pkt loss

this still on the cascade Lake so newer silicon will be better.

czoli1976 · 2026-06-17T10:21:29Z

now benched on Emerald Rapids (EMR) as well

if you are puzzled at why EMR seems slower ....

The left panel is VNNI ÷ AVX2 (pure compute throughput — EMR's 2 VNNI units win).
The right panel is (tiled+VNNI) ÷ (untiled main), and that e2e number is governed by how big a slice of the frame the GEMV is, by Amdahl:

So the mechanism is self-consistent: EMR's whole non-DNN pipeline is faster (newer core, DDR5, 2 VNNI units, higher IPC), so the GEMV is a smaller fraction of EMR's frame — and the same kernel win moves a smaller fraction less. "Faster kernel" (EMR) and "bigger e2e %" (Cascade, because it's more GEMV-bound) are simply not the same quantity.

Hope it helps

claude added 2 commits June 16, 2026 19:34

jmvalin closed this Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x86: runtime AVX512-VNNI tier for the int8 DNN GEMV (+ multi-accumulator cgemv)#484

x86: runtime AVX512-VNNI tier for the int8 DNN GEMV (+ multi-accumulator cgemv)#484
czoli1976 wants to merge 2 commits into
xiph:mainfrom
czoli1976:claude/stoic-babbage-hurt97

czoli1976 commented Jun 16, 2026

Uh oh!

jmvalin commented Jun 16, 2026

Uh oh!

czoli1976 commented Jun 16, 2026 via email

Uh oh!

czoli1976 commented Jun 17, 2026 •

edited

Loading

Uh oh!

czoli1976 commented Jun 17, 2026

Uh oh!

czoli1976 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

czoli1976 commented Jun 16, 2026

Problem

What this PR does

Bit-exactness

Performance — honest scope

Verification

Related PRs (same contribution batch)

Uh oh!

jmvalin commented Jun 16, 2026

Uh oh!

czoli1976 commented Jun 16, 2026 via email

Uh oh!

czoli1976 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

czoli1976 commented Jun 17, 2026

Uh oh!

czoli1976 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

czoli1976 commented Jun 17, 2026 •

edited

Loading