x86: runtime AVX512-VNNI tier for the int8 DNN GEMV (+ multi-accumulator cgemv)#484
x86: runtime AVX512-VNNI tier for the int8 DNN GEMV (+ multi-accumulator cgemv)#484czoli1976 wants to merge 2 commits into
Conversation
Add an RTCD arch level above AVX2 (index 5) so CPUs with AVX512-VNNI
run the EVEX-encoded 256-bit vpdpbusd int8 dot product in the DNN
int8 GEMV (compute_linear with quantized weights) instead of the AVX2
vpmaddubsw+vpmaddwd+vpaddd emulation. Mirrors the Arm DOTPROD tier.
This box has AVX512-VNNI but not the VEX-encoded AVX-VNNI, and opus's
strictly-sequential RTCD ladder (plus the FUZZING downgrade) can only
carry one encoding per tier, so the tier targets AVX512-VNNI: compiled
-mavx512vnni -mavx512vl (kept 256-bit via -mprefer-vector-width=256,
the win is the instruction not wider vectors) and gated on
AVX512F+AVX512VL+AVX512_VNNI with an OSXSAVE/XGETBV check for OS
AVX-512 state. Detection: CPUID.(7,0) EBX[16]/EBX[31]/ECX[11].
- celt/x86/x86cpu.{c,h}: detect AVX512-VNNI, MAY_HAVE_AVX512VNNI macro,
new arch level in opus_select_arch.
- dnn/x86: nnet_avx512vnni.c (RTCD_ARCH avx512vnni), declare/dispatch
compute_linear_avx512vnni; only compute_linear uses vpdpbusd, so
activation/conv2d reuse the AVX2 variants at index 5.
- Fill index 5 in every x86 _IMPL table (celt/silk) so it is never NULL.
- Wire detection + per-source flags in autotools, meson and cmake.
Verified on this CPU: opus_select_arch()=5, DNN_COMPUTE_LINEAR_IMPL[5]
dispatches to compute_linear_avx512vnni, object emits EVEX vpdpbusds
(no zmm), all three build systems compile it, and DRED encode+decode
output is byte-identical to the AVX2 tier (bit-exact).
https://claude.ai/code/session_01H3ZadW9kpYqiMEGxhtpV2j
The unroll-by-4 in cgemv8x4 / sparse_cgemv8x4 fed all four vpdpbusds into a single accumulator, serializing them on that register. On the VNNI tier the EVEX vpdpbusds has ~5-cycle latency, so this single- accumulator recurrence made the VNNI kernel latency-bound and actually slower than the AVX2 vpmaddubsw+vpmaddwd+vpaddd emulation (which keeps its multiplies off the accumulator's critical path). Split the unrolled body across four independent accumulators and sum them once per 8-row block, keeping several int8 dot products in flight. Bit-exact: the AVX2/SSE emulation accumulates with wrapping 32-bit adds (exactly associative), and on VNNI the regrouping is exact whenever the per-output sum stays within int32 - which holds for the quantized weights used by the models. DRED encode+decode output is byte-identical before and after, and the check-asm + assertions test suite passes. Measured on an AVX512-VNNI CPU (compute_linear int8 GEMV, best of 5): the VNNI tier goes from ~0.8x (a regression) to ~1.0-1.09x vs AVX2, e.g. 512x1536 24.8->17.0 us and 1024x1024 41.0->22.9 us; the AVX2 tier also speeds up a few percent. End-to-end the DNN is a fraction of the frame, so the wall-clock change is within noise on typical clips. https://claude.ai/code/session_01H3ZadW9kpYqiMEGxhtpV2j
|
Looks like AI slop for admittedly no real performance gain |
|
Very small indeed
Best Regards
Ckristian Zoli
…On Tue, 16 Jun 2026 at 21:41 Jean-Marc Valin ***@***.***> wrote:
*jmvalin* left a comment (xiph/opus#484)
<#484 (comment)>
Looks like AI slop for admittedly no real performance gain
—
Reply to this email directly, view it on GitHub
<#484?email_source=notifications&email_token=APL2Z6XPJ5BBDGGQMZEEO3D5AGWGVA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZSGM2DIMBXGIZ2M4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#issuecomment-4723440723>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/APL2Z6RQDY53ZXWPIPVYM3D5AGWGVAVCNFSNUABEKJSXA33TNF2G64TZHMZDGMBZG4YTOMZ3JFZXG5LFHM2DMNZXGQ3TENZXG6QXMAQ>
.
Triage notifications, keep track of coding agent tasks and review pull
requests on the go with GitHub Mobile for iOS
<https://github.com/notifications/mobile/ios/APL2Z6WSHT6IREUW652P5UT5AGWGVA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZSGM2DIMBXGIZ2M4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KUZTPN52GK4S7NFXXG>
and Android
<https://github.com/notifications/mobile/android/APL2Z6UNCDNHYY5MEPU543D5AGWGVA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZSGM2DIMBXGIZ2M4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2K4ZTPN52GK4S7MFXGI4TPNFSA>.
Download it today!
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
Fair, @jmvalin A bit of context: this was tested on a container with Intel Cascade Lake(CL), which is old and notoriously suffers from downclock when using AVX. It is hard to get an AVX gain on such silicon as it is usually slower but my kernel is x1.0-x1.09, which is already a win on this generation. A later generation, e.g. Ice Lake-SP or a Sapphire Rapids would not have the downclock issue and the same goes for a modern Desktop CPU like AlderLake. Moreover newer Silicon has more lanes and bigger L1 to bank more. However, given the kernel, especially on CL, is load-port-bound (vs compute-bound) I now have a version with tiling, It tiles several 8-row blocks together and broadcasts each activation only once. This on the same Cascade Lake container.
Again, can expect more on newer silicon. I understand if you think 2% E2E is small but given how SIMD optimised the code is already 2% sounds quite significant to me, especially considering that this micro optimisation is for ~5% of encode (~15% of the DNN). Would you like to see and bench the improved version? Especially if you had an Alder Lake or newer Desktop. PS: the Tiling benefits the AVX as well and for E2E translates to 2%. |




Problem
The VNNI single-instruction int8 dot product (
_mm256_dpbusds_epi32→vpdpbusd) already exists indnn/vec_avx.h, but it is compile-time gated behind#if defined(__AVXVNNI__) || defined(__AVX512VNNI__)with no runtime-dispatched tier. A normal distro build (-mavx2) running on a VNNI CPU therefore executes the 3-instruction AVX2 emulation (vpmaddubsw+vpmaddwd+vpaddd), nevervpdpbusd. This is the x86 mirror of the Arm DOTPROD tier.What this PR does
1. New RTCD tier above AVX2 (index 5) for the int8 GEMV (
compute_linearwith quantized weights):celt/x86/x86cpu.c(CPUID.(7,0) EBX[16]=AVX512F, EBX[31]=AVX512VL, ECX[11]=AVX512_VNNI), plus anOSXSAVE/XGETBVcheck for OS AVX-512 state.dnn/x86/nnet_avx512vnni.c(RTCD_ARCH avx512vnni),compute_linear_avx512vnni,MAY_HAVE_AVX512VNNI, dispatch indnn/x86/dnn_x86.h+x86_dnn_map.c. Onlycompute_linearusesvpdpbusd; activation/conv2d reuse the AVX2 variants at index 5._IMPLtable (celt/silk) so it is never NULL on a VNNI CPU.The target CPU here has AVX512-VNNI but not the VEX-encoded AVX-VNNI (a
-mavxvnnikernel is an illegal instruction on it). Because opus's RTCD ladder is strictly sequential (andFUZZINGrandomly downgrades within it), a tier can only carry one encoding, so this tier targets AVX512-VNNI: EVEX-encodedvpdpbusds, kept 256-bit wide (-mprefer-vector-width=256— the win is the instruction, not wider vectors).2. Multi-accumulator
cgemv8x4/sparse_cgemv8x4(the change that makes the tier worthwhile): the unroll-by-4 fed all fourvpdpbusdsinto a single accumulator, serializing them on that register. Since EVEXvpdpbusdshas ~5-cycle latency, that recurrence made the naive VNNI tier slower than the AVX2 emulation (which keeps its multiplies off the accumulator's critical path). Splitting across four independent accumulators fixes it and also speeds up the existing AVX2/SSE paths a few percent.Bit-exactness
Verified byte-identical DRED encode+decode output between the VNNI tier, the AVX2 tier, and the pre-restructure code on real model weights. The AVX2/SSE emulation accumulates with wrapping 32-bit adds (exactly associative); on VNNI the regrouping is exact whenever the per-output sum stays within int32, which holds for the quantized weights the models use.
meson test(with-Dcheck-asm=true -Dassertions=true) passes; the regroup only diverges in a synthetic full-int8-range stress test, where VNNI actually matches the exact C reference better than the saturating AVX2 emulation.Performance — honest scope
Measured on an AVX512-VNNI Xeon (
compute_linearint8 GEMV, best of 5):On this µarch the VNNI tier is ~1.0–1.09× vs AVX2 — a small win, not the ~2–3× the Arm DOTPROD analogy suggested. Reasons: this batch-1 recurrent GEMV is partly load-bound, and on this µarch
vpdpbusdsthroughput ≈ the AVX2 emulation's once latency is hidden. End-to-end the DNN is a fraction of a frame, so wall-clock is within noise on typical clips. It may show larger gains on Ice Lake / Sapphire Rapids (higher VNNI throughput, no AVX-512 downclock) or on VEX AVX-VNNI parts (Alder/Raptor Lake), which weren't available to test. Frequency downclock was measured and ruled out on this VM.Scope: only helps DNN-enabled builds (DRED/OSCE/Deep-PLC, off by default) on VNNI CPUs. The multi-accumulator restructure benefits all x86 SIMD tiers regardless of VNNI.
Verification
opus_select_arch()→ 5;DNN_COMPUTE_LINEAR_IMPL[5]→compute_linear_avx512vnni.vpdpbusds(nozmm); AVX2 object still hasvpmaddubswand novpdpbusd.Related PRs (same contribution batch)