silk: Add Arm NEON silk_VAD_GetSA_Q8 by czoli1976 · Pull Request #483 · xiph/opus

czoli1976 · 2026-06-16T18:25:39Z

Summary

silk_VAD_GetSA_Q8 had an x86 SSE4.1 implementation but no Arm one, even though it runs on every SILK/hybrid frame in the default (float) build. This adds a NEON version, mirroring the SSE4.1 one.

What's vectorised

The per-subframe energy sum-of-squares — (X[i] >> 3)^2 accumulated in int32 — 8 samples per iteration via vshrq_n_s16 + paired vmlal_s16 (low/high), with a scalar tail and a horizontal vaddvq_s32. Everything else (analysis filterbank, noise estimation, SNR/tilt) is identical to the C reference, exactly as the SSE4.1 version does.

Bit-exact with silk_VAD_GetSA_Q8_c (exact integer sum of squares, no overflow), validated by the existing OPUS_CHECK_ASM full-encoder-state memcmp.
As on x86, silk_VAD_GetNoiseLevels becomes exported (instead of static inline in VAD.c) when NEON is enabled, so the kernel can call it.

Dispatch / wiring

Uses the existing OVERRIDE_silk_VAD_GetSA_Q8 hook: a new silk/arm/VAD_arm.h provides the PRESUME (direct call) and RTCD (SILK_VAD_GETSA_Q8_IMPL table in arm_silk_map.c) dispatch, mirroring silk/x86/main_sse.h. The source is added to the common SILK_SOURCES_ARM_NEON_INTR group, which is already wired in autotools / CMake / Meson — so no build-system changes are needed.

Numbers (Apple M4)

Microbench of the vectorised loop at the real subband lengths (10–80 samples): ~1.1–1.7× over scalar.
End-to-end within run-to-run noise (VAD is a small per-frame cost), bitstream unchanged (bit-exact).
Full meson test suite passes.

This is the last of the silk-side x86-has-it/ARM-doesn't parity gaps that runs in the default build.

silk_VAD_GetSA_Q8 had an x86 SSE4.1 implementation but no Arm one, and it runs on every SILK/hybrid frame in the default (float) build. Add a NEON version mirroring the SSE4.1 one: it vectorises the per-subframe energy sum-of-squares ((X[i] >> 3)^2 accumulated in int32), 8 samples per iteration via vshrq_n_s16 + paired vmlal_s16, with a scalar tail. Bit-exact with the C reference (exact integer sum, no overflow), validated by the existing OPUS_CHECK_ASM full-state memcmp. As on x86, silk_VAD_GetNoiseLevels is exported (rather than static inline in VAD.c) when NEON is enabled so the kernel can call it. Dispatched via the existing OVERRIDE_silk_VAD_GetSA_Q8 hook (PRESUME + an RTCD table in arm_silk_map.c); the source goes in the common SILK_SOURCES_ARM_NEON_INTR group, already wired in autotools/CMake/Meson. Microbench on Apple M4 (the real subband lengths, 10-80): ~1.1-1.7x over scalar; E2E within run-to-run noise (VAD is a small per-frame cost). Full meson test suite passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

czoli1976 mentioned this pull request Jun 16, 2026

x86: runtime AVX512-VNNI tier for the int8 DNN GEMV (+ multi-accumulator cgemv) #484

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

silk: Add Arm NEON silk_VAD_GetSA_Q8#483

silk: Add Arm NEON silk_VAD_GetSA_Q8#483
czoli1976 wants to merge 1 commit into
xiph:mainfrom
czoli1976:silk-arm-vad

czoli1976 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

czoli1976 commented Jun 16, 2026

Summary

What's vectorised

Dispatch / wiring

Numbers (Apple M4)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant