Skip to content

silk: Add Arm NEON silk_VAD_GetSA_Q8#483

Open
czoli1976 wants to merge 1 commit into
xiph:mainfrom
czoli1976:silk-arm-vad
Open

silk: Add Arm NEON silk_VAD_GetSA_Q8#483
czoli1976 wants to merge 1 commit into
xiph:mainfrom
czoli1976:silk-arm-vad

Conversation

@czoli1976

Copy link
Copy Markdown

Summary

silk_VAD_GetSA_Q8 had an x86 SSE4.1 implementation but no Arm one, even though it runs on every SILK/hybrid frame in the default (float) build. This adds a NEON version, mirroring the SSE4.1 one.

What's vectorised

The per-subframe energy sum-of-squares(X[i] >> 3)^2 accumulated in int32 — 8 samples per iteration via vshrq_n_s16 + paired vmlal_s16 (low/high), with a scalar tail and a horizontal vaddvq_s32. Everything else (analysis filterbank, noise estimation, SNR/tilt) is identical to the C reference, exactly as the SSE4.1 version does.

  • Bit-exact with silk_VAD_GetSA_Q8_c (exact integer sum of squares, no overflow), validated by the existing OPUS_CHECK_ASM full-encoder-state memcmp.
  • As on x86, silk_VAD_GetNoiseLevels becomes exported (instead of static inline in VAD.c) when NEON is enabled, so the kernel can call it.

Dispatch / wiring

Uses the existing OVERRIDE_silk_VAD_GetSA_Q8 hook: a new silk/arm/VAD_arm.h provides the PRESUME (direct call) and RTCD (SILK_VAD_GETSA_Q8_IMPL table in arm_silk_map.c) dispatch, mirroring silk/x86/main_sse.h. The source is added to the common SILK_SOURCES_ARM_NEON_INTR group, which is already wired in autotools / CMake / Meson — so no build-system changes are needed.

Numbers (Apple M4)

  • Microbench of the vectorised loop at the real subband lengths (10–80 samples): ~1.1–1.7× over scalar.
  • End-to-end within run-to-run noise (VAD is a small per-frame cost), bitstream unchanged (bit-exact).
  • Full meson test suite passes.

This is the last of the silk-side x86-has-it/ARM-doesn't parity gaps that runs in the default build.

silk_VAD_GetSA_Q8 had an x86 SSE4.1 implementation but no Arm one, and it
runs on every SILK/hybrid frame in the default (float) build. Add a NEON
version mirroring the SSE4.1 one: it vectorises the per-subframe energy
sum-of-squares ((X[i] >> 3)^2 accumulated in int32), 8 samples per iteration
via vshrq_n_s16 + paired vmlal_s16, with a scalar tail. Bit-exact with the C
reference (exact integer sum, no overflow), validated by the existing
OPUS_CHECK_ASM full-state memcmp.

As on x86, silk_VAD_GetNoiseLevels is exported (rather than static inline in
VAD.c) when NEON is enabled so the kernel can call it. Dispatched via the
existing OVERRIDE_silk_VAD_GetSA_Q8 hook (PRESUME + an RTCD table in
arm_silk_map.c); the source goes in the common SILK_SOURCES_ARM_NEON_INTR
group, already wired in autotools/CMake/Meson.

Microbench on Apple M4 (the real subband lengths, 10-80): ~1.1-1.7x over
scalar; E2E within run-to-run noise (VAD is a small per-frame cost). Full
meson test suite passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant