Skip to content

RISC-V: Add optimized decompression path#236

Closed
zhanchangbao-sanechips wants to merge 1 commit into
google:mainfrom
zhanchangbao-sanechips:add_rvvopt
Closed

RISC-V: Add optimized decompression path#236
zhanchangbao-sanechips wants to merge 1 commit into
google:mainfrom
zhanchangbao-sanechips:add_rvvopt

Conversation

@zhanchangbao-sanechips

Copy link
Copy Markdown

Summary

This PR adds a RISC-V-optimized decompression path for the branchless inner loop in Snappy.
RISC-V lacks conditional-move (cmov) instructions, making the existing x86-optimized AdvanceToNextTagX86Optimized suboptimal on RISC-V platforms.
We introduce AdvanceToNextTagRISCVOptimized with a branch structure similar to ARM, and share ARM's ExtractOffset / Load16 strategy.

Motivation

The decompression loop (DecompressBranchless) is the hottest path for Snappy's RawUncompress, Uncompress, and validation APIs.

  • On x86_64, the loop uses cmov and volatile loads to minimize latency.
  • On ARM64, a simpler branch-based approach works better due to csinc.
  • RISC-V falls through to the x86 path by default, which forces the compiler to emulate cmov with extra register moves and branches.

This patch gives RISC-V its own optimized path.

Changes Made

  • Added AdvanceToNextTagRISCVOptimized() in snappy.cc
  • Updated ExtractOffset() to include defined(__riscv) alongside defined(__aarch64__)
  • Updated DecompressBranchless() to use the new RISC-V path with Load16

Implementation Details

  • No new dependencies: the optimization is purely scalar, guarded by #if defined(__riscv)
  • No API changes: fully backward compatible
  • Non-RISC-V platforms: zero impact — all changes are behind preprocessor conditionals

Performance Results

Test Environment

  • Hardware: Banana Pi K1 (SpacemiT X60)
  • CPU: 8-core X60 @ 1.6GHz
  • Compiler: Clang 17+ / GCC 13+ with -march=rv64gcv

BM_UFlat (Decompression) – Core Improvement

Benchmark Before (ns) After (ns) Speedup Bandwidth Gain
html/1 228,447 185,498 1.232 +23.2%
html/2 206,562 168,327 1.227 +22.7%
urls/1 2,703,524 2,221,052 1.217 +21.7%
urls/2 2,500,196 2,057,324 1.215 +21.5%
html4/1 930,183 762,913 1.219 +21.9%
txt1/1 1,008,827 817,822 1.234 +23.4%
txt2/1 889,770 724,470 1.228 +22.8%
txt3/1 2,686,800 2,183,861 1.230 +23.0%
txt4/1 3,750,967 3,052,873 1.229 +22.9%
pb/1 201,982 164,958 1.224 +22.4%
gaviota/1 997,243 822,349 1.213 +21.3%
Medley 13,674,217 11,192,662 1.222 +22.2%

BM_UValidate (Validation) – Consistent Gains

Benchmark Before (ns) After (ns) Speedup
html/1 141,792 116,440 1.218
pdf/1 13,438 11,055 1.215
txt1/1 631,535 516,461 1.223
txt4/1 2,312,771 1,891,391 1.223
pb/1 123,329 101,392 1.216
gaviota/1 597,290 487,891 1.224
Medley 8,432,222 6,930,131 1.217

Binary Data – No Regression

Benchmark Before (ns) After (ns) Speedup Note
jpg/1 27,303 27,755 0.984 -1.6% (within measurement noise)
jpg/2 26,663 26,757 0.996 -0.4% (within measurement noise)
jpg_200/1 1,138 1,130 1.007 +0.7%
pdf/1 38,860 36,969 1.051 +5.1%
pdf/2 84,765 75,830 1.118 +11.8%

Other Operations

Operation Result Assessment
BM_UFlatSink +22~23% on text Consistent with UFlat
BM_ZFlat ±2% No impact on compression
BM_UIOVecSource <3% variance No regression
BM_UIOVecSink <2% variance No regression

Test Repeatability

Three independent runs confirm stable and reproducible results.
All text workloads show consistently +21~23% improvement; binary workloads show <2% variance (within measurement noise).

Compatibility & Portability

Platform Behavior
RISC-V (__riscv defined) Uses new optimized path
Non-RISC-V (x86_64, ARM64) Completely unaffected — code is behind #if defined(__riscv)

Testing

  • snappy_unittest passes all tests
  • snappy_benchmark verified on RISC-V hardware (Banana Pi K1)
  • No regressions on existing platforms (CI verified)

Checklist

  • Code follows project’s C++ style
  • Comments added for non-obvious logic
  • Performance data included with multiple test runs
  • Full backward compatibility maintained
  • No breaking changes to API or behavior
  • All existing unit tests pass

Screenshots

Unit Tests - All Pass
u

Benchmark - Before Optimization
CjYTPWngPK-AML2xACGOFBkSvIg189

Benchmark - After Optimization
b

@google-cla

google-cla Bot commented Apr 28, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

    RISC-V lacks conditional-move (cmov) instructions, making the x86
    cmov-based path suboptimal. Use a branch-based approach similar to
    ARM64, and adopt the same ExtractOffset / Load16 strategy.

    Benchmarks on RV64 show:
    - UFlat/UValidate: +22~26% on text workloads
    - UFlatSink: +22~23%
    - Binary data (jpg/pdf): no regression
    - Compression (ZFlat): unchanged
@danilak-G

Copy link
Copy Markdown
Collaborator

This is a copy of #234

@danilak-G danilak-G closed this May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants