Java Quad Search – A Faster Alternative to Binary Search

An SIMD-powered search algorithm in Java that beats binary search for larger arrays.

Inspired by Lemire's blog post: https://lemire.me/blog/2026/04/27/you-can-beat-the-binary-search/

Thanks to Daniel Lemire for this interesting idea.

Motivation

Binary search is theoretically optimal. Its runtime complexity is O(log n). What it means by example:

Minimal comparisons, only 30 operations for 1 billion entries.

So, we can think of this; is there still any room for improvement? Probably, yes. Because:

Modern CPUs are not comparison-bound anymore.
Memory hierarchy and branch prediction dominate performance.

In this experiment, we take Arrays.binarySearch in Java as reference. We tried two alternative implementations:

QuadSearch: Quaternary narrowing – predefined blocks are grouped into four regions until max. 3 blocks left and then continue with normal sequential search
SIMDQuadSearch: Quaternary narrowing – the same process above – and then SIMD powered sequential search

In both implementations, we keep the Arrays.binarySearch semantics the same:

If the search value is found, we return the index which could be [0, n) when array size is n
If the search value is not found, we return the insertion point as (-insertion-point - 1)

Benchmark

In the benchmarks, we use different sizes of arrays; 4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k, and 1m (from 2^12 to 2^20 — 4096 to 1,048,576)

In our randomly generated test data, the count of arrays decreases proportionally while the size increases, like in the pattern below:

| Count | Array Size |
|-------|------------|
| 256k  | 4k         |
| 128k  | 8k         |
| 64k   | 16k        |
| 32k   | 32k        |
| 16k   | 64k        |
| 8k    | 128k       |
| 4k    | 256k       |
| 2k    | 512k       |
| 1k    | 1m         |

We took this approach because it is aligned with the real world use-cases. Usually, you have many but small-sized arrays, like in Vector databases, or you have few but larger arrays, such as per-column sorted data structures in OLAP/columnar databases and large inverted index posting lists in search engines.

Additionally, we run two different benchmarks to test cold and warm cache behavour:

Cold cache behavour: Search an array once, continue with the next one
Warm cache behavour: Search an array 100 times before going on with the next one

All the benchmarks we run in the latest Java 25 version (by the time released in sdkman) on both Oracle JDK and GraalVM vendors.

TL;DR: Main Observation

A cache-friendly hybrid quaternary search with SIMD instructions can outperform Arrays.binarySearch by:

up to 25% on Oracle JDK, up to 16% on GraalVM on Apple M4 ARM64 (Neon 128-bit) hardware
up to 24% on Oracle JDK, up to 29% on GraalVM on x86-64 AMD (AVX2 256-bit) hardware
p99 values are better than default binary search almost in all test cases

You can check the results in here: charts

Alternatively, the JMH results you can see over here: docs/results/

Why Classic Binary Search is not Hardware-Friendly

Binary Search Access Pattern

To understand better why Binary Search is not Hardware-Friendly, let's have a look at Binary search's access pattern. Binary search works fundemantally based on accessing mid-points which consequently leads to:

random memory jumps
poor cache locality
difficult branch prediction

As we notice, each access jumps to a distant location, so very little (or none) of the data stays in the cache.

Therefore, cache locality is almost never preserved.

Theory vs Real Hardware

In theory, binary search is very optimal that makes fewer comparisons all the time. Considering an array of 1 billion entries, it's still only 30 operations (2^30 = 1,073,741,824 ~= 1B) to scan the entire array.

However, fewer comparisons do not necessarily mean lower latency. Modern CPUs prefer:

sequential memory access
predictable control flow
cache locality

over minimal comparison count.

Why Small Linear Scans Are Surprisingly Fast

Locality Wins

In modern CPUs scanning 16–32 contiguous integers is extremely cache-friendly. There are many advantages of contiguous traversal, like:

sequential loads
hardware prefetch efficiency
minimal branch entropy
high Out-of-Order utilization

Thus, sometimes a tiny linear scan can surprisingly outperform logarithmic search locally.

The Hybrid Quaternary Search Design

Core Idea

We split the search into two phases:

Quaternary narrowing
Local block scan (via SIMD)

Architecture:

Large array
↓
Quaternary narrowing
↓
1 to 3 small blocks remain
↓
Linear block scan with SIMD

Quaternary Narrowing

Quaternary narrowing reduces the search space by probing three evenly distributed pivot points instead of a single midpoint as in classic binary search. On each iteration, the algorithm divides the remaining range into four regions and quickly narrows the candidate space down to one quarter of its previous size.

Once only one to three small blocks remain, the search switches to a localized linear or SIMD-assisted scan. This hybrid approach trades a few additional comparisons for significantly better cache locality, more predictable memory access patterns, and reduced branch entropy on modern CPUs.

Implementing SIMD algorithms in Java

In Java by using Project Panama's Vector API, we can tell the compiler to use vector instructions and get the performance benefit of SIMD on supported CPU architectures. Thus, we can read and compare more than 8 bytes, like 16 or 32 bytes depending on the underlying hardware.

Hence, the last remaining blocks (max. three blocks each has a size of 16) can be read by using Vector API:

/**
 * Search inside one 16-element block.
 * <p>
 * Returns:
 * found index
 * OR -(insertionPoint + 1)
 */
private static int blockSearch(int[] array, int base, int target) {
    final int step = SPECIES.length();
    for (int offset = 0; offset < BLOCK_SIZE; offset += step) {
        final int indexBase = base + offset;
        final IntVector values = IntVector.fromArray(SPECIES, array, indexBase);

        final VectorMask<Integer> geMask = values.compare(VectorOperators.GE, target);
        final long bits = geMask.toLong();

        if (bits != 0) {
            final int lane = Long.numberOfTrailingZeros(bits);
            final int index = indexBase + lane;
            final int value = array[index];

            return (value == target) ? index : -(index + 1);
        }
    }

    return -((base + BLOCK_SIZE) + 1);
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
docs		docs
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
run-benchmarks.sh		run-benchmarks.sh
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Java Quad Search – A Faster Alternative to Binary Search

Motivation

Benchmark

TL;DR: Main Observation

Why Classic Binary Search is not Hardware-Friendly

Binary Search Access Pattern

Theory vs Real Hardware

Why Small Linear Scans Are Surprisingly Fast

Locality Wins

The Hybrid Quaternary Search Design

Core Idea

Quaternary Narrowing

Implementing SIMD algorithms in Java

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Java Quad Search – A Faster Alternative to Binary Search

Motivation

Benchmark

TL;DR: Main Observation

Why Classic Binary Search is not Hardware-Friendly

Binary Search Access Pattern

Theory vs Real Hardware

Why Small Linear Scans Are Surprisingly Fast

Locality Wins

The Hybrid Quaternary Search Design

Core Idea

Quaternary Narrowing

Implementing SIMD algorithms in Java

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages