We can likely get better throughput by making the main loop branchless (enabling auto-vectorization), like [duckdb](https://github.com/duckdb/duckdb/blob/v1.1.1/src/include/duckdb/storage/compression/alp/algorithm/alp.hpp#L302-L329)
We can likely get better throughput by making the main loop branchless (enabling auto-vectorization), like duckdb