From 6f068fce5d693869fdc40a71212fbdf99e4e3e87 Mon Sep 17 00:00:00 2001 From: "Chen, Junjie" Date: Wed, 3 Oct 2018 14:16:57 +0800 Subject: [PATCH 1/2] PARQUET-41: Add Bloom filter --- BloomFilter.md | 134 +++++++++++++++++++++++++++++++++ src/main/thrift/parquet.thrift | 37 +++++++++ 2 files changed, 171 insertions(+) create mode 100644 BloomFilter.md diff --git a/BloomFilter.md b/BloomFilter.md new file mode 100644 index 000000000..01f177cba --- /dev/null +++ b/BloomFilter.md @@ -0,0 +1,134 @@ + + +Parquet Bloom Filter +=== +### Problem statement +In current format, statistic filter and dictionary filter are used for predicate pushdown. Statistic +filter use min/max to filter out values not in range while it can not filter out value within range +but not in set. Dictionary filter can effectively filter out value not in set but it maybe not +enabled since dictionary encoding can be fall back to plain encoding when the overhead threshold +is reached. Therefore, when performing predicate push down against a column with large cardinality, +there is no effective filter with a high probability. + +A Bloom filter[1] is a compact data structure to indicate whether an element is a member of a set. +It maintains a bitset initially sets to 0. Once an element is added to the set, it sets several +related bits in bitset to 1. One can query element by checking all of the related bits value. +If all of related bits are set to 1, it means this element is possibly exist in set, otherwise means +the element is definitely not in set. Since the size of Bloom filter is compact and can be controlled +through false positive rate, we can use it as an alternative filter to cover the case of large +cardinality column. + +### Goal +* Add a Bloom filter utility which can be used in project. + +* Implement row group filter base on Bloom Filter. In particular, selective queries with predicate +read Bloom filter data and evaluate predicate to determine whether to skip row group or not. + +* No additional I/O overhead when executing queries on other columns without Bloom filter enabled or +non selective queries. + +### Technical Approach +The Bloom filter in Parquet is implemented using blocked Bloom filter algorithm from Putze et al.'s +"Cache-, Hash- and Space-Efficient Bloom filters"[2]. Instead of setting bits by calculating index +with different hash functions in standard Bloom filter, the blocked Bloom filter uses a single hash +function to choose a precomputed pattern from a table (called a block or a tiny Bloom filter) of +random k-bit pattern of width w bytes. In many cases, the table fits into a single cache line or +smaller, and the related operation can take advantage of SIMD instructions. In this implementation, +we use a 32-byte table and 8-bit pattern. More specifically, it will set 8 bits in a 32-byte block, +one bit in each 32-bit word. + +#### Algorithm +In this blocked Bloom filter implementation, the algorithm use higher 32 bits from hash value in +little endian order as index to select a block from bitset. The lower 32 bits of hash value along +with eight SALT values are used to compute bit pattern to set bits. Multiply-shift[3] schema is used +to construct the bit pattern as shown in following: + +```c +// 8 SALT values used to compute bit pattern +static const uint32_t SALT[8] = {0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU, 0x705495c7U, + 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U}; + +// key: the lower 32 bits of hash result +// mask: the output bit pattern for a tiny Bloom filter +void Mask(uint32_t key, uint32_t mask[8]) { + for (int i = 0; i < 8; ++i) { + mask[i] = key * SALT[i]; + } + for (int i = 0; i < 8; ++i) { + mask[i] = mask[i] >> 27; + } + for (int i = 0; i < 8; ++i) { + mask[i] = 0x1U << mask[i]; + } +} + +``` + +#### Hash Function +The hash function used in this implementation is MurmurHash3[4] created by Austin Appleby, it +yields a 32-bit or 128-bit value. When producing 128-bit values, the x86 platform and x64 platform +yield different values as the optimization consideration. Here we use least significant 64 bits +value from the little endian result of 128-bit version on x64 platform. + + +#### Build a Bloom filter +To build a blocked Bloom filter, it needs to specify the size of Bloom filter bitset. The optimal +size of a Bloom filter can be calculated according to the number of column distinct values in a +row group and an expected false positive probability value. The formula is shown as: + +```c +// m: the size of blocked Bloom filter bitset +// n: the number of distinct values of the column in a row group +// p: the expected false positive probability value + m = -8 * n / log(1 - pow(p, 1.0 / 8)); +``` + +#### File Format +This implementation stores the Bloom filter data of column at the beginning of its column chunk +in the row group. The column chunk metadata contains the Bloom filter offset. + +``` +struct ColumnMetaData { + ... + /** Byte offset from beginning of file **/ + 14: optional i64 bloom_filter_offset; +} +``` +The Bloom filter is stored with a header and followed bitset. The header is defined as below: +``` +struct BloomFilterHeader { + + /** The size of bitset in bytes, must be a power of 2 and larger than 32**/ + 1: required i32 numBytes; + + /** The algorithm for setting bits. **/ + 2: required BloomFilterAlgorithm bloomFilterAlgorithm; + + /** The hash function used for bloom filter. **/ + 3: required BloomFilterHash bloomFilterHash; +} +``` +### Reference +1. [Bloom filter introduction at Wiki](https://en.wikipedia.org/wiki/Bloom_filter) +2. [Cache-, Hash- and Space-Efficient Bloom Filters](http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf) +3. [A Reliable Randomized Algorithm for the Closest-Pair Problem](http://www.diku.dk/~jyrki/Paper/CP-11.4.1997.ps) +4. [Murmur Hash at Wiki](https://en.wikipedia.org/wiki/MurmurHash) + + diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 6c9011b9a..378aa4733 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -475,6 +475,7 @@ enum PageType { INDEX_PAGE = 1; DICTIONARY_PAGE = 2; DATA_PAGE_V2 = 3; + BLOOM_FILTER_PAGE = 4; } /** @@ -554,6 +555,38 @@ struct DataPageHeaderV2 { 8: optional Statistics statistics; } +/** Block-based algorithm type annotation. **/ +struct SplitBlockAlgorithm {} +/** The algorithm used in Bloom filter. **/ +union BloomFilterAlgorithm { + /** Block-based Bloom filter. **/ + 1: SplitBlockAlgorithm BLOCK; +} +/** Hash strategy type annotation. It uses Murmur3Hash_x64_128 from the original SMHasher + * repo by Austin Appleby. + **/ +struct Murmur3 {} +/** + * The hash function used in Bloom filter. This function takes the hash of a column value + * using plain encoding. + **/ +union BloomFilterHash { + /** Murmur3 Hash Strategy. **/ + 1: Murmur3 MURMUR3; +} +/** + * Bloom filter header is stored at beginning of Bloom filter data of each column + * and followed by its bitset. + **/ +struct BloomFilterPageHeader { + /** The size of bitset in bytes **/ + 1: required i32 numBytes; + /** The algorithm for setting bits. **/ + 2: required BloomFilterAlgorithm algorithm; + /** The hash function used for Bloom filter. **/ + 3: required BloomFilterHash hash; +} + struct PageHeader { /** the type of the page: indicates which of the *_header fields is set **/ 1: required PageType type @@ -574,6 +607,7 @@ struct PageHeader { 6: optional IndexPageHeader index_page_header; 7: optional DictionaryPageHeader dictionary_page_header; 8: optional DataPageHeaderV2 data_page_header_v2; + 9: optional BloomFilterPageHeader bloom_filter_page_header; } /** @@ -660,6 +694,9 @@ struct ColumnMetaData { * This information can be used to determine if all data pages are * dictionary encoded for example **/ 13: optional list encoding_stats; + + /** Byte offset from beginning of file to Bloom filter data. **/ + 14: optional i64 bloom_filter_offset; } struct ColumnChunk { From 910c6cbc7a319146957c8fe4823bf1e2fb9e1790 Mon Sep 17 00:00:00 2001 From: Jim Apple Date: Sun, 7 Oct 2018 13:54:15 -0700 Subject: [PATCH 2/2] Grammar and structure tweaking for Bloom filter prose. --- BloomFilter.md | 132 ++++++++++++++++++++++--------------------------- 1 file changed, 59 insertions(+), 73 deletions(-) diff --git a/BloomFilter.md b/BloomFilter.md index 01f177cba..be27aefb3 100644 --- a/BloomFilter.md +++ b/BloomFilter.md @@ -16,54 +16,59 @@ - specific language governing permissions and limitations - under the License. --> - + Parquet Bloom Filter === ### Problem statement -In current format, statistic filter and dictionary filter are used for predicate pushdown. Statistic -filter use min/max to filter out values not in range while it can not filter out value within range -but not in set. Dictionary filter can effectively filter out value not in set but it maybe not -enabled since dictionary encoding can be fall back to plain encoding when the overhead threshold -is reached. Therefore, when performing predicate push down against a column with large cardinality, -there is no effective filter with a high probability. - -A Bloom filter[1] is a compact data structure to indicate whether an element is a member of a set. -It maintains a bitset initially sets to 0. Once an element is added to the set, it sets several -related bits in bitset to 1. One can query element by checking all of the related bits value. -If all of related bits are set to 1, it means this element is possibly exist in set, otherwise means -the element is definitely not in set. Since the size of Bloom filter is compact and can be controlled -through false positive rate, we can use it as an alternative filter to cover the case of large -cardinality column. +In their current format, column statistics and dictionaries can be used for predicate +pushdown. Statistics include minimum and maximum value, which can be used to filter out +values not in the range. Dictionaries are more specific, and readers can filter out values +that are between min and max but not in the dictionary. However, when there are too many +distinct values, writers sometimes choose not to add dictionaries because of the extra +space they occupy. This leaves columns with large cardinalities and widely separated min +and max without support for predicate pushdown. + +A Bloom filter[1] is a compact data structure that overapproximates a set. It can respond +to membership queries with either "definitely no" or "probably yes", where the probability +of false positives is configured when the filter is initialized. Bloom filters do not have +false negatives. + +Because Bloom filters are small compared to dictionaries, they can be used for predicate +pushdown even in columns with high cardinality and when space is at a premium. ### Goal -* Add a Bloom filter utility which can be used in project. - -* Implement row group filter base on Bloom Filter. In particular, selective queries with predicate -read Bloom filter data and evaluate predicate to determine whether to skip row group or not. +* Enable predicate pushdown for high-cardinality columns while using less space than + dictionaries. -* No additional I/O overhead when executing queries on other columns without Bloom filter enabled or -non selective queries. +* Induce no additional I/O overhead when executing queries on columns without Bloom + filters attached or when executing non-selective queries. ### Technical Approach -The Bloom filter in Parquet is implemented using blocked Bloom filter algorithm from Putze et al.'s -"Cache-, Hash- and Space-Efficient Bloom filters"[2]. Instead of setting bits by calculating index -with different hash functions in standard Bloom filter, the blocked Bloom filter uses a single hash -function to choose a precomputed pattern from a table (called a block or a tiny Bloom filter) of -random k-bit pattern of width w bytes. In many cases, the table fits into a single cache line or -smaller, and the related operation can take advantage of SIMD instructions. In this implementation, -we use a 32-byte table and 8-bit pattern. More specifically, it will set 8 bits in a 32-byte block, -one bit in each 32-bit word. +The initial Bloom filter algorithm in Parquet is implemented using a combination of two +Bloom filter techniques. + +First, the block Bloom filter algorithm from Putze et al.'s "Cache-, Hash- and +Space-Efficient Bloom filters"[2] is used. This divides a filter into many tiny Bloom +filters, each one of which is called a "block". In Parquet's initial implementation, each +block is 256 bits. When inserting or finding a value, part of the hash of that value is +used to index into the array of blocks and pick a single one. This single block is then +used for the remaining part of the operation. + +Second, within each block, this implementation uses the folklore split Bloom filter +technique, as described in section 2.1 of "Network Applications of Bloom Filters: A +Survey"[5]. This divides the 256 bits in each block up into eight contiguous 32-bit lanes +and sets or checks one bit in each lane. #### Algorithm -In this blocked Bloom filter implementation, the algorithm use higher 32 bits from hash value in -little endian order as index to select a block from bitset. The lower 32 bits of hash value along -with eight SALT values are used to compute bit pattern to set bits. Multiply-shift[3] schema is used -to construct the bit pattern as shown in following: +In the initial algorithm, the most significant 32 bits from the hash value are used as the +index to select a block from bitset. The lower 32 bits of the hash value, along with eight +constant salt values, are used to compute the bit to set in each lane of the block. The +salt and lower 32 bits are combined using the multiply-shift[3] hash function: ```c // 8 SALT values used to compute bit pattern -static const uint32_t SALT[8] = {0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU, 0x705495c7U, - 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U}; +static const uint32_t SALT[8] = {0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU, + 0x705495c7U, 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U}; // key: the lower 32 bits of hash result // mask: the output bit pattern for a tiny Bloom filter @@ -75,60 +80,41 @@ void Mask(uint32_t key, uint32_t mask[8]) { mask[i] = mask[i] >> 27; } for (int i = 0; i < 8; ++i) { - mask[i] = 0x1U << mask[i]; + mask[i] = UINT32_C(1) << mask[i]; } } ``` #### Hash Function -The hash function used in this implementation is MurmurHash3[4] created by Austin Appleby, it -yields a 32-bit or 128-bit value. When producing 128-bit values, the x86 platform and x64 platform -yield different values as the optimization consideration. Here we use least significant 64 bits -value from the little endian result of 128-bit version on x64 platform. - +The function used to hash values in the initial implementation is MurmurHash3[4], using +the least-significant 64 bits of the 128-bit version of the function on the x86-64 +platform. Note that the function produces different values on different architectures, so +implementors must be careful to use the version specific to x86-64. That function can be +emulated on different platforms without difficulty. #### Build a Bloom filter -To build a blocked Bloom filter, it needs to specify the size of Bloom filter bitset. The optimal -size of a Bloom filter can be calculated according to the number of column distinct values in a -row group and an expected false positive probability value. The formula is shown as: +The fact that exactly eight bits are checked during each lookup means that these filters +are most space efficient when used with an expected false positive rate of about +0.5%. This is achieved when there are about 11.54 bits for every distinct value inserted +into the filter. + +To calculate the size the filter should be for another false positive rate `p`, use the +following formula. The output is in bits per distinct element: ```c -// m: the size of blocked Bloom filter bitset -// n: the number of distinct values of the column in a row group -// p: the expected false positive probability value - m = -8 * n / log(1 - pow(p, 1.0 / 8)); +-8 / log(1 - pow(p, 1.0 / 8)); ``` #### File Format -This implementation stores the Bloom filter data of column at the beginning of its column chunk -in the row group. The column chunk metadata contains the Bloom filter offset. - -``` -struct ColumnMetaData { - ... - /** Byte offset from beginning of file **/ - 14: optional i64 bloom_filter_offset; -} -``` -The Bloom filter is stored with a header and followed bitset. The header is defined as below: -``` -struct BloomFilterHeader { - - /** The size of bitset in bytes, must be a power of 2 and larger than 32**/ - 1: required i32 numBytes; - - /** The algorithm for setting bits. **/ - 2: required BloomFilterAlgorithm bloomFilterAlgorithm; +The Bloom filter data of a column is stored at the beginning of its column chunk in the +row group. The column chunk metadata contains the Bloom filter offset. The Bloom filter is +stored with a header containing the size of the filter in bytes, the algorithm, and the +hash function. - /** The hash function used for bloom filter. **/ - 3: required BloomFilterHash bloomFilterHash; -} -``` ### Reference 1. [Bloom filter introduction at Wiki](https://en.wikipedia.org/wiki/Bloom_filter) 2. [Cache-, Hash- and Space-Efficient Bloom Filters](http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf) 3. [A Reliable Randomized Algorithm for the Closest-Pair Problem](http://www.diku.dk/~jyrki/Paper/CP-11.4.1997.ps) 4. [Murmur Hash at Wiki](https://en.wikipedia.org/wiki/MurmurHash) - - +5. [Network Applications of Bloom Filters: A Survey](https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf)