From 6f068fce5d693869fdc40a71212fbdf99e4e3e87 Mon Sep 17 00:00:00 2001
From: "Chen, Junjie" <cjjnjust@gmail.com>
Date: Wed, 3 Oct 2018 14:16:57 +0800
Subject: [PATCH 1/2] PARQUET-41: Add Bloom filter

---
 BloomFilter.md                 | 134 +++++++++++++++++++++++++++++++++
 src/main/thrift/parquet.thrift |  37 +++++++++
 2 files changed, 171 insertions(+)
 create mode 100644 BloomFilter.md

diff --git a/BloomFilter.md b/BloomFilter.md
new file mode 100644
index 000000000..01f177cba
--- /dev/null
+++ b/BloomFilter.md
@@ -0,0 +1,134 @@
+ <!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+  
+Parquet Bloom Filter
+===
+### Problem statement
+In current format, statistic filter and dictionary filter are used for predicate pushdown. Statistic
+filter use min/max to filter out values not in range while it can not filter out value within range
+but not in set. Dictionary filter can effectively filter out value not in set but it maybe not
+enabled since dictionary encoding can be fall back to plain encoding when the overhead threshold
+is reached. Therefore, when performing predicate push down against a column with large cardinality,
+there is no effective filter with a high probability.
+
+A Bloom filter[1] is a compact data structure to indicate whether an element is a member of a set.
+It maintains a bitset initially sets to 0. Once an element is added to the set, it sets several
+related bits in bitset to 1. One can query element by checking all of the related bits value.
+If all of related bits are set to 1, it means this element is possibly exist in set, otherwise means
+the element is definitely not in set. Since the size of Bloom filter is compact and can be controlled
+through false positive rate, we can use it as an alternative filter to cover the case of large
+cardinality column.
+
+### Goal
+* Add a Bloom filter utility which can be used in project.
+ 
+* Implement row group filter base on Bloom Filter. In particular, selective queries with predicate
+read Bloom filter data and evaluate predicate to determine whether to skip row group or not.
+
+* No additional I/O overhead when executing queries on other columns without Bloom filter enabled or
+non selective queries.
+
+### Technical Approach
+The Bloom filter in Parquet is implemented using blocked Bloom filter algorithm from Putze et al.'s
+"Cache-, Hash- and Space-Efficient Bloom filters"[2]. Instead of setting bits by calculating index
+with different hash functions in standard Bloom filter, the blocked Bloom filter uses a single hash
+function to choose a precomputed pattern from a table (called a block or a tiny Bloom filter) of
+random k-bit pattern of width w bytes. In many cases, the table fits into a single cache line or
+smaller, and the related operation can take advantage of SIMD instructions. In this implementation,
+we use a 32-byte table and 8-bit pattern. More specifically, it will set 8 bits in a 32-byte block,
+one bit in each 32-bit word.
+
+#### Algorithm
+In this blocked Bloom filter implementation, the algorithm use higher 32 bits from hash value in
+little endian order as index to select a block from bitset. The lower 32 bits of hash value along
+with eight SALT values are used to compute bit pattern to set bits. Multiply-shift[3] schema is used
+to construct the bit pattern as shown in following:
+
+```c
+// 8 SALT values used to compute bit pattern
+static const uint32_t SALT[8] = {0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU, 0x705495c7U,
+ 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U};
+
+// key: the lower 32 bits of hash result
+// mask: the output bit pattern for a tiny Bloom filter
+void Mask(uint32_t key, uint32_t mask[8]) {
+  for (int i = 0; i < 8; ++i) {
+    mask[i] = key * SALT[i];
+  }
+  for (int i = 0; i < 8; ++i) {
+    mask[i] = mask[i] >> 27;
+  }
+  for (int i = 0; i < 8; ++i) {
+    mask[i] = 0x1U << mask[i];
+  }
+}
+
+```
+
+#### Hash Function
+The hash function used in this implementation is MurmurHash3[4] created by Austin Appleby, it
+yields a 32-bit or 128-bit value. When producing 128-bit values, the x86 platform and x64 platform
+yield different values as the optimization consideration. Here we use least significant 64 bits
+value from the little endian result of 128-bit version on x64 platform.
+
+
+#### Build a Bloom filter
+To build a blocked Bloom filter, it needs to specify the size of Bloom filter bitset. The optimal
+size of a Bloom filter can be calculated according to the number of column distinct values in a
+row group and an expected false positive probability value. The formula is shown as:
+
+```c
+// m: the size of blocked Bloom filter bitset
+// n: the number of distinct values of the column in a row group
+// p: the expected false positive probability value
+		m = -8 * n / log(1 - pow(p, 1.0 / 8));
+```
+
+#### File Format
+This implementation stores the Bloom filter data of column at the beginning of its column chunk
+in the row group. The column chunk metadata contains the Bloom filter offset.
+
+```
+struct ColumnMetaData {
+  ...
+  /** Byte offset from beginning of file **/
+  14: optional i64 bloom_filter_offset;
+}
+```
+The Bloom filter is stored with a header and followed bitset. The header is defined as below:
+```
+struct BloomFilterHeader {
+
+  /** The size of bitset in bytes, must be a  power of 2 and larger than 32**/
+  1: required i32 numBytes;
+
+  /** The algorithm for setting bits. **/
+  2: required BloomFilterAlgorithm bloomFilterAlgorithm;
+
+  /** The hash function used for bloom filter. **/
+  3: required BloomFilterHash bloomFilterHash;
+}
+```
+### Reference
+1. [Bloom filter introduction at Wiki](https://en.wikipedia.org/wiki/Bloom_filter)
+2. [Cache-, Hash- and Space-Efficient Bloom Filters](http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf)
+3. [A Reliable Randomized Algorithm for the Closest-Pair Problem](http://www.diku.dk/~jyrki/Paper/CP-11.4.1997.ps)
+4. [Murmur Hash at Wiki](https://en.wikipedia.org/wiki/MurmurHash)
+
+
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 6c9011b9a..378aa4733 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -475,6 +475,7 @@ enum PageType {
   INDEX_PAGE = 1;
   DICTIONARY_PAGE = 2;
   DATA_PAGE_V2 = 3;
+  BLOOM_FILTER_PAGE = 4;
 }
 
 /**
@@ -554,6 +555,38 @@ struct DataPageHeaderV2 {
   8: optional Statistics statistics;
 }
 
+/** Block-based algorithm type annotation. **/
+struct SplitBlockAlgorithm {}
+/** The algorithm used in Bloom filter. **/
+union BloomFilterAlgorithm {
+  /** Block-based Bloom filter. **/
+  1: SplitBlockAlgorithm BLOCK;
+}
+/** Hash strategy type annotation. It uses Murmur3Hash_x64_128 from the original SMHasher
+ * repo by Austin Appleby.
+ **/
+struct Murmur3 {}
+/** 
+ * The hash function used in Bloom filter. This function takes the hash of a column value
+ * using plain encoding.
+ **/
+union BloomFilterHash {
+  /** Murmur3 Hash Strategy. **/
+  1: Murmur3 MURMUR3;
+}
+/**
+  * Bloom filter header is stored at beginning of Bloom filter data of each column
+  * and followed by its bitset.
+  **/
+struct BloomFilterPageHeader {
+  /** The size of bitset in bytes **/
+  1: required i32 numBytes;
+  /** The algorithm for setting bits. **/
+  2: required BloomFilterAlgorithm algorithm;
+  /** The hash function used for Bloom filter. **/
+  3: required BloomFilterHash hash;
+}
+
 struct PageHeader {
   /** the type of the page: indicates which of the *_header fields is set **/
   1: required PageType type
@@ -574,6 +607,7 @@ struct PageHeader {
   6: optional IndexPageHeader index_page_header;
   7: optional DictionaryPageHeader dictionary_page_header;
   8: optional DataPageHeaderV2 data_page_header_v2;
+  9: optional BloomFilterPageHeader bloom_filter_page_header;
 }
 
 /**
@@ -660,6 +694,9 @@ struct ColumnMetaData {
    * This information can be used to determine if all data pages are
    * dictionary encoded for example **/
   13: optional list<PageEncodingStats> encoding_stats;
+
+  /** Byte offset from beginning of file to Bloom filter data. **/
+  14: optional i64 bloom_filter_offset;
 }
 
 struct ColumnChunk {

From 910c6cbc7a319146957c8fe4823bf1e2fb9e1790 Mon Sep 17 00:00:00 2001
From: Jim Apple <jbapple-parquet@apache.org>
Date: Sun, 7 Oct 2018 13:54:15 -0700
Subject: [PATCH 2/2] Grammar and structure tweaking for Bloom filter prose.

---
 BloomFilter.md | 132 ++++++++++++++++++++++---------------------------
 1 file changed, 59 insertions(+), 73 deletions(-)

diff --git a/BloomFilter.md b/BloomFilter.md
index 01f177cba..be27aefb3 100644
--- a/BloomFilter.md
+++ b/BloomFilter.md
@@ -16,54 +16,59 @@
   - specific language governing permissions and limitations
   - under the License.
   -->
-  
+
 Parquet Bloom Filter
 ===
 ### Problem statement
-In current format, statistic filter and dictionary filter are used for predicate pushdown. Statistic
-filter use min/max to filter out values not in range while it can not filter out value within range
-but not in set. Dictionary filter can effectively filter out value not in set but it maybe not
-enabled since dictionary encoding can be fall back to plain encoding when the overhead threshold
-is reached. Therefore, when performing predicate push down against a column with large cardinality,
-there is no effective filter with a high probability.
-
-A Bloom filter[1] is a compact data structure to indicate whether an element is a member of a set.
-It maintains a bitset initially sets to 0. Once an element is added to the set, it sets several
-related bits in bitset to 1. One can query element by checking all of the related bits value.
-If all of related bits are set to 1, it means this element is possibly exist in set, otherwise means
-the element is definitely not in set. Since the size of Bloom filter is compact and can be controlled
-through false positive rate, we can use it as an alternative filter to cover the case of large
-cardinality column.
+In their current format, column statistics and dictionaries can be used for predicate
+pushdown. Statistics include minimum and maximum value, which can be used to filter out
+values not in the range. Dictionaries are more specific, and readers can filter out values
+that are between min and max but not in the dictionary. However, when there are too many
+distinct values, writers sometimes choose not to add dictionaries because of the extra
+space they occupy. This leaves columns with large cardinalities and widely separated min
+and max without support for predicate pushdown.
+
+A Bloom filter[1] is a compact data structure that overapproximates a set. It can respond
+to membership queries with either "definitely no" or "probably yes", where the probability
+of false positives is configured when the filter is initialized. Bloom filters do not have
+false negatives.
+
+Because Bloom filters are small compared to dictionaries, they can be used for predicate
+pushdown even in columns with high cardinality and when space is at a premium.
 
 ### Goal
-* Add a Bloom filter utility which can be used in project.
- 
-* Implement row group filter base on Bloom Filter. In particular, selective queries with predicate
-read Bloom filter data and evaluate predicate to determine whether to skip row group or not.
+* Enable predicate pushdown for high-cardinality columns while using less space than
+  dictionaries.
 
-* No additional I/O overhead when executing queries on other columns without Bloom filter enabled or
-non selective queries.
+* Induce no additional I/O overhead when executing queries on columns without Bloom
+  filters attached or when executing non-selective queries.
 
 ### Technical Approach
-The Bloom filter in Parquet is implemented using blocked Bloom filter algorithm from Putze et al.'s
-"Cache-, Hash- and Space-Efficient Bloom filters"[2]. Instead of setting bits by calculating index
-with different hash functions in standard Bloom filter, the blocked Bloom filter uses a single hash
-function to choose a precomputed pattern from a table (called a block or a tiny Bloom filter) of
-random k-bit pattern of width w bytes. In many cases, the table fits into a single cache line or
-smaller, and the related operation can take advantage of SIMD instructions. In this implementation,
-we use a 32-byte table and 8-bit pattern. More specifically, it will set 8 bits in a 32-byte block,
-one bit in each 32-bit word.
+The initial Bloom filter algorithm in Parquet is implemented using a combination of two
+Bloom filter techniques.
+
+First, the block Bloom filter algorithm from Putze et al.'s "Cache-, Hash- and
+Space-Efficient Bloom filters"[2] is used. This divides a filter into many tiny Bloom
+filters, each one of which is called a "block". In Parquet's initial implementation, each
+block is 256 bits. When inserting or finding a value, part of the hash of that value is
+used to index into the array of blocks and pick a single one. This single block is then
+used for the remaining part of the operation.
+
+Second, within each block, this implementation uses the folklore split Bloom filter
+technique, as described in section 2.1 of "Network Applications of Bloom Filters: A
+Survey"[5]. This divides the 256 bits in each block up into eight contiguous 32-bit lanes
+and sets or checks one bit in each lane.
 
 #### Algorithm
-In this blocked Bloom filter implementation, the algorithm use higher 32 bits from hash value in
-little endian order as index to select a block from bitset. The lower 32 bits of hash value along
-with eight SALT values are used to compute bit pattern to set bits. Multiply-shift[3] schema is used
-to construct the bit pattern as shown in following:
+In the initial algorithm, the most significant 32 bits from the hash value are used as the
+index to select a block from bitset. The lower 32 bits of the hash value, along with eight
+constant salt values, are used to compute the bit to set in each lane of the block. The
+salt and lower 32 bits are combined using the multiply-shift[3] hash function:
 
 ```c
 // 8 SALT values used to compute bit pattern
-static const uint32_t SALT[8] = {0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU, 0x705495c7U,
- 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U};
+static const uint32_t SALT[8] = {0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU,
+  0x705495c7U, 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U};
 
 // key: the lower 32 bits of hash result
 // mask: the output bit pattern for a tiny Bloom filter
@@ -75,60 +80,41 @@ void Mask(uint32_t key, uint32_t mask[8]) {
     mask[i] = mask[i] >> 27;
   }
   for (int i = 0; i < 8; ++i) {
-    mask[i] = 0x1U << mask[i];
+    mask[i] = UINT32_C(1) << mask[i];
   }
 }
 
 ```
 
 #### Hash Function
-The hash function used in this implementation is MurmurHash3[4] created by Austin Appleby, it
-yields a 32-bit or 128-bit value. When producing 128-bit values, the x86 platform and x64 platform
-yield different values as the optimization consideration. Here we use least significant 64 bits
-value from the little endian result of 128-bit version on x64 platform.
-
+The function used to hash values in the initial implementation is MurmurHash3[4], using
+the least-significant 64 bits of the 128-bit version of the function on the x86-64
+platform. Note that the function produces different values on different architectures, so
+implementors must be careful to use the version specific to x86-64. That function can be
+emulated on different platforms without difficulty.
 
 #### Build a Bloom filter
-To build a blocked Bloom filter, it needs to specify the size of Bloom filter bitset. The optimal
-size of a Bloom filter can be calculated according to the number of column distinct values in a
-row group and an expected false positive probability value. The formula is shown as:
+The fact that exactly eight bits are checked during each lookup means that these filters
+are most space efficient when used with an expected false positive rate of about
+0.5%. This is achieved when there are about 11.54 bits for every distinct value inserted
+into the filter.
+
+To calculate the size the filter should be for another false positive rate `p`, use the
+following formula. The output is in bits per distinct element:
 
 ```c
-// m: the size of blocked Bloom filter bitset
-// n: the number of distinct values of the column in a row group
-// p: the expected false positive probability value
-		m = -8 * n / log(1 - pow(p, 1.0 / 8));
+-8 / log(1 - pow(p, 1.0 / 8));
 ```
 
 #### File Format
-This implementation stores the Bloom filter data of column at the beginning of its column chunk
-in the row group. The column chunk metadata contains the Bloom filter offset.
-
-```
-struct ColumnMetaData {
-  ...
-  /** Byte offset from beginning of file **/
-  14: optional i64 bloom_filter_offset;
-}
-```
-The Bloom filter is stored with a header and followed bitset. The header is defined as below:
-```
-struct BloomFilterHeader {
-
-  /** The size of bitset in bytes, must be a  power of 2 and larger than 32**/
-  1: required i32 numBytes;
-
-  /** The algorithm for setting bits. **/
-  2: required BloomFilterAlgorithm bloomFilterAlgorithm;
+The Bloom filter data of a column is stored at the beginning of its column chunk in the
+row group. The column chunk metadata contains the Bloom filter offset. The Bloom filter is
+stored with a header containing the size of the filter in bytes, the algorithm, and the
+hash function.
 
-  /** The hash function used for bloom filter. **/
-  3: required BloomFilterHash bloomFilterHash;
-}
-```
 ### Reference
 1. [Bloom filter introduction at Wiki](https://en.wikipedia.org/wiki/Bloom_filter)
 2. [Cache-, Hash- and Space-Efficient Bloom Filters](http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf)
 3. [A Reliable Randomized Algorithm for the Closest-Pair Problem](http://www.diku.dk/~jyrki/Paper/CP-11.4.1997.ps)
 4. [Murmur Hash at Wiki](https://en.wikipedia.org/wiki/MurmurHash)
-
-
+5. [Network Applications of Bloom Filters: A Survey](https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf)