Skip to content

Commit d5ed6b9

Browse files
adriangbBugenZhaoXiangpengHaoRachelintJesse-Bakker
authored
Add ThriftMetadataWriter for writing Parquet metadata (#6197)
* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` Signed-off-by: Bugen Zhao <i@bugenzhao.com> * fix example tests Signed-off-by: Bugen Zhao <i@bugenzhao.com> --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> * Remove `impl<T: AsRef<[u8]>> From<T> for Buffer` that easily accidentally copies data (#6043) * deprecate auto copy, ask explicit reference * update comments * make cargo doc happy * Make display of interval types more pretty (#6006) * improve dispaly for interval. * update test in pretty, and fix display problem. * tmp * fix tests in arrow-cast. * fix tests in pretty. * fix style. * Update snafu (#5930) * Update Parquet thrift generated structures (#6045) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * Revert "Revert "Write Bloom filters between row groups instead of the end (#…" (#5933) This reverts commit 22e0b44. * Revert "Update snafu (#5930)" (#6069) This reverts commit 756b1fb. * Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075) * Update pyo3 requirement from 0.21.1 to 0.22.1 Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.21.1...v0.22.1) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * refactor: remove deprecated `FromPyArrow::from_pyarrow` "GIL Refs" are being phased out. * chore: update `pyo3` in integration tests --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * remove repeated codes to make the codes more concise. (#6080) * Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * add support for unencoded_byte_array_data_bytes * add comments * change sig of ColumnMetrics::update_variable_length_bytes() * rename ParquetOffsetIndex to OffsetSizeIndex * rename some functions * suggestion from review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * add Default trait to ColumnMetrics as suggested in review * rename OffsetSizeIndex to OffsetIndexMetaData --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085) Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.21.1...v0.22.2) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095) * deprecate read_page_locations * add to_thrift() to OffsetIndexMetaData * Update parquet/src/column/writer/mod.rs Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> * Upgrade protobuf definitions to flightsql 17.0 (#6133) * Update FlightSql.proto to version 17.0 Adds new message CommandStatementIngest and removes `experimental` from other messages. * Regenerate flight sql protocol This upgrades the file to version 17.0 of the protobuf definition. * Add `ParquetMetadataWriter` allow ad-hoc encoding of `ParquetMetadata` * fix loading in test by etseidl Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> * add rough equivalence test * one more check * make clippy happy * separate tests that require arrow into a separate module * add histograms to to_thrift() --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com> Co-authored-by: kamille <caoruiqiu.crq@antgroup.com> Co-authored-by: Jesse <github@jessebakker.com> Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Marco Neumann <marco@crepererum.net> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Douglas Anderson <djanderson@users.noreply.github.com> Co-authored-by: Ed Seidl <etseidl@live.com>
1 parent a235b9b commit d5ed6b9

File tree

3 files changed

+574
-98
lines changed

3 files changed

+574
-98
lines changed

parquet/src/file/page_index/index.rs

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,53 @@ impl<T: ParquetValueType> NativeIndex<T> {
225225
boundary_order: index.boundary_order,
226226
})
227227
}
228+
229+
pub(crate) fn to_thrift(&self) -> ColumnIndex {
230+
let min_values = self
231+
.indexes
232+
.iter()
233+
.map(|x| x.min_bytes().map(|x| x.to_vec()))
234+
.collect::<Option<Vec<_>>>()
235+
.unwrap_or_else(|| vec![vec![]; self.indexes.len()]);
236+
237+
let max_values = self
238+
.indexes
239+
.iter()
240+
.map(|x| x.max_bytes().map(|x| x.to_vec()))
241+
.collect::<Option<Vec<_>>>()
242+
.unwrap_or_else(|| vec![vec![]; self.indexes.len()]);
243+
244+
let null_counts = self
245+
.indexes
246+
.iter()
247+
.map(|x| x.null_count())
248+
.collect::<Option<Vec<_>>>();
249+
250+
// Concatenate page histograms into a single Option<Vec>
251+
let repetition_level_histograms = self
252+
.indexes
253+
.iter()
254+
.map(|x| x.repetition_level_histogram().map(|v| v.values()))
255+
.collect::<Option<Vec<&[i64]>>>()
256+
.map(|hists| hists.concat());
257+
258+
let definition_level_histograms = self
259+
.indexes
260+
.iter()
261+
.map(|x| x.definition_level_histogram().map(|v| v.values()))
262+
.collect::<Option<Vec<&[i64]>>>()
263+
.map(|hists| hists.concat());
264+
265+
ColumnIndex::new(
266+
self.indexes.iter().map(|x| x.min().is_none()).collect(),
267+
min_values,
268+
max_values,
269+
self.boundary_order,
270+
null_counts,
271+
repetition_level_histograms,
272+
definition_level_histograms,
273+
)
274+
}
228275
}
229276

230277
#[cfg(test)]

0 commit comments

Comments
 (0)