Skip to content

chore: update datafusion and related crates#1635

Merged
nitisht merged 1 commit into
parseablehq:mainfrom
nikhilsinhaparseable:df-update
May 1, 2026
Merged

chore: update datafusion and related crates#1635
nitisht merged 1 commit into
parseablehq:mainfrom
nikhilsinhaparseable:df-update

Conversation

@nikhilsinhaparseable
Copy link
Copy Markdown
Contributor

@nikhilsinhaparseable nikhilsinhaparseable commented May 1, 2026

this is to fix a bug reported in DF 51.0.0
apache/datafusion#19086

this impacts approx percentile queries approx_percentile_cont(..., 0.999) triggers the panic

Summary by CodeRabbit

Release Notes

  • Chores

    • Upgraded Arrow to 58.0.0, DataFusion to 53.0.0, parquet to 58.0.0, and object_store to 0.13.1.
  • Refactor

    • Updated parquet writer row group configuration and storage layer APIs.

this is to fix a bug reported in DF 51.0.0
apache/datafusion#19086

this impacts approx percentile queries
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

Walkthrough

The PR upgrades Arrow/DataFusion dependencies (arrow to 58.0.0, datafusion to 53.0.0, object_store to 0.13.1) and adapts the codebase to breaking API changes, including new trait imports, updated parquet writer APIs, refactored metrics instrumentation, and adjusted statistics configuration.

Changes

Cohort / File(s) Summary
Dependency Management
Cargo.toml
Upgraded arrow/datafusion ecosystem: arrow, arrow-*, parquet to 58.0.0; datafusion to 53.0.0; object_store to 0.13.1.
ObjectStore Trait Updates
src/hottier.rs, src/storage/azure_blob.rs, src/storage/gcs.rs, src/storage/s3.rs
Added ObjectStoreExt import to enable extension trait methods for ObjectStore clients across storage implementations.
Parquet & Query Configuration
src/parseable/streams.rs, src/query/stream_schema_provider.rs
Updated parquet writer to use set_max_row_group_row_count() instead of set_max_row_group_size(); refactored ParquetSource initialization to accept explicit schema and adjusted FileScanConfigBuilder API; extended ColumnStatistics with byte_size precision tracking.
Metrics Instrumentation Refactoring
src/storage/metrics_layer.rs
Removed instrumentation for legacy ObjectStore methods (put, get, head, delete, copy, rename); updated to options-based API (copy_opts, rename_opts); adjusted remaining instrumentation labels (GET_OPTS → GET, PUT_OPTS → PUT); updated imports for CopyOptions/RenameOptions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • parseablehq/parseable#1221: Modifies stream_schema_provider.rs for per-column statistics configuration and coordinates Arrow/DataFusion version updates.

Suggested labels

for next release

Suggested reviewers

  • nitisht
  • de-sh

Poem

🐰 Dependencies dance, Arrow takes flight,
ObjectStore traits now shimmer so bright,
Row groups reshape with newfound precision,
Metrics refactored with careful revision,
Upgrade complete—let parseable ignite! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description explains the bug fix motivation but lacks the structured sections and testing/documentation checkboxes required by the repository template. Restructure the description to include the Description section, rationale, key changes summary, and complete the testing/documentation checklist from the template.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main objective of the PR, which is updating DataFusion and related dependency crates to fix a bug.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 4/5 reviews remaining, refill in 12 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
Cargo.toml (1)

12-29: ⚡ Quick win

Add a regression test for the reported approx_percentile_cont(..., 0.999) panic.

This PR’s user-visible fix is entirely dependency-driven, so without a targeted query test there’s nothing here that proves the panic is gone or keeps it from coming back on the next crate bump. A small end-to-end query test around the exact percentile value from the issue would make this upgrade much safer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Cargo.toml` around lines 12 - 29, Add an end-to-end regression test that runs
a SQL query invoking approx_percentile_cont(..., 0.999) to ensure the panic no
longer occurs; create a new test (e.g., tests/regress_percentile.rs or in an
existing query integration test module) that builds a small table of numeric
values, executes a query using approx_percentile_cont(column, 0.999), asserts
the query completes without panicking and verifies the returned percentile value
(or at least that the result is Some/NotNull); ensure the test runs with the
current dependency set (DataFusion/arrow/parquet) and is included in CI so
future crate bumps will exercise approx_percentile_cont at the 0.999 edge case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@Cargo.toml`:
- Around line 12-29: Add an end-to-end regression test that runs a SQL query
invoking approx_percentile_cont(..., 0.999) to ensure the panic no longer
occurs; create a new test (e.g., tests/regress_percentile.rs or in an existing
query integration test module) that builds a small table of numeric values,
executes a query using approx_percentile_cont(column, 0.999), asserts the query
completes without panicking and verifies the returned percentile value (or at
least that the result is Some/NotNull); ensure the test runs with the current
dependency set (DataFusion/arrow/parquet) and is included in CI so future crate
bumps will exercise approx_percentile_cont at the 0.999 edge case.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 4edfca8c-fb35-4775-a4b4-2a04936c28e8

📥 Commits

Reviewing files that changed from the base of the PR and between ebc48eb and ec9da08.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • Cargo.toml
  • src/hottier.rs
  • src/parseable/streams.rs
  • src/query/stream_schema_provider.rs
  • src/storage/azure_blob.rs
  • src/storage/gcs.rs
  • src/storage/metrics_layer.rs
  • src/storage/s3.rs

@nitisht nitisht merged commit 001215b into parseablehq:main May 1, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants