-
Notifications
You must be signed in to change notification settings - Fork 22
DataFusion 52 release post #135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Matt Butrovich <mbutrovich@users.noreply.github.com>
|
Thanks @mbutrovich -- any additional context / suggestions you have on the sort mergejoin improvement would be most appreciated |
|
(this is on my list, but I am struggling to find time to finish it -- hopefully after CIDR / thursday) |
…sion-site into site/datafusion_52
| --- | ||
| layout: post | ||
| title: Apache DataFusion 52.0.0 Released | ||
| date: 2026-01-08 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to:
- https://lists.apache.org/thread/gt29yg6wxzx82s87drwq1xb06yhs16y6
- https://crates.io/crates/datafusion/52.0.0
| date: 2026-01-08 | |
| date: 2026-01-12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks -- I think in the past we have dated the blog posts based on when the post was released rather than when the software was 🤔
| changes is available in the [changelog]. Thanks to the [121 contributors] for | ||
| making this release possible. | ||
|
|
||
| TODO: confirm the release date for 52.0.0 and update the front matter if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| TODO: confirm the release date for 52.0.0 and update the front matter if needed. |
| TODO: confirm the release date for 52.0.0 and update the front matter if needed. | ||
|
|
||
| [DataFusion 52.0.0]: https://crates.io/crates/datafusion/52.0.0 | ||
| [DataFusion 51.0.0]: https://datafusion.apache.org/blog/2025/11/25/datafusion-51.0.0/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO
| explained in the [Extending SQL in DataFusion Blog]. With this new API, you can | ||
| customize DataFusion to support almost any SQL syntax, such as the following | ||
| (which are not supported by default): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that this is slightly misleading: it reads as if the RelationPlanner is what now allows extending expressions and types (and relations). Maybe something like:
In addition to the existing expression and types extension points, this new API now allows extending FROM clauses, leading DataFusion to support almost any SQL syntax, such as the following (which are not supported by default):
But reworded to be less of a run-on sentence...
| [Apache Comet]: https://datafusion.apache.org/comet/ | ||
| [mbutrovich]: https://github.com/mbutrovich | ||
|
|
||
| ### Rewritten merge join |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section title looks very similar to the previous one. The start of the first sentence is also identical. Maybe a title that differentiates this section more from the previous one (e.g. "Optimised Output Handling of Merge Join") would be clearer.
| Starting in DataFusion 51, filtering information from `HashJoinExec` is passed | ||
| dynamically to scans, as explained in the [Dynamic Filtering Blog] using a | ||
| technique referred to as [Sideways Information Passing] in Database research | ||
| literature. The initial implementation passed min/max values for the join keys. | ||
| DataFusion 52 extends the optimization ([#17171] / [#18393]) to use an `IN` list when the | ||
| build size is small such as when the join is very selective. The `IN` list is | ||
| pushed down to the probe side scan and is used to prune files, row groups, and | ||
| individual rows. Thanks to [adriangb] for implementing this feature, with | ||
| reviews from [LiaCastaneda], [asolimando], [comphead], and [mbutrovich]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also push down references to the hash table itself when InList is too big.
The main advantage of InList (which I think should be mentioned here) is that it can participate in statistics pruning.
| Starting in DataFusion 51, filtering information from `HashJoinExec` is passed | |
| dynamically to scans, as explained in the [Dynamic Filtering Blog] using a | |
| technique referred to as [Sideways Information Passing] in Database research | |
| literature. The initial implementation passed min/max values for the join keys. | |
| DataFusion 52 extends the optimization ([#17171] / [#18393]) to use an `IN` list when the | |
| build size is small such as when the join is very selective. The `IN` list is | |
| pushed down to the probe side scan and is used to prune files, row groups, and | |
| individual rows. Thanks to [adriangb] for implementing this feature, with | |
| reviews from [LiaCastaneda], [asolimando], [comphead], and [mbutrovich]. | |
| Starting in DataFusion 51, filtering information from `HashJoinExec` is passed | |
| dynamically to scans, as explained in the [Dynamic Filtering Blog] using a | |
| technique referred to as [Sideways Information Passing] in Database research | |
| literature. The initial implementation passed min/max values for the join keys. | |
| DataFusion 52 extends the optimization ([#17171] / [#18393]) to use an `IN` list when the | |
| build size is small such as when the join is very selective or a reference to the build side hash map when the build side is larger. | |
| These new expressions are pushed down to the probe side scan and is used to prune files, row groups, and | |
| individual rows. | |
| When the build side is small enough (<=20 rows but configurable) the pushed down filters can even participate in statistics pruning to avoid even reading the join keys from row groups that will not match. | |
| Thanks to [adriangb] for implementing this feature, with | |
| reviews from [LiaCastaneda], [asolimando], [comphead], and [mbutrovich]. |
nuno-faria
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at the changelog and this PR caught my attention: apache/datafusion#18644. Maybe it could be worth a mention as well.
|
|
||
| This release also includes several additional caching improvements. | ||
|
|
||
| A new statistics cache for Parquet Metadata avoids repeatedly (re)calculating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe "Parquet Metadata" -> "File Metadata"? Since there is also a separate cache for the Parquet metadata itself.
52.0.0release datafusion#1969152.0.0(Dec 2025 / Jan 2026) datafusion#18566This is a draft of the DataFusion 52 release post
See rendered preview: https://datafusion.staged.apache.org/blog/2026/01/08/datafusion-52.0.0/
This was initially created using coded. Commands below
Details
We are going to write a blog post for the DataFusion 52.0.0 release
We need to cover the major features in this release. If you are unsure of any content, please leave a "TODO" note in the text and we can fill it in
later.
Please start with a copy of the previous post as a starting point: content/blog/2025-11-25-datafusion-51.0.0.md and update as needed.
The changelog is here: https://github.com/xudong963/arrow-datafusion/blob/update_version/dev/changelog/52.0.0.md
The list of major features can be found in apache/datafusion#18566 under the section "Features to mention in the blog
(if they make it)". Only include the ones that made it into the release, with a checkmark.
Please
example where possible.