Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jan 8, 2026

This is a draft of the DataFusion 52 release post

See rendered preview: https://datafusion.staged.apache.org/blog/2026/01/08/datafusion-52.0.0/

This was initially created using coded. Commands below

Details

We are going to write a blog post for the DataFusion 52.0.0 release

We need to cover the major features in this release. If you are unsure of any content, please leave a "TODO" note in the text and we can fill it in
later.

Please start with a copy of the previous post as a starting point: content/blog/2025-11-25-datafusion-51.0.0.md and update as needed.

The changelog is here: https://github.com/xudong963/arrow-datafusion/blob/update_version/dev/changelog/52.0.0.md

The list of major features can be found in apache/datafusion#18566 under the section "Features to mention in the blog
(if they make it)". Only include the ones that made it into the release, with a checkmark.

Please

  • write a blog post
  • leave a section for performance chart which we can fill in later
  • include a section for each major feature, summarizing what it is and why it is important, and the related PRs. Please try to include a diagram or
    example where possible.

Co-authored-by: Matt Butrovich <mbutrovich@users.noreply.github.com>
@alamb
Copy link
Contributor Author

alamb commented Jan 20, 2026

Thanks @mbutrovich -- any additional context / suggestions you have on the sort mergejoin improvement would be most appreciated

@alamb
Copy link
Contributor Author

alamb commented Jan 20, 2026

(this is on my list, but I am struggling to find time to finish it -- hopefully after CIDR / thursday)

@alamb alamb changed the title WIP: DataFusion 52 release post DataFusion 52 release post Jan 23, 2026
@alamb
Copy link
Contributor Author

alamb commented Jan 23, 2026

FYI @2010YOUY01 @BlakeOrth @Dandandan @Jefffrey @LiaCastaneda @NGA-TRAN @Tim-53 @Yuvraj-cyborg @adriangb @alamb @alchemist51 @asolimando @bharath-techie @comphead @corasaurus-hex
@ethan-tyler @feniljain @gabotechs @geoffreyclaude @jdcasale @jizezhang @kosiew @martin-g @mbutrovich @milenkovicm @nuno-faria @pepijnve @rluvaton @theirix @timsaucer @zhuqi-lucas and @xudong963 as you are mentioned in this post

---
layout: post
title: Apache DataFusion 52.0.0 Released
date: 2026-01-08
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- I think in the past we have dated the blog posts based on when the post was released rather than when the software was 🤔

changes is available in the [changelog]. Thanks to the [121 contributors] for
making this release possible.

TODO: confirm the release date for 52.0.0 and update the front matter if needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TODO: confirm the release date for 52.0.0 and update the front matter if needed.

TODO: confirm the release date for 52.0.0 and update the front matter if needed.

[DataFusion 52.0.0]: https://crates.io/crates/datafusion/52.0.0
[DataFusion 51.0.0]: https://datafusion.apache.org/blog/2025/11/25/datafusion-51.0.0/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

Comment on lines +228 to +230
explained in the [Extending SQL in DataFusion Blog]. With this new API, you can
customize DataFusion to support almost any SQL syntax, such as the following
(which are not supported by default):
Copy link
Contributor

@geoffreyclaude geoffreyclaude Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that this is slightly misleading: it reads as if the RelationPlanner is what now allows extending expressions and types (and relations). Maybe something like:

In addition to the existing expression and types extension points, this new API now allows extending FROM clauses, leading DataFusion to support almost any SQL syntax, such as the following (which are not supported by default):

But reworded to be less of a run-on sentence...

[Apache Comet]: https://datafusion.apache.org/comet/
[mbutrovich]: https://github.com/mbutrovich

### Rewritten merge join
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section title looks very similar to the previous one. The start of the first sentence is also identical. Maybe a title that differentiates this section more from the previous one (e.g. "Optimised Output Handling of Merge Join") would be clearer.

Comment on lines +178 to +186
Starting in DataFusion 51, filtering information from `HashJoinExec` is passed
dynamically to scans, as explained in the [Dynamic Filtering Blog] using a
technique referred to as [Sideways Information Passing] in Database research
literature. The initial implementation passed min/max values for the join keys.
DataFusion 52 extends the optimization ([#17171] / [#18393]) to use an `IN` list when the
build size is small such as when the join is very selective. The `IN` list is
pushed down to the probe side scan and is used to prune files, row groups, and
individual rows. Thanks to [adriangb] for implementing this feature, with
reviews from [LiaCastaneda], [asolimando], [comphead], and [mbutrovich].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also push down references to the hash table itself when InList is too big.

The main advantage of InList (which I think should be mentioned here) is that it can participate in statistics pruning.

Suggested change
Starting in DataFusion 51, filtering information from `HashJoinExec` is passed
dynamically to scans, as explained in the [Dynamic Filtering Blog] using a
technique referred to as [Sideways Information Passing] in Database research
literature. The initial implementation passed min/max values for the join keys.
DataFusion 52 extends the optimization ([#17171] / [#18393]) to use an `IN` list when the
build size is small such as when the join is very selective. The `IN` list is
pushed down to the probe side scan and is used to prune files, row groups, and
individual rows. Thanks to [adriangb] for implementing this feature, with
reviews from [LiaCastaneda], [asolimando], [comphead], and [mbutrovich].
Starting in DataFusion 51, filtering information from `HashJoinExec` is passed
dynamically to scans, as explained in the [Dynamic Filtering Blog] using a
technique referred to as [Sideways Information Passing] in Database research
literature. The initial implementation passed min/max values for the join keys.
DataFusion 52 extends the optimization ([#17171] / [#18393]) to use an `IN` list when the
build size is small such as when the join is very selective or a reference to the build side hash map when the build side is larger.
These new expressions are pushed down to the probe side scan and is used to prune files, row groups, and
individual rows.
When the build side is small enough (<=20 rows but configurable) the pushed down filters can even participate in statistics pruning to avoid even reading the join keys from row groups that will not match.
Thanks to [adriangb] for implementing this feature, with
reviews from [LiaCastaneda], [asolimando], [comphead], and [mbutrovich].

Copy link
Contributor

@nuno-faria nuno-faria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at the changelog and this PR caught my attention: apache/datafusion#18644. Maybe it could be worth a mention as well.


This release also includes several additional caching improvements.

A new statistics cache for Parquet Metadata avoids repeatedly (re)calculating
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe "Parquet Metadata" -> "File Metadata"? Since there is also a separate cache for the Parquet metadata itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blog post for the DataFusion 52.0.0 release

7 participants