DataFusion 52 release post #135

alamb · 2026-01-08T22:47:33Z

This is a draft of the DataFusion 52 release post

See rendered preview: https://datafusion.staged.apache.org/blog/2026/01/08/datafusion-52.0.0/

This was initially created using coded. Commands below

Details

We are going to write a blog post for the DataFusion 52.0.0 release

We need to cover the major features in this release. If you are unsure of any content, please leave a "TODO" note in the text and we can fill it in
later.

Please start with a copy of the previous post as a starting point: content/blog/2025-11-25-datafusion-51.0.0.md and update as needed.

The changelog is here: https://github.com/xudong963/arrow-datafusion/blob/update_version/dev/changelog/52.0.0.md

The list of major features can be found in apache/datafusion#18566 under the section "Features to mention in the blog
(if they make it)". Only include the ones that made it into the release, with a checkmark.

Please

write a blog post
leave a section for performance chart which we can fill in later
include a section for each major feature, summarizing what it is and why it is important, and the related PRs. Please try to include a diagram or
example where possible.

content/blog/2026-01-08-datafusion-52.0.0.md

Co-authored-by: Matt Butrovich <mbutrovich@users.noreply.github.com>

alamb · 2026-01-20T17:25:57Z

Thanks @mbutrovich -- any additional context / suggestions you have on the sort mergejoin improvement would be most appreciated

alamb · 2026-01-20T17:26:20Z

(this is on my list, but I am struggling to find time to finish it -- hopefully after CIDR / thursday)

…sion-site into site/datafusion_52

alamb · 2026-01-23T00:50:59Z

martin-g · 2026-01-23T06:38:05Z

content/blog/2026-01-08-datafusion-52.0.0.md

+---
+layout: post
+title: Apache DataFusion 52.0.0 Released
+date: 2026-01-08


According to:

https://lists.apache.org/thread/gt29yg6wxzx82s87drwq1xb06yhs16y6

https://crates.io/crates/datafusion/52.0.0

Suggested change

date: 2026-01-08

date: 2026-01-12

Thanks -- I think in the past we have dated the blog posts based on when the post was released rather than when the software was 🤔

martin-g · 2026-01-23T06:38:36Z

content/blog/2026-01-08-datafusion-52.0.0.md

+changes is available in the [changelog]. Thanks to the [121 contributors] for
+making this release possible.
+
+TODO: confirm the release date for 52.0.0 and update the front matter if needed.


Suggested change

TODO: confirm the release date for 52.0.0 and update the front matter if needed.

martin-g · 2026-01-23T06:39:02Z

content/blog/2026-01-08-datafusion-52.0.0.md

+TODO: confirm the release date for 52.0.0 and update the front matter if needed.
+
+[DataFusion 52.0.0]: https://crates.io/crates/datafusion/52.0.0
+[DataFusion 51.0.0]: https://datafusion.apache.org/blog/2025/11/25/datafusion-51.0.0/


geoffreyclaude · 2026-01-23T07:05:21Z

content/blog/2026-01-08-datafusion-52.0.0.md

+explained in the [Extending SQL in DataFusion Blog]. With this new API, you can
+customize DataFusion to support almost any SQL syntax, such as the following
+(which are not supported by default):


I feel that this is slightly misleading: it reads as if the RelationPlanner is what now allows extending expressions and types (and relations). Maybe something like:

In addition to the existing expression and types extension points, this new API now allows extending FROM clauses, leading DataFusion to support almost any SQL syntax, such as the following (which are not supported by default):

But reworded to be less of a run-on sentence...

pepijnve · 2026-01-23T12:09:07Z

content/blog/2026-01-08-datafusion-52.0.0.md

+[Apache Comet]: https://datafusion.apache.org/comet/
+[mbutrovich]: https://github.com/mbutrovich
+
+### Rewritten merge join


This section title looks very similar to the previous one. The start of the first sentence is also identical. Maybe a title that differentiates this section more from the previous one (e.g. "Optimised Output Handling of Merge Join") would be clearer.

adriangb · 2026-01-23T17:00:13Z

content/blog/2026-01-08-datafusion-52.0.0.md

+Starting in DataFusion 51, filtering information from `HashJoinExec` is passed
+dynamically to scans, as explained in the [Dynamic Filtering Blog] using a
+technique referred to as [Sideways Information Passing] in Database research
+literature. The initial implementation passed min/max values for the join keys.
+DataFusion 52 extends the optimization ([#17171] / [#18393]) to use an `IN` list when the
+build size is small such as when the join is very selective. The `IN` list is
+pushed down to the probe side scan and is used to prune files, row groups, and
+individual rows.  Thanks to [adriangb] for implementing this feature, with
+reviews from [LiaCastaneda], [asolimando], [comphead], and [mbutrovich].


We also push down references to the hash table itself when InList is too big.

The main advantage of InList (which I think should be mentioned here) is that it can participate in statistics pruning.

Suggested change

Starting in DataFusion 51, filtering information from `HashJoinExec` is passed

dynamically to scans, as explained in the [Dynamic Filtering Blog] using a

technique referred to as [Sideways Information Passing] in Database research

literature. The initial implementation passed min/max values for the join keys.

DataFusion 52 extends the optimization ([#17171] / [#18393]) to use an `IN` list when the

build size is small such as when the join is very selective. The `IN` list is

pushed down to the probe side scan and is used to prune files, row groups, and

individual rows. Thanks to [adriangb] for implementing this feature, with

reviews from [LiaCastaneda], [asolimando], [comphead], and [mbutrovich].

Starting in DataFusion 51, filtering information from `HashJoinExec` is passed

dynamically to scans, as explained in the [Dynamic Filtering Blog] using a

technique referred to as [Sideways Information Passing] in Database research

literature. The initial implementation passed min/max values for the join keys.

DataFusion 52 extends the optimization ([#17171] / [#18393]) to use an `IN` list when the

build size is small such as when the join is very selective or a reference to the build side hash map when the build side is larger.

These new expressions are pushed down to the probe side scan and is used to prune files, row groups, and

individual rows.

When the build side is small enough (<=20 rows but configurable) the pushed down filters can even participate in statistics pruning to avoid even reading the join keys from row groups that will not match.

Thanks to [adriangb] for implementing this feature, with

reviews from [LiaCastaneda], [asolimando], [comphead], and [mbutrovich].

nuno-faria

I was looking at the changelog and this PR caught my attention: apache/datafusion#18644. Maybe it could be worth a mention as well.

nuno-faria · 2026-01-23T21:17:58Z

content/blog/2026-01-08-datafusion-52.0.0.md

+
+This release also includes several additional caching improvements.
+
+A new statistics cache for Parquet Metadata avoids repeatedly (re)calculating


nit: maybe "Parquet Metadata" -> "File Metadata"? Since there is also a separate cache for the Parquet metadata itself.

Initial draft (coded with codex)

68abcfb

alamb mentioned this pull request Jan 8, 2026

WIP: DataFusion 52 release post #134

Closed

updates

1686945

mbutrovich reviewed Jan 12, 2026

View reviewed changes

content/blog/2026-01-08-datafusion-52.0.0.md Outdated Show resolved Hide resolved

alamb added 3 commits January 18, 2026 21:54

Merge remote-tracking branch 'apache/main' into site/datafusion_52

63b8e12

updates

3c7dd6a

Update sql planning

2cc1f59

mbutrovich reviewed Jan 20, 2026

View reviewed changes

content/blog/2026-01-08-datafusion-52.0.0.md Outdated Show resolved Hide resolved

mbutrovich reviewed Jan 20, 2026

View reviewed changes

content/blog/2026-01-08-datafusion-52.0.0.md Outdated Show resolved Hide resolved

mbutrovich reviewed Jan 20, 2026

View reviewed changes

content/blog/2026-01-08-datafusion-52.0.0.md Outdated Show resolved Hide resolved

Apply suggestions from code review

ccc5d42

Co-authored-by: Matt Butrovich <mbutrovich@users.noreply.github.com>

alamb added 7 commits January 22, 2026 18:45

Updates

81954c5

acknowledgments

781cd62

update

790b658

updates

d63a4de

update

d38d99f

typos

b47c50d

refine

1f5b91e

alamb changed the title ~~WIP: DataFusion 52 release post~~ DataFusion 52 release post Jan 23, 2026

alamb added 2 commits January 22, 2026 19:42

Merge branch 'site/datafusion_52' of https://github.com/apache/datafu…

34cea38

…sion-site into site/datafusion_52

clean

2823de5

martin-g reviewed Jan 23, 2026

View reviewed changes

geoffreyclaude reviewed Jan 23, 2026

View reviewed changes

pepijnve reviewed Jan 23, 2026

View reviewed changes

adriangb reviewed Jan 23, 2026

View reviewed changes

nuno-faria approved these changes Jan 23, 2026

View reviewed changes


		This release also includes several additional caching improvements.

		A new statistics cache for Parquet Metadata avoids repeatedly (re)calculating

DataFusion 52 release post #135

Are you sure you want to change the base?

DataFusion 52 release post #135

Conversation

alamb commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb commented Jan 20, 2026

Uh oh!

alamb commented Jan 20, 2026

Uh oh!

alamb commented Jan 23, 2026

Uh oh!

martin-g Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

geoffreyclaude Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pepijnve Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

nuno-faria left a comment

Choose a reason for hiding this comment

Uh oh!

nuno-faria Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

alamb commented Jan 8, 2026 •

edited

Loading

geoffreyclaude Jan 23, 2026 •

edited

Loading