Skip to content

[GLUTEN-6650] Remove bloomfilter from partition filter#6652

Closed
WangGuangxin wants to merge 1 commit into
apache:mainfrom
WangGuangxin:fix_bloomfilter
Closed

[GLUTEN-6650] Remove bloomfilter from partition filter#6652
WangGuangxin wants to merge 1 commit into
apache:mainfrom
WangGuangxin:fix_bloomfilter

Conversation

@WangGuangxin

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

If BloomFilterMightContains is pushed down to partition filter, it's executed in Spark driver using codegen. But since the bloom filter contructed by native are different with Spark, a NPE will throw in this case

(Fixes: #6650)

How was this patch tested?

UT

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

@github-actions

Copy link
Copy Markdown

#6650

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI

partitionFilters.filter(isDynamicPruningFilter)
partitionFilters
.filter(isDynamicPruningFilter)
.filterNot(isBloomFilterMightContain)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do bloom filters still take effect in Gluten if we do this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't task effect in this case, and partition filter will not re-evalute in following FilterExec or other operator.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the PR completely disables runtime bloom-filter in Gluten? Am I missing something?

@WangGuangxin WangGuangxin Aug 1, 2024

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completely

@zhztheplayer Only disable the case when bloomfilter is applied on hive table's partition column ( join key is hive table's partition column).

In most case, bloom filter is applied on data column, which is not affected by this PR.

@zhztheplayer

zhztheplayer commented Aug 13, 2024

Copy link
Copy Markdown
Member

But since the bloom filter contructed by native are different with Spark, a NPE will throw in this case

I remember we had made Velox bloom-filter functions runnable in vanilla Spark operators so there shouldn't have compatibility issues. #5435. Am I missing something? Or there is a bug?

@WangGuangxin

Copy link
Copy Markdown
Contributor Author

But since the bloom filter contructed by native are different with Spark, a NPE will throw in this case

I remember we had made Velox bloom-filter functions runnable in vanilla Spark operators so there shouldn't have compatibility issues. #5435. Am I missing something? Or there is a bug?

@zhztheplayer Got it. Let me check.

@WangGuangxin

Copy link
Copy Markdown
Contributor Author

But since the bloom filter contructed by native are different with Spark, a NPE will throw in this case

I remember we had made Velox bloom-filter functions runnable in vanilla Spark operators so there shouldn't have compatibility issues. #5435. Am I missing something? Or there is a bug?

@zhztheplayer Got it. Let me check.

@zhztheplayer I digged into it, the main reason is that in this case, its a partition filter and the bloom filter is evaluted in driver side, so we need to change VeloxBloomFilterMightContain's codegen from

val bf = ctx.addMutableState(className, "bloomFilter")
    ctx.addPartitionInitializationStatement(s"$bf = $className.readFrom($bfData);")

to

val bf = ctx.addMutableState(className, "bloomFilter", bf => s"$bf = $className.readFrom($bfData);")

Otherwise, the bf object is not initialized when executed in driver and a NPE is thrown.

But after this change, the VeloxBloomFilter still cannot evaluted on driver side, since most of JNI and Resource Management will check whether we are in a spark task or not.
such as

Runtimes.contextInstance

and

Spillable reservation listener must be used in a Spark task.

@github-actions

Copy link
Copy Markdown

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions Bot added the stale stale label Sep 29, 2024
@github-actions

Copy link
Copy Markdown

This PR was auto-closed because it has been stalled for 10 days with no activity. Please feel free to reopen if it is still valid. Thanks.

@zhztheplayer

Copy link
Copy Markdown
Member

Sorry for missing the latest message and I am now revisiting. Thanks! @WangGuangxin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CORE] Remove bloom filter from partition filter since the BloomFilter result format is different with spark

2 participants