Skip to content

[SPARK-32346][SQL] Support filters pushdown in Avro datasource#29145

Closed
MaxGekk wants to merge 23 commits into
apache:masterfrom
MaxGekk:avro-filters-pushdown
Closed

[SPARK-32346][SQL] Support filters pushdown in Avro datasource#29145
MaxGekk wants to merge 23 commits into
apache:masterfrom
MaxGekk:avro-filters-pushdown

Conversation

@MaxGekk

@MaxGekk MaxGekk commented Jul 17, 2020

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

In the PR, I propose to support pushed down filters in Avro datasource V1 and V2.

  1. Added new SQL config spark.sql.avro.filterPushdown.enabled to control filters pushdown to Avro datasource. It is on by default.
  2. Renamed CSVFilters to OrderedFilters.
  3. OrderedFilters is used in AvroFileFormat (DSv1) and in AvroPartitionReaderFactory (DSv2)
  4. Modified AvroDeserializer to return None from the deserialize method when pushdown filters return false.

Why are the changes needed?

The changes improve performance on synthetic benchmarks up to 2 times on JDK 11:

OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Filters pushdown:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters                                        9614           9669          54          0.1        9614.1       1.0X
pushdown disabled                                 10077          10141          66          0.1       10077.2       1.0X
w/ filters                                         4681           4713          29          0.2        4681.5       2.1X

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Added UT to AvroCatalystDataConversionSuite and AvroSuite
  • Re-running AvroReadBenchmark using Amazon EC2:
Item Description
Region us-west-2 (Oregon)
Instance r3.xlarge (spot instance)
AMI ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1)
Java OpenJDK8/11 installed bysudo add-apt-repository ppa:openjdk-r/ppa & sudo apt install openjdk-11-jdk

and ./dev/run-benchmarks:

#!/usr/bin/env python3

import os
from sparktestsupport.shellutils import run_cmd

benchmarks = [
  ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark']
]

print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'

for b in benchmarks:
    print("Run benchmark: %s" % b[1])
    run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])

@SparkQA

SparkQA commented Jul 17, 2020

Copy link
Copy Markdown

Test build #126053 has finished for PR 29145 at commit c010a4e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jul 17, 2020

Copy link
Copy Markdown

Test build #126063 has finished for PR 29145 at commit ab17bd0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jul 17, 2020

Copy link
Copy Markdown

Test build #126072 has finished for PR 29145 at commit 0ea0ef7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jul 17, 2020

Copy link
Copy Markdown

Test build #126074 has finished for PR 29145 at commit 5c4c177.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jul 18, 2020

Copy link
Copy Markdown

Test build #126114 has finished for PR 29145 at commit 72022bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jul 19, 2020

Copy link
Copy Markdown

Test build #126115 has finished for PR 29145 at commit 072eab0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [WIP][SPARK-32346][SQL] Support filters pushdown in Avro datasource [SPARK-32346][SQL] Support filters pushdown in Avro datasource Jul 19, 2020
@SparkQA

SparkQA commented Jul 19, 2020

Copy link
Copy Markdown

Test build #126128 has finished for PR 29145 at commit 668e497.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk

MaxGekk commented Jul 19, 2020

Copy link
Copy Markdown
Member Author

@gengliangwang @dongjoon-hyun @HyukjinKwon @cloud-fan Please, take a look at this PR.

@dongjoon-hyun

Copy link
Copy Markdown
Member

Thank you for pinging me, @MaxGekk .

@MaxGekk

MaxGekk commented Jul 20, 2020

Copy link
Copy Markdown
Member Author

@dongjoon-hyun I am looking forward to review comments from you.

@SparkQA

SparkQA commented Jul 21, 2020

Copy link
Copy Markdown

Test build #126199 has finished for PR 29145 at commit 74f8f1b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment thread external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala Outdated
Comment thread external/avro/benchmarks/AvroReadBenchmark-jdk11-results.txt
@SparkQA

SparkQA commented Jul 21, 2020

Copy link
Copy Markdown

Test build #126255 has finished for PR 29145 at commit 438e634.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait RowReader

@HyukjinKwon

Copy link
Copy Markdown
Member

retest this please

@SparkQA

SparkQA commented Jul 22, 2020

Copy link
Copy Markdown

Test build #126303 has finished for PR 29145 at commit 438e634.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait RowReader

@MaxGekk

MaxGekk commented Jul 22, 2020

Copy link
Copy Markdown
Member Author

jenkins, retest this, please

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/OrderedFilters.scala Outdated
Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/OrderedFilters.scala Outdated
@SparkQA

SparkQA commented Jul 22, 2020

Copy link
Copy Markdown

Test build #126310 has finished for PR 29145 at commit 438e634.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait RowReader

@SparkQA

SparkQA commented Jul 22, 2020

Copy link
Copy Markdown

Test build #126339 has finished for PR 29145 at commit f6b2a9a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

HyukjinKwon pushed a commit that referenced this pull request Jul 24, 2020
…checking out

### What changes were proposed in this pull request?
Refactoring of `JsonFilters`:
- Add an assert to the `skipRow` method to check the input `index`
- Move checking of the SQL config `spark.sql.json.filterPushdown.enabled` from `JsonFilters` to `JacksonParser`.

### Why are the changes needed?
1. The assert should catch incorrect usage of `JsonFilters`
2. The config checking out of `JsonFilters` makes it consistent with `OrderedFilters` (see #29145).
3. `JsonFilters` can be used by other datasource in the future and don't depend from the JSON configs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing tests suites:
```
$ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.json.*"
$ build/sbt "test:testOnly org.apache.spark.sql.catalyst.json.*"
```

Closes #29206 from MaxGekk/json-filters-pushdown-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@MaxGekk

MaxGekk commented Jul 27, 2020

Copy link
Copy Markdown
Member Author

@cloud-fan Please, review this PR.

@MaxGekk

MaxGekk commented Jul 29, 2020

Copy link
Copy Markdown
Member Author

@gengliangwang Are you ok with this PR?

@gengliangwang gengliangwang left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one comment

@gengliangwang

Copy link
Copy Markdown
Member

Thanks, merging to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants