Skip to content

[SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package#33350

Closed
sunchao wants to merge 3 commits into
apache:masterfrom
sunchao:SPARK-36136-partitions-suite
Closed

[SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package#33350
sunchao wants to merge 3 commits into
apache:masterfrom
sunchao:SPARK-36136-partitions-suite

Conversation

@sunchao

@sunchao sunchao commented Jul 15, 2021

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Move both PruneFileSourcePartitionsSuite and PrunePartitionSuiteBase to the package org.apache.spark.sql.execution.datasources. Did a few refactoring to enable this.

Why are the changes needed?

Currently both PruneFileSourcePartitionsSuite and PrunePartitionSuiteBase are in package org.apache.spark.sql.hive.execution which doesn't look correct as these tests are not specific to Hive. Therefore, it's better to move them into org.apache.spark.sql.execution.datasources, the same place where the rule PruneFileSourcePartitions is at.

Does this PR introduce any user-facing change?

No, it's just test refactoring.

How was this patch tested?

Using existing tests:

build/sbt "sql/testOnly *PruneFileSourcePartitionsSuite"

and

build/sbt "hive/testOnly *PruneHiveTablePartitionsSuite"

@github-actions github-actions Bot added the SQL label Jul 15, 2021
@SparkQA

SparkQA commented Jul 15, 2021

Copy link
Copy Markdown

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45554/

@SparkQA

SparkQA commented Jul 15, 2021

Copy link
Copy Markdown

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45554/

@SparkQA

SparkQA commented Jul 15, 2021

Copy link
Copy Markdown

Test build #141039 has finished for PR 33350 at commit f782ce7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class PruneFileSourcePartitionsSuite extends PrunePartitionSuiteBase with SharedSparkSession
  • abstract class PrunePartitionSuiteBase extends StatisticsCollectionTestBase
  • class PruneHiveTablePartitionsSuite extends PrunePartitionSuiteBase with TestHiveSingleton

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this claims a simple moving classes, shall we preserve i instead of introducing new column name, id?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm not sure whether it's worth doing so because we changed how the test table is created by using the DataFrame API spark.range(10).selectExpr("id", "id % 3 as p").write.partitionBy("p").saveAsTable("test"), which creates id column by default. The id here is also consistent with the rest of the tests in this file as well as other tests which use the same API to create tables.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant we should recover it from id to i together~

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, if the title and scope becomes broaden, I'm okay for id, too.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll keep as it is then :-)

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we change the title. Moving is misleading for this PR content. This PR is a kind of generalization or refactoring.

@dongjoon-hyun

Copy link
Copy Markdown
Member

BTW, I agree with the idea.

@sunchao

sunchao commented Jul 15, 2021

Copy link
Copy Markdown
Member Author

Thanks @dongjoon-hyun for reviewing. On the title, do you have any suggestion? this PR does move these files from one package to another package so I'm thinking at least it expressed that part OK. How about something like "Refactor PruneFileSourcePartitionsSuite and move it to the correct package"?

@dongjoon-hyun

Copy link
Copy Markdown
Member

Ya, Refactoring ... sounds more accurate. However, the correct package should not be there. The previous one also can be considered correct.

@sunchao sunchao changed the title [SPARK-36136][SQL][TEST] Move PruneFileSourcePartitionsSuite to org.apache.spark.sql.execution.datasources [SPARK-36136][SQL][TEST] Refactor PruneFileSourcePartitionsSuite etc to a different package Jul 15, 2021
@sunchao

sunchao commented Jul 16, 2021

Copy link
Copy Markdown
Member Author

Thanks @dongjoon-hyun . I've changed the title to "Refactor PruneFileSourcePartitionsSuite etc to a different package" - let me know if this works for you :)

@HyukjinKwon HyukjinKwon changed the title [SPARK-36136][SQL][TEST] Refactor PruneFileSourcePartitionsSuite etc to a different package [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package Jul 16, 2021

@dongjoon-hyun dongjoon-hyun Jul 16, 2021

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know, saveAsTable is different from STORED AS parquet. The original test coverage seems to be coupled with convertMetastoreParquet, but this one looks different. Are we losing the existing test coverage?

scala> spark.range(10).selectExpr("id", "id % 3 as p").write.partitionBy("p").saveAsTable("t1")

scala> sql("DESCRIBE TABLE EXTENDED t1").show()
...
|            Provider|             parquet|       |
...
scala> sql("CREATE TABLE t2(a int) STORED AS parquet").show()
scala> sql("DESCRIBE TABLE EXTENDED t2").show()
...
|            Provider|                hive|       |
...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific test coverage should remain at hive module.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what is convertMetastoreParquet? I couldn't find it anywhere.

Regarding the test, I think it is still covered (I've debugged the test and made sure it is still going through the related code paths in PruneFileSourcePartitions). Much has changed since 2016 though: the test (added in #15569) was originally designed to make sure that LogicalRelation.expectedOutputAttributes was correctly populated in the class. The expectedOutputAttributes, however, was later replaced by directly passing output in LogicalRelation (in #17552), which I think further prevented the issue from happening.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean spark.sql.hive.convertMetastoreParquet. Here is the document.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, CREATE TABLE ... USING PARQUET (spark syntax) and CREATE TABLE ... STORED AS PARQUET (hive syntax) generates different tables in Apache Spark.

@dongjoon-hyun dongjoon-hyun Jul 16, 2021

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Hive tables generated by STORED AS syntax, Spark converts them to data source tables on the fly because spark.sql.hive.convertMetastoreParquet is true by default. It's the same for ORC. For ORC, we have spark.sql.hive.convertMetastoreOrc.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dongjoon-hyun . I found it now. However I'm not sure whether this matters for the test though: what it does is just 1) register table metadata in the catalog, 2) create a LogicalRelation wrapping a HadoopFsRelation which has the data and partition schema from the step 1), and 3) feed it into the rule PruneFileSourcePartitions and see if the LogicalRelation's expectedOutputAttributes is properly set. Seems this is irrelevant to what SerDe it is using?

@dongjoon-hyun

Copy link
Copy Markdown
Member

Could you resolve the conflicts, @sunchao ?

Also, cc @cloud-fan , @maropu , @viirya .

Comment on lines 47 to 54

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know why it uses external table before? Is it related to the test coverage here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained this in the other thread and I don't think this is related to the test coverage here. Let me know if you think otherwise @viirya @cloud-fan .

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I saw the comment. It looks making sense and that's also what I read from the test. Just wondering why it uses external table originally.

@sunchao sunchao Jul 19, 2021

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure too, IMO the EXTERNAL keyword doesn't matter here. I've run the test with and without it and the outcome is the same.

@cloud-fan

Copy link
Copy Markdown
Contributor

Do we still have the test coverage for partition pruning with hive tables?

@sunchao

sunchao commented Jul 19, 2021

Copy link
Copy Markdown
Member Author

Do we still have the test coverage for partition pruning with hive tables?

@cloud-fan you mean PruneHiveTablePartitionsSuite? yes it is untouched by this PR.

@sunchao sunchao force-pushed the SPARK-36136-partitions-suite branch from f782ce7 to 33c38a4 Compare July 19, 2021 18:48
@SparkQA

SparkQA commented Jul 19, 2021

Copy link
Copy Markdown

Test build #141273 has finished for PR 33350 at commit 33c38a4.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class PruneFileSourcePartitionsSuite extends PrunePartitionSuiteBase with SharedSparkSession
  • abstract class PrunePartitionSuiteBase extends StatisticsCollectionTestBase
  • class PruneHiveTablePartitionsSuite extends PrunePartitionSuiteBase with TestHiveSingleton

@SparkQA

SparkQA commented Jul 19, 2021

Copy link
Copy Markdown

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45787/

@SparkQA

SparkQA commented Jul 19, 2021

Copy link
Copy Markdown

Test build #141276 has finished for PR 33350 at commit 1316644.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jul 19, 2021

Copy link
Copy Markdown

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45790/

@SparkQA

SparkQA commented Jul 19, 2021

Copy link
Copy Markdown

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45790/

@SparkQA

SparkQA commented Jul 19, 2021

Copy link
Copy Markdown

Test build #141282 has finished for PR 33350 at commit c98975c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jul 19, 2021

Copy link
Copy Markdown

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45796/

@SparkQA

SparkQA commented Jul 19, 2021

Copy link
Copy Markdown

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45796/

@sunchao sunchao force-pushed the SPARK-36136-partitions-suite branch from c98975c to 7473aea Compare July 20, 2021 03:32
@SparkQA

SparkQA commented Jul 20, 2021

Copy link
Copy Markdown

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45807/

@SparkQA

SparkQA commented Jul 20, 2021

Copy link
Copy Markdown

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45807/

@SparkQA

SparkQA commented Jul 20, 2021

Copy link
Copy Markdown

Test build #141293 has finished for PR 33350 at commit 7473aea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you do a rebases since the last test run is a quite long time ago?

@sunchao sunchao force-pushed the SPARK-36136-partitions-suite branch from 7473aea to d6fa7a5 Compare July 26, 2021 17:09
@sunchao

sunchao commented Jul 26, 2021

Copy link
Copy Markdown
Member Author

Thanks @cloud-fan ! just rebased.

@SparkQA

SparkQA commented Jul 26, 2021

Copy link
Copy Markdown

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46167/

@SparkQA

SparkQA commented Jul 26, 2021

Copy link
Copy Markdown

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46167/

@SparkQA

SparkQA commented Jul 26, 2021

Copy link
Copy Markdown

Test build #141651 has finished for PR 33350 at commit d6fa7a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya

viirya commented Jul 26, 2021

Copy link
Copy Markdown
Member

Thanks. Merging to master/3.2. Although this is not bug fix, but it is only for test and I think it is better to keep consistency between master/3.2 so it is easier to backport changes.

Feel free to revert it from 3.2 if you prefer to have it only in master. Thanks.

viirya pushed a commit that referenced this pull request Jul 26, 2021
… to a different package

### What changes were proposed in this pull request?

Move both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` to the package `org.apache.spark.sql.execution.datasources`. Did a few refactoring to enable this.

### Why are the changes needed?

Currently both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` are in package `org.apache.spark.sql.hive.execution` which doesn't look correct as these tests are not specific to Hive. Therefore, it's better to move them into `org.apache.spark.sql.execution.datasources`, the same place where the rule `PruneFileSourcePartitions` is at.

### Does this PR introduce _any_ user-facing change?

No, it's just test refactoring.

### How was this patch tested?

Using existing tests:
```
build/sbt "sql/testOnly *PruneFileSourcePartitionsSuite"
```
and
```
build/sbt "hive/testOnly *PruneHiveTablePartitionsSuite"
```

Closes #33350 from sunchao/SPARK-36136-partitions-suite.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 634f96d)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
@viirya viirya closed this in 634f96d Jul 26, 2021
@sunchao

sunchao commented Jul 26, 2021

Copy link
Copy Markdown
Member Author

Thanks @viirya ! Yes I agree we should backport and keep master & 3.2 consistent.

@sunchao sunchao deleted the SPARK-36136-partitions-suite branch July 26, 2021 20:30
@venkata91

Copy link
Copy Markdown
Contributor

@viirya @sunchao It seems like the refactor caused couple of tests in PruneFileSourcePartitionsSuite tests to fail. Seems like the refactor from Hive tables to datasource is causing issues.
I think this test SPARK-35985 push filters for empty read schema - returns all the files under the partition with DSV2 therefore failing

SPARK-36128: spark.sql.hive.metastorePartitionPruning should work for file data sources - this test checks for HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount since the tests are now moved, this is becoming 0. Please take a look.

@viirya

viirya commented Jul 27, 2021

Copy link
Copy Markdown
Member

Hm? This passed Jenkins and GA was also passed. Where do you see the failed tests?

@cloud-fan

Copy link
Copy Markdown
Contributor

seems they are failing in 3.2 branch

@cloud-fan

Copy link
Copy Markdown
Contributor

hmm, they are also failing in master: #33498

@sunchao

sunchao commented Jul 27, 2021

Copy link
Copy Markdown
Member Author

Sorry. Let me check the failed tests.

@viirya

viirya commented Jul 27, 2021

Copy link
Copy Markdown
Member

We can revert it if it needs taking some time to investigate.

@viirya

viirya commented Jul 27, 2021

Copy link
Copy Markdown
Member

Is it flaky or do other merged PRs cause the result different?

@viirya

viirya commented Jul 27, 2021

Copy link
Copy Markdown
Member

Created #33533 to revert it first. @cloud-fan @sunchao @dongjoon-hyun

@sunchao

sunchao commented Jul 27, 2021

Copy link
Copy Markdown
Member Author

We can revert it first. The test failures are related. Not sure why they weren't detected by the CI previously though.

@LuciferYang

Copy link
Copy Markdown
Contributor

@sunchao I also meet this problem

@LuciferYang

Copy link
Copy Markdown
Contributor

For SPARK-35985 push filters for empty read schema

The old case writes one result file for each partition, but the new case writes two result files for each partition, I guess some test configurations have changed

@LuciferYang

LuciferYang commented Jul 27, 2021

Copy link
Copy Markdown
Contributor

@sunchao
I found that the old case use spark.master local[1], but SharedSparkSession create TestSparkSession with local[2] as default , so we should override createSparkSession method in new PruneFileSourcePartitionsSuite file to create TestSparkSession with local[1] to pass SPARK-35985 push filters for empty read schema

@LuciferYang

Copy link
Copy Markdown
Contributor

It seems that SPARK-36128: spark.sql.hive.metastorePartitionPruning should work for file data sources should not be placed in sql/core module.

@sunchao

sunchao commented Jul 27, 2021

Copy link
Copy Markdown
Member Author

Thanks @LuciferYang ! yes I found that we could either use local[1] or coalesce(1) to fix the first test case. For the second, it relies on HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED which is no longer available since it switched to use InMemoryCatalog. I need to find a new way to write the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants