Skip to content

[SPARK-2393][SQL] Cost estimation optimization framework for Catalyst logical plans & sample usage.#1238

Closed
concretevitamin wants to merge 27 commits into
apache:masterfrom
concretevitamin:estimates
Closed

[SPARK-2393][SQL] Cost estimation optimization framework for Catalyst logical plans & sample usage.#1238
concretevitamin wants to merge 27 commits into
apache:masterfrom
concretevitamin:estimates

Conversation

@concretevitamin

Copy link
Copy Markdown
Contributor

The idea is that every Catalyst logical plan gets hold of a Statistics class, the usage of which provides useful estimations on various statistics. See the implementations of MetastoreRelation.

This patch also includes several usages of the estimation interface in the planner. For instance, we now use physical table sizes from the estimate interface to convert an equi-join to a broadcast join (when doing so is beneficial, as determined by a size threshold).

Finally, there are a couple minor accompanying changes including:

  • Remove the not-in-use BaseRelation.
  • Make SparkLogicalPlan take a SQLContext in the second param list.

@AmplabJenkins

Copy link
Copy Markdown

Merged build triggered.

@AmplabJenkins

Copy link
Copy Markdown

Merged build started.

@AmplabJenkins

Copy link
Copy Markdown

Merged build triggered.

@AmplabJenkins

Copy link
Copy Markdown

Merged build started.

@AmplabJenkins

Copy link
Copy Markdown

Merged build finished.

@AmplabJenkins

Copy link
Copy Markdown

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16174/

@AmplabJenkins

Copy link
Copy Markdown

Merged build finished.

@AmplabJenkins

Copy link
Copy Markdown

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16173/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove broadcastTables.contains(b.tableName) and then your if can fit in a single line, and you don't need to indent the next line ...

@AmplabJenkins

Copy link
Copy Markdown

Merged build triggered.

@AmplabJenkins

Copy link
Copy Markdown

Merged build started.

@AmplabJenkins

Copy link
Copy Markdown

Merged build triggered.

@AmplabJenkins

Copy link
Copy Markdown

Merged build started.

@AmplabJenkins

Copy link
Copy Markdown

Merged build finished.

@AmplabJenkins

Copy link
Copy Markdown

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16316/

@concretevitamin

Copy link
Copy Markdown
Contributor Author

Jenkins, retest this please.

@AmplabJenkins

Copy link
Copy Markdown

Merged build triggered.

@AmplabJenkins

Copy link
Copy Markdown

Merged build started.

@AmplabJenkins

Copy link
Copy Markdown

Merged build finished.

@AmplabJenkins

Copy link
Copy Markdown

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16319/

@AmplabJenkins

Copy link
Copy Markdown

Merged build finished.

@AmplabJenkins

Copy link
Copy Markdown

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16317/

@AmplabJenkins

Copy link
Copy Markdown

Merged build triggered.

@AmplabJenkins

Copy link
Copy Markdown

Merged build started.

@concretevitamin

Copy link
Copy Markdown
Contributor Author

Rebased & addressed review comments.

@concretevitamin

Copy link
Copy Markdown
Contributor Author

Jenkins test this please

@SparkQA

SparkQA commented Jul 29, 2014

Copy link
Copy Markdown

QA tests have started for PR 1238. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17369/consoleFull

@SparkQA

SparkQA commented Jul 29, 2014

Copy link
Copy Markdown

QA results for PR 1238:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17369/consoleFull

@marmbrus

Copy link
Copy Markdown
Contributor

Thanks! I've merged this into master.

@asfgit asfgit closed this in c7db274 Jul 29, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
… logical plans & sample usage.

The idea is that every Catalyst logical plan gets hold of a Statistics class, the usage of which provides useful estimations on various statistics. See the implementations of `MetastoreRelation`.

This patch also includes several usages of the estimation interface in the planner. For instance, we now use physical table sizes from the estimate interface to convert an equi-join to a broadcast join (when doing so is beneficial, as determined by a size threshold).

Finally, there are a couple minor accompanying changes including:
- Remove the not-in-use `BaseRelation`.
- Make SparkLogicalPlan take a `SQLContext` in the second param list.

Author: Zongheng Yang <zongheng.y@gmail.com>

Closes apache#1238 from concretevitamin/estimates and squashes the following commits:

329071d [Zongheng Yang] Address review comments; turn config name from string to field in SQLConf.
8663e84 [Zongheng Yang] Use BigInt for stat; for logical leaves, by default throw an exception.
2f2fb89 [Zongheng Yang] Fix statistics for SparkLogicalPlan.
9951305 [Zongheng Yang] Remove childrenStats.
16fc60a [Zongheng Yang] Avoid calling statistics on plans if auto join conversion is disabled.
8bd2816 [Zongheng Yang] Add a note on performance of statistics.
6e594b8 [Zongheng Yang] Get size info from metastore for MetastoreRelation.
01b7a3e [Zongheng Yang] Update scaladoc for a field and move it to @param section.
549061c [Zongheng Yang] Remove numTuples in Statistics for now.
729a8e2 [Zongheng Yang] Update docs to be more explicit.
573e644 [Zongheng Yang] Remove singleton SQLConf and move back `settings` to the trait.
2d99eb5 [Zongheng Yang] {Cleanup, use synchronized in, enrich} StatisticsSuite.
ca5b825 [Zongheng Yang] Inject SQLContext into SparkLogicalPlan, removing SQLConf mixin from it.
43d38a6 [Zongheng Yang] Revert optimization for BroadcastNestedLoopJoin (this fixes tests).
0ef9e5b [Zongheng Yang] Use multiplication instead of sum for default estimates.
4ef0d26 [Zongheng Yang] Make Statistics a case class.
3ba8f3e [Zongheng Yang] Add comment.
e5bcf5b [Zongheng Yang] Fix optimization conditions & update scala docs to explain.
7d9216a [Zongheng Yang] Apply estimation to planning ShuffleHashJoin & BroadcastNestedLoopJoin.
73cde01 [Zongheng Yang] Move SQLConf back. Assign default sizeInBytes to SparkLogicalPlan.
73412be [Zongheng Yang] Move SQLConf to Catalyst & add default val for sizeInBytes.
7a60ab7 [Zongheng Yang] s/Estimates/Statistics, s/cardinality/numTuples.
de3ae13 [Zongheng Yang] Add parquetAfter() properly in test.
dcff9bd [Zongheng Yang] Cleanups.
84301a4 [Zongheng Yang] Refactors.
5bf5586 [Zongheng Yang] Typo.
56a8e6e [Zongheng Yang] Prototype impl of estimations for Catalyst logical plans.
wangyum pushed a commit that referenced this pull request May 26, 2023
* [CARMEL-6439] Add configuration to enable log column lineage

* remove log

* useless condition

* skip empty lineage
mapr-devops pushed a commit to mapr/spark that referenced this pull request May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants