Skip to content

[SPARK-19426][SQL] Custom coalescer for Dataset#18861

Closed
maropu wants to merge 5 commits into
apache:masterfrom
maropu:pr16766
Closed

[SPARK-19426][SQL] Custom coalescer for Dataset#18861
maropu wants to merge 5 commits into
apache:masterfrom
maropu:pr16766

Conversation

@maropu

@maropu maropu commented Aug 6, 2017

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This pr added a new API for coalesce in Dataset; users can specify the custom coalescer which reduces an input Dataset into fewer partitions. This coalescer implementation is the same with the one in RDD#coalesce added in #11865 (SPARK-14042).

This is the rework of #16766.

How was this patch tested?

Added tests in DatasetSuite.

@SparkQA

SparkQA commented Aug 6, 2017

Copy link
Copy Markdown

Test build #80306 has finished for PR 18861 at commit bb7f9b0.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class PartitionCoalesce(numPartitions: Int, coalescer: PartitionCoalescer, child: LogicalPlan)
  • case class CoalesceExec(numPartitions: Int, child: SparkPlan, coalescer: Option[PartitionCoalescer])

@SparkQA

SparkQA commented Aug 6, 2017

Copy link
Copy Markdown

Test build #80307 has finished for PR 18861 at commit 253bdcc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class PartitionCoalesce(numPartitions: Int, coalescer: PartitionCoalescer, child: LogicalPlan)
  • case class CoalesceExec(numPartitions: Int, child: SparkPlan, coalescer: Option[PartitionCoalescer])

@maropu

maropu commented Aug 7, 2017

Copy link
Copy Markdown
Member Author

@gatorsmile ok

* the current partitioning is).
*/
case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends UnaryExecNode {
case class CoalesceExec(numPartitions: Int, child: SparkPlan, coalescer: Option[PartitionCoalescer])

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the parm description of coalescer? also update function descriptions? Thanks~!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok!

index += 1
} else {
updateGroups
updateGroups()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the above changes are not related to this PR, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I just left the changes of the original author (probably refactoring stuffs?) ..., better remove this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine about this, but it might confuse the others. Maybe just remove them in this PR? You can submit a separate PR later.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll drop these from this pr.

* Returns a new RDD that has at most `numPartitions` partitions. This behavior can be modified by
* supplying a `PartitionCoalescer` to control the behavior of the partitioning.
*/
case class PartitionCoalesce(numPartitions: Int, coalescer: PartitionCoalescer, child: LogicalPlan)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding new logical nodes also needs the updates in multiple different components. (e.g., Optimizer).

Is that possible to reuse the existing node Repartition?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I think so. I'll try and plz give me days to do so.

@SparkQA

SparkQA commented Aug 7, 2017

Copy link
Copy Markdown

Test build #80321 has finished for PR 18861 at commit c0306d3.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Aug 7, 2017

Copy link
Copy Markdown

Test build #80320 has finished for PR 18861 at commit 413b0eb.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Aug 7, 2017

Copy link
Copy Markdown

Test build #80327 has finished for PR 18861 at commit 83ac85f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Repartition(

@SparkQA

SparkQA commented Aug 7, 2017

Copy link
Copy Markdown

Test build #80328 has finished for PR 18861 at commit 3b4c679.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Repartition(

@maropu

maropu commented Aug 7, 2017

Copy link
Copy Markdown
Member Author

@gatorsmile ok, could you check again?

child: LogicalPlan,
coalescer: Option[PartitionCoalescer] = None)
extends RepartitionOperation {
require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new require here?

require(!shuffle || coalescer.isEmpty, "Custom coalescer is not allowed for repartition(shuffle=true)")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

* @group typedrel
* @since 2.3.0
*/
def coalesce(numPartitions: Int, userDefinedCoalescer: Option[PartitionCoalescer])

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def coalesce(
    numPartitions: Int,
    userDefinedCoalescer: Option[PartitionCoalescer]): Dataset[T] = withTypedPlan {

case (false, true) => if (r.numPartitions >= child.numPartitions) child else r
case _ => r.copy(child = child.child)
}
case r @ Repartition(_, _, child: RepartitionOperation, None) =>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, my bad. Fixed.

@SparkQA

SparkQA commented Aug 8, 2017

Copy link
Copy Markdown

Test build #80373 has finished for PR 18861 at commit d7392c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu

maropu commented Aug 8, 2017

Copy link
Copy Markdown
Member Author

@gatorsmile ok, fixed.

@maropu

maropu commented Aug 10, 2017

Copy link
Copy Markdown
Member Author

@gatorsmile ping

@@ -596,14 +596,17 @@ object CollapseProject extends Rule[LogicalPlan] {
object CollapseRepartition extends Rule[LogicalPlan] {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add new test cases to CollapseRepartitionSuite for the changes in this rule.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@jiangxb1987

Copy link
Copy Markdown
Contributor

@maropu any update on this issue?

@maropu

maropu commented Oct 3, 2017

Copy link
Copy Markdown
Member Author

@gatorsmile better to close this pr and the jira?

@gatorsmile

Copy link
Copy Markdown
Member

Yeah, maybe close it first. Thanks!

@maropu

maropu commented Oct 4, 2017

Copy link
Copy Markdown
Member Author

ok, thanks!

@maropu maropu closed this Oct 4, 2017
@SubhamSinghal

Copy link
Copy Markdown

I am willing to contribute to the missing pieces in this PR, but need guidance on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants