[SPARK-13136][SQL] Create a dedicated Broadcast exchange operator#11083
[SPARK-13136][SQL] Create a dedicated Broadcast exchange operator#11083hvanhovell wants to merge 26 commits into
Conversation
|
Test build #50775 has finished for PR 11083 at commit
|
# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashOuterJoin.scala
|
Retest this please |
|
Test build #50879 has finished for PR 11083 at commit
|
|
Test build #50881 has finished for PR 11083 at commit
|
|
Retest this please |
|
Test build #50883 has finished for PR 11083 at commit
|
|
This one is ready for review. |
|
Test build #50897 has finished for PR 11083 at commit
|
|
retest this please |
|
Test build #50898 has finished for PR 11083 at commit
|
|
retest this please |
|
Test build #50900 has finished for PR 11083 at commit
|
|
retest this please |
| case class Broadcast( | ||
| f: Iterable[InternalRow] => Any, | ||
| child: SparkPlan) | ||
| extends UnaryNode with CodegenSupport { |
There was a problem hiding this comment.
Since we do include this in generated code of BroadcastHashJoin, I think it's better to not implement CodegenSupport, then we don't need the special case in CollapseCodegenStages
|
Test build #50928 has finished for PR 11083 at commit
|
|
Test build #50942 has finished for PR 11083 at commit
|
# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala
|
Test build #51039 has finished for PR 11083 at commit
|
|
@yhuai if you have some time this wk, can you review this? |
| /** | ||
| * Represents data where tuples are broadcasted to every node. It is quite common that the | ||
| * entire set of tuples is transformed into different data structure. | ||
| */ |
There was a problem hiding this comment.
i'm thinking maybe it's better to just declare that we want a hashed broadcast distribution, and then don't take a closure. The reason it is bad to take a closure is that this won't work if we want to whole-stage codegen the building of the hash table, or if we want to change the internal engine to a push-based model.
|
Retest this please |
|
Test build #51418 has finished for PR 11083 at commit
|
# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashOuterJoin.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/joins/InnerJoinSuite.scala
|
Test build #51596 has finished for PR 11083 at commit
|
| */ | ||
| case class LeftSemiJoinBNL( | ||
| streamed: SparkPlan, broadcast: SparkPlan, condition: Option[Expression]) | ||
| left: SparkPlan, right: SparkPlan, condition: Option[Expression]) |
There was a problem hiding this comment.
why did you do this change (streamed -> left, broadcast -> right)? this makes the variable name more confusing.
There was a problem hiding this comment.
Yeah, I'll revert that.
|
I'm going to review this more carefully tonight. |
|
@hvanhovell when you get a chance, please update the description if it merits any change. |
|
Test build #51605 has finished for PR 11083 at commit
|
| /** | ||
| * Marker trait to identify the shape in which tuples are broadcasted. Typical examples of this are | ||
| * identity (tuples remain unchanged) or hashed (tuples are converted into some hash index). | ||
| */ |
There was a problem hiding this comment.
I'd move this and IdentityBroadcastMode into a new file.
|
This looks pretty good actually. |
|
@rxin I agree that this is stretching the definitions of both |
|
Test build #51635 has finished for PR 11083 at commit
|
|
Test build #51637 has finished for PR 11083 at commit
|
|
Thanks. I'm going to merge this. |
Quite a few Spark SQL join operators broadcast one side of the join to all nodes. The are a few problems with this:
This PR defines both a
BroadcastDistributionandBroadcastPartitioning, these contain aBroadcastMode. TheBroadcastModedefines the way in which we transform the Array ofInternalRow's into an index. We currently support the followingBroadcastMode's:Set.HashedRelation, and broadcasts this index.To match this distribution we implement a
BroadcastExchangeoperator which will perform the broadcast for us, and haveEnsureRequirementsplan this operator. The old Exchange operator has been renamed into ShuffleExchange in order to clearly separate between Shuffled and Broadcasted exchanges. Finally the classes in Exchange.scala have been moved to a dedicated package.cc @rxin @davies