-
Notifications
You must be signed in to change notification settings - Fork 29.3k
[SPARK-41398][SQL] Relax constraints on Storage-Partitioned Join when partition keys after runtime filtering do not match #38924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -81,18 +81,21 @@ case class BatchScanExec( | |
|
|
||
| val newRows = new InternalRowSet(p.expressions.map(_.dataType)) | ||
| newRows ++= newPartitions.map(_.asInstanceOf[HasPartitionKey].partitionKey()) | ||
| val oldRows = p.partitionValuesOpt.get | ||
|
|
||
| if (oldRows.size != newRows.size) { | ||
| throw new SparkException("Data source must have preserved the original partitioning " + | ||
| "during runtime filtering: the number of unique partition values obtained " + | ||
| s"through HasPartitionKey changed: before ${oldRows.size}, after ${newRows.size}") | ||
| val oldRows = p.partitionValuesOpt.get.toSet | ||
| // We require the new number of partition keys to be equal or less than the old number | ||
| // of partition keys here. In the case of less than, empty partitions will be added for | ||
| // those missing keys that are not present in the new input partitions. | ||
| if (oldRows.size < newRows.size) { | ||
| throw new SparkException("During runtime filtering, data source must either report " + | ||
| "the same number of partition keys, or a subset of partition keys from the " + | ||
| s"original. Before: ${oldRows.size} partition keys. After: ${newRows.size} " + | ||
| "partition keys") | ||
| } | ||
|
|
||
| if (!oldRows.forall(newRows.contains)) { | ||
| throw new SparkException("Data source must have preserved the original partitioning " + | ||
| "during runtime filtering: the number of unique partition values obtained " + | ||
| s"through HasPartitionKey remain the same but do not exactly match") | ||
| if (!newRows.forall(oldRows.contains)) { | ||
| throw new SparkException("During runtime filtering, data source must not report new " + | ||
| "partition keys that are not present in the original partitioning.") | ||
| } | ||
|
|
||
| groupPartitions(newPartitions).get.map(_._2) | ||
|
|
@@ -114,8 +117,21 @@ case class BatchScanExec( | |
| // return an empty RDD with 1 partition if dynamic filtering removed the only split | ||
| sparkContext.parallelize(Array.empty[InternalRow], 1) | ||
| } else { | ||
| var finalPartitions = filteredPartitions | ||
|
|
||
| outputPartitioning match { | ||
| case p: KeyGroupedPartitioning => | ||
| val partitionMapping = finalPartitions.map(s => | ||
| s.head.asInstanceOf[HasPartitionKey].partitionKey() -> s).toMap | ||
| finalPartitions = p.partitionValuesOpt.get.map { partKey => | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shall we move this logic to
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm any obvious advantage of this? It looks the same to me 😅
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's better to centralize the related code. This empty filling is quite related to the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was hoping to re-use the logic in #38950 too, which is not related to runtime filtering. If moving this close to the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah then it makes sense to keep it here. |
||
| // Use empty partition for those partition keys that are not present | ||
| partitionMapping.getOrElse(partKey, Seq.empty) | ||
| } | ||
| case _ => | ||
| } | ||
|
|
||
| new DataSourceRDD( | ||
| sparkContext, filteredPartitions, readerFactory, supportsColumnar, customMetrics) | ||
| sparkContext, finalPartitions, readerFactory, supportsColumnar, customMetrics) | ||
| } | ||
| postDriverMetrics() | ||
| rdd | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use
InternalRowSet?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops you're right! I forgot about
InternalRowSet. Let me create a follow-up.