-
Notifications
You must be signed in to change notification settings - Fork 29.3k
[SPARK-56395][SQL] Add NEAREST BY top-K ranking join (catalyst-side) #55629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
2472cf4
bd534c4
b2e11ee
808afff
8719eb1
d1f1378
b692b26
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -203,6 +203,33 @@ private[sql] object QueryParsingErrors extends DataTypeErrorsBase { | |||||
| ctx) | ||||||
| } | ||||||
|
|
||||||
| def nearestByJoinWithLateralUnsupportedError(ctx: ParserRuleContext): Throwable = { | ||||||
| new ParseException( | ||||||
| errorClass = "UNSUPPORTED_FEATURE.LATERAL_JOIN_NEAREST_BY", | ||||||
| messageParameters = Map.empty, | ||||||
| ctx) | ||||||
| } | ||||||
|
|
||||||
| def unsupportedNearestByJoinTypeError( | ||||||
| ctx: ParserRuleContext, | ||||||
| joinType: String, | ||||||
| supported: String): Throwable = { | ||||||
| new ParseException( | ||||||
| errorClass = "NEAREST_BY_JOIN.UNSUPPORTED_JOIN_TYPE", | ||||||
| messageParameters = Map("joinType" -> toSQLStmt(joinType), "supported" -> supported), | ||||||
| ctx) | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use
Suggested change
(Will need a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. QueryParsingErrors lives in sql/api, which can't depend on sql/catalyst (where NearestByJoinType is defined) I Now i have the constant that is passed through as a param from AstBuilder rather than referenced directly here. Hope it is okay ?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you move |
||||||
| } | ||||||
|
|
||||||
| def nearestByJoinNumResultsOutOfRangeError( | ||||||
| ctx: ParserRuleContext, | ||||||
| numResults: String, | ||||||
| max: Int): Throwable = { | ||||||
| new ParseException( | ||||||
| errorClass = "NEAREST_BY_JOIN.NUM_RESULTS_OUT_OF_RANGE", | ||||||
| messageParameters = Map("numResults" -> numResults, "min" -> "1", "max" -> max.toString), | ||||||
| ctx) | ||||||
| } | ||||||
|
|
||||||
| def repetitiveWindowDefinitionError(name: String, ctx: WindowClauseContext): Throwable = { | ||||||
| new ParseException( | ||||||
| errorClass = "INVALID_SQL_SYNTAX.REPETITIVE_WINDOW_DEFINITION", | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -657,6 +657,34 @@ trait CheckAnalysis extends LookupCatalog with QueryErrorsBase with PlanToString | |
| messageParameters = Map.empty) | ||
| } | ||
|
|
||
| // Reject streaming inputs early. The optimizer rewrite is built around an | ||
| // unconditioned cross-product fed into a global `Aggregate` keyed by a per-row | ||
| // identifier (`__qid`). That shape doesn't compose cleanly with structured-streaming | ||
| // semantics: a stateful aggregate keyed by a freshly-generated identifier accumulates | ||
| // state indefinitely (every batch creates new keys, old keys never match again) and a | ||
| // cross-product against a streaming right side has no bounded state model today. | ||
| // Failing at analysis time is clearer than letting either fail at runtime. Streaming | ||
| // support is tracked as a follow-up; resolving it likely comes from a different | ||
| // grouping strategy or a dedicated physical operator. | ||
| case j: NearestByJoin if j.isStreaming => | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the streaming guard: the current comment frames the issue as "MID is per-batch only", but MID itself is fine within a batch - the real blocker is that the rewrite uses a global Aggregate keyed by The MID is just an implementation detail (we only need a per-row group key), so streaming support doesn't have to wait on a streaming-aware MID. A few directions for the follow-up:
Happy to leave the guard in this PR; just suggesting we update the comment to reflect the actual reason so future-us isn't misled.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @zhidongqu-db Thanks for the suggestion. I would try to address this in a folllow-up. |
||
| j.failAnalysis( | ||
| errorClass = "NEAREST_BY_JOIN.STREAMING_NOT_SUPPORTED", | ||
| messageParameters = Map.empty) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removing the
The user wrote The new test golden output ( Could this be gated here, alongside case j: NearestByJoin if !conf.crossJoinEnabled =>
j.failAnalysis(
errorClass = "NEAREST_BY_JOIN.CROSS_JOIN_NOT_ENABLED",
messageParameters = Map.empty)plus a new sub-condition explaining that NEAREST BY is implemented as a bounded cross-product and pointing the user at
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can have a follow-up on this
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @gengliangwang Thanks !! Will follow-up on this. |
||
|
|
||
| case j @ NearestByJoin(_, _, _, _, _, rankingExpression, _) | ||
| if !RowOrdering.isOrderable(rankingExpression.dataType) => | ||
| j.failAnalysis( | ||
| errorClass = "NEAREST_BY_JOIN.NON_ORDERABLE_RANKING_EXPRESSION", | ||
| messageParameters = Map( | ||
| "expression" -> toSQLExpr(rankingExpression), | ||
| "type" -> toSQLType(rankingExpression.dataType))) | ||
|
|
||
| case j @ NearestByJoin(_, _, _, false, _, rankingExpression, _) | ||
| if !rankingExpression.deterministic => | ||
| j.failAnalysis( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have to fail this case?
I view them as
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess this depends on how we define EXACT semantic here. We explicitly mentioned in the SPIP that EXACT with non-deterministic ordering expr should fail. The intention was to have the EXACT keyword express the semantic of deterministic ordering given a deterministic input and scoring expr. If the scoring expr is not deterministic in the first place - e.g. LLM generated scores, the query would fail and user should use APPROX where the keyword explicitly does not imply deterministic results |
||
| errorClass = "NEAREST_BY_JOIN.EXACT_WITH_NONDETERMINISTIC_EXPRESSION", | ||
| messageParameters = Map("expression" -> toSQLExpr(rankingExpression))) | ||
|
|
||
| case a: Aggregate => | ||
| a.groupingExpressions.foreach( | ||
| expression => | ||
|
|
@@ -949,6 +977,17 @@ trait CheckAnalysis extends LookupCatalog with QueryErrorsBase with PlanToString | |
| context = j.origin.getQueryContext, | ||
| summary = j.origin.context.summary) | ||
|
|
||
| case j: NearestByJoin if !j.duplicateResolved => | ||
| val conflictingAttributes = | ||
| j.left.outputSet.intersect(j.right.outputSet).map(toSQLExpr(_)).mkString(", ") | ||
| throw SparkException.internalError( | ||
| msg = s""" | ||
| |Failure when resolving conflicting references in ${j.nodeName}: | ||
| |${planToString(plan)} | ||
| |Conflicting attributes: $conflictingAttributes.""".stripMargin, | ||
| context = j.origin.getQueryContext, | ||
| summary = j.origin.context.summary) | ||
|
|
||
| // TODO: although map type is not orderable, technically map type should be able to be | ||
| // used in equality comparison, remove this type check once we support it. | ||
| case o if mapColumnInSetOperation(o).isDefined => | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This covers explicit LATERAL JOIN. Do we care about lateral column alias usage for queries over the results of the nearest-neighbor join as well, or is that orthogonal?