[SPARK-48845][SQL] GenericUDF catch exceptions from children#47268
[SPARK-48845][SQL] GenericUDF catch exceptions from children#47268jackylee-ch wants to merge 12 commits into
Conversation
f097982 to
eb2b0a2
Compare
…xception_from_child_func
|
cc @cloud-fan @yaooqinn @panbingkun FYI |
|
Can we explain the execution path before Spark 3.5? Is this PR trying to follow the previous execution path? |
For Spark 3.3, the |
|
Is it possible to combine the code generation and non-code generation code? |
AFAIK, it's hard to set lazy function to DeferredObject. I have tried with lambda functions and inner classes to put Any better ideas? |
| def setArg(index: Int, arg: Any): Unit = | ||
| deferredObjects(index).asInstanceOf[DeferredObjectAdapter].set(arg) | ||
|
|
||
| def setFuncArg(index: Int, arg: () => Any): Unit = |
There was a problem hiding this comment.
It seems that setFuncArg and setException can be merged?
There was a problem hiding this comment.
It is a little ugly that we use inner class in codegen. If we use setFuncArg, the codegen for exceptions would be follow code.
s"""
|try {
| ${eval.code}
| if (${eval.isNull}) {
| $refEvaluator.setArg($i, null);
| } else {
| $refEvaluator.setArg($i, ${eval.value});
| }
|} catch (Exception exp) {
| final exception = exp;
| $refEvaluator.setFundArg($i, new scala.Function0<Object>() {
| public Object apply() {
| throw exception;
| }
| });
|}
|""".stripMargin
| | $refEvaluator.setArg($i, null); | ||
| | } else { | ||
| | $refEvaluator.setArg($i, ${eval.value}); | ||
| | } |
There was a problem hiding this comment.
- It seems that
catch (...)is only related to${eval.code} - Why is it
Exceptioninstead ofThrowable?
There was a problem hiding this comment.
It seems that catch (...) is only related to ${eval.code}
Yes, we should catch the exceptions from child's executions.
Why is it Exception instead of Throwable?
The Exception can catch most of throwable problems. It's ok to be change to Throwable.
|
Will it cover |
Yes, it cover the |
| |} else { | ||
| | $refEvaluator.setArg($i, ${eval.value}); | ||
| |try { | ||
| | ${eval.code} |
There was a problem hiding this comment.
This is slightly different from the interpreted path. In codegen we still eagerly execute the function inputs, but delay the throw of errors. I understand that it's hard to be truly lazy in codegen, can we update the interpreted path instead to make it the same as codegen?
There was a problem hiding this comment.
Sure and done. I have updated the code for non-codegen mode and the description for this pr.
| test("SPARK-48845: GenericUDF catch exceptions from child UDFs") { | ||
| withTable("test_catch_exception") { | ||
| Seq("9", "9-1").toDF("a").write.saveAsTable("test_catch_exception") | ||
| sql("create temporary function udf_exception as " + |
There was a problem hiding this comment.
let's wrap with withUserDefinedFunction to clean up the functions at the end.
|
Regarding the failure of GA, I guess you need to rebase master. 😄 |
…_exception_from_child_func
|
@panbingkun any other comments? |
No, go ahead. |
…ThrowException.java
…ThrowException.java
|
|
Oh my bad @LuciferYang |
### What changes were proposed in this pull request? This pr is trying to fix the syntax issues with GenericUDF since 3.5.0. The problem arose from DeferredObject currently passing a value instead of a function, which prevented users from catching exceptions in GenericUDF, resulting in semantic differences. Here is an example case we encountered. Originally, the semantics were that udf_exception would throw an exception, while udf_catch_exception could catch the exception and return a null value. However, currently, any exception encountered by udf_exception will cause the program to fail. ``` select udf_catch_exception(udf_exception(col1)) from table ``` ### Why are the changes needed? For before Spark 3.5, we directly made the GenericUDF's DeferredObject lazy and evaluated the children in `function.evaluate(deferredObjects)`. Now, we would run the children's code first. If an exception is thrown, we would make it lazy to GenericUDF's DeferredObject. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly added UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47268 from jackylee-ch/generic_udf_catch_exception_from_child_func. Lead-authored-by: jackylee-ch <lijunqing@baidu.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 236d957) Signed-off-by: Kent Yao <yao@apache.org>
|
Thank you all. Merged to master and branch-3.5 for the coming 3.5.2 |
|
Thanks for your review. @LuciferYang @yaooqinn @cloud-fan @panbingkun |
### What changes were proposed in this pull request? This pr is trying to fix the syntax issues with GenericUDF since 3.5.0. The problem arose from DeferredObject currently passing a value instead of a function, which prevented users from catching exceptions in GenericUDF, resulting in semantic differences. Here is an example case we encountered. Originally, the semantics were that udf_exception would throw an exception, while udf_catch_exception could catch the exception and return a null value. However, currently, any exception encountered by udf_exception will cause the program to fail. ``` select udf_catch_exception(udf_exception(col1)) from table ``` ### Why are the changes needed? For before Spark 3.5, we directly made the GenericUDF's DeferredObject lazy and evaluated the children in `function.evaluate(deferredObjects)`. Now, we would run the children's code first. If an exception is thrown, we would make it lazy to GenericUDF's DeferredObject. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly added UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47268 from jackylee-ch/generic_udf_catch_exception_from_child_func. Lead-authored-by: jackylee-ch <lijunqing@baidu.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request? This pr is trying to fix the syntax issues with GenericUDF since 3.5.0. The problem arose from DeferredObject currently passing a value instead of a function, which prevented users from catching exceptions in GenericUDF, resulting in semantic differences. Here is an example case we encountered. Originally, the semantics were that udf_exception would throw an exception, while udf_catch_exception could catch the exception and return a null value. However, currently, any exception encountered by udf_exception will cause the program to fail. ``` select udf_catch_exception(udf_exception(col1)) from table ``` ### Why are the changes needed? For before Spark 3.5, we directly made the GenericUDF's DeferredObject lazy and evaluated the children in `function.evaluate(deferredObjects)`. Now, we would run the children's code first. If an exception is thrown, we would make it lazy to GenericUDF's DeferredObject. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly added UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47268 from jackylee-ch/generic_udf_catch_exception_from_child_func. Lead-authored-by: jackylee-ch <lijunqing@baidu.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 236d957) Signed-off-by: Kent Yao <yao@apache.org>
What changes were proposed in this pull request?
This pr is trying to fix the syntax issues with GenericUDF since 3.5.0. The problem arose from DeferredObject currently passing a value instead of a function, which prevented users from catching exceptions in GenericUDF, resulting in semantic differences.
Here is an example case we encountered. Originally, the semantics were that udf_exception would throw an exception, while udf_catch_exception could catch the exception and return a null value. However, currently, any exception encountered by udf_exception will cause the program to fail.
Why are the changes needed?
For before Spark 3.5, we directly made the GenericUDF's DeferredObject lazy and evaluated the children in
function.evaluate(deferredObjects).Now, we would run the children's code first. If an exception is thrown, we would make it lazy to GenericUDF's DeferredObject.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Newly added UT.
Was this patch authored or co-authored using generative AI tooling?
No.