[SPARK-43033][SQL] Avoid task retries due to AssertNotNull checks by xiaochen-zhou · Pull Request #40707 · apache/spark

xiaochen-zhou · 2023-04-07T17:33:11Z

What changes were proposed in this pull request?

This PR update the task retry logic to not retry if the exception has an error class which means a user error.

Why are the changes needed?

As discussed #40655 (comment), tasks that failed because of exceptions generated by AssertNotNull should not be retried.

Does this PR introduce any user-facing change?

No

How was this patch tested?

This PR comes with tests.

mridulm · 2023-04-08T07:07:09Z

+      messageParameters = Map(
+        "field" -> errMsg
+      ),
+      cause = new NullPointerException)


I have not followed the error classes changes much - but this is counter intuitive - why are we not passing the actual exception here ? Instead of creating a dummy exception ?

I have not followed the error classes changes much - but this is counter intuitive - why are we not passing the actual exception here ? Instead of creating a dummy exception ?

Thank you very much for your review, I have modified the code, can you re-review the code when you are free, and make some comments.

HyukjinKwon · 2023-04-10T02:31:59Z

cc @Ngone51 @jiangxb1987 too FYI

aokolnychyi · 2023-04-10T18:03:02Z

+      "User exception: <msg>"
+    ]
+  },
+  "_LEGACY_ERROR_TEMP_3044" : {


Shall we define a proper error class? There is NOT_NULL_CONSTRAINT_VIOLATION that seems relevant. It is currently used to validate NOT NULL constraints for array elements and map values.

aokolnychyi · 2023-04-10T18:11:23Z

            info.id, taskSet.id, tid, ef.description))
          return
        }
+        if (ef.className == classOf[SparkUserException].getName) {


I wonder whether this logic can be generic and apply to all exceptions that extend SparkThrowable and have an error class defined?

cc @cloud-fan @gengliangwang @HyukjinKwon

+1, we should skip retry if the exception is SparkThrowable and the error class is present.

One thing we should think about is how to differentiate user-facing error and user-triggered error. We may still need to retry for user-facing error, e.g. file read error which can be transient.

One idea is to have a special prefix for error classes that should still trigger retry, such as file read error and OOM, which shouldn't be many.

Or we can create a base trait for transient errors.

Thank you very much for review and sorry for the late response. I will try to modify the code according to your suggestions @aokolnychyi @cloud-fan

aokolnychyi · 2023-04-10T18:16:07Z

+/**
+ * User error exception thrown from Spark with an error class.
+ */
+private[spark] class SparkUserException(


Question: Are we sure a custom exception is needed for this case? Is there any existing exception we can reuse with NPE as cause?

If we want to have a brand new exception, what about SparkNotNullConstraintViolationException to be more specific? I guess it will depend whether we want to skip retries only for this exception type as opposed to all Spark exceptions with known error codes.

It's too complicated to use both error class and exception type to differentiate errors. I think in principle we should always use SparkException with different error classes, except for some places that need to be compatible with old code.

It's too complicated to use both error class and exception type to differentiate errors. I think in principle we should always use SparkException with different error classes, except for some places that need to be compatible with old code.

Thank you very much for review, I try to change the code according to this idea.

Question: Are we sure a custom exception is needed for this case? Is there any existing exception we can reuse with NPE as cause?

If we want to have a brand new exception, what about SparkNotNullConstraintViolationException to be more specific? I guess it will depend whether we want to skip retries only for this exception type as opposed to all Spark exceptions with known error codes.

Thank you very much for review, My understanding is that we want to skip retry logic of user-triggered error, not only NPE, So I defined a new exception SparkUserException.

There can be thousands of user-triggered errors while the transient errors are likely to be less than 10. I think it's better to define a new exception for transient errors.

There can be thousands of user-triggered errors while the transient errors are likely to be less than 10. I think it's better to define a new exception for transient errors.

I see, thank you. I try to change the code according to this idea.

There can be thousands of user-triggered errors while the transient errors are likely to be less than 10. I think it's better to define a new exception for transient errors.

During the implementation process, I think that if the idea of define a new exception is adopted, the exception type of error_class may be changed, such as _UNABLE_TO_ACQUIRE_MEMORY and its exception type may be changed from SparkOutOfMemoryError to SparkTransientError, but we need to use SparkOutOfMemoryError in many places. (SparkOutOfMemoryError cannot extend SparkTransientError)

} catch (SparkOutOfMemoryError e) { // should have trigger spilling if (!inMemSorter.hasSpaceForAnotherRecord()) { logger.error("Unable to grow the pointer array"); throw e; }

So I think having a special prefix may be a more good idea. I don't know if my idea is right, hope you leave some comments in your free time. @cloud-fan

xiaochen-zhou · 2023-04-17T13:30:36Z

According to the two ideas provided by @cloud-fan on how to differentiate user-facing errors and user-triggered errors ([have a special prefix] or [create a base trait for transient errors]), in the implementation process, I think having a special prefix may be a more good idea.
I defined a new SparkThrowable#isTransientError method with reference to SparkThrowable#isInternalError, and decided whether to skip the retry logic based on the return value of the SparkThrowable#isTransientError.

  def isInternalError(errorClass: String): Boolean = {
    errorClass == "INTERNAL_ERROR"
  }

  def isTransientError(errorClass: String): Boolean = {
    errorClass.startsWith("TRANSIENT")
  }

can you re-review the code when you are free, and make some comments. @cloud-fan @aokolnychyi

aokolnychyi · 2023-04-18T06:19:02Z

    "sqlState" : "22003"
  },
-  "INVALID_BUCKET_FILE" : {
+  "TRANSIENT_INVALID_BUCKET_FILE" : {


Optional: Instead of changing the error class name (since that error class is shown in the error message), I'd consider adding a field to indicate whether the error is transient (i.e. should be retried), similar to sqlCode we have today. We would need more feedback from folks who worked on the error framework.

Optional: Instead of changing the error class name (since that error class is shown in the error message), I'd consider adding a field to indicate whether the error is transient (i.e. should be retried), similar to sqlCode we have today. We would need more feedback from folks who worked on the error framework.

I am now changing the code based on this idea, and the next step may be to seek help from people who worked on the error framework.

cloud-fan · 2023-04-18T08:53:31Z

OK, I'm on the fence now. On one hand, the number of transient errors should be much smaller than the number of user-triggered errors, so it's better to find out these transient errors and mark them. On the other hand, not retrying the task can be a regression that leads to job failure, so we should make sure we only skip task retry when the error is definitely user-triggered.

To be conservative, now I'm leaning towards picking some errors and marking them as "can skip task retry". I like the idea from @aokolnychyi that we can add a JSON field for it.

mridulm · 2023-04-18T09:07:37Z

I would be on the conservative side and skip retry when we are absolutely certain that a retry will not help.
SerDe failures, for example, are good candidates (which is already handled), and similar ... note that in non deterministic tasks, a retry can succeed which earlier failed with a user exception

aokolnychyi · 2023-04-18T18:45:16Z

+1 for being safe

xiaochen-zhou · 2023-04-19T00:10:39Z

OK, I'm on the fence now. On one hand, the number of transient errors should be much smaller than the number of user-triggered errors, so it's better to find out these transient errors and mark them. On the other hand, not retrying the task can be a regression that leads to job failure, so we should make sure we only skip task retry when the error is definitely user-triggered.

To be conservative, now I'm leaning towards picking some errors and marking them as "can skip task retry". I like the idea from @aokolnychyi that we can add a JSON field for it.

I'm trying to change the code now

xiaochen-zhou · 2023-04-19T00:45:18Z

I would be on the conservative side and skip retry when we are absolutely certain that a retry will not help. SerDe failures, for example, are good candidates (which is already handled), and similar ... note that in non deterministic tasks, a retry can succeed which earlier failed with a user exception

I see, I'm trying to change the code now

…c-SPARK-43033 # Conflicts: # project/MimaExcludes.scala

xiaochen-zhou · 2023-05-06T00:07:35Z

I have modified the code, can you re-review the code when you are free, and make some comments. @cloud-fan @aokolnychyi @mridulm

xiaochen-zhou · 2023-05-07T18:11:30Z

According to the suggestions provided by @cloud-fan @aokolnychyi .I modified the code.
I added the isTransient attribute to some error_classes
such as:

AMBIGUOUS_LATERAL_COLUMN_ALIAS
CANNOT_PARSE_DECIMAL
DATATYPE_MISMATCH
DIVIDE_BY_ZERO
_LEGACY_ERROR_TEMP_3043(npe)

  "CANNOT_PARSE_DECIMAL" : {
    "message" : [
      "Cannot parse decimal."
    ],
    "sqlState" : "22018",
    "isTransient" : false
  },

When these errors occur, the retry logic is skipped.

   if (!ef.isTransient) {
        // if the exception has an error class which means a non-transient error, not retry
        logError(s"$task has a non-transient exception: ${ef.description}; not retrying")
        sched.dagScheduler.taskEnded(tasks(index), reason, null, accumUpdates, metricPeaks, info)
        abort(s"$task has a non-transient exception: ${ef.description}", ef.exception)
        return
    }

hope you leave some comments in your free time. @cloud-fan @aokolnychyi @mridulm, thanks a lot.

gabry-lab · 2023-05-14T03:25:21Z

useful feature, any updates here?

xiaochen-zhou · 2023-05-15T00:27:32Z

useful feature, any updates here?

I modified the code according to the suggestions provided by @cloud-fan @aokolnychyi @mridulm , next step may be to seek help from people who worked on the error framework. Can you give some suggestions on the next work?

github-actions · 2023-08-24T00:16:35Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

[SPARK-43033][SQL] Avoid task retries due to AssertNotNull checks

6552f19

github-actions Bot added CORE SQL labels Apr 7, 2023

update test case

554de2b

mridulm reviewed Apr 8, 2023

View reviewed changes

update error_class

2bc025e

xiaochen-zhou added 2 commits April 10, 2023 22:35

update test case

b0e477f

update test case

be2b7f4

aokolnychyi reviewed Apr 10, 2023

View reviewed changes

xiaochen-zhou added 2 commits April 17, 2023 00:42

add isTransient()

ac130dc

add isTransientError

1977035

xiaochen-zhou added 6 commits April 17, 2023 21:57

scala style

7ef342d

scala style

f18ec1a

assertNotNullException

ec0a92c

TRANSIENT_LEGACY_ERROR_TEMP_2107

25d5062

style

62f0f72

ProblemFilters.exclude[DirectMissingMethodProblem]

844f919

github-actions Bot added the BUILD label Apr 17, 2023

xiaochen-zhou added 3 commits April 18, 2023 07:29

ProblemFilters.exclude[DirectMissingMethodProblem]

5ca6fce

ProblemFilters.exclude[DirectMissingMethodProblem]

2b1feb1

test

7dc25ee

aokolnychyi reviewed Apr 18, 2023

View reviewed changes

github-actions Bot added the CONNECT label May 4, 2023

xiaochen-zhou added 5 commits May 5, 2023 09:01

error-classes add field "isTransient"

ebbf711

error-classes add field "isTransient"

457f46c

error-classes add field "isTransient"

3819959

Merge branch 'master' of https://github.com/clownxc/spark into clownx…

0590166

…c-SPARK-43033 # Conflicts: # project/MimaExcludes.scala

MimaExcludes

8be1596

github-actions Bot removed the BUILD label May 5, 2023

xiaochen-zhou added 2 commits May 5, 2023 22:14

style

d2fa8bf

update

fe96c22

github-actions Bot added the BUILD label May 5, 2023

update pyspark test

4193866

github-actions Bot added the PYTHON label May 5, 2023

xiaochen-zhou added 3 commits May 6, 2023 01:39

update test

73d7c26

ProblemFilters.exclude

c568607

format import

aca109d

xiaochen-zhou added 3 commits May 6, 2023 21:58

ProblemFilters.exclude

6dfdbfd

Avoid task retries due to AssertNotNull checks

ed7db55

update istranisent default value

332936c

github-actions Bot removed the PYTHON label May 7, 2023

add test non-transient errors lead to task set abortion

8c6cdf5

update test

bec61bf

xiaochen-zhou requested review from aokolnychyi, cloud-fan and mridulm May 9, 2023 08:01

github-actions Bot added the Stale label Aug 24, 2023

github-actions Bot closed this Aug 25, 2023

Uh oh!

Conversation

xiaochen-zhou commented Apr 7, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 10, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaochen-zhou Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaochen-zhou Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaochen-zhou Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaochen-zhou Apr 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaochen-zhou commented Apr 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 18, 2023

Uh oh!

mridulm commented Apr 18, 2023

Uh oh!

aokolnychyi commented Apr 18, 2023

Uh oh!

xiaochen-zhou commented Apr 19, 2023

Uh oh!

xiaochen-zhou commented Apr 19, 2023

Uh oh!

xiaochen-zhou commented May 6, 2023

Uh oh!

xiaochen-zhou commented May 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabry-lab commented May 14, 2023

Uh oh!

xiaochen-zhou commented May 15, 2023

Uh oh!

github-actions Bot commented Aug 24, 2023

Uh oh!

Reviewers

Assignees

aokolnychyi Apr 10, 2023 •

edited

Loading

xiaochen-zhou Apr 11, 2023 •

edited

Loading

xiaochen-zhou Apr 11, 2023 •

edited

Loading

xiaochen-zhou Apr 11, 2023 •

edited

Loading

xiaochen-zhou Apr 17, 2023 •

edited

Loading

xiaochen-zhou commented Apr 17, 2023 •

edited

Loading

xiaochen-zhou commented May 7, 2023 •

edited

Loading