Skip to content

[SPARK-43033][SQL] Avoid task retries due to AssertNotNull checks#40707

Closed
xiaochen-zhou wants to merge 34 commits into
apache:masterfrom
xiaochen-zhou:clownxc-SPARK-43033
Closed

[SPARK-43033][SQL] Avoid task retries due to AssertNotNull checks#40707
xiaochen-zhou wants to merge 34 commits into
apache:masterfrom
xiaochen-zhou:clownxc-SPARK-43033

Conversation

@xiaochen-zhou

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR update the task retry logic to not retry if the exception has an error class which means a user error.

Why are the changes needed?

As discussed #40655 (comment), tasks that failed because of exceptions generated by AssertNotNull should not be retried.

Does this PR introduce any user-facing change?

No

How was this patch tested?

This PR comes with tests.

messageParameters = Map(
"field" -> errMsg
),
cause = new NullPointerException)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not followed the error classes changes much - but this is counter intuitive - why are we not passing the actual exception here ? Instead of creating a dummy exception ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not followed the error classes changes much - but this is counter intuitive - why are we not passing the actual exception here ? Instead of creating a dummy exception ?

Thank you very much for your review, I have modified the code, can you re-review the code when you are free, and make some comments.

@HyukjinKwon

Copy link
Copy Markdown
Member

cc @Ngone51 @jiangxb1987 too FYI

"User exception: <msg>"
]
},
"_LEGACY_ERROR_TEMP_3044" : {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we define a proper error class? There is NOT_NULL_CONSTRAINT_VIOLATION that seems relevant. It is currently used to validate NOT NULL constraints for array elements and map values.

info.id, taskSet.id, tid, ef.description))
return
}
if (ef.className == classOf[SparkUserException].getName) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether this logic can be generic and apply to all exceptions that extend SparkThrowable and have an error class defined?

cc @cloud-fan @gengliangwang @HyukjinKwon

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we should skip retry if the exception is SparkThrowable and the error class is present.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we should think about is how to differentiate user-facing error and user-triggered error. We may still need to retry for user-facing error, e.g. file read error which can be transient.

One idea is to have a special prefix for error classes that should still trigger retry, such as file read error and OOM, which shouldn't be many.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can create a base trait for transient errors.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for review and sorry for the late response. I will try to modify the code according to your suggestions @aokolnychyi @cloud-fan

/**
* User error exception thrown from Spark with an error class.
*/
private[spark] class SparkUserException(

@aokolnychyi aokolnychyi Apr 10, 2023

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Are we sure a custom exception is needed for this case? Is there any existing exception we can reuse with NPE as cause?

If we want to have a brand new exception, what about SparkNotNullConstraintViolationException to be more specific? I guess it will depend whether we want to skip retries only for this exception type as opposed to all Spark exceptions with known error codes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too complicated to use both error class and exception type to differentiate errors. I think in principle we should always use SparkException with different error classes, except for some places that need to be compatible with old code.

@xiaochen-zhou xiaochen-zhou Apr 11, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too complicated to use both error class and exception type to differentiate errors. I think in principle we should always use SparkException with different error classes, except for some places that need to be compatible with old code.

Thank you very much for review, I try to change the code according to this idea.

@xiaochen-zhou xiaochen-zhou Apr 11, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Are we sure a custom exception is needed for this case? Is there any existing exception we can reuse with NPE as cause?

If we want to have a brand new exception, what about SparkNotNullConstraintViolationException to be more specific? I guess it will depend whether we want to skip retries only for this exception type as opposed to all Spark exceptions with known error codes.

Thank you very much for review, My understanding is that we want to skip retry logic of user-triggered error, not only NPE, So I defined a new exception SparkUserException.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be thousands of user-triggered errors while the transient errors are likely to be less than 10. I think it's better to define a new exception for transient errors.

@xiaochen-zhou xiaochen-zhou Apr 11, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be thousands of user-triggered errors while the transient errors are likely to be less than 10. I think it's better to define a new exception for transient errors.

I see, thank you. I try to change the code according to this idea.

@xiaochen-zhou xiaochen-zhou Apr 17, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be thousands of user-triggered errors while the transient errors are likely to be less than 10. I think it's better to define a new exception for transient errors.

During the implementation process, I think that if the idea of ​​define a new exception is adopted, the exception type of error_class may be changed, such as _UNABLE_TO_ACQUIRE_MEMORY and its exception type may be changed from SparkOutOfMemoryError to SparkTransientError, but we need to use SparkOutOfMemoryError in many places. (SparkOutOfMemoryError cannot extend SparkTransientError)

     } catch (SparkOutOfMemoryError e) {
        // should have trigger spilling
        if (!inMemSorter.hasSpaceForAnotherRecord()) {
          logger.error("Unable to grow the pointer array");
          throw e;
        }

So I think having a special prefix may be a more good idea. I don't know if my idea is right, hope you leave some comments in your free time. @cloud-fan

@xiaochen-zhou

xiaochen-zhou commented Apr 17, 2023

Copy link
Copy Markdown
Contributor Author

According to the two ideas provided by @cloud-fan on how to differentiate user-facing errors and user-triggered errors ([have a special prefix] or [create a base trait for transient errors]), in the implementation process, I think having a special prefix may be a more good idea.
I defined a new SparkThrowable#isTransientError method with reference to SparkThrowable#isInternalError, and decided whether to skip the retry logic based on the return value of the SparkThrowable#isTransientError.

  def isInternalError(errorClass: String): Boolean = {
    errorClass == "INTERNAL_ERROR"
  }

  def isTransientError(errorClass: String): Boolean = {
    errorClass.startsWith("TRANSIENT")
  }

can you re-review the code when you are free, and make some comments. @cloud-fan @aokolnychyi

@github-actions github-actions Bot added the BUILD label Apr 17, 2023
"sqlState" : "22003"
},
"INVALID_BUCKET_FILE" : {
"TRANSIENT_INVALID_BUCKET_FILE" : {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Instead of changing the error class name (since that error class is shown in the error message), I'd consider adding a field to indicate whether the error is transient (i.e. should be retried), similar to sqlCode we have today. We would need more feedback from folks who worked on the error framework.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Instead of changing the error class name (since that error class is shown in the error message), I'd consider adding a field to indicate whether the error is transient (i.e. should be retried), similar to sqlCode we have today. We would need more feedback from folks who worked on the error framework.

I am now changing the code based on this idea, and the next step may be to seek help from people who worked on the error framework.

@cloud-fan

Copy link
Copy Markdown
Contributor

OK, I'm on the fence now. On one hand, the number of transient errors should be much smaller than the number of user-triggered errors, so it's better to find out these transient errors and mark them. On the other hand, not retrying the task can be a regression that leads to job failure, so we should make sure we only skip task retry when the error is definitely user-triggered.

To be conservative, now I'm leaning towards picking some errors and marking them as "can skip task retry". I like the idea from @aokolnychyi that we can add a JSON field for it.

@mridulm

mridulm commented Apr 18, 2023

Copy link
Copy Markdown
Contributor

I would be on the conservative side and skip retry when we are absolutely certain that a retry will not help.
SerDe failures, for example, are good candidates (which is already handled), and similar ... note that in non deterministic tasks, a retry can succeed which earlier failed with a user exception

@aokolnychyi

Copy link
Copy Markdown
Contributor

+1 for being safe

@xiaochen-zhou

Copy link
Copy Markdown
Contributor Author

OK, I'm on the fence now. On one hand, the number of transient errors should be much smaller than the number of user-triggered errors, so it's better to find out these transient errors and mark them. On the other hand, not retrying the task can be a regression that leads to job failure, so we should make sure we only skip task retry when the error is definitely user-triggered.

To be conservative, now I'm leaning towards picking some errors and marking them as "can skip task retry". I like the idea from @aokolnychyi that we can add a JSON field for it.

I'm trying to change the code now

@xiaochen-zhou

Copy link
Copy Markdown
Contributor Author

I would be on the conservative side and skip retry when we are absolutely certain that a retry will not help. SerDe failures, for example, are good candidates (which is already handled), and similar ... note that in non deterministic tasks, a retry can succeed which earlier failed with a user exception

I see, I'm trying to change the code now

@github-actions github-actions Bot removed the BUILD label May 5, 2023
@github-actions github-actions Bot added the BUILD label May 5, 2023
@github-actions github-actions Bot added the PYTHON label May 5, 2023
@xiaochen-zhou

Copy link
Copy Markdown
Contributor Author

I have modified the code, can you re-review the code when you are free, and make some comments. @cloud-fan @aokolnychyi @mridulm

@github-actions github-actions Bot removed the PYTHON label May 7, 2023
@xiaochen-zhou

xiaochen-zhou commented May 7, 2023

Copy link
Copy Markdown
Contributor Author

According to the suggestions provided by @cloud-fan @aokolnychyi .I modified the code.
I added the isTransient attribute to some error_classes
such as:

  • AMBIGUOUS_LATERAL_COLUMN_ALIAS
  • CANNOT_PARSE_DECIMAL
  • DATATYPE_MISMATCH
  • DIVIDE_BY_ZERO
  • _LEGACY_ERROR_TEMP_3043(npe)
  "CANNOT_PARSE_DECIMAL" : {
    "message" : [
      "Cannot parse decimal."
    ],
    "sqlState" : "22018",
    "isTransient" : false
  },

When these errors occur, the retry logic is skipped.

   if (!ef.isTransient) {
        // if the exception has an error class which means a non-transient error, not retry
        logError(s"$task has a non-transient exception: ${ef.description}; not retrying")
        sched.dagScheduler.taskEnded(tasks(index), reason, null, accumUpdates, metricPeaks, info)
        abort(s"$task has a non-transient exception: ${ef.description}", ef.exception)
        return
    }

hope you leave some comments in your free time. @cloud-fan @aokolnychyi @mridulm, thanks a lot.

@gabry-lab

Copy link
Copy Markdown
Member

useful feature, any updates here?

@xiaochen-zhou

Copy link
Copy Markdown
Contributor Author

useful feature, any updates here?

I modified the code according to the suggestions provided by @cloud-fan @aokolnychyi @mridulm , next step may be to seek help from people who worked on the error framework. Can you give some suggestions on the next work?

@github-actions

Copy link
Copy Markdown

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions Bot added the Stale label Aug 24, 2023
@github-actions github-actions Bot closed this Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants