-
Notifications
You must be signed in to change notification settings - Fork 17.3k
revamp was it killed externally question #40141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
408f34a
22b6976
22afde3
4491a72
e849b5d
07fa7f6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -313,9 +313,14 @@ def _manage_executor_state( | |
| and ti.state in self.STATES_COUNT_AS_RUNNING | ||
| ): | ||
| msg = ( | ||
| f"Executor reports task instance {ti} finished ({state}) although the task says its " | ||
| f"{ti.state}. Was the task killed externally? Info: {info}" | ||
| f"The executor reported that the task instance {ti} finished with state {state}, " | ||
| f"but the task instance's state attribute is {ti.state}. " | ||
| "This indicates that the task was marked failed by something other than the scheduler. " | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am wondering if instead of saying ".....something other than the scheduler," it should be phrased as "something other than the worker/pod that is actually running the task," or something similar. The scheduler is still an external component and is responsible for failing tasks that are timed out while being stuck in the queue or detecting and killing zombie tasks. In fact, it is the scheduler that marks them as failed, no?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you are right that it is not precise. In fact it's not necessarily scheduler or executor that marks task as "failed". It can also be marked as failed from the UI - directly in the DB and that does not involve scheduler or executor at all, then the process that monitors task execution will see it and stop the process. so the "externally" means that the process that run the task either failed or has been killed by something elase (but NOT by setting the state of the task in the DB). So I think maybe explaining that somehow in more detail would be useful (but rather pointing to the description of what happened not by trying to fit it into a long sentence where we try to squeeze all possible reason). My initial idea was to write a short "this is how monitoring for task state works" and short descirption of possible reasons what kind of "External" factors can kill the task:
Etc. Explaining all those reasons would not fit into a simple error message, but a bit generic description, pointing (via URL) to a detailed description in our documentation would be a great resource - both for users, who are "profficient enough" to follow up with these clues but also by .... us ... if we consider that any of the commiiters/triage people will attempt to help less-profficient users who will not follow (or even not click) that URL, when they copy&paste such error message, the triage team member WILL follow and learn about it - even if they did not know how it work. |
||
| "The task might have been marked failed by a user, by the task_queued_timeout configuration, " | ||
| "or it might have been killed by something else." | ||
| ) | ||
| if info is not None: | ||
| msg += f" Extra info: {info}" | ||
| self.log.error(msg) | ||
| ti.handle_failure(error=msg) | ||
| continue | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not be marked failed though? Should you be using
ti.statehere?