revamp was it killed externally question#40141
Conversation
603e259 to
22b6976
Compare
|
Here again, I'd say, we should aim for giving a lot more self-help options for the users. Users will not understand what it means and effectively they will come-back to our issues/discussions when they see such message. Which - if they do - it does not matter if it says 'killed externally' or 'changed state' because we will know where in the code the message is generated. So if I'd like to improve this one, rather than only stating what happened, I'd link to a documentation page (troubleshooting) where users (but also other committers) could read more about the are that is affected here and what could be possible reasons for such problems. Not only that would be more helpful but it could be a good start to add more possible reasons as we find them. For example if we find a reason for similar issue which was unexpected, it makes it easy to just add a new "possible cause and remediations" to the list of those issues. Over time it might become helpful also for non-committers - and users helping each other. Even if somoene will not follow such link or not understand consequences, another person looking at the discussion or slack, might actually do it (we've seen it multiple times in the past that when we had "common" problems and documented them, over time more and more people looked at the documentation and helped each other pointing to the docs. |
a15c7d6 to
4491a72
Compare
|
@potiuk what do you think now? |
| f"{ti.state}. Was the task killed externally? Info: {info}" | ||
| f"The executor reported that the task instance {ti} finished with state {state}, " | ||
| f"but the task instance's state attribute is {ti.state}. " | ||
| "This indicates that the task was marked failed by something other than the scheduler. " |
There was a problem hiding this comment.
It might not be marked failed though? Should you be using ti.state here?
LGTM. With the small NIT by @jedcunningham |
| f"{ti.state}. Was the task killed externally? Info: {info}" | ||
| f"The executor reported that the task instance {ti} finished with state {state}, " | ||
| f"but the task instance's state attribute is {ti.state}. " | ||
| "This indicates that the task was marked failed by something other than the scheduler. " |
There was a problem hiding this comment.
I am wondering if instead of saying ".....something other than the scheduler," it should be phrased as "something other than the worker/pod that is actually running the task," or something similar. The scheduler is still an external component and is responsible for failing tasks that are timed out while being stuck in the queue or detecting and killing zombie tasks. In fact, it is the scheduler that marks them as failed, no?
There was a problem hiding this comment.
I think you are right that it is not precise.
In fact it's not necessarily scheduler or executor that marks task as "failed". It can also be marked as failed from the UI - directly in the DB and that does not involve scheduler or executor at all, then the process that monitors task execution will see it and stop the process. so the "externally" means that the process that run the task either failed or has been killed by something elase (but NOT by setting the state of the task in the DB). So I think maybe explaining that somehow in more detail would be useful (but rather pointing to the description of what happened not by trying to fit it into a long sentence where we try to squeeze all possible reason).
My initial idea was to write a short "this is how monitoring for task state works" and short descirption of possible reasons what kind of "External" factors can kill the task:
- tasks failing usually with low-level C-library errors (SIGSEGV, BAD ACCESS etc.)
- external process sending signal to the task (for example deployment manager - k8s etc. - deciding to free or move the execution to another machine by killing tasks running on it).
- OOM where deployment has not enough memory to share with the task
- some external scripts/ admins sending signal to the task
- hardware error (fault memory/disk)
Etc.
Explaining all those reasons would not fit into a simple error message, but a bit generic description, pointing (via URL) to a detailed description in our documentation would be a great resource - both for users, who are "profficient enough" to follow up with these clues but also by .... us ... if we consider that any of the commiiters/triage people will attempt to help less-profficient users who will not follow (or even not click) that URL, when they copy&paste such error message, the triage team member WILL follow and learn about it - even if they did not know how it work.
The "Was it killed externally?" message isn't particularly helpful. I think the log message should provide all the available context and nothing more.