Update standard_task_runner.py#39543
Conversation
This can be caused by any number of things. Suggesting OOM can send a user down a troubleshooting path that is not helpful.
Maybe instead of remove describe that kind of things and how it could be analysed. |
|
Yep. I think this is is an oportunity to send a user on a GOOD path (or at least explain various reasons and possibly send the user to some troubleshooting documentation). Otherwise this new message gives absolutely no clue whatsoever (which will mean they will come an open issue because they will have no idea what to do). The OOM at least could get the user go to A direction where they started looking at possible reasons and maybe they could find something. |
I don't have a good recommendation. A more likely path for OOM (at least in my experience) is the task becoming a zombie rather than Airflow failing the task in a more typical manner. I've seen this error message somewhat frequently, but I don't know that I've ever seen it as a result of OOM. |
|
@amoghrajesh had a good response to a user in Slack:
Perhaps it could recommend that users add an |
|
@RNHTTR good recommendation. What if we document that better, or do something even better. |
SIGKILL will ever trigger the I think the right approach is to explain more what happens - current description is rather vague. Here that the task process was killed externally by -9, and have possible reasons why it might happen. OOM is one of the reasons, but there are other reasons - for example when machine/pod is evicted, -9 might be sent to all the processes when they are not responsive to other attempts to kill. I think it would be great maybe to get a little more description on all that and give the user some direction to look for - usually it's a signal sent by the deployment (K8S) but likely there might be other reasons - I think also Airflow standard task runner heartbeat might actually sigkill such process if it becomes unresponsive (and likely there is another log written in this case somewhere) - it would be worth to check it. So, just a few things listed here as possible reasons (and making sure it is open-ended) could be useful. Maybe even somewhere in our FAQ we should have a section "why my task can get sig-killed" and do a bit more description there. |
|
Thanks for the clarification @potiuk
|
|
@potiuk I thought about this a little more. I think we should keep a note about OOMKill for -9 (and maybe add a blurb about other things that could cause -9 like you suggest), but we should replace log something different for when the return code is |
I think anything where we have a space (in our docs) where we can direct user (via link) where they look for a problem is good. Even if it is incomplete but says "those can be the reasons by there are more" is way better than anything that gives the user no clue whatsoever. We can add more stuff there over time even if initial assesment is not complete, every single time when we discuss with user and find another reason we can update that documentation and make it better. If another committer looks at it and they have no clue, they can also learn from that information - that's why it should provide context and where the error might be generated from. I think just providing the log with explaining WHAT happened without telling context WHY it happend and HOW they can fix the problem will inevitably lead to the users asking on the issues or discussions what to do. And our goal should be: a) either they find a possible cause by following the docs |
This can be caused by any number of things. Suggesting OOM can send a user down a troubleshooting path that is not helpful.