Update standard_task_runner.py by RNHTTR · Pull Request #39543 · apache/airflow

RNHTTR · 2024-05-10T12:33:58Z

This can be caused by any number of things. Suggesting OOM can send a user down a troubleshooting path that is not helpful.

Taragolis · 2024-05-10T12:37:40Z

This can be caused by any number of things.

Maybe instead of remove describe that kind of things and how it could be analysed.

potiuk · 2024-05-10T13:13:09Z

Yep. I think this is is an oportunity to send a user on a GOOD path (or at least explain various reasons and possibly send the user to some troubleshooting documentation). Otherwise this new message gives absolutely no clue whatsoever (which will mean they will come an open issue because they will have no idea what to do). The OOM at least could get the user go to A direction where they started looking at possible reasons and maybe they could find something.

RNHTTR · 2024-05-16T17:51:53Z

Yep. I think this is is an oportunity to send a user on a GOOD path (or at least explain various reasons and possibly send the user to some troubleshooting documentation). Otherwise this new message gives absolutely no clue whatsoever (which will mean they will come an open issue because they will have no idea what to do). The OOM at least could get the user go to A direction where they started looking at possible reasons and maybe they could find something.

I don't have a good recommendation. A more likely path for OOM (at least in my experience) is the task becoming a zombie rather than Airflow failing the task in a more typical manner. I've seen this error message somewhat frequently, but I don't know that I've ever seen it as a result of OOM.

RNHTTR · 2024-05-16T20:08:53Z

@amoghrajesh had a good response to a user in Slack:

-9 is a SIGKILL.
You can handle using using the on_kill callback in your Operator.

Perhaps it could recommend that users add an on_kill operator?

amoghrajesh · 2024-05-17T04:01:39Z

@RNHTTR good recommendation. What if we document that better, or do something even better.
In the on_kill callback for the baseoperator, we can add enough information for the users to send them in various debugging paths. TLDR; Add all the possible causes we can think of there. Since on_kill callback will only be called in case of, well kill.

    def on_kill(self):
        self.log.info("SIGKILL was called. It could be because of: a)...b)....")

potiuk · 2024-05-19T15:35:37Z

@RNHTTR good recommendation. What if we document that better, or do something even better. In the on_kill callback for the baseoperator, we can add enough information for the users to send them in various debugging paths. TLDR; Add all the possible causes we can think of there. Since on_kill callback will only be called in case of, well kill.
    def on_kill(self):
        self.log.info("SIGKILL was called. It could be because of: a)...b)....")

SIGKILL will ever trigger the on_kill. The -9 signal is not possible to handle really in "on_kill" - this is why we are guessing here why processes were killed. The "on_kill" method name has really no relation to SIGKILL (-9) - it's called when the task was stopped more gracefully rather by -9.

I think the right approach is to explain more what happens - current description is rather vague. Here that the task process was killed externally by -9, and have possible reasons why it might happen. OOM is one of the reasons, but there are other reasons - for example when machine/pod is evicted, -9 might be sent to all the processes when they are not responsive to other attempts to kill. I think it would be great maybe to get a little more description on all that and give the user some direction to look for - usually it's a signal sent by the deployment (K8S) but likely there might be other reasons - I think also Airflow standard task runner heartbeat might actually sigkill such process if it becomes unresponsive (and likely there is another log written in this case somewhere) - it would be worth to check it. So, just a few things listed here as possible reasons (and making sure it is open-ended) could be useful. Maybe even somewhere in our FAQ we should have a section "why my task can get sig-killed" and do a bit more description there.

amoghrajesh · 2024-05-20T04:17:33Z

Thanks for the clarification @potiuk
I was under the impression that -9 is caught in on_kill. I agree with your suggestions and we should mention some of these to provide a debugging direction. It can have things like:

The task might have been killed due to OOM
The task might have been killed due to a bad underlying host
The task could have been killed by your deployment manager due to x, y reasons.

RNHTTR · 2024-06-09T17:23:19Z

@potiuk I thought about this a little more. I think we should keep a note about OOMKill for -9 (and maybe add a blurb about other things that could cause -9 like you suggest), but we should replace log something different for when the return code is None. In this case, we should simply indicate that the task was killed for some unknown reason. I think that just assuming -9 is misleading, and causes more confusion. What do you think?

potiuk · 2024-06-09T20:47:08Z

@potiuk I thought about this a little more. I think we should keep a note about OOMKill for -9 (and maybe add a blurb about other things that could cause -9 like you suggest), but we should replace log something different for when the return code is None. In this case, we should simply indicate that the task was killed for some unknown reason. I think that just assuming -9 is misleading, and causes more confusion. What do you think?

I think anything where we have a space (in our docs) where we can direct user (via link) where they look for a problem is good. Even if it is incomplete but says "those can be the reasons by there are more" is way better than anything that gives the user no clue whatsoever. We can add more stuff there over time even if initial assesment is not complete, every single time when we discuss with user and find another reason we can update that documentation and make it better. If another committer looks at it and they have no clue, they can also learn from that information - that's why it should provide context and where the error might be generated from. I think just providing the log with explaining WHAT happened without telling context WHY it happend and HOW they can fix the problem will inevitably lead to the users asking on the issues or discussions what to do. And our goal should be:

a) either they find a possible cause by following the docs
b) or when they post issue or discussion, other users will follow the docs and advise them
c) a manintainer finds the root cause by investigation and not only tell it to the user but also update the documentation to include that reason

RNHTTR · 2024-06-19T15:37:13Z

Based on the feedback here and the feedback in #40141 , I'm going to add a troubleshooting obscure task failures docs page and link out to that page for this log message as well as the one in #40141.

See #40334

Update standard_task_runner.py

b4ef119

This can be caused by any number of things. Suggesting OOM can send a user down a troubleshooting path that is not helpful.

boring-cyborg Bot added the area:Scheduler including HA (high availability) scheduler label May 10, 2024

Merge branch 'main' into patch-8

3c96fbd

RNHTTR mentioned this pull request Jun 19, 2024

revamp was it killed externally question #40141

Closed

RNHTTR closed this Jun 19, 2024

RNHTTR mentioned this pull request Jun 19, 2024

revamp some confusing log messages #40334

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update standard_task_runner.py#39543

Update standard_task_runner.py#39543
RNHTTR wants to merge 2 commits into
apache:mainfrom
RNHTTR:patch-8

RNHTTR commented May 10, 2024

Uh oh!

Taragolis commented May 10, 2024

Uh oh!

potiuk commented May 10, 2024

Uh oh!

RNHTTR commented May 16, 2024

Uh oh!

RNHTTR commented May 16, 2024

Uh oh!

amoghrajesh commented May 17, 2024

Uh oh!

potiuk commented May 19, 2024

Uh oh!

amoghrajesh commented May 20, 2024

Uh oh!

RNHTTR commented Jun 9, 2024

Uh oh!

potiuk commented Jun 9, 2024

Uh oh!

RNHTTR commented Jun 19, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

RNHTTR commented May 10, 2024

Uh oh!

Taragolis commented May 10, 2024

Uh oh!

potiuk commented May 10, 2024

Uh oh!

RNHTTR commented May 16, 2024

Uh oh!

RNHTTR commented May 16, 2024

Uh oh!

amoghrajesh commented May 17, 2024

Uh oh!

potiuk commented May 19, 2024

Uh oh!

amoghrajesh commented May 20, 2024

Uh oh!

RNHTTR commented Jun 9, 2024

Uh oh!

potiuk commented Jun 9, 2024

Uh oh!

RNHTTR commented Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RNHTTR commented Jun 19, 2024 •

edited

Loading