Skip to content

revamp was it killed externally question#40141

Closed
RNHTTR wants to merge 6 commits into
apache:mainfrom
RNHTTR:was-it-killed-externally
Closed

revamp was it killed externally question#40141
RNHTTR wants to merge 6 commits into
apache:mainfrom
RNHTTR:was-it-killed-externally

Conversation

@RNHTTR

@RNHTTR RNHTTR commented Jun 8, 2024

Copy link
Copy Markdown
Contributor

The "Was it killed externally?" message isn't particularly helpful. I think the log message should provide all the available context and nothing more.

@boring-cyborg boring-cyborg Bot added the area:Scheduler including HA (high availability) scheduler label Jun 8, 2024
@RNHTTR RNHTTR marked this pull request as ready for review June 8, 2024 14:14
@RNHTTR RNHTTR requested review from XD-DENG, ashb and kaxil as code owners June 8, 2024 14:14
@RNHTTR RNHTTR marked this pull request as draft June 8, 2024 15:08
@RNHTTR RNHTTR force-pushed the was-it-killed-externally branch from 603e259 to 22b6976 Compare June 8, 2024 15:16
@RNHTTR RNHTTR marked this pull request as ready for review June 9, 2024 13:50
@potiuk

potiuk commented Jun 9, 2024

Copy link
Copy Markdown
Member

Here again, I'd say, we should aim for giving a lot more self-help options for the users. Users will not understand what it means and effectively they will come-back to our issues/discussions when they see such message. Which - if they do - it does not matter if it says 'killed externally' or 'changed state' because we will know where in the code the message is generated.

So if I'd like to improve this one, rather than only stating what happened, I'd link to a documentation page (troubleshooting) where users (but also other committers) could read more about the are that is affected here and what could be possible reasons for such problems. Not only that would be more helpful but it could be a good start to add more possible reasons as we find them. For example if we find a reason for similar issue which was unexpected, it makes it easy to just add a new "possible cause and remediations" to the list of those issues.

Over time it might become helpful also for non-committers - and users helping each other. Even if somoene will not follow such link or not understand consequences, another person looking at the discussion or slack, might actually do it (we've seen it multiple times in the past that when we had "common" problems and documented them, over time more and more people looked at the documentation and helped each other pointing to the docs.

@RNHTTR RNHTTR force-pushed the was-it-killed-externally branch from a15c7d6 to 4491a72 Compare June 9, 2024 17:16
@RNHTTR

RNHTTR commented Jun 9, 2024

Copy link
Copy Markdown
Contributor Author

@potiuk what do you think now?

f"{ti.state}. Was the task killed externally? Info: {info}"
f"The executor reported that the task instance {ti} finished with state {state}, "
f"but the task instance's state attribute is {ti.state}. "
"This indicates that the task was marked failed by something other than the scheduler. "

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be marked failed though? Should you be using ti.state here?

@potiuk

potiuk commented Jun 11, 2024

Copy link
Copy Markdown
Member

@potiuk what do you think now?

LGTM. With the small NIT by @jedcunningham

@eladkal eladkal added this to the Airflow 2.9.3 milestone Jun 16, 2024
f"{ti.state}. Was the task killed externally? Info: {info}"
f"The executor reported that the task instance {ti} finished with state {state}, "
f"but the task instance's state attribute is {ti.state}. "
"This indicates that the task was marked failed by something other than the scheduler. "

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if instead of saying ".....something other than the scheduler," it should be phrased as "something other than the worker/pod that is actually running the task," or something similar. The scheduler is still an external component and is responsible for failing tasks that are timed out while being stuck in the queue or detecting and killing zombie tasks. In fact, it is the scheduler that marks them as failed, no?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right that it is not precise.

In fact it's not necessarily scheduler or executor that marks task as "failed". It can also be marked as failed from the UI - directly in the DB and that does not involve scheduler or executor at all, then the process that monitors task execution will see it and stop the process. so the "externally" means that the process that run the task either failed or has been killed by something elase (but NOT by setting the state of the task in the DB). So I think maybe explaining that somehow in more detail would be useful (but rather pointing to the description of what happened not by trying to fit it into a long sentence where we try to squeeze all possible reason).

My initial idea was to write a short "this is how monitoring for task state works" and short descirption of possible reasons what kind of "External" factors can kill the task:

  • tasks failing usually with low-level C-library errors (SIGSEGV, BAD ACCESS etc.)
  • external process sending signal to the task (for example deployment manager - k8s etc. - deciding to free or move the execution to another machine by killing tasks running on it).
  • OOM where deployment has not enough memory to share with the task
  • some external scripts/ admins sending signal to the task
  • hardware error (fault memory/disk)

Etc.

Explaining all those reasons would not fit into a simple error message, but a bit generic description, pointing (via URL) to a detailed description in our documentation would be a great resource - both for users, who are "profficient enough" to follow up with these clues but also by .... us ... if we consider that any of the commiiters/triage people will attempt to help less-profficient users who will not follow (or even not click) that URL, when they copy&paste such error message, the triage team member WILL follow and learn about it - even if they did not know how it work.

@RNHTTR

RNHTTR commented Jun 19, 2024

Copy link
Copy Markdown
Contributor Author

Based on the feedback here and the feedback in #39543, I'm going to add a troubleshooting obscure task failures docs page and link out to that page for this log message as well as the one in #39543.

See #40334

@RNHTTR RNHTTR closed this Jun 19, 2024
@dosubot dosubot Bot removed this from the Airflow 2.9.3 milestone Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants