Skip to content

Deadlock when a flaky task is stolen #6689

Description

@crusaderky
  1. Task x is running on worker A
  2. x is stolen to worker B
  3. x completes successfully on worker B
  4. y, which depends on x, is scheduled on worker A
  5. x eventually completes on worker A, raising an exception. To reiterate: x is a task that does not behave deterministically, and/or changes behaviour depending on which worker it's executed on.

Expected behaviour

Either of these is acceptable:

  • the entire computation falls over, re-raising the exception on the client
  • worker A logs the incident, without informing the scheduler, and fetches the task from worker B

Actual behaviour

Cluster deadlock; corrupted worker state

Reproducer

def test_worker_state_executing_failure_to_fetch(ws_with_running_task):
    ws = ws_with_running_task
    ws2 = "127.0.0.1:2"
    instructions = ws.handle_stimulus(
        FreeKeysEvent(keys=["x"], stimulus_id="s1"),
        ComputeTaskEvent.dummy(key="y", who_has={"x": [ws2]}, stimulus_id="s2"),
        ExecuteFailureEvent(
            key="x",
            start=0.0,
            stop=1.0,
            exception=Serialize(Exception()),
            traceback=None,
            exception_text="",
            traceback_text="",
            stimulus_id="s3",
        ),
    )
    assert instructions == [
        GatherDep(worker=ws2, to_gather={"x"}, total_nbytes=1, stimulus_id="s3")
    ]
    assert ws.tasks["x"].state == "flight"

Output:

Expected :[GatherDep(stimulus_id='s3', worker='127.0.0.1:2', to_gather={'x'}, total_nbytes=1)]
Actual   :[TaskErredMsg(stimulus_id='s3', key='x', ...)]

            for ts_wait in ts.waiting_for_data:
                assert ts_wait.key in self.tasks
>               assert (
                    ts_wait.state in READY | {"executing", "flight", "fetch", "missing"}
                    or ts_wait in self.missing_dep_flight
                    or ts_wait.who_has.issubset(self.in_flight_workers)
                ), (ts, ts_wait, self.story(ts), self.story(ts_wait))
E               AssertionError: (<TaskState 'y' waiting>, <TaskState 'x' error>, [('y', 'compute-task', 'released', 's2', 1657213114.0749152), ('y', 'released', 'waiting', 'waiting', {'x': 'fetch'}, 's2', 1657213114.0749364)], [('x', 'compute-task', 'released', 'compute', 1657213114.0746293), ('x', 'released', 'waiting', 'waiting', {'x': 'constrained'}, 'compute', 1657213114.0746458), ('x', 'waiting', 'constrained', 'constrained', {'x': 'executing'}, 'compute', 1657213114.0746567), ('x', 'constrained', 'executing', 'executing', {}, 'compute', 1657213114.074662), ('free-keys', ['x'], 's1', 1657213114.0748878), ('x', 'executing', 'released', 'cancelled', {}, 's1', 1657213114.0748944), ('x', 'ensure-task-exists', 'cancelled', 's2', 1657213114.0749264), ('x', 'cancelled', 'fetch', 'resumed', {}, 's2', 1657213114.0749395), ('x', 'resumed', 'error', 'error', {}, 's3', 1657213114.0749602)])

../worker_state_machine.py:3081: AssertionError

Metadata

Metadata

Assignees

Labels

deadlockThe cluster appears to not make any progress

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions