Skip to content

Race condition between gather() and AMM #7982

Description

@crusaderky

The who_has mapping used by gather can fall out of sync.
This can happen, for example:

  1. when the client is gathering intermediate keys, that are dependencies to other keys, and because of this they are replicated on additional workers while Scheduler.gather is running. Afterwards, AMM ReduceReplicas deletes one or more of the excess replicas, which may be the ones in the who_has mapping.
  2. when a dynamic cluster is shrinking down, which in turn is causing AMM RetireWorkers to create replicas somewhere else and delete the old ones.

When this happens, if a remote worker is missing any key, gather completely removes the worker from its own internal copy of who_has. It does not refresh who_has from the scheduler.

Once the local copy of who_has owned by gather runs out of workers for a key, the key is marked as missing, and gather fails .

Metadata

Metadata

Assignees

Labels

bugSomething is broken

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions