The who_has mapping used by gather can fall out of sync.
This can happen, for example:
- when the client is gathering intermediate keys, that are dependencies to other keys, and because of this they are replicated on additional workers while
Scheduler.gather is running. Afterwards, AMM ReduceReplicas deletes one or more of the excess replicas, which may be the ones in the who_has mapping.
- when a dynamic cluster is shrinking down, which in turn is causing AMM RetireWorkers to create replicas somewhere else and delete the old ones.
When this happens, if a remote worker is missing any key, gather completely removes the worker from its own internal copy of who_has. It does not refresh who_has from the scheduler.
Once the local copy of who_has owned by gather runs out of workers for a key, the key is marked as missing, and gather fails .
The
who_hasmapping used bygathercan fall out of sync.This can happen, for example:
Scheduler.gatheris running. Afterwards, AMM ReduceReplicas deletes one or more of the excess replicas, which may be the ones in thewho_hasmapping.When this happens, if a remote worker is missing any key,
gathercompletely removes the worker from its own internal copy of who_has. It does not refresh who_has from the scheduler.Once the local copy of
who_hasowned bygatherruns out of workers for a key, the key is marked as missing, andgatherfails .