Skip to content

psutil causes Nanny to crash #6089

Description

@ungarj

What happened:

We are getting the following exceptions occasionally from our workers resulting the whole process to stall eventually:

FileNotFoundError: [Errno 2] No such file or directory: '/proc/12/statm'
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
    return self.callback()
  File "/usr/local/lib/python3.8/site-packages/distributed/system_monitor.py", line 121, in update
    read_bytes_disk = (disk_ioc.read_bytes - last_disk.read_bytes) / (
AttributeError: 'NoneType' object has no attribute 'read_bytes'

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/psutil/_common.py", line 441, in wrapper
    ret = self._cache[fun]
AttributeError: 'Process' object has no attribute '_cache'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
    return self.callback()
  File "/usr/local/lib/python3.8/site-packages/distributed/worker_memory.py", line 322, in memory_monitor
    memory = proc.memory_info().rss
  File "/usr/local/lib/python3.8/site-packages/psutil/_common.py", line 444, in wrapper
    return fun(self)
  File "/usr/local/lib/python3.8/site-packages/psutil/__init__.py", line 1061, in memory_info
    return self._proc.memory_info()
  File "/usr/local/lib/python3.8/site-packages/psutil/_pslinux.py", line 1661, in wrapper
    return fun(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/psutil/_pslinux.py", line 1895, in memory_info
    with open_binary("%s/%s/statm" % (self._procfs_path, self.pid)) as f:
  File "/usr/local/lib/python3.8/site-packages/psutil/_common.py", line 711, in open_binary
    return open(fname, "rb", **kwargs)

If I understand the code correctly the following is happening:

psutil.disk_io_counters() returns a named tuple:

https://github.com/dask/distributed/blob/2022.04.0/distributed/system_monitor.py#L39

so that the internal self._collect_disk_io_counters is not set to False but psutil.disk_io_counters() returns None instead of an expected named tuple when called within the update method:

https://github.com/dask/distributed/blob/2022.04.0/distributed/system_monitor.py#L115

later thus causing the Nanny to crash.

It seems to be an issue of psutil in the first place but I think the SystemMonitor could be more resilient if that happens.

What you expected to happen:

SystemMonitor should not raise an exception if psutil.disk_io_counters() returns None:

disk_ioc = psutil.disk_io_counters()

Minimal Complete Verifiable Example:

# Put your MCVE code here

Anything else we need to know?:

Should I prepare a PR with the suggested changes?

Environment:

  • Dask version: 2022.4.0
  • Python version: 3.8
  • Operating System: Debian GNU/Linux 10 (buster)
  • Install method (conda, pip, source): pip
Cluster Dump State:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions