Skip to content

Set nanny environment variables in config#5098

Merged
jrbourbeau merged 5 commits into
dask:mainfrom
mrocklin:nanny-environ-config
Jul 26, 2021
Merged

Set nanny environment variables in config#5098
jrbourbeau merged 5 commits into
dask:mainfrom
mrocklin:nanny-environ-config

Conversation

@mrocklin

Copy link
Copy Markdown
Member

With the nanny we have the ability to set environment variables in dask workers.
This makes this configurable with the dask.config system.

Somewhat more controversially, this also sets a few common defaults like
MKL_NUM_THREADS, OMP_NUM_THREADS, and MALLOC_TRIM_THRESHOLD

These may have surprising effects, however I suspect that setting
these defaults will help far more people than it would harm.

cc @jacobtomlinson @crusaderky @fjetter and @jcrist who I think are the most likely to be concerned by the default

With the nanny we have the ability to set environment variables in dask workers.
This makes this configurable with the dask.config system.

Somewhat more controversially, this also sets a few common defaults like
MKL_NUM_THREADS, OMP_NUM_THREADS, and MALLOC_TRIM_THRESHOLD

These *may* have surprising effects, however I suspect that setting
these defaults will help far more people than it would harm.

@fjetter fjetter left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These may have surprising effects, however I suspect that setting
these defaults will help far more people than it would harm.

I'm on board. You're probably right.


What's your reasoning for setting the num threads arguments? That might upset some folks and I don't see as big of a win for them as for the malloc setting.

preload: [] # Run custom modules with Nanny
preload-argv: [] # See https://docs.dask.org/en/latest/setup/custom-startup.html
environ:
MALLOC_TRIM_THRESHOLD_: 65536

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MALLOC_TRIM_THRESHOLD_: 65536
# Linux only; inconsequential on Windows, MacOS, or when using alternative memory managers.
# See https://distributed.dask.org/en/latest/worker.html#memory-not-released-back-to-the-os
MALLOC_TRIM_THRESHOLD_: 65536

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically we've avoided putting lots of comments in the main config file, relying instead on the description in the schema file (which gets put into docs). I'm also OK being silent about this because it doesn't have any negative effect.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind this value for MALLOC_TRIM_THRESHOLD_?

Comment thread distributed/nanny.py Outdated

self.Worker = Worker if worker_class is None else worker_class
self.env = env or {}
self.env = env or dask.config.get("distributed.nanny.environ")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.env = env or dask.config.get("distributed.nanny.environ")
self.env = dask.config.get("distributed.nanny.environ")
self.env.update(env or {})

Also I could not find anything that passes the env parameter to the Nanny constructor. Maybe we could deprecate it? Unless some dask users actually create Nanny objects by hand?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to leave the deprecation question to other work if that's ok. It seems harmless to leave here for now and there are other things that seem higher priority.

Comment thread distributed/tests/test_nanny.py Outdated
mrocklin and others added 2 commits July 21, 2021 06:50

@jrbourbeau jrbourbeau left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to hold off on merging this until after the release this Friday. That will give us a bit more time to hear from folks using the main branch of distributed

preload: [] # Run custom modules with Nanny
preload-argv: [] # See https://docs.dask.org/en/latest/setup/custom-startup.html
environ:
MALLOC_TRIM_THRESHOLD_: 65536

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind this value for MALLOC_TRIM_THRESHOLD_?

@mrocklin

Copy link
Copy Markdown
Member Author

What's the reasoning behind this value for MALLOC_TRIM_THRESHOLD_?

@crusaderky suggested it, and I trust @crusaderky :)

@mrocklin

Copy link
Copy Markdown
Member Author

What's your reasoning for setting the num threads arguments? That might upset some folks and I don't see as big of a win for them as for the malloc setting.

Anyone using blas/lapack today essentially needs to set these in order to avoid massively oversubscribing their machines. OpenBLAS in particular will often just shut down the program if you try to use more than twice the number of threads on the system.

@jacobtomlinson jacobtomlinson left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally sounds good to me.

My only concern is that this feels a little magic and I think it could be easy for a user to not understand where an environment variable is coming from. It would be easy to set something in your YAML and forget about it.

For example if I set:

distributed:
  nanny:
    environ:
      FOO: "bar"

and then run

$FOO=baz dask-worker tcp://localhost:8786

The environment variable on the worker will be bar. This doesn't feel intuitive.

Comment thread distributed/nanny.py Outdated
@crusaderky

crusaderky commented Jul 22, 2021

Copy link
Copy Markdown
Collaborator

What's the reasoning behind this value for MALLOC_TRIM_THRESHOLD_?

@crusaderky suggested it, and I trust @crusaderky :)

0 or 65536 is pretty much the same for CPython and CPython extenions like numpy, because PyMalloc invokes the underlying malloc() in large-ish chunks.
However, C/C++ libraries that invoke malloc directly - which are the whole reason why the glibc implements its own user-space memory manager - would risk getting severely slowed down by setting the trim to 0.

@mrocklin mrocklin force-pushed the nanny-environ-config branch from a886c06 to 62d3805 Compare July 22, 2021 14:19
@mrocklin

Copy link
Copy Markdown
Member Author

@jrbourbeau are we ok to merge this in?

@jrbourbeau jrbourbeau left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mrocklin. It looks like there's no objections to setting these default environment variables. Will merge once CI finishes

@jacobtomlinson, to your point about environment variable priority, Matt included an update where environment variables in os.environ will take priority over those in the distributed.nanny.environ configuration setting. I think this addresses the concerns you raised, but if not I'm happy to follow up in a separate PR

@jrbourbeau jrbourbeau merged commit b37ac9d into dask:main Jul 26, 2021
fjetter added a commit that referenced this pull request Oct 23, 2023
In #5098 we set a malloc trim threshold by default to more aggressively control memory trimming.
also related #7177

At the same time, we included these default settings but didn't have incredibly solid arguments for it. It's been a long standing best practice when using dask to disable this nested parallelism.

We haven't received a lot of user feedback about this. However, we had some internal reports of users who were struggling with this because this was quite unexpected behavior for them and non-trivial to debug for the ordinary end user.

In apache/arrow#38389 (comment) this also suggests to negatively impact read performance of parquet tables.

We should consider removing this again
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants