Set nanny environment variables in config#5098
Conversation
With the nanny we have the ability to set environment variables in dask workers. This makes this configurable with the dask.config system. Somewhat more controversially, this also sets a few common defaults like MKL_NUM_THREADS, OMP_NUM_THREADS, and MALLOC_TRIM_THRESHOLD These *may* have surprising effects, however I suspect that setting these defaults will help far more people than it would harm.
fjetter
left a comment
There was a problem hiding this comment.
These may have surprising effects, however I suspect that setting
these defaults will help far more people than it would harm.
I'm on board. You're probably right.
What's your reasoning for setting the num threads arguments? That might upset some folks and I don't see as big of a win for them as for the malloc setting.
| preload: [] # Run custom modules with Nanny | ||
| preload-argv: [] # See https://docs.dask.org/en/latest/setup/custom-startup.html | ||
| environ: | ||
| MALLOC_TRIM_THRESHOLD_: 65536 |
There was a problem hiding this comment.
| MALLOC_TRIM_THRESHOLD_: 65536 | |
| # Linux only; inconsequential on Windows, MacOS, or when using alternative memory managers. | |
| # See https://distributed.dask.org/en/latest/worker.html#memory-not-released-back-to-the-os | |
| MALLOC_TRIM_THRESHOLD_: 65536 |
There was a problem hiding this comment.
Historically we've avoided putting lots of comments in the main config file, relying instead on the description in the schema file (which gets put into docs). I'm also OK being silent about this because it doesn't have any negative effect.
There was a problem hiding this comment.
What's the reasoning behind this value for MALLOC_TRIM_THRESHOLD_?
|
|
||
| self.Worker = Worker if worker_class is None else worker_class | ||
| self.env = env or {} | ||
| self.env = env or dask.config.get("distributed.nanny.environ") |
There was a problem hiding this comment.
| self.env = env or dask.config.get("distributed.nanny.environ") | |
| self.env = dask.config.get("distributed.nanny.environ") | |
| self.env.update(env or {}) |
Also I could not find anything that passes the env parameter to the Nanny constructor. Maybe we could deprecate it? Unless some dask users actually create Nanny objects by hand?
There was a problem hiding this comment.
I'm inclined to leave the deprecation question to other work if that's ok. It seems harmless to leave here for now and there are other things that seem higher priority.
Co-authored-by: crusaderky <crusaderky@gmail.com>
jrbourbeau
left a comment
There was a problem hiding this comment.
I'd like to hold off on merging this until after the release this Friday. That will give us a bit more time to hear from folks using the main branch of distributed
| preload: [] # Run custom modules with Nanny | ||
| preload-argv: [] # See https://docs.dask.org/en/latest/setup/custom-startup.html | ||
| environ: | ||
| MALLOC_TRIM_THRESHOLD_: 65536 |
There was a problem hiding this comment.
What's the reasoning behind this value for MALLOC_TRIM_THRESHOLD_?
@crusaderky suggested it, and I trust @crusaderky :) |
Anyone using blas/lapack today essentially needs to set these in order to avoid massively oversubscribing their machines. OpenBLAS in particular will often just shut down the program if you try to use more than twice the number of threads on the system. |
jacobtomlinson
left a comment
There was a problem hiding this comment.
This generally sounds good to me.
My only concern is that this feels a little magic and I think it could be easy for a user to not understand where an environment variable is coming from. It would be easy to set something in your YAML and forget about it.
For example if I set:
distributed:
nanny:
environ:
FOO: "bar"and then run
$FOO=baz dask-worker tcp://localhost:8786
The environment variable on the worker will be bar. This doesn't feel intuitive.
0 or 65536 is pretty much the same for CPython and CPython extenions like numpy, because PyMalloc invokes the underlying malloc() in large-ish chunks. |
a886c06 to
62d3805
Compare
|
@jrbourbeau are we ok to merge this in? |
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @mrocklin. It looks like there's no objections to setting these default environment variables. Will merge once CI finishes
@jacobtomlinson, to your point about environment variable priority, Matt included an update where environment variables in os.environ will take priority over those in the distributed.nanny.environ configuration setting. I think this addresses the concerns you raised, but if not I'm happy to follow up in a separate PR
In #5098 we set a malloc trim threshold by default to more aggressively control memory trimming. also related #7177 At the same time, we included these default settings but didn't have incredibly solid arguments for it. It's been a long standing best practice when using dask to disable this nested parallelism. We haven't received a lot of user feedback about this. However, we had some internal reports of users who were struggling with this because this was quite unexpected behavior for them and non-trivial to debug for the ordinary end user. In apache/arrow#38389 (comment) this also suggests to negatively impact read performance of parquet tables. We should consider removing this again
With the nanny we have the ability to set environment variables in dask workers.
This makes this configurable with the dask.config system.
Somewhat more controversially, this also sets a few common defaults like
MKL_NUM_THREADS, OMP_NUM_THREADS, and MALLOC_TRIM_THRESHOLD
These may have surprising effects, however I suspect that setting
these defaults will help far more people than it would harm.
cc @jacobtomlinson @crusaderky @fjetter and @jcrist who I think are the most likely to be concerned by the default