Centralize configuration#3432
Conversation
martindurant
left a comment
There was a problem hiding this comment.
Several minor comments here.
| from .compatibility import FileExistsError | ||
|
|
||
|
|
||
| config_paths = [ |
There was a problem hiding this comment.
These appear windows-unfriendly
There was a problem hiding this comment.
Yeah, I'm not yet sure what the windows equivalent is for /etc/. This is still a TODO. Wisdom welcome here.
| ] | ||
|
|
||
| if 'DASK_CONFIG' in os.environ: | ||
| config_paths.append(os.environ['DASK_CONFIG']) |
There was a problem hiding this comment.
Should be override, i.e., in position 0?
-edit-
I see below that later wins. Perhaps a comment here, then. Is it a surprise when providing config explicitly like this, that other values are also read, if not overridden?
There was a problem hiding this comment.
I get the sense that it's nicer to stack things. This allows, for example, administrators to specify certain low-level configuration values while also letting users add their own. I think that this is the best approach, but would welcome alternatives.
| return result | ||
|
|
||
|
|
||
| def collect_yaml(paths=config_paths): |
There was a problem hiding this comment.
OK to use mutable default?
There was a problem hiding this comment.
We don't mutate it or pass it on, so I think so. I've been burned by this before though. If you see a possible concern please let me know.
| for path in paths: | ||
| if os.path.exists(path): | ||
| if os.path.isdir(path): | ||
| file_paths.extend(sorted([ |
There was a problem hiding this comment.
Looks like a job for glob?
There was a problem hiding this comment.
I'm not sure how best to replace this code with glob. It seems more complex than glob's common use.
There was a problem hiding this comment.
Perhaps:
file_paths.extend(sorted(chain(*map(glob, ['*.json', '*.yaml', '*.yml']))))
There was a problem hiding this comment.
It's also not clear to me what benefit we get from supporting multiple configuration formats. If yaml is the common one, then we already have to rely on a third-party library.
| # Parse yaml files | ||
| for path in file_paths: | ||
| with open(path) as f: | ||
| data = yaml.load(f.read() or {}) |
There was a problem hiding this comment.
yaml.load(f.read()) or {} - but probably should have try/except (with logger warnings), so syntax errors in the files don't break config.
There was a problem hiding this comment.
Indeed. Thanks for the {} fixs .
I'm inclined to fail hard on syntax errors in yaml files. This is something that users opt-in to, so I think it's ok to give them immediate hard feedback.
| import shutil | ||
| if not os.path.exists(os.path.dirname(destination)): | ||
| try: | ||
| os.mkdir(os.path.dirname(destination)) |
There was a problem hiding this comment.
Maybe prettier to use our own makedirs(path, exists_ok=True), since FileExistsError is a port anyway in py2.
| tmp = '%s.tmp.%d' % (destination, os.getpid()) | ||
| shutil.copy(source, tmp) | ||
| try: | ||
| os.rename(tmp, destination) |
There was a problem hiding this comment.
Is this necessarily atomic? May need a lock (a file in the directory, in this case) to be certain.
There was a problem hiding this comment.
It seems to work well in practice. It's infrequently used (often just on first access of a library). I'm inclined to avoid the complications that come along with file-based locks if we can get away with it (which so far, we have).
|
|
||
|
|
||
| @contextmanager | ||
| def set_config(arg=None, **kwargs): |
There was a problem hiding this comment.
I assume this is also thread-unsafe - is it only ever called from the client?
There was a problem hiding this comment.
No expectations are made about calling this. Also this isn't just for the distributed scheduler, this is for everything. Locks might be appropriate, I'm not sure. In practice it hasn't been used that much outside of tests.
| try: | ||
| yield | ||
| finally: | ||
| for key in kwargs: |
There was a problem hiding this comment.
or config.clear() and for key in old.
There was a problem hiding this comment.
Maybe. Currently we don't copy the entire old config, only the changed keys. I'd be happy either way. We'll need to improve this over time anyway to handle nested keys.
| old = {} | ||
| for key in kwargs: | ||
| if key in config: | ||
| old[key] = config[key] |
dask/dask#3432 centralizes configuration in dask/dask. This commit updates dask/distributed for these changes. Notably we centralize all default values in distributed/config.yaml which is now un-commented. We merge it into the global configuration by default on startup.
dask/dask#3432 centralizes configuration in dask/dask. This commit updates dask/distributed for these changes. Notably we centralize all default values in distributed/config.yaml which is now un-commented. We merge it into the global configuration by default on startup.
This builds on dask/dask#3432 to centralize configuration within Dask. This adds a new jobqueue.yaml config file that allows users to specify configuration in a file and have that be used across all of their use of the dask-jobqueue project. This is also something that might allow IT to help standardize configuration options for a given cluster.
|
Here are two RFC PRs that show how the proposed system would work under downstream projects: |
|
I'm curious to know whether appdirs has been considered for setting the config directory? |
|
… On Tue, May 1, 2018 at 10:33 AM, Jacob Tomlinson ***@***.***> wrote:
I'm curious to know whether appdirs
<https://github.com/ActiveState/appdirs> has been considered for setting
the config directory?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3432 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszOZkp76AOGwyon3FTbx9whZ5dyBXks5tuHIzgaJpZM4TgaWM>
.
|
|
Interesting! So what I gathered from @minrk's comment was that consistency between Mac and Linux wins out over being good platform citizens but for things other than config |
|
@shoyer you might also find this interesting. I'd welcome any feedback. |
|
To me this feels good enough to merge. I'm going to wait until the release, but I'd like to merge this soon afterwards. Critical feedback on this PR would be very welcome |
| '/etc/dask', | ||
| os.path.join(sys.prefix, 'etc', 'dask'), | ||
| os.path.join(os.path.expanduser('~'), '.config', 'dask'), | ||
| os.path.join(os.path.expanduser('~'), '.dask') |
There was a problem hiding this comment.
Should ~/.dask not be before ~/.config/dask? IIUC the former is deprecated so I would expect that it does not win against the latter.
There was a problem hiding this comment.
For backwards compatibility I think we need to leave this as-is. We might choose to raise a deprecation warning though if there is any non-trivial configuration in ~/.dask/. Or we might choose to just leave it here. It doesn't do much concrete harm, except by spreading around the places to look for config.
There was a problem hiding this comment.
It doesn't do much concrete harm, except by spreading around the places to look for config.
I guess the use case I had in mind was:
- you have an old config in
~/.daskthat you forgot even existed - you are trying to configure something in
~/.config/daskbecause you have heard that this is the modern way of doing things. This does not have any effect because your config is overriden by the one in~/.dask
Maybe it is unlikely to happen because people do not tweak dask configuration files very often.
There was a problem hiding this comment.
I suggest that we keep ~/.dask/ higher priority, but warn if that directory has any configuration in it. This will keep existing deployments functional, but will also inform them that something needs to change.
There was a problem hiding this comment.
Can we have a "list configurations" function? It should give the locations searched, and if anything was found in each, with timestamps.
There was a problem hiding this comment.
That's possible, yes, although this is also trivial enough to do that I'm somewhat inclined not to maintain that functionality within Dask, but instead make it clear where the information is and the format that it is in.
| locations: | ||
|
|
||
| 1. The ``~/.config/dask`` directory in the user's home directory | ||
| 2. The ``{sys.executable}/etc/dask`` directory local to the Python executable |
There was a problem hiding this comment.
Maybe sys.prefix rather than sys.executable?
|
|
||
| *Note: for historical reasons we also look in the ``~/.dask`` directory for | ||
| config files. This is deprecated and will soon be removed.* | ||
|
|
There was a problem hiding this comment.
Maybe it would be worth showing a snippet from yaml file. You could use scheduler.work_stealing and scheduler.allowed_failures, i.e. the same example as environment variables below or something more complete as https://distributed.readthedocs.io/en/latest/configuration.html#user-wide-configuration, not sure.
There was a problem hiding this comment.
Good idea, I've added an example yaml file snippet
jcrist
left a comment
There was a problem hiding this comment.
My main issue with this is the use of a global configuration object. I have a slight preference for moving the global config state and functions into a class, and initing an instance of that class on load. This would consolidate the implementation into a single object, which would be clearer to me what's global and what's not. It also would allow for changing the configuration files after init and reloading them.
This is similar to the hadoop ecosystem, where services subclass Configuration, and applications create an instance of Configuration to use throughout. A super simple version of this can be seen in python here
| for path in paths: | ||
| if os.path.exists(path): | ||
| if os.path.isdir(path): | ||
| file_paths.extend(sorted([ |
There was a problem hiding this comment.
Perhaps:
file_paths.extend(sorted(chain(*map(glob, ['*.json', '*.yaml', '*.yml']))))
| for path in paths: | ||
| if os.path.exists(path): | ||
| if os.path.isdir(path): | ||
| file_paths.extend(sorted([ |
There was a problem hiding this comment.
It's also not clear to me what benefit we get from supporting multiple configuration formats. If yaml is the common one, then we already have to rely on a third-party library.
| configs = [] | ||
|
|
||
| try: | ||
| import yaml |
There was a problem hiding this comment.
This logic feels a bit off to me, as functions rely on yaml being defined globally, but the import is hidden in the middle of the file. I'd prefer to see something like:
try:
import yaml
except ImportError:
yaml = None
...
if yaml is None:
configs.extend(collect_yaml())
OK, I'm hearing two possibly separate (although possibly not) ideas here:
GlobalWith the current implementation subprojects could use their own config dicts (all functions in So yes, by convention I'm proposing a single unified configuration space for all Dask projects. I believe that in the end this is simpler. Sub-projects can rely on namespaces within the configuration (see dask/dask-jobqueue#48 for an example) or can mix if they like. I believe that this will be both simpler and more convenient overall, but that's only based on the experience that I've had so far with a limited number of systems. @jcrist I would be curious to learn more about your experiences dealing with multiple separate configs if you have time. For context, a single global namespaced config seemed natural to me after dealing with Kubernetes specs and Helm charts, which are probably a strong influence in my design here. So the current approach is global by convention, but not by requirement. Combine configuration and functions into an objectYeah, I can see how that would make sense. This is the kind of situation for which objects were designed. I typically have an aversion to this. I really like that someone can take their Dask config and directly dump it to a file to give to someone else without reading documentation on how Dask Config objects work. With things being global the Other
I think that this is a good idea, and that we can achieve it regardless of which path we choose above |
Because yaml and yml are both somewhat common we have to have a check here. Yaml is a superset of JSON, so my thinking here was "why not". A use case would be that someone sends a machine-generated configuration file over some json-serialized system and dumps it down. |
This doesn't take into account cases where people point us to |
|
@jcrist I believe that the recent commits may address some of your concerns about making it easier to refresh configuration. I would also value feedback on the Downstream Libraries section of the documentation |
|
There is still some work to do here in replacing I would like to wait on that work for another PR that will follow soon on this one. @jcrist any objections? |
|
I would like to merge this soon. I will continue to be work on this topic for the next few work-days and so will be responsive to concerns if they arise. I'm inclined to merge this tomorrow morning (US Eastern time) if no one has any objections and no other questions arise. |
This looks through a list of standard locations for config files, parses them, and then adds them to a global dask.config.config dictionary. See dask#3367
These make it clear how to update configuration other than at import time.
This caused them to be dumped to sphinx output
dask/dask#3432 centralizes configuration in dask/dask. This commit updates dask/distributed for these changes. Notably we centralize all default values in distributed/config.yaml which is now un-commented. We merge it into the global configuration by default on startup.
This is a large overhaul of our configuration. It depends on dask/dask#3432 which centralizes configuration code in dask/dask. In particular this does the following: 1. The config dict moves from distributed.config to dask.config.config 2. We now depend on the config values in distributed/distributed.yaml being in the config, allowing us to drop default values that were previously sprinkled around the codebase 3. We change our configuration keys to be more hierarchical, so rather than `{'scheduler-wait-time': ...}` we get values more like `{'scheduler': {'wait-time': ...}}`. 4. We establish a mechanism to migrate old config names to new ones.
This builds on dask/dask#3432 to centralize configuration within Dask. This adds a new jobqueue.yaml config file that allows users to specify configuration in a file and have that be used across all of their use of the dask-jobqueue project. This is also something that might allow IT to help standardize configuration options for a given cluster.
This creates a dask.config module to serve as central repository for all configuration in Dask and Dask sub-projects. It includes utilities to find config files (in yaml format), parse environment variables, establish standardized locations for those files, and migrate configurations over time.
Fixes #3367
flake8 daskdocs/source/changelog.rstfor all changesand one of the
docs/source/*-api.rstfiles for new API