Skip to content

Centralize configuration#3432

Merged
mrocklin merged 26 commits into
dask:masterfrom
mrocklin:config
May 5, 2018
Merged

Centralize configuration#3432
mrocklin merged 26 commits into
dask:masterfrom
mrocklin:config

Conversation

@mrocklin

@mrocklin mrocklin commented Apr 23, 2018

Copy link
Copy Markdown
Member

This creates a dask.config module to serve as central repository for all configuration in Dask and Dask sub-projects. It includes utilities to find config files (in yaml format), parse environment variables, establish standardized locations for those files, and migrate configurations over time.

Fixes #3367

  • Tests added / passed
  • Passes flake8 dask
  • Fully documented, including docs/source/changelog.rst for all changes
    and one of the docs/source/*-api.rst files for new API

@martindurant martindurant left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several minor comments here.

Comment thread dask/config.py Outdated
from .compatibility import FileExistsError


config_paths = [

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These appear windows-unfriendly

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not yet sure what the windows equivalent is for /etc/. This is still a TODO. Wisdom welcome here.

Comment thread dask/config.py Outdated
]

if 'DASK_CONFIG' in os.environ:
config_paths.append(os.environ['DASK_CONFIG'])

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be override, i.e., in position 0?

-edit-
I see below that later wins. Perhaps a comment here, then. Is it a surprise when providing config explicitly like this, that other values are also read, if not overridden?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the sense that it's nicer to stack things. This allows, for example, administrators to specify certain low-level configuration values while also letting users add their own. I think that this is the best approach, but would welcome alternatives.

Comment thread dask/config.py Outdated
return result


def collect_yaml(paths=config_paths):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to use mutable default?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't mutate it or pass it on, so I think so. I've been burned by this before though. If you see a possible concern please let me know.

Comment thread dask/config.py
for path in paths:
if os.path.exists(path):
if os.path.isdir(path):
file_paths.extend(sorted([

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a job for glob?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how best to replace this code with glob. It seems more complex than glob's common use.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps:

file_paths.extend(sorted(chain(*map(glob, ['*.json', '*.yaml', '*.yml']))))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also not clear to me what benefit we get from supporting multiple configuration formats. If yaml is the common one, then we already have to rely on a third-party library.

Comment thread dask/config.py Outdated
# Parse yaml files
for path in file_paths:
with open(path) as f:
data = yaml.load(f.read() or {})

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaml.load(f.read()) or {} - but probably should have try/except (with logger warnings), so syntax errors in the files don't break config.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Thanks for the {} fixs .

I'm inclined to fail hard on syntax errors in yaml files. This is something that users opt-in to, so I think it's ok to give them immediate hard feedback.

Comment thread dask/config.py Outdated
import shutil
if not os.path.exists(os.path.dirname(destination)):
try:
os.mkdir(os.path.dirname(destination))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe prettier to use our own makedirs(path, exists_ok=True), since FileExistsError is a port anyway in py2.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it a shot

Comment thread dask/config.py
tmp = '%s.tmp.%d' % (destination, os.getpid())
shutil.copy(source, tmp)
try:
os.rename(tmp, destination)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessarily atomic? May need a lock (a file in the directory, in this case) to be certain.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to work well in practice. It's infrequently used (often just on first access of a library). I'm inclined to avoid the complications that come along with file-based locks if we can get away with it (which so far, we have).

Comment thread dask/config.py Outdated


@contextmanager
def set_config(arg=None, **kwargs):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is also thread-unsafe - is it only ever called from the client?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No expectations are made about calling this. Also this isn't just for the distributed scheduler, this is for everything. Locks might be appropriate, I'm not sure. In practice it hasn't been used that much outside of tests.

Comment thread dask/config.py Outdated
try:
yield
finally:
for key in kwargs:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or config.clear() and for key in old.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. Currently we don't copy the entire old config, only the changed keys. I'd be happy either way. We'll need to improve this over time anyway to handle nested keys.

Comment thread dask/config.py Outdated
old = {}
for key in kwargs:
if key in config:
old[key] = config[key]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deep-copy?

@mrocklin mrocklin mentioned this pull request Apr 25, 2018
mrocklin added a commit to mrocklin/distributed that referenced this pull request Apr 30, 2018
dask/dask#3432 centralizes configuration in
dask/dask.

This commit updates dask/distributed for these changes.
Notably we centralize all default values in distributed/config.yaml
which is now un-commented.  We merge it into the global configuration by
default on startup.
mrocklin added a commit to mrocklin/distributed that referenced this pull request Apr 30, 2018
dask/dask#3432 centralizes configuration in
dask/dask.

This commit updates dask/distributed for these changes.
Notably we centralize all default values in distributed/config.yaml
which is now un-commented.  We merge it into the global configuration by
default on startup.
mrocklin added a commit to mrocklin/dask-jobqueue that referenced this pull request Apr 30, 2018
This builds on dask/dask#3432 to centralize
configuration within Dask.

This adds a new jobqueue.yaml config file that allows users to specify
configuration in a file and have that be used across all of their use of the
dask-jobqueue project.  This is also something that might allow IT to help
standardize configuration options for a given cluster.
@mrocklin

Copy link
Copy Markdown
Member Author

Here are two RFC PRs that show how the proposed system would work under downstream projects:

  1. Add configuration file dask-jobqueue#48
  2. Overhaul configuration distributed#1948

@mrocklin mrocklin changed the title [WIP] Add standard configuration file handling Add standard configuration file handling Apr 30, 2018
@mrocklin mrocklin changed the title Add standard configuration file handling Centralized configuration May 1, 2018
@jacobtomlinson

Copy link
Copy Markdown
Member

I'm curious to know whether appdirs has been considered for setting the config directory?

@mrocklin

mrocklin commented May 1, 2018 via email

Copy link
Copy Markdown
Member Author

@jacobtomlinson

Copy link
Copy Markdown
Member

Interesting! So what I gathered from @minrk's comment was that consistency between Mac and Linux wins out over being good platform citizens but for things other than config appdirs is a good way to go?

@mrocklin

mrocklin commented May 2, 2018

Copy link
Copy Markdown
Member Author

@shoyer you might also find this interesting. I'd welcome any feedback.

@mrocklin

mrocklin commented May 2, 2018

Copy link
Copy Markdown
Member Author

To me this feels good enough to merge. I'm going to wait until the release, but I'd like to merge this soon afterwards. Critical feedback on this PR would be very welcome

Comment thread dask/config.py
'/etc/dask',
os.path.join(sys.prefix, 'etc', 'dask'),
os.path.join(os.path.expanduser('~'), '.config', 'dask'),
os.path.join(os.path.expanduser('~'), '.dask')

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ~/.dask not be before ~/.config/dask? IIUC the former is deprecated so I would expect that it does not win against the latter.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backwards compatibility I think we need to leave this as-is. We might choose to raise a deprecation warning though if there is any non-trivial configuration in ~/.dask/. Or we might choose to just leave it here. It doesn't do much concrete harm, except by spreading around the places to look for config.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't do much concrete harm, except by spreading around the places to look for config.

I guess the use case I had in mind was:

  • you have an old config in ~/.dask that you forgot even existed
  • you are trying to configure something in ~/.config/dask because you have heard that this is the modern way of doing things. This does not have any effect because your config is overriden by the one in ~/.dask

Maybe it is unlikely to happen because people do not tweak dask configuration files very often.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we keep ~/.dask/ higher priority, but warn if that directory has any configuration in it. This will keep existing deployments functional, but will also inform them that something needs to change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a "list configurations" function? It should give the locations searched, and if anything was found in each, with timestamps.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's possible, yes, although this is also trivial enough to do that I'm somewhat inclined not to maintain that functionality within Dask, but instead make it clear where the information is and the format that it is in.

Comment thread docs/source/configuration.rst Outdated
locations:

1. The ``~/.config/dask`` directory in the user's home directory
2. The ``{sys.executable}/etc/dask`` directory local to the Python executable

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe sys.prefix rather than sys.executable?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. Fixed!


*Note: for historical reasons we also look in the ``~/.dask`` directory for
config files. This is deprecated and will soon be removed.*

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be worth showing a snippet from yaml file. You could use scheduler.work_stealing and scheduler.allowed_failures, i.e. the same example as environment variables below or something more complete as https://distributed.readthedocs.io/en/latest/configuration.html#user-wide-configuration, not sure.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I've added an example yaml file snippet

@jcrist jcrist left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main issue with this is the use of a global configuration object. I have a slight preference for moving the global config state and functions into a class, and initing an instance of that class on load. This would consolidate the implementation into a single object, which would be clearer to me what's global and what's not. It also would allow for changing the configuration files after init and reloading them.

This is similar to the hadoop ecosystem, where services subclass Configuration, and applications create an instance of Configuration to use throughout. A super simple version of this can be seen in python here

Comment thread dask/config.py
for path in paths:
if os.path.exists(path):
if os.path.isdir(path):
file_paths.extend(sorted([

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps:

file_paths.extend(sorted(chain(*map(glob, ['*.json', '*.yaml', '*.yml']))))

Comment thread dask/config.py
for path in paths:
if os.path.exists(path):
if os.path.isdir(path):
file_paths.extend(sorted([

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also not clear to me what benefit we get from supporting multiple configuration formats. If yaml is the common one, then we already have to rely on a third-party library.

Comment thread dask/config.py Outdated
configs = []

try:
import yaml

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic feels a bit off to me, as functions rely on yaml being defined globally, but the import is hidden in the middle of the file. I'd prefer to see something like:

try:
    import yaml
except ImportError:
    yaml = None

...

if yaml is None:
    configs.extend(collect_yaml())

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@mrocklin

mrocklin commented May 3, 2018

Copy link
Copy Markdown
Member Author

My main issue with this is the use of a global configuration object. I have a slight preference for moving the global config state and functions into a class, and initing an instance of that class on load. This would consolidate the implementation into a single object, which would be clearer to me what's global and what's not. It also would allow for changing the configuration files after init and reloading them.

OK, I'm hearing two possibly separate (although possibly not) ideas here:

  1. Configurations shouldn't be global
  2. We should combine the config dict and the collect/get/set methods into an object

Global

With the current implementation subprojects could use their own config dicts (all functions in dask.config take in an optional config= parameter) but I agree that the design wasn't intended for this multiple separate configurations.

So yes, by convention I'm proposing a single unified configuration space for all Dask projects. I believe that in the end this is simpler. Sub-projects can rely on namespaces within the configuration (see dask/dask-jobqueue#48 for an example) or can mix if they like. I believe that this will be both simpler and more convenient overall, but that's only based on the experience that I've had so far with a limited number of systems.

@jcrist I would be curious to learn more about your experiences dealing with multiple separate configs if you have time.

For context, a single global namespaced config seemed natural to me after dealing with Kubernetes specs and Helm charts, which are probably a strong influence in my design here.

So the current approach is global by convention, but not by requirement.

Combine configuration and functions into an object

Yeah, I can see how that would make sense. This is the kind of situation for which objects were designed. I typically have an aversion to this. I really like that someone can take their Dask config and directly dump it to a file to give to someone else without reading documentation on how Dask Config objects work. With things being global the dask.config module ends up being the organizational structure, which I think works decently.

Other

It also would allow for changing the configuration files after init and reloading them

I think that this is a good idea, and that we can achieve it regardless of which path we choose above

@mrocklin

mrocklin commented May 3, 2018

Copy link
Copy Markdown
Member Author

It's also not clear to me what benefit we get from supporting multiple configuration formats. If yaml is the common one, then we already have to rely on a third-party library.

Because yaml and yml are both somewhat common we have to have a check here. Yaml is a superset of JSON, so my thinking here was "why not". A use case would be that someone sends a machine-generated configuration file over some json-serialized system and dumps it down.

@mrocklin

mrocklin commented May 3, 2018

Copy link
Copy Markdown
Member Author

file_paths.extend(sorted(chain(map(glob, ['.json', '.yaml', '.yml']))))

This doesn't take into account cases where people point us to specific yaml files with DASK_CONFIG. The if-else branch in there makes it slightly trickier to use something like glob. The current code seems readable enough to me (though I have writer's bias) that the for loops don't seem like a big deal.

@mrocklin

mrocklin commented May 3, 2018

Copy link
Copy Markdown
Member Author

@jcrist I believe that the recent commits may address some of your concerns about making it easier to refresh configuration. I would also value feedback on the Downstream Libraries section of the documentation

@mrocklin

mrocklin commented May 3, 2018

Copy link
Copy Markdown
Member Author

There is still some work to do here in replacing dask.context.set_options and dask.context._global. This turns out to be slightly involved (dask/dask uses this namespace somewhat more freely than is typical in dask.config) but seems doable.

I would like to wait on that work for another PR that will follow soon on this one. @jcrist any objections?

@mrocklin

mrocklin commented May 3, 2018

Copy link
Copy Markdown
Member Author

I would like to merge this soon. I will continue to be work on this topic for the next few work-days and so will be responsive to concerns if they arise.

I'm inclined to merge this tomorrow morning (US Eastern time) if no one has any objections and no other questions arise.

mrocklin added 3 commits May 4, 2018 08:50
This looks through a list of standard locations for config files, parses them,
and then adds them to a global dask.config.config dictionary.

See dask#3367
@mrocklin mrocklin changed the title Centralized configuration Centralize configuration May 5, 2018
@mrocklin mrocklin merged commit 5c2787f into dask:master May 5, 2018
@mrocklin mrocklin deleted the config branch May 5, 2018 23:09
mrocklin added a commit to mrocklin/distributed that referenced this pull request May 5, 2018
dask/dask#3432 centralizes configuration in
dask/dask.

This commit updates dask/distributed for these changes.
Notably we centralize all default values in distributed/config.yaml
which is now un-commented.  We merge it into the global configuration by
default on startup.
mrocklin added a commit to dask/distributed that referenced this pull request May 6, 2018
This is a large overhaul of our configuration. 

It depends on dask/dask#3432 which centralizes 
configuration code in dask/dask.

In particular this does the following:

1.  The config dict moves from distributed.config to dask.config.config
2.  We now depend on the config values in distributed/distributed.yaml being in the config, allowing us to drop default values that were previously sprinkled around the codebase
3.  We change our configuration keys to be more hierarchical, so rather than `{'scheduler-wait-time': ...}` we get values more like `{'scheduler': {'wait-time': ...}}`. 
4.  We establish a mechanism to migrate old config names to new ones.
mrocklin added a commit to mrocklin/dask-jobqueue that referenced this pull request May 16, 2018
This builds on dask/dask#3432 to centralize
configuration within Dask.

This adds a new jobqueue.yaml config file that allows users to specify
configuration in a file and have that be used across all of their use of the
dask-jobqueue project.  This is also something that might allow IT to help
standardize configuration options for a given cluster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants