Add pathlib.Path support to open_(mf)dataset#1514
Conversation
|
Are you sure |
I missed this, apologies. Can we add |
shoyer
left a comment
There was a problem hiding this comment.
This is great for a start. For consistency, it would be nice for this to also work with open_dataarray, Dataset.to_netcdf, DataArray.to_netcdfandsave_mfdataset`.
| if isinstance(paths, GeneratorType): | ||
| paths = list(paths) | ||
| if isinstance(paths[0], Path): | ||
| paths = sorted(str(p) for p in paths) |
There was a problem hiding this comment.
I don't think we should sort here. We don't do this for lists of strings because (I think) the order in which paths are provided can affect the order of data in the resulting Dataset.
|
|
||
| if isinstance(paths, basestring): | ||
| paths = sorted(glob(paths)) | ||
|
|
There was a problem hiding this comment.
Rather than explicitly checking for GeneratorType above, let's add an else clause here:
else:
# also converts iterables of Path objects
paths = [str(p) if isinstance(p, Path) else p
for path for paths]| # handle output of pathlib.Path.glob() | ||
| if isinstance(paths, GeneratorType): | ||
| paths = list(paths) | ||
| if isinstance(paths[0], Path): |
There was a problem hiding this comment.
I don't like only checking the first element here. There is lots of overhead for opening a file, so I don't think it should be a problem to check isinstance on every element of the paths argument.
| raise | ||
|
|
||
|
|
||
| @contextlib.contextmanager |
There was a problem hiding this comment.
Rather than repeating everything from create_tmp_file, can you simply wrap the output from create_tmp_file in Path?
| autoclose=self.autoclose) as actual: | ||
| self.assertEqual(actual.foo.variable.data.chunks, | ||
| ((3, 2, 3, 2),)) | ||
| for _create_tmp_file in [create_tmp_file, create_tmp_file_pathlib]: |
There was a problem hiding this comment.
Can you make a separate test_open_mfdataset_path test rather than squeezing this into test_open_mfdataset?
Some copy/paste is OK, but it doesn't need to all the check we have here around chunks and dask arrays. A simple self.assertDatasetAllClose(original, actual) would suffice.
|
Thanks for the review @shoyer ! On my machine, I've already stared adapting the other backends ( |
* Added show_commit_url to asv.conf This should setup the proper links from the published output to the commit on Github. FYI the benchmarks should be running stably now, and posted to http://pandas.pydata.org/speed/xarray. http://pandas.pydata.org/speed/xarray/regressions.xml has an RSS feed to the regressions. * Update asv.conf.json
* Clarify in docs that inferring DataArray dimensions is deprecated * Fix DataArray docstring * Clarify DataArray coords documentation
This follows <jazzband/pathlib2#8 (comment)> who argues for sticking to pathlib2.
|
|
||
| INSTALL_REQUIRES = ['numpy >= 1.7', 'pandas >= 0.15.0'] | ||
| TESTS_REQUIRE = ['pytest >= 2.7.1'] | ||
| if sys.version_info < (3, 0): |
There was a problem hiding this comment.
I've added a check for Python < 3 to setup.py. This seems to work. I'm not sure if this really is an accepted way of creating conditional requirements.
| from glob import glob | ||
| from io import BytesIO | ||
| from numbers import Number | ||
| try: |
There was a problem hiding this comment.
This prefers pathlib2 and only falls back to pathlib if necessary. I'm following jazzband/pathlib2#8 (comment) here although preferring the stdlib module would feel better.
What do you think?
There was a problem hiding this comment.
I agree, I would prefer stdlib module.
Tests are already done on my personal Travis Account (skipped intermediate commits there).
Actually true only for To me, aae32a8 looks like pathlib support is now present whereever it makes sense. |
|
One more thing:
Currently, I've set |
|
@willirath take a look at what we do for handling the optional dask dependency: xarray/xarray/core/pycompat.py Lines 55 to 60 in bcd6081 |
Ah, that's nice! And then I remove the |
Yes, |
pathlib.Path support to open_(mf)datasetpathlib.Path support to open_(mf)dataset
jhamman
left a comment
There was a problem hiding this comment.
looks pretty good, just a few minor tweaks from my perspective.
| from pathlib import Path | ||
| except ImportError: | ||
| from pathlib2 import Path | ||
|
|
There was a problem hiding this comment.
let's move this to tests/__init__.py. We'll also want a requires_pathlib defined there.
There was a problem hiding this comment.
I added the requires_pathlib decorator but couldn't remove this nested import block from test_backends.py, because Path() is explicitly used to set up the pathlib tests.
| with self.assertRaisesRegexp(ValueError, 'same length'): | ||
| save_mfdataset([ds, ds], ['only one path']) | ||
|
|
||
| def test_save_mfdataset_pathlib_roundtrip(self): |
There was a problem hiding this comment.
add @requires_pathlib decorator
| with open_dataarray(tmp, drop_variables=['y']) as loaded: | ||
| self.assertDataArrayIdentical(expected, loaded) | ||
|
|
||
| def test_dataarray_to_netcdf_no_name_pathlib(self): |
There was a problem hiding this comment.
add @requires_pathlib decorator
| from pathlib2 import Path | ||
| path_type = (Path, ) | ||
| except ImportError as e: | ||
| path_type = () |
There was a problem hiding this comment.
this looks right to me but we should put it in pycompat.py
| "install_timeout": 600, | ||
|
|
||
| // the base URL to show a commit for the project. | ||
| // "show_commit_url": "http://github.com/owner/project/commit/", |
There was a problem hiding this comment.
I think this is already on master, can we roll this back?
There was a problem hiding this comment.
I've rebased onto master some point. Reverted it now.
| from glob import glob | ||
| from io import BytesIO | ||
| from numbers import Number | ||
| try: |
There was a problem hiding this comment.
I agree, I would prefer stdlib module.
| (only netCDF3 supported). | ||
| filename_or_obj : str, Path, file or xarray.backends.*DataStore | ||
| Strings and Path objects are interpreted as a path to a netCDF file | ||
| oran OpenDAP URL and opened with python-netCDF4, unless the filename |
|
|
|
AppVeyor test failed with HTTP time-outs when trying to get |
| Enhancements | ||
| ~~~~~~~~~~~~ | ||
|
|
||
| - Support for `pathlib.Path` objects added to |
There was a problem hiding this comment.
nit: we typically cite the issue number (e.g. :issue: 799:). Would be nice to include here.
There was a problem hiding this comment.
I just pushed a commit to add this
| if isinstance(paths, basestring): | ||
| paths = sorted(glob(paths)) | ||
| else: | ||
| paths = [str(p) if isinstance(p, path_type) else p for p in paths] |
There was a problem hiding this comment.
You may have already discussed this with @shoyer but can you remind me why we're not sorting in the same way we do for the glob path above? I guess we're assuming all the paths are expanded already?
There was a problem hiding this comment.
We sort after glob() since the iteration order in arbitrary. But we don't sort in general, since the order of the provided filenames might be intentional.
Unfortunately, there isn't any way to detect a generator created by pathlib's glob() method, since it's just a Python generator.
| Enhancements | ||
| ~~~~~~~~~~~~ | ||
|
|
||
| - Support for `pathlib.Path` objects added to |
There was a problem hiding this comment.
I just pushed a commit to add this
| <xarray.Dataset> | ||
| [...] | ||
|
|
||
| In [6]: all_files = data_dir.glob("dta_for_month_*.nc") |
There was a problem hiding this comment.
I removed this example from What's New for two reason:
- The section was getting a little longer than essential
- It's actually a bit of an anti-pattern, since the order of paths matters to xarray but is arbtirary from glob. The right way to write this is
sorted(data_dir.glob(...)).
|
Thanks @shoyer!
I like the shorter example. It shows the essence of why pathlib is such a
nice thing.
Is there anything more to do in this PR?
Am 31. August 2017 17:48:24 schrieb Stephan Hoyer <notifications@github.com>:
… shoyer approved this pull request.
> @@ -21,9 +21,38 @@ v0.9.7 (unreleased)
Enhancements
~~~~~~~~~~~~
+- Support for `pathlib.Path` objects added to
I just pushed a commit to add this
> + .. ipython::
+ :verbatim:
+ In [1]: import xarray as xr
+
+ In [2]: from pathlib import Path # In Python 2, use pathlib2!
+
+ In [3]: data_dir = Path("data/")
+
+ In [4]: one_file = data_dir / "dta_for_month_01.nc"
+
+ In [5]: print(xr.open_dataset(one_file))
+ Out[5]:
+ <xarray.Dataset>
+ [...]
+
+ In [6]: all_files = data_dir.glob("dta_for_month_*.nc")
I removed this example from What's New for two reason:
1. The section was getting a little longer than essential
2. It's actually a bit of an anti-pattern, since the order of paths matters
to xarray but is arbtirary from glob. The right way to write this is
`sorted(data_dir.glob(...))`.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1514 (review)
|
|
CI failures are all the issue with dask distributed (#1540), so I'm going ahead and merging. Thanks @willirath ! |
git diff upstream/master | flake8 --diffwhats-new.rstfor all changes andapi.rstfor new APIThis is meant to eventually make
xarray.open_datasetandxarray.open_mfdatasetwork withpathlib.Pathobjects. I think this can be achieved as follows:In
xarray.open_dataset, cast anypathlib.Pathobject to stringIn
xarray.open_mfdataset, make sure to handle generators. This is necessary, becausepathlib.Path('some-path').glob()returns generators.Curently, tests with Python 2 are failing, because there is no explicit
pathlibdependency yet.With Python 3, everything seems to work. I am not happy with the tests I've added so far, though.