Add `pathlib.Path` support to `open_(mf)dataset` by willirath · Pull Request #1514 · pydata/xarray

willirath · 2017-08-21T18:21:34Z

Closes Support for pathlib.Path #799
Tests added / passed
Passes git diff upstream/master | flake8 --diff
Fully documented, including whats-new.rst for all changes and api.rst for new API

This is meant to eventually make xarray.open_dataset and xarray.open_mfdataset work with pathlib.Path objects. I think this can be achieved as follows:

In xarray.open_dataset, cast any pathlib.Path object to string
In xarray.open_mfdataset, make sure to handle generators. This is necessary, because pathlib.Path('some-path').glob() returns generators.

Curently, tests with Python 2 are failing, because there is no explicit pathlib dependency yet.

With Python 3, everything seems to work. I am not happy with the tests I've added so far, though.

max-sixty · 2017-08-21T20:27:58Z

Are you sure pathlib exists in py2? I had thought you needed to install pathlib2 (I may be wrong on the specifics)

max-sixty · 2017-08-21T20:30:48Z

Curently, tests with Python 2 are failing, because there is no explicit pathlib dependency yet.

I missed this, apologies.

Can we add pathlib2 as an optional dependency and handle the case where it's not installed?

shoyer

This is great for a start. For consistency, it would be nice for this to also work with open_dataarray, Dataset.to_netcdf, DataArray.to_netcdfandsave_mfdataset`.

shoyer · 2017-08-22T20:26:42Z

+    if isinstance(paths, GeneratorType):
+        paths = list(paths)
+    if isinstance(paths[0], Path):
+        paths = sorted(str(p) for p in paths)


I don't think we should sort here. We don't do this for lists of strings because (I think) the order in which paths are provided can affect the order of data in the resulting Dataset.

shoyer · 2017-08-22T20:29:17Z

+
    if isinstance(paths, basestring):
        paths = sorted(glob(paths))
+


Rather than explicitly checking for GeneratorType above, let's add an else clause here:

else: # also converts iterables of Path objects paths = [str(p) if isinstance(p, Path) else p for path for paths]

shoyer · 2017-08-22T20:30:18Z

+    # handle output of pathlib.Path.glob()
+    if isinstance(paths, GeneratorType):
+        paths = list(paths)
+    if isinstance(paths[0], Path):


I don't like only checking the first element here. There is lots of overhead for opening a file, so I don't think it should be a problem to check isinstance on every element of the paths argument.

shoyer · 2017-08-25T05:26:00Z

                raise


+@contextlib.contextmanager


Rather than repeating everything from create_tmp_file, can you simply wrap the output from create_tmp_file in Path?

shoyer · 2017-08-25T05:29:17Z

-                                    autoclose=self.autoclose) as actual:
-                    self.assertEqual(actual.foo.variable.data.chunks,
-                                     ((3, 2, 3, 2),))
+        for _create_tmp_file in [create_tmp_file, create_tmp_file_pathlib]:


Can you make a separate test_open_mfdataset_path test rather than squeezing this into test_open_mfdataset?

Some copy/paste is OK, but it doesn't need to all the check we have here around chunks and dask arrays. A simple self.assertDatasetAllClose(original, actual) would suffice.

willirath · 2017-08-25T11:59:27Z

Thanks for the review @shoyer !

On my machine, I've already stared adapting the other backends (to_netcdf et al) to support pathlib as well. I'll include it here.

* Added show_commit_url to asv.conf This should setup the proper links from the published output to the commit on Github. FYI the benchmarks should be running stably now, and posted to http://pandas.pydata.org/speed/xarray. http://pandas.pydata.org/speed/xarray/regressions.xml has an RSS feed to the regressions. * Update asv.conf.json

* Clarify in docs that inferring DataArray dimensions is deprecated * Fix DataArray docstring * Clarify DataArray coords documentation

This follows <jazzband/pathlib2#8 (comment)> who argues for sticking to pathlib2.

willirath · 2017-08-25T15:40:18Z


 INSTALL_REQUIRES = ['numpy >= 1.7', 'pandas >= 0.15.0']
 TESTS_REQUIRE = ['pytest >= 2.7.1']
+if sys.version_info < (3, 0):


I've added a check for Python < 3 to setup.py. This seems to work. I'm not sure if this really is an accepted way of creating conditional requirements.

willirath · 2017-08-25T15:43:45Z

 from glob import glob
 from io import BytesIO
 from numbers import Number
+try:


This prefers pathlib2 and only falls back to pathlib if necessary. I'm following jazzband/pathlib2#8 (comment) here although preferring the stdlib module would feel better.

What do you think?

I agree, I would prefer stdlib module.

willirath · 2017-08-25T15:56:20Z

Tests added / passed

Tests are already done on my personal Travis Account (skipped intermediate commits there).

Passes git diff upstream/master | flake8 --diff

Actually true only for git diff upstream/master xarray/ | flake8 --diff. But I guess that's what should pass the linter?

To me, aae32a8 looks like pathlib support is now present whereever it makes sense.

willirath · 2017-08-25T16:12:23Z

One more thing:

Can we add pathlib2 as an optional dependency and handle the case where it's not installed?

Currently, I've set pathlib2 as a mandatory requirement for Python 2. How'd we achieve pathlib being optional? Test for pathlib upon import and wrap all the if isinstace(path, Path): path = str(path) in a function just passing unmodified path if pathlib's not present?

shoyer · 2017-08-25T16:23:28Z

@willirath take a look at what we do for handling the optional dask dependency:

xarray/xarray/core/pycompat.py

Lines 55 to 60 in bcd6081

    
           try: 
        
               # solely for isinstance checks 
        
               import dask.array 
        
               dask_array_type = (dask.array.Array,) 
        
           except ImportError:  # pragma: no cover 
        
               dask_array_type = ()

willirath · 2017-08-25T16:28:16Z

take a look at what we do for handling the optional dask dependency

Ah, that's nice!

And then I remove the pathlib2 dependency I've introduced in setup.py again?

shoyer · 2017-08-25T16:31:20Z

And then I remove the pathlib2 dependency I've introduced in setup.py again?

Yes, pathlib2 should not be a required dependency.

jhamman

looks pretty good, just a few minor tweaks from my perspective.

jhamman · 2017-08-25T21:06:21Z

+        from pathlib import Path
+    except ImportError:
+        from pathlib2 import Path
+


let's move this to tests/__init__.py. We'll also want a requires_pathlib defined there.

I added the requires_pathlib decorator but couldn't remove this nested import block from test_backends.py, because Path() is explicitly used to set up the pathlib tests.

jhamman · 2017-08-25T21:07:06Z

        with self.assertRaisesRegexp(ValueError, 'same length'):
            save_mfdataset([ds, ds], ['only one path'])

+    def test_save_mfdataset_pathlib_roundtrip(self):


add @requires_pathlib decorator

jhamman · 2017-08-25T21:07:21Z

            with open_dataarray(tmp, drop_variables=['y']) as loaded:
                self.assertDataArrayIdentical(expected, loaded)
+
+    def test_dataarray_to_netcdf_no_name_pathlib(self):


add @requires_pathlib decorator

jhamman · 2017-08-25T21:11:24Z

+        from pathlib2 import Path
+    path_type = (Path, )
+except ImportError as e:
+    path_type = ()


this looks right to me but we should put it in pycompat.py

jhamman · 2017-08-25T21:13:44Z

    "install_timeout": 600,

    // the base URL to show a commit for the project.
-    // "show_commit_url": "http://github.com/owner/project/commit/",


I think this is already on master, can we roll this back?

I've rebased onto master some point. Reverted it now.

shoyer · 2017-08-26T18:21:23Z

 from glob import glob
 from io import BytesIO
 from numbers import Number
+try:


I agree, I would prefer stdlib module.

shoyer · 2017-08-26T18:21:35Z

-        (only netCDF3 supported).
+    filename_or_obj : str, Path, file or xarray.backends.*DataStore
+        Strings and Path objects are interpreted as a path to a netCDF file
+        oran OpenDAP URL and opened with python-netCDF4, unless the filename


oran -> or an

This reverts commit 02023ed.

This reverts commit 4276bb8.

willirath · 2017-08-26T19:47:40Z

flake8 now fails because of the unused pathlib in xarray/tests/__init__.py.

willirath · 2017-08-27T09:35:40Z

AppVeyor test failed with HTTP time-outs when trying to get repodata.json. Can somebody trigger the build again?

jhamman

I think another commit or @shoyer can trigger another build on Appveyor.

jhamman · 2017-08-28T02:15:28Z

 Enhancements
 ~~~~~~~~~~~~

+- Support for `pathlib.Path` objects added to


nit: we typically cite the issue number (e.g. :issue: 799:). Would be nice to include here.

I just pushed a commit to add this

jhamman · 2017-08-28T02:19:01Z

    if isinstance(paths, basestring):
        paths = sorted(glob(paths))
+    else:
+        paths = [str(p) if isinstance(p, path_type) else p for p in paths]


You may have already discussed this with @shoyer but can you remind me why we're not sorting in the same way we do for the glob path above? I guess we're assuming all the paths are expanded already?

We sort after glob() since the iteration order in arbitrary. But we don't sort in general, since the order of the provided filenames might be intentional.

Unfortunately, there isn't any way to detect a generator created by pathlib's glob() method, since it's just a Python generator.

shoyer · 2017-08-31T15:45:12Z

 Enhancements
 ~~~~~~~~~~~~

+- Support for `pathlib.Path` objects added to


I just pushed a commit to add this

shoyer · 2017-08-31T15:48:07Z

+    <xarray.Dataset>
+    [...]
+
+    In [6]: all_files = data_dir.glob("dta_for_month_*.nc")


I removed this example from What's New for two reason:

The section was getting a little longer than essential

It's actually a bit of an anti-pattern, since the order of paths matters to xarray but is arbtirary from glob. The right way to write this is sorted(data_dir.glob(...)).

willirath · 2017-09-01T10:35:41Z

Thanks @shoyer! I like the shorter example. It shows the essence of why pathlib is such a nice thing. Is there anything more to do in this PR? Am 31. August 2017 17:48:24 schrieb Stephan Hoyer <notifications@github.com>:

…

shoyer approved this pull request. > @@ -21,9 +21,38 @@ v0.9.7 (unreleased) Enhancements ~~~~~~~~~~~~ +- Support for `pathlib.Path` objects added to I just pushed a commit to add this > + .. ipython:: + :verbatim: + In [1]: import xarray as xr + + In [2]: from pathlib import Path # In Python 2, use pathlib2! + + In [3]: data_dir = Path("data/") + + In [4]: one_file = data_dir / "dta_for_month_01.nc" + + In [5]: print(xr.open_dataset(one_file)) + Out[5]: + <xarray.Dataset> + [...] + + In [6]: all_files = data_dir.glob("dta_for_month_*.nc") I removed this example from What's New for two reason: 1. The section was getting a little longer than essential 2. It's actually a bit of an anti-pattern, since the order of paths matters to xarray but is arbtirary from glob. The right way to write this is `sorted(data_dir.glob(...))`. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #1514 (review)

shoyer · 2017-09-01T15:31:59Z

CI failures are all the issue with dask distributed (#1540), so I'm going ahead and merging.

Thanks @willirath !

willirath added 2 commits August 21, 2017 19:47

Add pathlib support

cb55c45

Loop over tmpfile functions

f9922d6

jhamman added topic-backends enhancement labels Aug 23, 2017

shoyer reviewed Aug 25, 2017

View reviewed changes

TomAugspurger and others added 6 commits August 25, 2017 15:58

Small documentation fixes (#1516)

4276bb8

* Clarify in docs that inferring DataArray dimensions is deprecated * Fix DataArray docstring * Clarify DataArray coords documentation

Condense pathlib handling for open_mf_dataset

812a483

Add and test pathlib support for backends

47be4b7

Add pathlib2 for python < 3

aac0760

Use pathlib backport if available.

3ca8c9e

This follows <jazzband/pathlib2#8 (comment)> who argues for sticking to pathlib2.

willirath commented Aug 25, 2017

View reviewed changes

Use pathlib w DataArray.to_netcdf

aae32a8

willirath added 7 commits August 25, 2017 18:46

Handle case of completely missing pathlib

2cc69f4

Remove pathlib requirement

3033433

Drop pathlib from minimal test env

c8722db

Add what's-new entry on pathlib support

aeed776

Prefer stdlib pathlib

137dff2

Suppress ImportError's for pathlib

b55b013

Acutally get suppress function

422615f

willirath changed the title ~~WIP: Add pathlib.Path support to open_(mf)dataset~~ Add pathlib.Path support to open_(mf)dataset Aug 25, 2017

jhamman requested changes Aug 25, 2017

View reviewed changes

Add decorator for tests requiring pathlib(2)

8c9ee31

shoyer reviewed Aug 26, 2017

View reviewed changes

willirath added 5 commits August 26, 2017 21:29

Move path_type to central submodule

f3dbf4b

Remove unnecessary parens

efdc883

Revert "Added show_commit_url to asv.conf (#1515)"

999d21d

This reverts commit 02023ed.

Revert "Small documentation fixes (#1516)"

04216f1

This reverts commit 4276bb8.

Fix typo in docstring and fallback-module name

ce156a8

jhamman approved these changes Aug 28, 2017

View reviewed changes

Tweak what's new for pathlib support

b22a389

shoyer approved these changes Aug 31, 2017

View reviewed changes

Merge branch 'master' into 799-add-pathlib-support-2

791ba5b

shoyer merged commit 4a15cfa into pydata:master Sep 1, 2017

jhamman mentioned this pull request Sep 1, 2017

v0.10 Release #1535

Closed

13 tasks


		if isinstance(paths, basestring):
		paths = sorted(glob(paths))

		raise


		@contextlib.contextmanager

Uh oh!

Uh oh!

Conversation

willirath commented Aug 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-sixty commented Aug 21, 2017

Uh oh!

max-sixty commented Aug 21, 2017

Uh oh!

shoyer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

willirath commented Aug 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

willirath commented Aug 25, 2017

Uh oh!

willirath commented Aug 25, 2017

Uh oh!

shoyer commented Aug 25, 2017

Uh oh!

willirath commented Aug 25, 2017

Uh oh!

shoyer commented Aug 25, 2017

Uh oh!

jhamman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

willirath commented Aug 26, 2017

Uh oh!

willirath commented Aug 27, 2017

Uh oh!

jhamman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

willirath commented Aug 21, 2017 •

edited

Loading