Multi-index indexing by benbovy · Pull Request #802 · pydata/xarray

benbovy · 2016-03-24T14:39:38Z

Follows #767.

This is incomplete (it still needs some tests and documentation updates), but it is working for both Dataset and DataArray objects. I also don't know if it is fully compatible with lazy indexing (Dask).

Using the example from #767:

In [4]: da.sel(band_wavenumber={'band': 'foo'})
Out[4]:
<xarray.DataArray (wavenumber: 2)>
array([ 0.00017,  0.00014])
Coordinates:
  * wavenumber  (wavenumber) float64 4.05e+03 4.05e+03

As shown in this example, similarily to pandas, it automatically renames the dimension and assigns a new coordinate when the selection doesn't return a pd.MultiIndex (here it returns a pd.FloatIndex).

In some cases this behavior may be unwanted (??), so I added a drop_level keyword argument (if False it keeps the multi-index and doesn't change the dimension/coordinate names):

In [5]: da.sel(band_wavenumber={'band': 'foo'}, drop_level=False)
Out[5]:
<xarray.DataArray (band_wavenumber: 2)>
array([ 0.00017,  0.00014])
Coordinates:
  * band_wavenumber  (band_wavenumber) object ('foo', 4050.2) ('foo', 4050.3)

Note that it also works with DataArray.loc, but (for now) in that case it always returns the multi-index:

In [6]: da.loc[{'band_wavenumber': {'band': 'foo'}}]
Out[6]:
<xarray.DataArray (band_wavenumber: 2)>
array([ 0.00017,  0.00014])
Coordinates:
  * band_wavenumber  (band_wavenumber) object ('foo', 4050.2) ('foo', 4050.3)

This is however inconsistent with Dataset.sel and Dataset.loc that both apply drop_level=True by default, due to their different implementation. Two solutions: (1) make DataArray.loc apply drop_level by default, or (2) use drop_level=False by default everywhere.

shoyer · 2016-03-25T03:53:53Z

This looks very nice! I would probably opt for making drop_level=True the default for .loc -- it's hard to see when that wouldn't be desirable. In fact, I'm not even sure that it's worth having the option to set drop_level=False.

benbovy · 2016-03-25T14:17:29Z

I followed your suggestions.

Two more comments (not critical issues I think) :

Multi-index level drop only works when a dict is provided for a dimension, i.e., da.sel(band_wavenumber='foo') still returns the full pandas.MultiIndex. This would require for indexing.convert_label_indexer to return an updated index also when label is not dict-like, which is not that straightforward I think.
There is a potential conflict when providing a dictionary to DataArray.loc, it may be interpreted either as a mapping of dimensions / labels or a mapping of index levels / labels for the 1st dimension. For now, the second option is not handled, e.g., da.loc[{'band': 'foo'}] raises a KeyError.

In summary, da.loc[{'band_wavenumber': {'band': 'foo'}}] returns a DataArray with updated index and coordinate / dimension name, da.loc['foo'] returns a DataArray with the full multi-index, and da.loc[{'band': 'foo'}] raises a KeyError. Three different results for the same operation, although I think that the first one is more explicit and better.

shoyer · 2016-03-25T15:51:02Z

Multi-index level drop only works when a dict is provided for a dimension, i.e., da.sel(band_wavenumber='foo') still returns the full pandas.MultiIndex. This would require for indexing.convert_label_indexer to return an updated index also when label is not dict-like, which is not that straightforward I think.

Indeed. This would require an another data structure somewhere keeping track of level names -- and ideally also ensuring that they are always unique (like dimensions). This seems fine to me for now.

There is a potential conflict when providing a dictionary to DataArray.loc, it may be interpreted either as a mapping of dimensions / labels or a mapping of index levels / labels for the 1st dimension. For now, the second option is not handled, e.g., da.loc[{'band': 'foo'}] raises a KeyError.

I agree -- better to require the user to be explicit. I also don't see many use cases for specifying the coordinate value and level name but not the dimension name. What happens if you type da.loc[{'band': 'foo'}, :] or da.loc[{'band': 'foo'}, ...]? In principle these are not ambiguous, but I could only guess at what happens without looking at the code.

benbovy · 2016-03-25T17:10:40Z

After refactoring a bit, da.loc[{'band': 'foo'}, :] and da.loc[{'band': 'foo'}, ...] now work as expected.

da.loc[{'band': 'foo'}] still raises a KeyError. Although it is possible to catch this exception, it still remains ambiguous so it is probably better to do nothing.

shoyer · 2016-03-25T17:30:33Z

Dictionaries not hashable, so we might be able to detect this case by
calling hash() and raising a KeyError with alternative suggestions. Just an
idea -- not sure if there's a clean way to implement that.

On Fri, Mar 25, 2016 at 10:10 AM, Benoit Bovy notifications@github.com
wrote:

After refactoring a bit, da.loc[{'band': 'foo'}, :] and da.loc[{'band':
'foo'}, ...] now work as expected.

da.loc[{'band': 'foo'}] still raises a KeyError. Although it is possible
to catch this exception, it still remains ambiguous so it is probably
better to do nothing.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#802 (comment)

benbovy · 2016-03-25T17:56:51Z

Unless you see any other issues, I think that this feature doesn't need more development for now.

I'll be back next week to finish this PR (write some tests and doc).

tippetts · 2016-04-06T14:45:17Z

This will be a great feature. I for one am really looking forward to using it.

Will this work also allow saving to/reading from hdf5 and netcdf files with a MultiIndex? If not, can you give a sketch outline of the approach you (Stephan or Benoit) would take? I assume it would involve saving the information about the MultiIndex structure in some transformed way that fits into an hdf5 file, then reconstructing it on the read. I might need to hack together something for that before MultiIndex serialization makes it into xarray, but I'd like to make sure I don't veer too far off from the real solution that will ultimately come out.

benbovy · 2016-04-06T16:12:44Z

I hope to have some time next week to work again on this PR.

@tippetts You can see in #719 a few comments about saving/reading xarray data objects with multi-index to/from netCDF. I also looking forward to see this feature implemented - actually I need it for another project that uses xarray - so maybe I'll find some time in the next couple of weeks to start a new PR on this.

benbovy · 2016-05-30T19:30:03Z

I finally managed to add some tests and docs.

Two more comments:

Index replacement: if the multi-index has unnamed levels, the chosen name for the replaced index (if any) is <old_dim>_unnamed_level (old_dim is the name of the replaced multi-index dimension).
Not sure whether or not it is really necessary to track the index level number. Single indexes returned by get_loc_level don't keep that information. Anyway, it is always better to set explicit names for the index levels.
Indexing on multi-index now also works with nested tuples (multi-index panel to xrarray dataarray index selection #850).

@shoyer I think that it is ready for review. I successfully run the tests on my local branch. Currently, CI tests seem broken for some reason I don't know.

shoyer · 2016-06-01T18:13:28Z

+
+   da_midx.sel(x=(list('ab'), [0]))
+
+Indexing with dictionaries uses the ``MultiIndex.get_loc_level`` pandas method


This is an implementation detail -- probably best to leave it out of the public docs.

shoyer · 2016-06-07T16:26:29Z

+        for k, v in iteritems(self._variables):
+            if k in indexes.keys():
+                idx = indexes[k]
+                variables[k] = Coordinate(idx.name, idx)


If idx.name != k above, then this could be constructing an invalid dataset.

I think we should create Coordinate(k, idx) and then remap back to the original names below, if necessary.

shoyer · 2016-06-07T17:05:11Z

Just went through and gave another full review -- this is looking quite nice, just a few more things to clean up!

benbovy · 2016-06-10T16:52:44Z

Thanks for your second review @shoyer ! It actually helped me to better understand the indexing logic of pandas (never too late)!

I made some updates according to your comments.

I think we're getting closer to a working feature!

shoyer · 2016-06-11T01:40:47Z

-However, the alternate ``from_series`` constructor will automatically unpack
-any hierarchical indexes it encounters by expanding the series into a
-multi-dimensional array, as described in :doc:`pandas`.
+Xarray supports labeling coordinate values with a :py:class:`pandas.MultiIndex`.


It might make sense simply to drop this paragraph instead -- do we really need to explicitly call out MultiIndex if it's supported?

No I don't think we need it. However, it might be good to put a sentence somewhere in the docs to recommend users to set names for multi-index levels before creating data arrays or datasets. What do you think?

shoyer · 2016-06-11T01:57:10Z

I did a little bit of testing. Here is one case I found where things don't work as I expected:

In [32]: idx = pd.MultiIndex.from_product([['a', 'b'], [0, 1, 2]], names=['lev0', 'lev1'])

In [33]: data = xr.DataArray(range(6), [('x', idx)])

In [34]: data.sel(x=('a',))
Out[34]:
<xarray.DataArray (lev1: 3)>
array([0, 1, 2])
Coordinates:
  * lev1     (lev1) int64 0 1 2

In [35]: data.sel(x='a')
Out[35]:
<xarray.DataArray (x: 3)>
array([0, 1, 2])
Coordinates:
  * x        (x) object ('a', 0) ('a', 1) ('a', 2)

I would expect an index drop in the last case, too. I guess we need to check for scalars.

shoyer · 2016-06-11T01:59:30Z

+
    else:
        label = _asarray_tuplesafe(label)
        if label.ndim == 0:


this is where scalars end up -- probably need to add a clause here to handle MultiIndex

tippetts · 2016-07-18T22:50:50Z

Mind if I ask if this will get merged into master? It looks like a lot of work went into the pull request, and the discussion + passed checks lead me to believe it could be close to going in. Is there anything a third party can do to push it across the finish line?

shoyer · 2016-07-19T01:15:34Z

@tippetts Thanks for checking in. I actually hadn't noticed the last round of changes here, but I think this is good to go in now! Big thanks to @benbovy!!!

benbovy · 2016-07-19T10:48:56Z

Thanks again @shoyer for your reviews and comments!

Merging this PR broke the tests added in #881. I fixed that in #903.

shoyer mentioned this pull request May 20, 2016

multi-index panel to xrarray dataarray index selection #850

Closed

benbovy force-pushed the multi-index_select branch from ae12850 to 93ef7d2 Compare May 30, 2016 15:43

shoyer reviewed Jun 1, 2016
View reviewed changes

Benoit Bovy added 14 commits June 4, 2016 18:16

Use dict-like for indexing on dims with multiindex

33ee76f

renamed drop_levels to drop_level

6695e85

fixed existing test failures

647967a

removed drop_level option

de03fc9

multi-index level drop for DataArray.loc

f88cd42

mindex level drop for DataArray.loc with Ellipsis

8b17a79

fix unnamed indexes

881da7f

added tests for indexing and selection

b786793

added documentation (indexing)

3b6d36d

allow multi-index indexing with nested tuples

0a5a7cf

updated what's new

8a9d488

updated doc

fd26dce

set better default names for multi-index levels

dabb2ce

refactored and fixed dim name / coord replacement

5f7d670

benbovy force-pushed the multi-index_select branch from e0569dc to 5f7d670 Compare June 4, 2016 16:28

shoyer reviewed Jun 7, 2016
View reviewed changes

Benoit Bovy added 6 commits June 8, 2016 15:30

re-written _replace_indexes

e900e9a

avoid creating temp dataset for dataarray sel/loc

d73cff5

clean up internal function is_nested_tuple

31b2c50

better handling of multi-index level drop

6e17a85

more global handling of unnamed multi-index levels

934beef

updated doc

03c21bd

typos and missing details (docstrings, doc)

030ee25

shoyer reviewed Jun 11, 2016
View reviewed changes

handle multi-index level drop for scalar labels

712497c

shoyer merged commit 7a9e84b into pydata:master Jul 19, 2016

benbovy deleted the multi-index_select branch July 19, 2016 10:46

shoyer mentioned this pull request Aug 18, 2016

Indexing with alignment and broadcasting #974

Closed

benbovy mentioned this pull request Sep 14, 2016

MultiIndex and data selection #767

Closed


		da_midx.sel(x=(list('ab'), [0]))

		Indexing with dictionaries uses the ``MultiIndex.get_loc_level`` pandas method

Uh oh!

Uh oh!

Conversation

benbovy commented Mar 24, 2016

Uh oh!

shoyer commented Mar 25, 2016

Uh oh!

benbovy commented Mar 25, 2016

Uh oh!

shoyer commented Mar 25, 2016

Uh oh!

benbovy commented Mar 25, 2016

Uh oh!

shoyer commented Mar 25, 2016

Uh oh!

benbovy commented Mar 25, 2016

Uh oh!

tippetts commented Apr 6, 2016

Uh oh!

benbovy commented Apr 6, 2016

Uh oh!

benbovy commented May 30, 2016

Uh oh!

shoyer Jun 1, 2016

Choose a reason for hiding this comment

Uh oh!

shoyer Jun 7, 2016

Choose a reason for hiding this comment

Uh oh!

shoyer commented Jun 7, 2016

Uh oh!

benbovy commented Jun 10, 2016

Uh oh!

shoyer Jun 11, 2016

Choose a reason for hiding this comment

Uh oh!

benbovy Jun 11, 2016

Choose a reason for hiding this comment

Uh oh!

shoyer commented Jun 11, 2016

Uh oh!

shoyer Jun 11, 2016

Choose a reason for hiding this comment

Uh oh!

tippetts commented Jul 18, 2016

Uh oh!

shoyer commented Jul 19, 2016

Uh oh!

benbovy commented Jul 19, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants