Multi-index indexing#802
Conversation
|
This looks very nice! I would probably opt for making |
|
I followed your suggestions. Two more comments (not critical issues I think) :
In summary, |
Indeed. This would require an another data structure somewhere keeping track of level names -- and ideally also ensuring that they are always unique (like dimensions). This seems fine to me for now.
I agree -- better to require the user to be explicit. I also don't see many use cases for specifying the coordinate value and level name but not the dimension name. What happens if you type |
|
After refactoring a bit,
|
|
Dictionaries not hashable, so we might be able to detect this case by On Fri, Mar 25, 2016 at 10:10 AM, Benoit Bovy notifications@github.com
|
|
Unless you see any other issues, I think that this feature doesn't need more development for now. I'll be back next week to finish this PR (write some tests and doc). |
|
This will be a great feature. I for one am really looking forward to using it. Will this work also allow saving to/reading from hdf5 and netcdf files with a MultiIndex? If not, can you give a sketch outline of the approach you (Stephan or Benoit) would take? I assume it would involve saving the information about the MultiIndex structure in some transformed way that fits into an hdf5 file, then reconstructing it on the read. I might need to hack together something for that before MultiIndex serialization makes it into xarray, but I'd like to make sure I don't veer too far off from the real solution that will ultimately come out. |
|
I hope to have some time next week to work again on this PR. @tippetts You can see in #719 a few comments about saving/reading xarray data objects with multi-index to/from netCDF. I also looking forward to see this feature implemented - actually I need it for another project that uses xarray - so maybe I'll find some time in the next couple of weeks to start a new PR on this. |
ae12850 to
93ef7d2
Compare
|
I finally managed to add some tests and docs. Two more comments:
@shoyer I think that it is ready for review. I successfully run the tests on my local branch. Currently, CI tests seem broken for some reason I don't know. |
|
|
||
| da_midx.sel(x=(list('ab'), [0])) | ||
|
|
||
| Indexing with dictionaries uses the ``MultiIndex.get_loc_level`` pandas method |
There was a problem hiding this comment.
This is an implementation detail -- probably best to leave it out of the public docs.
e0569dc to
5f7d670
Compare
| for k, v in iteritems(self._variables): | ||
| if k in indexes.keys(): | ||
| idx = indexes[k] | ||
| variables[k] = Coordinate(idx.name, idx) |
There was a problem hiding this comment.
If idx.name != k above, then this could be constructing an invalid dataset.
I think we should create Coordinate(k, idx) and then remap back to the original names below, if necessary.
|
Just went through and gave another full review -- this is looking quite nice, just a few more things to clean up! |
|
Thanks for your second review @shoyer ! It actually helped me to better understand the indexing logic of pandas (never too late)! I made some updates according to your comments. I think we're getting closer to a working feature! |
| However, the alternate ``from_series`` constructor will automatically unpack | ||
| any hierarchical indexes it encounters by expanding the series into a | ||
| multi-dimensional array, as described in :doc:`pandas`. | ||
| Xarray supports labeling coordinate values with a :py:class:`pandas.MultiIndex`. |
There was a problem hiding this comment.
It might make sense simply to drop this paragraph instead -- do we really need to explicitly call out MultiIndex if it's supported?
There was a problem hiding this comment.
No I don't think we need it. However, it might be good to put a sentence somewhere in the docs to recommend users to set names for multi-index levels before creating data arrays or datasets. What do you think?
|
I did a little bit of testing. Here is one case I found where things don't work as I expected: I would expect an index drop in the last case, too. I guess we need to check for scalars. |
|
|
||
| else: | ||
| label = _asarray_tuplesafe(label) | ||
| if label.ndim == 0: |
There was a problem hiding this comment.
this is where scalars end up -- probably need to add a clause here to handle MultiIndex
|
Mind if I ask if this will get merged into master? It looks like a lot of work went into the pull request, and the discussion + passed checks lead me to believe it could be close to going in. Is there anything a third party can do to push it across the finish line? |
Follows #767.
This is incomplete (it still needs some tests and documentation updates), but it is working for both
DatasetandDataArrayobjects. I also don't know if it is fully compatible with lazy indexing (Dask).Using the example from #767:
As shown in this example, similarily to pandas, it automatically renames the dimension and assigns a new coordinate when the selection doesn't return a
pd.MultiIndex(here it returns apd.FloatIndex).In some cases this behavior may be unwanted (??), so I added a
drop_levelkeyword argument (ifFalseit keeps the multi-index and doesn't change the dimension/coordinate names):Note that it also works with
DataArray.loc, but (for now) in that case it always returns the multi-index:This is however inconsistent with
Dataset.selandDataset.locthat both applydrop_level=Trueby default, due to their different implementation. Two solutions: (1) makeDataArray.locapply drop_level by default, or (2) usedrop_level=Falseby default everywhere.