Skip to content

Rework DataArray internals#648

Merged
shoyer merged 1 commit into
pydata:masterfrom
shoyer:DataArray-internals
Dec 4, 2015
Merged

Rework DataArray internals#648
shoyer merged 1 commit into
pydata:masterfrom
shoyer:DataArray-internals

Conversation

@shoyer

@shoyer shoyer commented Nov 9, 2015

Copy link
Copy Markdown
Member

Fixes #367
Fixes #634
Fixes #649

The internal data model used by DataArray has been rewritten to fix several outstanding issues (#367, #634 and this stackoverflow report). Namely, if a DataArray has the same name as one of its coordinates, the array and the coordinate no longer share the same data.

This means that creating a DataArray with the same name as one of its dimensions no longer automatically uses that array to label the corresponding coordinate. You will now need to provide coordinate labels explicitly. Here's the old behavior:

In [2]: xray.DataArray([4, 5, 6], dims='x', name='x')
Out[2]:
<xray.DataArray 'x' (x: 3)>
array([4, 5, 6])
Coordinates:
  * x        (x) int64 4 5 6

and the new behavior (compare the values of the x coordinate):

In [2]: xray.DataArray([4, 5, 6], dims='x', name='x')
Out[2]:
<xray.DataArray 'x' (x: 3)>
array([4, 5, 6])
Coordinates:
  * x        (x) int64 0 1 2

It's also no longer possible to convert a DataArray to a Dataset with DataArray.to_dataset if it is unnamed. This will now raise ValueError. If the array is unnamed, you need to supply the name argument.

@jhamman

jhamman commented Nov 10, 2015

Copy link
Copy Markdown
Member

@shoyer - I read this mainly trying to get a better idea of the internal DataArray data model. The code itself looks great. My main two comments on the refactor are:

  1. The use of the _ThisArray singleton seems a bit clunky to me. This is my first time interacting with this approach so maybe it is the best way, but I don't know much about it.
  2. This all seems to be complicated by the DataArrays's use of the Dataset internally to store coordinates. Not for this PR, but I wonder if there is a way to clear up the lines between the two objects.

All in all, impressive work.

@shoyer

shoyer commented Nov 10, 2015

Copy link
Copy Markdown
Member Author

Indeed, I wonder if it would make sense to decouple DataArray from Dataset by storing the state on two (protected) attributes:

  • variable: the variable for this array
  • coords: an ordered dict of coordinates

The main downside is that we add a bit more redundant code (e.g., to loop over all variables in .sel). But on the plus side our data model is much more similar to the public API, which would probably make things easier to understand, especially for new contributors -- they don't have to learn Dataset to learn DataArray.

@max-sixty

Copy link
Copy Markdown
Collaborator

Indeed, I wonder if it would make sense to decouple DataArray from Dataset by storing the state on two (protected) attributes:

As a newbie, 👍. I took some time to figure out why a DataArray contained a Dataset.

The main downside is that we add a bit more redundant code (e.g., to loop over all variables in .sel).

Low confidence, but you could have a common ancestor (XRayContainer, etc) which contained the similar code or a ._stuff attribute that's iterated over

Comment thread xray/core/dataarray.py Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor, but name = name or self.name might be clearer / more pythonic than the repeated if name is None:

@jhamman

jhamman commented Nov 10, 2015

Copy link
Copy Markdown
Member

@shoyer - do you have a feel for how difficult it would be to go the variable/coords route?

@shoyer

shoyer commented Nov 10, 2015

Copy link
Copy Markdown
Member Author

@shoyer - do you have a feel for how difficult it would be to go the variable/coords route?

Hmm. Might not be so bad now that I've already gone through the trouble of thinking what these new tests should look like. I'll give it a shot tonight and see how it goes...

@shoyer

shoyer commented Nov 10, 2015

Copy link
Copy Markdown
Member Author

I realize now that changing the internal representation for DataArray doesn't mean we need to rewrite how every routine works. We can still convert dataarrays to a dataset when convenient -- it just means we'll need to use a method to do so instead of modifying ._dataset. For example, we currently have:

    def copy(self, deep=True):
        ds = self._dataset.copy(deep=deep)
        return self._with_replaced_dataset(ds)

and instead we could simply write:

    def copy(self, deep=True):
        ds = self._to_temp_dataset().copy(deep=deep)
        return self._new_from_temp_dataset(ds)

However, going forward it will give us more flexibility for how to write DataArray methods. For example, it might actually be clearer to write:

    def copy(self, deep=True):
        variable = self.variable.copy(deep=deep)
        coords = OrderedDict((k, v.copy(deep=deep))
                             for k, v in self._coords.items())
        return type(self)(variable, coords, name=name, fastpath=True)

@shoyer

shoyer commented Nov 11, 2015

Copy link
Copy Markdown
Member Author

OK, latest commit changes DataArray's internals to rely on _variable and _coords instead of _dataset. There's still a bit more _to_temp_dataset() than ideal, but hopefully the data model makes more sense now...

@shoyer shoyer force-pushed the DataArray-internals branch 2 times, most recently from 0aeea33 to edea054 Compare November 12, 2015 06:46
@shoyer shoyer force-pushed the DataArray-internals branch 2 times, most recently from 0e9b656 to 96eeb13 Compare November 30, 2015 08:34
@shoyer

shoyer commented Nov 30, 2015

Copy link
Copy Markdown
Member Author

This is ready for review if anyone wants to take another look.

Comment thread xray/core/dataarray.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's going on here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- needed explanation. Let me know if the comments I added help.

@jhamman

jhamman commented Dec 2, 2015

Copy link
Copy Markdown
Member

@shoyer - I don't have any more inline comments. There is one failing test and there are merge conflicts, once those are addressed, I'll take one more brief look.

@shoyer shoyer force-pushed the DataArray-internals branch from 5325ff9 to c00a72b Compare December 4, 2015 01:22
Fixes GH367
Fixes GH634

The internal data model used by :py:class:`~xray.DataArray` has been
rewritten to fix several outstanding issues (:issue:`367`, :issue:`634`,
`this stackoverflow report`_). Internally, ``DataArray`` is now implemented
in terms of ``._variable`` and ``._coords`` attributes instead of holding
variables in a ``Dataset`` object.
@shoyer shoyer force-pushed the DataArray-internals branch from 24b90c3 to f368046 Compare December 4, 2015 01:37
@shoyer

shoyer commented Dec 4, 2015

Copy link
Copy Markdown
Member Author

Rebased and tests are passing.

@jhamman

jhamman commented Dec 4, 2015

Copy link
Copy Markdown
Member

lgtm, go ahead and merge.

shoyer added a commit that referenced this pull request Dec 4, 2015
@shoyer shoyer merged commit 2c6a366 into pydata:master Dec 4, 2015
@shoyer shoyer deleted the DataArray-internals branch December 4, 2015 20:40
@max-sixty

Copy link
Copy Markdown
Collaborator

👏
I'll try and get my PR in this weekend for the pandas wrapping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

error when using broadcast_arrays with coordinates Unexpected behavior by diff when applied to coordinate DataArray ds['time.time'] is broken

3 participants