Skip to content

Tweak to to_dask_dataframe()#1667

Merged
jhamman merged 6 commits into
pydata:masterfrom
shoyer:dask_dataframe_tweak
Oct 31, 2017
Merged

Tweak to to_dask_dataframe()#1667
jhamman merged 6 commits into
pydata:masterfrom
shoyer:dask_dataframe_tweak

Conversation

@shoyer

@shoyer shoyer commented Oct 28, 2017

Copy link
Copy Markdown
Member

Follow on to #1489.

  • Add a dim_order argument

  • Always write columns for each dimension

  • Docstring to NumPy format

  • Tests added / passed

  • Passes git diff upstream/master **/*py | flake8 --diff

  • Fully documented, including whats-new.rst for all changes and api.rst for new API

CC @jmunroe @jhamman

- Add a `dim_order` argument
- Always write columns for each dimension
- Docstring to NumPy format
@shoyer shoyer requested a review from jhamman October 28, 2017 17:35
Comment thread xarray/core/dataset.py

# ensure all variables have the same chunking structure
if v.chunks != chunks:
v = v.chunk(chunks)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmunroe was there was reason why you didn't just chunk everything here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the assumption (probably mistaken) that there was a cost to calling .chunk(chunks) on a variable that already had that chunking structure. If that assumption was not correct, then, yes, everything could just be chunked.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rechunk is nearly free if chunks are unchanged -- it actually returns the same dask array object.

@shoyer shoyer mentioned this pull request Oct 29, 2017
13 tasks
@shoyer

shoyer commented Oct 30, 2017

Copy link
Copy Markdown
Member Author

@jhamman can you please take a look here when you have the chance?

@jhamman jhamman left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor comments

Comment thread xarray/core/dataset.py
if isinstance(v, xr.IndexVariable):
v = v.to_base_variable()
if dim_order is None:
dim_order = list(self.dims)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to remember but is this always a sorted tuple/dict?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Dataset, it's a SortedKeysDict (i.e., the dimensions in alphabetical order).

Comment thread xarray/core/dataset.py Outdated
var = self.variables[name]
except KeyError:
# dimension without a matching coordinate
values = np.arange(self.dims[name], dtype=np.int64)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we initialize this as a dask array to avoid creating the array when it will not be used.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea, will do

Comment thread xarray/tests/test_dask.py
ds['y'] = ('y', list('abc'))

expected = ds.compute().to_dataframe()
actual = ds.to_dask_dataframe(set_index=True)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than using xfail above, use raises_regex to make sure we raise an error in the correct line.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning on using xfail was that that makes this test more robust. If/when dask implementing MultiIndex, we'll just get an unexpected xfail rather than a failing test. NotImplementedError seems specific enough (unlike, e.g., ValueError) that I'm not concerned about grepping for the exact error message.

@jhamman jhamman merged commit 7e9193c into pydata:master Oct 31, 2017
@shoyer shoyer deleted the dask_dataframe_tweak branch October 31, 2017 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants