Tweak to to_dask_dataframe()#1667
Conversation
- Add a `dim_order` argument - Always write columns for each dimension - Docstring to NumPy format
|
|
||
| # ensure all variables have the same chunking structure | ||
| if v.chunks != chunks: | ||
| v = v.chunk(chunks) |
There was a problem hiding this comment.
@jmunroe was there was reason why you didn't just chunk everything here?
There was a problem hiding this comment.
On the assumption (probably mistaken) that there was a cost to calling .chunk(chunks) on a variable that already had that chunking structure. If that assumption was not correct, then, yes, everything could just be chunked.
There was a problem hiding this comment.
Rechunk is nearly free if chunks are unchanged -- it actually returns the same dask array object.
|
@jhamman can you please take a look here when you have the chance? |
| if isinstance(v, xr.IndexVariable): | ||
| v = v.to_base_variable() | ||
| if dim_order is None: | ||
| dim_order = list(self.dims) |
There was a problem hiding this comment.
I can't seem to remember but is this always a sorted tuple/dict?
There was a problem hiding this comment.
For Dataset, it's a SortedKeysDict (i.e., the dimensions in alphabetical order).
| var = self.variables[name] | ||
| except KeyError: | ||
| # dimension without a matching coordinate | ||
| values = np.arange(self.dims[name], dtype=np.int64) |
There was a problem hiding this comment.
can we initialize this as a dask array to avoid creating the array when it will not be used.
| ds['y'] = ('y', list('abc')) | ||
|
|
||
| expected = ds.compute().to_dataframe() | ||
| actual = ds.to_dask_dataframe(set_index=True) |
There was a problem hiding this comment.
rather than using xfail above, use raises_regex to make sure we raise an error in the correct line.
There was a problem hiding this comment.
My reasoning on using xfail was that that makes this test more robust. If/when dask implementing MultiIndex, we'll just get an unexpected xfail rather than a failing test. NotImplementedError seems specific enough (unlike, e.g., ValueError) that I'm not concerned about grepping for the exact error message.
Follow on to #1489.
Add a
dim_orderargumentAlways write columns for each dimension
Docstring to NumPy format
Tests added / passed
Passes
git diff upstream/master **/*py | flake8 --diffFully documented, including
whats-new.rstfor all changes andapi.rstfor new APICC @jmunroe @jhamman