Skip to content

ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using metadata#612

Closed
cpcloud wants to merge 10 commits into
apache:masterfrom
cpcloud:ARROW-881
Closed

ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using metadata#612
cpcloud wants to merge 10 commits into
apache:masterfrom
cpcloud:ARROW-881

Conversation

@cpcloud

@cpcloud cpcloud commented Apr 29, 2017

Copy link
Copy Markdown
Contributor

Comment thread cpp/src/arrow/ipc/metadata.cc Outdated

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to revert these.

Comment thread cpp/src/arrow/ipc/metadata.cc Outdated

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too.

Comment thread cpp/src/arrow/type.h Outdated

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rebase artifact.

Comment thread python/pyarrow/_parquet.pyx Outdated

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a test

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this cool now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add a small test, doing it now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread python/pyarrow/tests/test_parquet.py Outdated

@cpcloud cpcloud Apr 29, 2017

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This limits us to MultiIndexes with <= 255 levels (because we're using string -> string for metadata). I think that's reasonable for now. We can always come with up a more complex encoding if we want to support more levels than that. I'd be surprised if this ever comes up in practice.

Comment thread python/pyarrow/__init__.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe serialize_pandas and deserialize_pandas?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread python/pyarrow/_table.pyx Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is getting "chubby" enough that we should probably move it to a pandas utility module in pure Python.

Comment thread python/pyarrow/_table.pyx Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above re: doing this in pure Python. It would also encourage adding appropriate public APIs to pyarrow.Table. We already have Table.remove_column, so it is probably better to use that if possible.

Comment thread python/pyarrow/ipc.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstrings

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This DEFAULT_INDEX_FIELD is a slight nuisance. Perhaps add an argument to from_pandas whether to ingest the index (default could be True or False I guess)?

Comment thread python/pyarrow/tests/test_parquet.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think <= 255 levels is OK. I would actually rather see this metadata stored as a JSON blob under a single pandas key, otherwise we are possibly muddying the metadata namespace.

metadata = {b'pandas': json.dumps(pandas_meta).encode('utf8')}

@wesm

wesm commented Apr 29, 2017

Copy link
Copy Markdown
Member

PARQUET-595 is merged

@cpcloud

cpcloud commented May 14, 2017

Copy link
Copy Markdown
Contributor Author

@wesm This is ready for another round of review when you get a chance.

@wesm

wesm commented May 14, 2017

Copy link
Copy Markdown
Member

OK, taking a look now. Minor rebase conflict from #679

@cpcloud

cpcloud commented May 14, 2017

Copy link
Copy Markdown
Contributor Author

Fixed the conflict and addressed the {,RecordBatch}File{Reader,Writer} change.

@wesm wesm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks fine, this will be very nice to have! I would say we should start factoring out code from pyarrow.lib that doesn't need to be cythonized, which will make iterative development a little easier in cases too

Comment thread cpp/src/arrow/type.h Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to call this and the one above in a const context. You'll have to mark name_to_index_ as mutable to make this work

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread python/pyarrow/_parquet.pyx Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this cool now?

Comment thread python/pyarrow/array.pxi Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should factor this out into a pandas_compat.py module, along with the rest of the stuff below

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty awkward to factor out because of the TimeUnit_* enum values. We'd have to make pandas_compat.pxi if we wanted to keep those available to Cython but not Python (which would seem to defeat part of the purpose of factoring out) or expose the enum values to Python. This doesn't seem worth it for something that will never be seen by a user. Still, if you feel strongly about it I can spend some more time on it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True true, no worries, this is fine as is.

Comment thread python/pyarrow/parquet.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need an explicit read_pandas function in this class so that the user must express intent to use the additional pandas metadata

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the last thing in this patch. I would like to have the option to ignore the metadata and read the file as-is as an Arrow table (without having the index columns tacked on against my will). So we can either add a read_pandas method to enables the metadata wrangling logic, or an option to read that does the same thing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, fully on board here. Just trying to iron out pandas_compat stuff, then moving on to this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread python/pyarrow/table.pxi Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a regression, since pandas is not a hard dependency.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread python/pyarrow/table.pxi Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this some of this code to pyarrow.pandas_compat?

Comment thread python/pyarrow/table.pxi Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use pyarrow_wrap_table here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make check_index default to false?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread python/pyarrow/tests/test_ipc.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test a MultiIndex here?

What is the behavior when the columns are not strings?

@cpcloud cpcloud May 15, 2017

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now raises a TypeError alerting the user to the fact that column names cannot be anything other than strings.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added a multiindex test.

Comment thread python/pyarrow/table.pxi Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document this extra parameter

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wesm

wesm commented May 15, 2017

Copy link
Copy Markdown
Member

I think this and #602 are the last things I'd like to get in before cutting 0.4.0 (outside some clean up patches).

@cpcloud

cpcloud commented May 15, 2017

Copy link
Copy Markdown
Contributor Author

Sounds good!

Comment thread python/pyarrow/ipc.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add nthreads=None here and pass through to to_pandas (single-threaded by default)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wesm

wesm commented May 15, 2017

Copy link
Copy Markdown
Member

made a last comment #612 (comment) but outside of that i think this is about good to go

Comment thread python/pyarrow/parquet.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to call _get_column_indices on these?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah crap. Yep. Will also add a test since this wasn't failing for me locally.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wesm

wesm commented May 16, 2017

Copy link
Copy Markdown
Member

Here's the appveyor build: https://ci.appveyor.com/project/cpcloud/arrow/build/1.0.158

+1, thanks for doing this!

@asfgit asfgit closed this in bed0197 May 16, 2017
@cpcloud cpcloud deleted the ARROW-881 branch May 16, 2017 18:54
jeffknupp pushed a commit to jeffknupp/arrow that referenced this pull request Jun 3, 2017
cc @mrocklin

Author: Phillip Cloud <cpcloud@gmail.com>

Closes apache#612 from cpcloud/ARROW-881 and squashes the following commits:

4fa679d [Phillip Cloud] Add metadata test
60f71aa [Phillip Cloud] More doc
de616e8 [Phillip Cloud] Add doc
a42a084 [Phillip Cloud] Decode metadata to utf8 because JSON
2198dc5 [Phillip Cloud] Call column_name_idx on index_columns
32c5e64 [Phillip Cloud] Add test for read_pandas subset
2fa1f16 [Phillip Cloud] Do not write index_column metadata if not requested
21a8829 [Phillip Cloud] Add docs to pq.read_pandas
c35970c [Phillip Cloud] Add test for no index written and pq.read_pandas
59477b5 [Phillip Cloud] ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using custom_metadata
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants