Skip to content

ARROW-10121: [C++] Fix emission of new dictionaries in IPC writer#8302

Closed
pitrou wants to merge 3 commits into
apache:masterfrom
pitrou:ARROW-10121-ipc-emit-dicts
Closed

ARROW-10121: [C++] Fix emission of new dictionaries in IPC writer#8302
pitrou wants to merge 3 commits into
apache:masterfrom
pitrou:ARROW-10121-ipc-emit-dicts

Conversation

@pitrou

@pitrou pitrou commented Sep 29, 2020

Copy link
Copy Markdown
Member

When a dictionary changes from the previous batch, it is emitted again in the IPC stream.
If this happens when writing the IPC file format, an error is returned.

@pitrou

pitrou commented Sep 29, 2020

Copy link
Copy Markdown
Member Author

@wesm This is a first draft. It is functional (i.e. fixes the underlying issue) but doesn't try to detect deltas.

@github-actions

Copy link
Copy Markdown

Comment thread cpp/src/arrow/ipc/writer.cc Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this will populate last_dictionaries_ entries even when the length of the dictionary is zero? I recall there was some discussion about empty dictionaries (and whether they need to be written at all) at the start of a stream

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we can skip zero-length dictionaries explicitly, but won't it confuse the IPC reader?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if skipped dictionaries will work, but they are permitted by the format, so it should probably be investigated in a follow up issue

Comment thread cpp/src/arrow/ipc/writer.cc Outdated
@pitrou pitrou force-pushed the ARROW-10121-ipc-emit-dicts branch from c414eec to cd759d0 Compare September 30, 2020 14:06
@pitrou

pitrou commented Sep 30, 2020

Copy link
Copy Markdown
Member Author

I still need to add tests for this (I'll do that once #8309 is merged). Detecting and writing delta dictionaries is not implemented, but that's a feature, not a bug, so it may also wait for another PR.
(also, it will have to be disabled by default, for compatibility with other IPC readers)

@pitrou pitrou force-pushed the ARROW-10121-ipc-emit-dicts branch from cd759d0 to 3e5e6df Compare October 1, 2020 14:51
@pitrou pitrou marked this pull request as ready for review October 1, 2020 14:51
@pitrou

pitrou commented Oct 5, 2020

Copy link
Copy Markdown
Member Author

@wesm Do you want to review this again? Otherwise perhaps @bkietz ?

pitrou added 3 commits October 5, 2020 11:29
When a dictionary changes from the previous batch, we should re-emit it.
Move tensor tests in separate file
@pitrou pitrou force-pushed the ARROW-10121-ipc-emit-dicts branch from 3e5e6df to 4e5dac1 Compare October 5, 2020 09:32
@wesm

wesm commented Oct 5, 2020

Copy link
Copy Markdown
Member

I can take a final look this morning.

Comment thread cpp/src/arrow/ipc/writer.cc Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if skipped dictionaries will work, but they are permitted by the format, so it should probably be investigated in a follow up issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants