Skip to content

fix: ExtDType equality#1961

Merged
danking merged 4 commits into
developfrom
adamg/fix-extdtype-eq
Jan 15, 2025
Merged

fix: ExtDType equality#1961
danking merged 4 commits into
developfrom
adamg/fix-extdtype-eq

Conversation

@AdamGS

@AdamGS AdamGS commented Jan 15, 2025

Copy link
Copy Markdown
Contributor

Trying to better "adhere" to the Arrow behavior.

@AdamGS AdamGS requested a review from danking January 15, 2025 16:07
Comment thread vortex-dtype/src/extension.rs Outdated
impl PartialEq for ExtDType {
fn eq(&self, other: &Self) -> bool {
self.id == other.id
self.id == other.id && self.storage_dtype == other.storage_dtype

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the very least, shouldn't we factor in the metadata? E.g. vortex.timestamp always stores i64, but the timezone is stored in the extmetadata. Should two timestamp types be equal if they have different timezones?

@AdamGS AdamGS requested a review from a10y January 15, 2025 16:24
@a10y

a10y commented Jan 15, 2025

Copy link
Copy Markdown
Contributor

Capturing some thoughts from Slack:

a good thing to think about is the existing temporal types, which are extension types.
currently we have 3 different extension types
vortex.time
vortex.date
vortex.timestamp
all 3 have metadata that makes the interpretation of the values logically different. For time/date, it's the time unit, and for timestamp, it's the timezone information.
what should a try_cast() operation do on a timestamp? should it shift the timezone? should it just assign new timezones to all of the scalars?
11:09
we differ from arrow's extension types b/c our extension types wrap over logical types not encodings. so you can still have multiple encodings of a vortex.timestamp . IMO the extension type ID + metadata should be the relevant things to check to determine if two ext-types are equal."
unfortunately, ExtMetadata's builtin Eq is not quite right, since it just checks byte equality and not semantic equality of two pieces of metadata (edited) 

There is a lot of logic that the extension type author needs to have control over, namely the semantics of how some compute operations (like CastFn) affect the type, and how to interpret the metadata of the type to ensure semantic equality.

We should find a way to make it possible for external type authors to plug-in these decisions. For now, it's probably sufficient to just compare ExtMetadatas byte-wise

@danking

danking commented Jan 15, 2025

Copy link
Copy Markdown
Contributor

For opaque extension types, Arrow-cpp defines equality in terms of name and storage type. It seems to ignore metadata.

Arrow-rs appears to take a strong stance against treating extension types any differently from their storage type: apache/arrow-rs#4472. I think Arrow-rs would put this information into the Schema metadata; however, I am not entirely certain of that.

@danking

danking commented Jan 15, 2025

Copy link
Copy Markdown
Contributor

PyArrow includes the storage dtype in its definition of equality. The docs actually highlight using different storage types with the same extension type as a feature, not a bug:

Further, note that while we registered the concrete type RationalType(pa.int32()), the same extension name ("my_package.rational") is used by RationalType(integer_type) for all Arrow integer types. As such, the above code also allows users to (de)serialize these data types:
[an example which creates a "big" rational type using the same extension name]

@danking danking merged commit 31d2bef into develop Jan 15, 2025
@danking danking deleted the adamg/fix-extdtype-eq branch January 15, 2025 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants