Skip to content

fix: Optimize some functions to rewrite dictionary-encoded strings#627

Merged
andygrove merged 16 commits into
apache:mainfrom
vaibhawvipul:issue-504
Jul 15, 2024
Merged

fix: Optimize some functions to rewrite dictionary-encoded strings#627
andygrove merged 16 commits into
apache:mainfrom
vaibhawvipul:issue-504

Conversation

@vaibhawvipul

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #504 .

Rationale for this change

Improves performance

What changes are included in this PR?

rewrite dictionary encoded strings

How are these changes tested?

All test cases pass.

@vaibhawvipul vaibhawvipul marked this pull request as ready for review July 4, 2024 13:43
@andygrove

Copy link
Copy Markdown
Member

Thanks @vaibhawvipul. This is looking good. There is also this section in cast.rs:

    fn cast_array(&self, array: ArrayRef) -> DataFusionResult<ArrayRef> {
        let to_type = &self.data_type;
        let array = array_with_timezone(array, self.timezone.clone(), Some(to_type))?;
        let from_type = array.data_type().clone();

        // unpack dictionary string arrays first
        // TODO: we are unpacking a dictionary-encoded array and then performing
        // the cast. We could potentially improve performance here by casting the
        // dictionary values directly without unpacking the array first, although this
        // would add more complexity to the code

@vaibhawvipul

Copy link
Copy Markdown
Contributor Author

Thanks @vaibhawvipul. This is looking good. There is also this section in cast.rs:

    fn cast_array(&self, array: ArrayRef) -> DataFusionResult<ArrayRef> {
        let to_type = &self.data_type;
        let array = array_with_timezone(array, self.timezone.clone(), Some(to_type))?;
        let from_type = array.data_type().clone();

        // unpack dictionary string arrays first
        // TODO: we are unpacking a dictionary-encoded array and then performing
        // the cast. We could potentially improve performance here by casting the
        // dictionary values directly without unpacking the array first, although this
        // would add more complexity to the code

This is done @andygrove . Thank you for pointing this out.

@andygrove

Copy link
Copy Markdown
Member

There is a test failure:

Cause: org.apache.comet.CometNativeException: Unsupported data type Dictionary(Int32, Binary) for function md5

@vaibhawvipul

Copy link
Copy Markdown
Contributor Author

There is a test failure:

Cause: org.apache.comet.CometNativeException: Unsupported data type Dictionary(Int32, Binary) for function md5

fixed

Comment thread native/core/src/execution/datafusion/expressions/cast.rs Outdated
Comment thread native/core/src/execution/datafusion/expressions/cast.rs Outdated
Comment thread native/core/src/execution/datafusion/expressions/cast.rs
@vaibhawvipul vaibhawvipul requested a review from andygrove July 12, 2024 05:42
);

let casted_result = match to_type {
DataType::Dictionary(_, _) => Arc::new(casted_dictionary.clone()),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we would currently hit the use case of casting to a dictionary type, or whether this is something that could help us in the future (it seems like we could take advantage of this when running natively even though Spark doesn't support keeping data dictionary-encoded after a cast, as far as I know, but maybe @viirya could confirm).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awaiting @viirya 's comment before making any change.

Comment thread native/core/src/execution/datafusion/expressions/cast.rs Outdated

@andygrove andygrove left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for refactoring the code @vaibhawvipul, it is looking much cleaner now. I would like to see the additional test that I mentioned before we merge this.

@andygrove

Copy link
Copy Markdown
Member

I think this is ready to merge, but I'd like to get a another review. Perhaps @parthchandra @huaxingao @kazuyukitanimura or @viirya could take a look

} else {
cast_array.append_null()
}
if to_type != &DataType::Date32 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a logic change here. Is this just refactoring?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, no logic change here.

@parthchandra parthchandra left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@andygrove

Copy link
Copy Markdown
Member

Thanks @vaibhawvipul and thanks for the review @parthchandra

@andygrove andygrove merged commit 086513c into apache:main Jul 15, 2024
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
…pache#627)

* dedup code

* transforming the dict directly

* code optimization for cast string to timestamp

* minor optimizations

* fmt fixes and casting to dict array without unpacking to array first

* bug fixes

* revert unrelated change

* Added test case and code refactor

* minor optimization

* minor optimization again

* convert the cast to array

* Revert "convert the cast to array"

This reverts commit 9270aed.

* bug fixes

* rename the test to cast_dict_to_timestamp arr
@vaibhawvipul vaibhawvipul deleted the issue-504 branch March 20, 2026 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: Optimize some functions to rewrite dictionary-encoded strings

3 participants