Skip to content

Fix Redundant ScalarValue Boxed Collection#2523

Merged
tustvold merged 9 commits into
apache:masterfrom
comphead:unbox_scalar_merged
May 17, 2022
Merged

Fix Redundant ScalarValue Boxed Collection#2523
tustvold merged 9 commits into
apache:masterfrom
comphead:unbox_scalar_merged

Conversation

@comphead

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #2449.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@comphead comphead closed this May 13, 2022
@comphead comphead reopened this May 13, 2022
@comphead comphead changed the title unbox scalars Fix Redundant ScalarValue Boxed Collection May 13, 2022
@comphead comphead marked this pull request as ready for review May 13, 2022 15:08
@alamb alamb mentioned this pull request May 13, 2022

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @comphead

They key test is that size_of_scalar is not changed

https://github.com/apache/arrow-datafusion/blob/41b4e491663029f653e491b110d0b5e74d08a0b6/datafusion/core/src/scalar.rs#L373-L383

However, when I investigated more it turns out that this test was broken (see #2531)

When I reenabled the test it shows that this PR makes the size of ScalarValue increase from 48 bytes to 64 bytes

cc @tustvold

Comment thread datafusion/common/src/scalar.rs Outdated
Comment thread datafusion/common/src/scalar.rs Outdated

@tustvold tustvold left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR removes a non-redundant Box<DataType>. Is it possible this is the cause of the size change?

Comment thread datafusion/common/src/scalar.rs Outdated
/// list of nested ScalarValue (boxed to reduce size_of(ScalarValue))
#[allow(clippy::box_collection)]
List(Option<Box<Vec<ScalarValue>>>, Box<DataType>),
List(Option<Vec<ScalarValue>>, DataType),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about unboxing DataType here

@tustvold

Copy link
Copy Markdown
Contributor

So I got this to not change the size by making

List(Option<Vec<ScalarValue>>, Box<DataType>),
Struct(Option<Vec<ScalarValue>>, Box<[Field]>),

@alamb What do you think about Box<[Field]>?

comphead and others added 2 commits May 15, 2022 01:21
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@comphead

comphead commented May 15, 2022

Copy link
Copy Markdown
Contributor Author

Thanks @alamb @tustvold for the feedback, please let me know the final decision on what exactly has to be boxed.

@alamb

alamb commented May 16, 2022

Copy link
Copy Markdown
Contributor

@alamb What do you think about Box<[Field]>?

It seems fine to me

@comphead

Copy link
Copy Markdown
Contributor Author

so we going with changes below?

List(Option<Vec<ScalarValue>>, Box<DataType>),
Struct(Option<Vec<ScalarValue>>, Box<Vec<Field>>),

@tustvold

Copy link
Copy Markdown
Contributor

If you prefer Box<Vec<Field>> over Box<[Field]>, I don't feel strongly. The latter avoids a pointer indirection, and associated memory allocation, but the former is perhaps more consistent with rest of the enumeration (which uses Vec despite not actually needing to).

@comphead

Copy link
Copy Markdown
Contributor Author

@tustvold I added boxing back.
For now its

Struct(Option<Vec<ScalarValue>>, Box<Vec<Field>>),

as its more consistent, you are right. If you feel we should go to Box<[Field]> let me know

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great @comphead -- thank you very much for sticking with it.

I also merged this PR locally with apache/master to get #2531 and verified that the tests still pass 🎉

Comment thread datafusion/common/src/scalar.rs Outdated
for c in 0..columns.len() {
let column = columns.get_mut(c).unwrap();
column.push(values[c].clone());
for (i, v) in

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this is a nicer implementation

@comphead comphead May 16, 2022

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clippy complained onfor c in 0..columns.len() :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the better that removing a layer of Boxing lets clippy fix things more easily 👍

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A perhaps more idiomatic way to write this would be something like

for (column, value) in columns.iter_mut().zip(values)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thats more concise. Fixed in 2 places.

@alamb

alamb commented May 16, 2022

Copy link
Copy Markdown
Contributor

I'll wait until tomorrow to see if @tustvold has any more comments, and if not I'll merge it in

@tustvold tustvold left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than a minor nit and a questionable test change, this looks good to me 👍

Thank you, and apologies for the back and forth

Comment thread datafusion/common/src/scalar.rs Outdated
for c in 0..columns.len() {
let column = columns.get_mut(c).unwrap();
column.push(values[c].clone());
for (i, v) in

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A perhaps more idiomatic way to write this would be something like

for (column, value) in columns.iter_mut().zip(values)

Comment thread datafusion/core/src/scalar.rs Outdated
Some(Box::new(vec![Int64(Some(1)), Int64(Some(5))])),
Box::new(DataType::Int64),
Some(vec![Int64(Some(1)), Int64(Some(5))]),
Box::new(DataType::Int32),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change in type?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. Thanks for finding that

@tustvold tustvold merged commit c924536 into apache:master May 17, 2022
@tustvold tustvold added the api change Changes the API exposed to users of the crate label May 17, 2022
@alamb

alamb commented May 17, 2022

Copy link
Copy Markdown
Contributor

image
👍

@comphead comphead deleted the unbox_scalar_merged branch September 8, 2022 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Redundant ScalarValue Boxed Collection

3 participants