chore: Prefer append_value_n over append_value#1868
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1868 +/- ##
=======================================
Coverage 85.89% 85.89%
=======================================
Files 137 137
Lines 40012 40012
Branches 40012 40012
=======================================
Hits 34368 34368
Misses 4155 4155
Partials 1489 1489 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| for (builder, value) in field_builders.zip(data.values()) { | ||
| value.append_to(builder, 1)?; | ||
| } | ||
| builder.append(true); |
There was a problem hiding this comment.
There isn't a corresponding bulk operation for this?
There was a problem hiding this comment.
Sadly no... I only see one for bulk-appending nulls:
https://docs.rs/arrow/latest/arrow/array/struct.StructBuilder.html#method.append_nulls
Under the hood, both append and append_null are delegating to the underlying NullBufferBuilder, and even the bulk-null is extending from an internally created vec![false; n], which doesn't seem especially fantastic either -- NullBufferBuilder::append_n_nulls would almost certainly be more efficient, and there's a matching NullBufferBuilder::append_n_non_nulls?
There was a problem hiding this comment.
Looks like it is missing from the struct_builder.rs: https://github.com/apache/arrow-rs/blob/2f40f78e4feae3aee261d9608cede9535e1429e0/arrow-array/src/builder/struct_builder.rs#L209-L214
There was a problem hiding this comment.
OK, maybe lets add a small TODO and consider a contribution upstream?
There was a problem hiding this comment.
| // StructBuilder and ListBuilder both provide it. | ||
| let builder = | ||
| builder_as!(array::MapBuilder<Box<dyn ArrayBuilder>, Box<dyn ArrayBuilder>>); | ||
| for _ in 0..num_rows { |
There was a problem hiding this comment.
I think for the null case we should be able to optimize this as well?
There was a problem hiding this comment.
I don't see any bulk-null appender nor any way to access the underlying null buffer builder?
There was a problem hiding this comment.
There was a problem hiding this comment.
Same comment on TODO.
There was a problem hiding this comment.
Or we can also try this with REE arrays, if it still isn't fast enough
There was a problem hiding this comment.
Raise a PR here:
emkornfield
left a comment
There was a problem hiding this comment.
A few small questions, otherwise generally seems reasonable. It might be good to have a small microbenchmark to demonstrate the speed difference.
scovich
left a comment
There was a problem hiding this comment.
LGTM, given the limitations of arrow-rs API.
| for (builder, value) in field_builders.zip(data.values()) { | ||
| value.append_to(builder, 1)?; | ||
| } | ||
| builder.append(true); |
There was a problem hiding this comment.
Sadly no... I only see one for bulk-appending nulls:
https://docs.rs/arrow/latest/arrow/array/struct.StructBuilder.html#method.append_nulls
Under the hood, both append and append_null are delegating to the underlying NullBufferBuilder, and even the bulk-null is extending from an internally created vec![false; n], which doesn't seem especially fantastic either -- NullBufferBuilder::append_n_nulls would almost certainly be more efficient, and there's a matching NullBufferBuilder::append_n_non_nulls?
| // StructBuilder and ListBuilder both provide it. | ||
| let builder = | ||
| builder_as!(array::MapBuilder<Box<dyn ArrayBuilder>, Box<dyn ArrayBuilder>>); | ||
| for _ in 0..num_rows { |
There was a problem hiding this comment.
I don't see any bulk-null appender nor any way to access the underlying null buffer builder?
| }}; | ||
| } | ||
|
|
||
| // Use append_value in a loop for builders without batch append (String, Binary) |
There was a problem hiding this comment.
I've created an upstream PR to add this to the GenericByteArray: apache/arrow-rs#9426
It will both harmonize the APIs between the builders and throw in a 10-20% speed improvement 🥳
8f0c9f2 to
c7d6416
Compare
|
Added TODOs and moving this forward since there is concensus. Let's pick up the remaining things in separate PRs. Thanks @scovich and @emkornfield for the prompt review 🚀 |
What changes are proposed in this pull request?
Instead of adding the value in the loop, we push the operation down to Arrow which is much faster.
How was this change tested?