Skip to content

Split out arrow-array crate (#2594)#2769

Merged
tustvold merged 6 commits into
apache:masterfrom
tustvold:split-out-arrow-array
Sep 26, 2022
Merged

Split out arrow-array crate (#2594)#2769
tustvold merged 6 commits into
apache:masterfrom
tustvold:split-out-arrow-array

Conversation

@tustvold

@tustvold tustvold commented Sep 22, 2022

Copy link
Copy Markdown
Contributor

Draft as I wish to perform another pass, and double-check the benchmarks

Which issue does this PR close?

Part of #2594

Rationale for this change

Continues the process of splitting apart the crate, so that components can depend on just what they need, compilation parallelizes better, etc...

What changes are included in this PR?

Moves the array, array builders, and record batch definitions into a new arrow-array crate

Are there any user-facing changes?

The deprecated RecordBatch::concat is removed, otherwise there are no breaking changes 🎉

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Sep 22, 2022
Comment thread arrow/src/pyarrow.rs
}
}

impl<T> PyArrowConvert for T

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't be implemented as it errors complaining that arrow_schema::DataType could be updated to implement Array + From<ArrayData> which would then cause a conflict with the impl PyArrowConvert for DataType.

Ultimately this impl is not hugely important, as it is just a case of using make_array and Array::data

@tustvold tustvold marked this pull request as ready for review September 23, 2022 17:01
@alamb alamb added the api-change Changes to the arrow API label Sep 24, 2022
@alamb

alamb commented Sep 24, 2022

Copy link
Copy Markdown
Contributor

marking as api-change due to removal of RecordBatch::concat

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty epic PR -- I went through it fairly carefully and it looks great to me

/// assert_eq!(array.keys(), &Int8Array::from(vec![0, 0, 1, 2]));
/// assert_eq!(array.values(), &values);
/// ```
pub type Int8DictionaryArray = DictionaryArray<Int8Type>;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these pub types are new, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use std::any::Any;

///
/// # Example: Using `collect`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for adding basic doc examples to these typedefs

Comment thread arrow-array/src/cast.rs
assert!(!as_decimal_array(&array).is_empty());
let result_decimal = as_decimal_array(&array);
assert_eq!(result_decimal, &array);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Comment thread arrow-array/src/lib.rs
use crate::builder::*;

#[test]
fn test_buffer_builder_availability() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is the kind of thing that should be in in a tests type integration test to ensure that the types are pub and not pub(crate) for example

fn schema(&self) -> SchemaRef;

/// Reads the next `RecordBatch`.
#[deprecated(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 This is another breaking API change (nice cleanup


// export
array.to_pyarrow(py)
array.data().to_pyarrow(py)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a very reasonable change

(but is it also an API change?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry I made this change after I wrote the PR description

Comment thread arrow/src/lib.rs
fn schema(&self) -> SchemaRef;

/// Reads the next `RecordBatch`.
#[deprecated(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, whoops -- maybe we should remove this deprecated API (perhaps as a follow on PR)

@alamb alamb changed the title Split out arrow-array Split out arrow-array crate Sep 24, 2022
@alamb

alamb commented Sep 25, 2022

Copy link
Copy Markdown
Contributor

FWIW it occurs to me we probably need to update the github workflow triggers to reflect this new code organization:

For example:
https://github.com/apache/arrow-rs/blob/master/.github/workflows/arrow.yml#L21-L29

SHould probably include arrow-array and arrow-buffer

@tustvold

Copy link
Copy Markdown
Contributor Author

Running the benchmarks, some of the faster benchmarks do show the odd ~10% regression, but we're talking 10s of microseconds here. I'm inclined to think this is not an issue, and if it transpires to be so, we can revisit those kernels.

@tustvold tustvold merged commit 06c204c into apache:master Sep 26, 2022
@ursabot

ursabot commented Sep 26, 2022

Copy link
Copy Markdown

Benchmark runs are scheduled for baseline = 6bee576 and contender = 06c204c. 06c204c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@tustvold tustvold changed the title Split out arrow-array crate Split out arrow-array crate (#2594) Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants