Skip to content

Add Schema::project and RecordBatch::project functions #1033

Merged
alamb merged 4 commits into
apache:masterfrom
hntd187:schema_project
Dec 20, 2021
Merged

Add Schema::project and RecordBatch::project functions #1033
alamb merged 4 commits into
apache:masterfrom
hntd187:schema_project

Conversation

@hntd187

@hntd187 hntd187 commented Dec 12, 2021

Copy link
Copy Markdown
Contributor

…eturning a new schema with those columns only

Which issue does this PR close?

Closes #1014.

Rationale for this change

See #1014 but a lot of code can be simplified and also fix silent bugs with handling metadata.

What changes are included in this PR?

2 methods on Schema and RecordBatch to allow them to project on their schemas.

Are there any user-facing changes?

…eturning a new schema with those columns only
@github-actions github-actions Bot added the arrow Changes to the arrow crate label Dec 12, 2021

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hntd187 ❤️

This is a great start

Comment thread arrow/src/datatypes/schema.rs Outdated
Comment on lines +94 to +98
let mut new_fields = vec![];
for i in indices {
let f = self.fields[i].clone();
new_fields.push(f);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as written

  1. This will panic! if there the index is not in bounds:
  2. is not "idiomatic rust style" (which to me means avoid mut). This is far less important

How about something such as (untested):

Suggested change
let mut new_fields = vec![];
for i in indices {
let f = self.fields[i].clone();
new_fields.push(f);
}
let new_fields = indices
.into_iter()
.map(|i| {
self.fields.get(i).map(|f| f.clone()))
.ok_or_else(|| Err(ArrowError::SchemaError(
format!("project index {} out of bounds, max field {}"
i, self.fields().len()),
))
})
.collect::<Result<Vec<_>>>()?;

Note the use of https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get to avoid fields[i] and then the somewhat confusing use of turbofish .collect::<Result<Vec<_>>() -- it took me quite a while to get used to that pattern

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that seems good to me, the for loop was the first thing that popped into my head, but I can't think of any reason it's better than yours.

@alamb alamb Dec 14, 2021

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the for loop thing is what one would write in other languages like C/C++, Java, go ,etc :) It is certainly what I was writing when I started learning rust.

Then I realized that a big part of how rust avoids bounds checks while still being safe is by the use of the functional style

assert_eq!(projected.fields()[0].name(), "name");
assert_eq!(projected.fields()[1].name(), "priority");
assert_eq!(projected.metadata.get("meta").unwrap(), "data")
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to above -- I recommend a test for handling if index is out of bounds -- like schema.project([2, 3])

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do

Comment thread arrow/src/record_batch.rs Outdated


/// Projects the schema onto the specified columns
pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent of this field was to project the RecordBatch rather than just the schema:

A signature like this:

Suggested change
pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> {
pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<RecordBatch> {

(so we would also have to project the columns as well as the schema)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I thought this part was a bit too easy, okay I'll update to reflect that.

@hntd187

hntd187 commented Dec 14, 2021

Copy link
Copy Markdown
Contributor Author

@alamb impl IntoIterator<Item=usize> I wanted to reuse this for the schema projection in addition, so I had to add impl IntoIterator<Item=usize> + Clone to it for RecordBatch, this doesn't seem immediately correct to me since they have different arguments, but it works.

@alamb

alamb commented Dec 14, 2021

Copy link
Copy Markdown
Contributor

@alamb impl IntoIterator<Item=usize> I wanted to reuse this for the schema projection in addition, so I had to add impl IntoIterator<Item=usize> + Clone to it for RecordBatch, this doesn't seem immediately correct to me since they have different arguments, but it works.

It looks like the new code may not yet have been pushed to github

@codecov-commenter

codecov-commenter commented Dec 14, 2021

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.28571% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.25%. Comparing base (239cba1) to head (8ade651).

Files with missing lines Patch % Lines
arrow/src/record_batch.rs 80.00% 4 Missing ⚠️
arrow/src/datatypes/schema.rs 94.44% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1033      +/-   ##
==========================================
- Coverage   82.31%   82.25%   -0.07%     
==========================================
  Files         168      168              
  Lines       49031    49197     +166     
==========================================
+ Hits        40360    40465     +105     
- Misses       8671     8732      +61     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THanks for sticking with this @hntd187

Comment thread arrow/src/record_batch.rs

RecordBatch::try_new(SchemaRef::new(projected_schema), batch_fields)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about some tests?

Perhaps something like

    #[test]
    fn project() {
        let a: ArrayRef = Arc::new(Int32Array::from(vec![
            Some(1),
            None,
            Some(3),
        ]));
        let b: ArrayRef = Arc::new(StringArray::from(vec!["a", "b", "c"]));
        let c: ArrayRef = Arc::new(StringArray::from(vec!["d", "e", "f"]));

        let record_batch = RecordBatch::try_from_iter(vec![("a", a.clone()), ("b", b.clone()), ("c", c.clone())])
            .expect("valid conversion");

        let expected = RecordBatch::try_from_iter(vec![("a", a), ("c", c)])
            .expect("valid conversion");

        assert_eq!(expected, record_batch.project(&vec![0, 2]).unwrap());
    }

Comment thread arrow/src/record_batch.rs Outdated
&self,
indices: impl IntoIterator<Item = usize> + Clone,
) -> Result<RecordBatch> {
let projected_schema = self.schema.project(indices.clone())?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now why you needed to make the iter Clone which is kind of annoying 🤔

Comment thread arrow/src/datatypes/schema.rs Outdated

/// Returns a new schema with only the specified columns in the new schema
/// This carries metadata from the parent schema over as well
pub fn project(&self, indices: impl IntoIterator<Item = usize>) -> Result<Schema> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I did something different in the ticket, but I think this interface is kind of annoying.

Namely, I couldn't pass in &vec![1, 2]

   --> arrow/src/datatypes/schema.rs:405:40
    |
405 |         let projected: Schema = schema.project(&vec![0, 2]).unwrap();
    |                                        ^^^^^^^ expected `&{integer}`, found `usize`

What would you think about being less fancy and changing this (and RecordBatch) to something like:

    pub fn project(&self, indices: &[size]) -> Result<Schema> {

Which would then avoid the need for the clone on RecordBatch::project as well

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- thank you @hntd187

@alamb

alamb commented Dec 20, 2021

Copy link
Copy Markdown
Contributor

@hntd187 there were some fmt and clippy errors on this PR; I have pushed a fix in 8ade651

@alamb alamb changed the title Projection on Schema and RecordBatch Add Schema::project and RecordBatch::project functions Dec 20, 2021
@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Dec 20, 2021
@alamb alamb merged commit f3e452c into apache:master Dec 20, 2021
@hntd187

hntd187 commented Dec 20, 2021

Copy link
Copy Markdown
Contributor Author

@hntd187 there were some fmt and clippy errors on this PR; I have pushed a fix in 8ade651

oh thank you very much, I appreciate that !

alamb added a commit that referenced this pull request Dec 21, 2021
* Allow Schema and RecordBatch to project schemas on specific columns returning a new schema with those columns only

* Addressing PR updates and adding a test for out of range projection

* switch to &[usize]

* fix: clippy and fmt

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
alamb added a commit that referenced this pull request Dec 22, 2021
* Allow Schema and RecordBatch to project schemas on specific columns returning a new schema with those columns only

* Addressing PR updates and adding a test for out of range projection

* switch to &[usize]

* fix: clippy and fmt

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Stephen Carman <hntd187@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Schema::project and RecordBatch project function to project / select a subset of columns

3 participants