Add Schema::project and RecordBatch::project functions by hntd187 · Pull Request #1033 · apache/arrow-rs

hntd187 · 2021-12-12T00:10:27Z

…eturning a new schema with those columns only

Which issue does this PR close?

Closes #1014.

Rationale for this change

See #1014 but a lot of code can be simplified and also fix silent bugs with handling metadata.

What changes are included in this PR?

2 methods on Schema and RecordBatch to allow them to project on their schemas.

Are there any user-facing changes?

…eturning a new schema with those columns only

alamb

Thank you @hntd187 ❤️

This is a great start

alamb · 2021-12-13T21:22:58Z

+        let mut new_fields = vec![];
+        for i in indices {
+            let f = self.fields[i].clone();
+            new_fields.push(f);
+        }


I think as written

This will panic! if there the index is not in bounds:

is not "idiomatic rust style" (which to me means avoid mut). This is far less important

How about something such as (untested):

Suggested change

let mut new_fields = vec![];

for i in indices {

let f = self.fields[i].clone();

new_fields.push(f);

}

let new_fields = indices

.into_iter()

.map(|i| {

self.fields.get(i).map(|f| f.clone()))

.ok_or_else(|| Err(ArrowError::SchemaError(

format!("project index {} out of bounds, max field {}"

i, self.fields().len()),

))

})

.collect::<Result<Vec<_>>>()?;

Note the use of https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get to avoid fields[i] and then the somewhat confusing use of turbofish .collect::<Result<Vec<_>>() -- it took me quite a while to get used to that pattern

Yea, that seems good to me, the for loop was the first thing that popped into my head, but I can't think of any reason it's better than yours.

I think the for loop thing is what one would write in other languages like C/C++, Java, go ,etc :) It is certainly what I was writing when I started learning rust.

Then I realized that a big part of how rust avoids bounds checks while still being safe is by the use of the functional style

alamb · 2021-12-13T21:31:15Z

+        assert_eq!(projected.fields()[0].name(), "name");
+        assert_eq!(projected.fields()[1].name(), "priority");
+        assert_eq!(projected.metadata.get("meta").unwrap(), "data")
+    }


Related to above -- I recommend a test for handling if index is out of bounds -- like schema.project([2, 3])

Sure, will do

alamb · 2021-12-13T21:32:30Z


+
+    /// Projects the schema onto the specified columns
+    pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> {


The intent of this field was to project the RecordBatch rather than just the schema:

A signature like this:

Suggested change

pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> {

pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<RecordBatch> {

(so we would also have to project the columns as well as the schema)

Ahh, I thought this part was a bit too easy, okay I'll update to reflect that.

hntd187 · 2021-12-14T17:37:30Z

@alamb impl IntoIterator<Item=usize> I wanted to reuse this for the schema projection in addition, so I had to add impl IntoIterator<Item=usize> + Clone to it for RecordBatch, this doesn't seem immediately correct to me since they have different arguments, but it works.

alamb · 2021-12-14T20:59:31Z

@alamb impl IntoIterator<Item=usize> I wanted to reuse this for the schema projection in addition, so I had to add impl IntoIterator<Item=usize> + Clone to it for RecordBatch, this doesn't seem immediately correct to me since they have different arguments, but it works.

It looks like the new code may not yet have been pushed to github

codecov-commenter · 2021-12-14T21:12:18Z

Codecov Report

❌ Patch coverage is 89.28571% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.25%. Comparing base (239cba1) to head (8ade651).

Files with missing lines	Patch %	Lines
arrow/src/record_batch.rs	80.00%	4 Missing ⚠️
arrow/src/datatypes/schema.rs	94.44%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1033      +/-   ##
==========================================
- Coverage   82.31%   82.25%   -0.07%     
==========================================
  Files         168      168              
  Lines       49031    49197     +166     
==========================================
+ Hits        40360    40465     +105     
- Misses       8671     8732      +61

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alamb

THanks for sticking with this @hntd187

alamb · 2021-12-16T21:40:21Z

+
+        RecordBatch::try_new(SchemaRef::new(projected_schema), batch_fields)
+    }
+


How about some tests?

Perhaps something like

#[test] fn project() { let a: ArrayRef = Arc::new(Int32Array::from(vec![ Some(1), None, Some(3), ])); let b: ArrayRef = Arc::new(StringArray::from(vec!["a", "b", "c"])); let c: ArrayRef = Arc::new(StringArray::from(vec!["d", "e", "f"])); let record_batch = RecordBatch::try_from_iter(vec![("a", a.clone()), ("b", b.clone()), ("c", c.clone())]) .expect("valid conversion"); let expected = RecordBatch::try_from_iter(vec![("a", a), ("c", c)]) .expect("valid conversion"); assert_eq!(expected, record_batch.project(&vec![0, 2]).unwrap()); }

alamb · 2021-12-16T21:42:28Z

+        &self,
+        indices: impl IntoIterator<Item = usize> + Clone,
+    ) -> Result<RecordBatch> {
+        let projected_schema = self.schema.project(indices.clone())?;


I see now why you needed to make the iter Clone which is kind of annoying 🤔

alamb · 2021-12-16T21:48:45Z


+    /// Returns a new schema with only the specified columns in the new schema
+    /// This carries metadata from the parent schema over as well
+    pub fn project(&self, indices: impl IntoIterator<Item = usize>) -> Result<Schema> {


I know I did something different in the ticket, but I think this interface is kind of annoying.

Namely, I couldn't pass in &vec![1, 2]

--> arrow/src/datatypes/schema.rs:405:40 | 405 | let projected: Schema = schema.project(&vec![0, 2]).unwrap(); | ^^^^^^^ expected `&{integer}`, found `usize`

What would you think about being less fancy and changing this (and RecordBatch) to something like:

pub fn project(&self, indices: &[size]) -> Result<Schema> {

Which would then avoid the need for the clone on RecordBatch::project as well

alamb

Looks good -- thank you @hntd187

alamb · 2021-12-20T16:18:38Z

@hntd187 there were some fmt and clippy errors on this PR; I have pushed a fix in 8ade651

hntd187 · 2021-12-20T20:40:40Z

@hntd187 there were some fmt and clippy errors on this PR; I have pushed a fix in 8ade651

oh thank you very much, I appreciate that !

* Allow Schema and RecordBatch to project schemas on specific columns returning a new schema with those columns only * Addressing PR updates and adding a test for out of range projection * switch to &[usize] * fix: clippy and fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Allow Schema and RecordBatch to project schemas on specific columns returning a new schema with those columns only * Addressing PR updates and adding a test for out of range projection * switch to &[usize] * fix: clippy and fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Stephen Carman <hntd187@users.noreply.github.com>

Allow Schema and RecordBatch to project schemas on specific columns r…

b896bdf

…eturning a new schema with those columns only

github-actions Bot added the arrow Changes to the arrow crate label Dec 12, 2021

alamb reviewed Dec 13, 2021

View reviewed changes

Addressing PR updates and adding a test for out of range projection

753d40f

alamb reviewed Dec 16, 2021

View reviewed changes

switch to &[usize]

824f8ff

alamb approved these changes Dec 20, 2021

View reviewed changes

fix: clippy and fmt

8ade651

alamb changed the title ~~Projection on Schema and RecordBatch~~ Add Schema::project and RecordBatch::project functions Dec 20, 2021

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Dec 20, 2021

alamb merged commit f3e452c into apache:master Dec 20, 2021

alamb mentioned this pull request Dec 20, 2021

Consolidate Projection for Schema and RecordBatch apache/datafusion#1425

Closed

alamb added the cherry-picked label Dec 21, 2021

alamb mentioned this pull request Dec 21, 2021

Cherry pick Add Schema::project and RecordBatch::project functions to active_release #1077

Merged

alamb mentioned this pull request Dec 22, 2021

Fix SortExec discards field metadata on the output schema apache/datafusion#1477

Merged

This was referenced Jan 21, 2022

Consolidate Schema and RecordBatch projection apache/datafusion#1638

Merged

Consolidate Schema and RecordBatch projection #1638 apache/datafusion#1646

Closed

-        let mut new_fields = vec![];
-        for i in indices {
-            let f = self.fields[i].clone();
-            new_fields.push(f);
-        }
+        let new_fields = indices
+          .into_iter()
+          .map(|i| {
+            self.fields.get(i).map(|f| f.clone()))
+              .ok_or_else(|| Err(ArrowError::SchemaError(
+                format!("project index {} out of bounds, max field {}"
+                                    i, self.fields().len()),
+                            ))
+          })
+          .collect::<Result<Vec<_>>>()?;



		/// Projects the schema onto the specified columns
		pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> {


		RecordBatch::try_new(SchemaRef::new(projected_schema), batch_fields)
		}

Conversation

hntd187 commented Dec 12, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Dec 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hntd187 commented Dec 14, 2021

Uh oh!

alamb commented Dec 14, 2021

Uh oh!

codecov-commenter commented Dec 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 20, 2021

Uh oh!

hntd187 commented Dec 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alamb Dec 14, 2021 •

edited

Loading

codecov-commenter commented Dec 14, 2021 •

edited

Loading