Add Schema::project and RecordBatch::project functions #1033
Conversation
…eturning a new schema with those columns only
| let mut new_fields = vec![]; | ||
| for i in indices { | ||
| let f = self.fields[i].clone(); | ||
| new_fields.push(f); | ||
| } |
There was a problem hiding this comment.
I think as written
- This will
panic!if there the index is not in bounds: - is not "idiomatic rust style" (which to me means avoid
mut). This is far less important
How about something such as (untested):
| let mut new_fields = vec![]; | |
| for i in indices { | |
| let f = self.fields[i].clone(); | |
| new_fields.push(f); | |
| } | |
| let new_fields = indices | |
| .into_iter() | |
| .map(|i| { | |
| self.fields.get(i).map(|f| f.clone())) | |
| .ok_or_else(|| Err(ArrowError::SchemaError( | |
| format!("project index {} out of bounds, max field {}" | |
| i, self.fields().len()), | |
| )) | |
| }) | |
| .collect::<Result<Vec<_>>>()?; |
Note the use of https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get to avoid fields[i] and then the somewhat confusing use of turbofish .collect::<Result<Vec<_>>() -- it took me quite a while to get used to that pattern
There was a problem hiding this comment.
Yea, that seems good to me, the for loop was the first thing that popped into my head, but I can't think of any reason it's better than yours.
There was a problem hiding this comment.
I think the for loop thing is what one would write in other languages like C/C++, Java, go ,etc :) It is certainly what I was writing when I started learning rust.
Then I realized that a big part of how rust avoids bounds checks while still being safe is by the use of the functional style
| assert_eq!(projected.fields()[0].name(), "name"); | ||
| assert_eq!(projected.fields()[1].name(), "priority"); | ||
| assert_eq!(projected.metadata.get("meta").unwrap(), "data") | ||
| } |
There was a problem hiding this comment.
Related to above -- I recommend a test for handling if index is out of bounds -- like schema.project([2, 3])
|
|
||
|
|
||
| /// Projects the schema onto the specified columns | ||
| pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> { |
There was a problem hiding this comment.
The intent of this field was to project the RecordBatch rather than just the schema:
A signature like this:
| pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> { | |
| pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<RecordBatch> { |
(so we would also have to project the columns as well as the schema)
There was a problem hiding this comment.
Ahh, I thought this part was a bit too easy, okay I'll update to reflect that.
|
@alamb |
It looks like the new code may not yet have been pushed to github |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1033 +/- ##
==========================================
- Coverage 82.31% 82.25% -0.07%
==========================================
Files 168 168
Lines 49031 49197 +166
==========================================
+ Hits 40360 40465 +105
- Misses 8671 8732 +61 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
||
| RecordBatch::try_new(SchemaRef::new(projected_schema), batch_fields) | ||
| } | ||
|
|
There was a problem hiding this comment.
How about some tests?
Perhaps something like
#[test]
fn project() {
let a: ArrayRef = Arc::new(Int32Array::from(vec![
Some(1),
None,
Some(3),
]));
let b: ArrayRef = Arc::new(StringArray::from(vec!["a", "b", "c"]));
let c: ArrayRef = Arc::new(StringArray::from(vec!["d", "e", "f"]));
let record_batch = RecordBatch::try_from_iter(vec![("a", a.clone()), ("b", b.clone()), ("c", c.clone())])
.expect("valid conversion");
let expected = RecordBatch::try_from_iter(vec![("a", a), ("c", c)])
.expect("valid conversion");
assert_eq!(expected, record_batch.project(&vec![0, 2]).unwrap());
}| &self, | ||
| indices: impl IntoIterator<Item = usize> + Clone, | ||
| ) -> Result<RecordBatch> { | ||
| let projected_schema = self.schema.project(indices.clone())?; |
There was a problem hiding this comment.
I see now why you needed to make the iter Clone which is kind of annoying 🤔
|
|
||
| /// Returns a new schema with only the specified columns in the new schema | ||
| /// This carries metadata from the parent schema over as well | ||
| pub fn project(&self, indices: impl IntoIterator<Item = usize>) -> Result<Schema> { |
There was a problem hiding this comment.
I know I did something different in the ticket, but I think this interface is kind of annoying.
Namely, I couldn't pass in &vec![1, 2]
--> arrow/src/datatypes/schema.rs:405:40
|
405 | let projected: Schema = schema.project(&vec![0, 2]).unwrap();
| ^^^^^^^ expected `&{integer}`, found `usize`
What would you think about being less fancy and changing this (and RecordBatch) to something like:
pub fn project(&self, indices: &[size]) -> Result<Schema> {Which would then avoid the need for the clone on RecordBatch::project as well
* Allow Schema and RecordBatch to project schemas on specific columns returning a new schema with those columns only * Addressing PR updates and adding a test for out of range projection * switch to &[usize] * fix: clippy and fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* Allow Schema and RecordBatch to project schemas on specific columns returning a new schema with those columns only * Addressing PR updates and adding a test for out of range projection * switch to &[usize] * fix: clippy and fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Stephen Carman <hntd187@users.noreply.github.com>
…eturning a new schema with those columns only
Which issue does this PR close?
Closes #1014.
Rationale for this change
See #1014 but a lot of code can be simplified and also fix silent bugs with handling metadata.
What changes are included in this PR?
2 methods on Schema and RecordBatch to allow them to project on their schemas.
Are there any user-facing changes?