ARROW-15641: [C++][Python] UDF Aggregate Function Implementation#14527
ARROW-15641: [C++][Python] UDF Aggregate Function Implementation#14527vibhatha wants to merge 13 commits into
Conversation
|
|
|
|
979b9f1 to
cef71af
Compare
|
cc @westonpace appreciate an initial review on the draft PR. |
westonpace
left a comment
There was a problem hiding this comment.
The shape of this looks correct. I believe you are on the right track. I think I would like to see at least one example of using a custom UDF in an exec plan. Also, similar to scalar UDFs, I think we will eventually want an example that uses something other than pyarrow (e.g. numpy) to do the computation (at least for the consume function)
| @property | ||
| def non_null(self): | ||
| return self._non_null | ||
|
|
||
| @non_null.setter | ||
| def non_null(self, value): | ||
| self._non_null = value |
There was a problem hiding this comment.
I'm not much of a python expert but getters and setters seem like overkill here. Are they needed?
| state = State(0) | ||
| return state |
There was a problem hiding this comment.
| state = State(0) | |
| return state | |
| return State(0) |
| def consume(ctx, x): | ||
| if isinstance(x, pa.Array): | ||
| non_null = pc.sum(pc.invert(pc.is_nan(x))).as_py() | ||
| elif isinstance(x, pa.Scalar): |
There was a problem hiding this comment.
Can a unary aggregate ever be called with a scalar?
| def finalize(ctx): | ||
| return pa.array([ctx.state.non_null]) | ||
|
|
||
| func_name = "simple_count" |
There was a problem hiding this comment.
Maybe valid_count or non_null_count?
| @property | ||
| def non_null(self): | ||
| return self._non_null | ||
|
|
||
| @non_null.setter | ||
| def non_null(self, value): | ||
| self._non_null = value | ||
|
|
||
| @property | ||
| def null(self): | ||
| return self._null | ||
|
|
||
| @null.setter | ||
| def null(self, value): | ||
| self._null = value |
There was a problem hiding this comment.
Same comment on getters and setters
| To define a varargs function, pass a callable that takes | ||
| varargs. The last in_type will be the type of all varargs | ||
| arguments. |
There was a problem hiding this comment.
How do varargs work? Do I define a *args?
| must be merged with. The current state can be retrieved from | ||
| the context object which can be acessed by `context.state`. | ||
| The state doesn't need to be set in the Python side and it is | ||
| autonomously handled in the C++ backend. The updated state must |
There was a problem hiding this comment.
I'm not sure I understand the sentence that starts with "The state doesn't need to be set..."
|
|
||
| This function returns the updated state after consuming the | ||
| received data. | ||
| merge_func: callable |
There was a problem hiding this comment.
The concept of a "merge function" is not going to be obvious to most users. It is very possible for an engine to be defined that does not have to worry about the concept of a merge. I think we need to describe some background here for why a merge is needed in the first place. Something like:
Aggregates may be calculated across many threads in parallel. Each thread will call the init function once to generate a state for that thread. Once all values have been consumed then the threads from each state will be merged together to get the final result state. The merge function should take two states and combine them.
| The first argument is the context argument of type | ||
| ScalarUdfContext. | ||
| Using the context argument the state can be extracted and return | ||
| type must be an array matching the `out_type`. |
There was a problem hiding this comment.
In your C++ example the return type is a scalar?
| A callable implementing the user-defined finalize function. | ||
| The first argument is the context argument of type | ||
| ScalarUdfContext. | ||
| Using the context argument the state can be extracted and return |
There was a problem hiding this comment.
"the state can be extracted" is not very obvious. Maybe something like:
The purpose of the finalize function is to transform the state (which is available in the context argument) into an array. This array will be the final result of the aggregation.
Initial Draft of Aggregate UDFs for Python