[SPARK-29317][SQL][PYTHON] Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan#25989
[SPARK-29317][SQL][PYTHON] Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan#25989HyukjinKwon wants to merge 1 commit into
Conversation
|
@d80tb7 and @BryanCutler, although I don't think this way is particularly better, I thought it's anyway better to let them separate. Code lengths are virtually same. Do you guys like this way? |
|
Test build #111644 has finished for PR 25989 at commit
|
|
Hi @HyukjinKwon It looks good to me- it's certainly no worse than what was there before and it meets the requirement of keeping the R and Python code paths aligned. |
|
Thanks, @d80tb7 Merged to master. |
|
@HyukjinKwon I think I slightly prefer the way it was before, but I haven't thought too much about aligning with R runner classes. I'm all for refactoring these to deduplicate and make it easier to manage, so if this is a step towards that, then it's fine. There are a couple things I have in mind for a redesign, so it would be good if we could discuss some before jumping in. |
|
I dont plan to redesign it right now but wanted both just to be smilar as it was before for now. Sure, let's discuss when we do this.. I think we might have to think about this soon maybe after Spark 3.0 release. |
What changes were proposed in this pull request?
This PR proposes to avoid abstract classes introduced at #24965 but instead uses trait and object.
abstract class BaseArrowPythonRunner->trait PythonArrowOutputto allow mix-inBefore:
After:
abstract class BasePandasGroupExec->object PandasGroupUtilsto decoupleBefore:
After:
Why are the changes needed?
The problem is that R code path is being matched with Python side:
Python:
R:
I would like to match the hierarchy and decouple other stuff for now if possible. Ideally we should deduplicate both code paths. Internal implementation is also similar intentionally.
BasePandasGroupExeccase is similar as well. R (with Arrow optimization, in particular) has some duplicated codes with Pandas UDFs.FlatMapGroupsInRWithArrowExec<>FlatMapGroupsInPandasExecMapPartitionsInRWithArrowExec<>ArrowEvalPythonExecIn order to prepare deduplication here as well, it might better avoid changing hierarchy alone in Python side.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Locally tested existing tests. Jenkins tests should verify this too.