Skip to content

[SPARK-29317][SQL][PYTHON] Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan#25989

Closed
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:SPARK-29317
Closed

[SPARK-29317][SQL][PYTHON] Avoid inheritance hierarchy in pandas CoGroup arrow runner and its plan#25989
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:SPARK-29317

Conversation

@HyukjinKwon

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This PR proposes to avoid abstract classes introduced at #24965 but instead uses trait and object.

  • abstract class BaseArrowPythonRunner -> trait PythonArrowOutput to allow mix-in

    Before:

    BasePythonRunner
    ├── BaseArrowPythonRunner
    │   ├── ArrowPythonRunner
    │   └── CoGroupedArrowPythonRunner
    ├── PythonRunner
    └── PythonUDFRunner
    

    After:

    └── BasePythonRunner
        ├── ArrowPythonRunner
        ├── CoGroupedArrowPythonRunner
        ├── PythonRunner
        └── PythonUDFRunner
    
  • abstract class BasePandasGroupExec -> object PandasGroupUtils to decouple

    Before:

    └── BasePandasGroupExec
        ├── FlatMapGroupsInPandasExec
        └── FlatMapCoGroupsInPandasExec
    

    After:

    ├── FlatMapGroupsInPandasExec
    └── FlatMapCoGroupsInPandasExec
    

Why are the changes needed?

The problem is that R code path is being matched with Python side:

Python:

└── BasePythonRunner
    ├── ArrowPythonRunner
    ├── CoGroupedArrowPythonRunner
    ├── PythonRunner
    └── PythonUDFRunner

R:

└── BaseRRunner
    ├── ArrowRRunner
    └── RRunner

I would like to match the hierarchy and decouple other stuff for now if possible. Ideally we should deduplicate both code paths. Internal implementation is also similar intentionally.

BasePandasGroupExec case is similar as well. R (with Arrow optimization, in particular) has some duplicated codes with Pandas UDFs.

FlatMapGroupsInRWithArrowExec <> FlatMapGroupsInPandasExec
MapPartitionsInRWithArrowExec <> ArrowEvalPythonExec

In order to prepare deduplication here as well, it might better avoid changing hierarchy alone in Python side.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Locally tested existing tests. Jenkins tests should verify this too.

@HyukjinKwon

HyukjinKwon commented Oct 1, 2019

Copy link
Copy Markdown
Member Author

@d80tb7 and @BryanCutler, although I don't think this way is particularly better, I thought it's anyway better to let them separate.

Code lengths are virtually same. Do you guys like this way?

@SparkQA

SparkQA commented Oct 1, 2019

Copy link
Copy Markdown

Test build #111644 has finished for PR 25989 at commit 868cda2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@d80tb7

d80tb7 commented Oct 3, 2019

Copy link
Copy Markdown
Contributor

Hi @HyukjinKwon

It looks good to me- it's certainly no worse than what was there before and it meets the requirement of keeping the R and Python code paths aligned.

@HyukjinKwon

Copy link
Copy Markdown
Member Author

Thanks, @d80tb7

Merged to master.

@BryanCutler

Copy link
Copy Markdown
Member

@HyukjinKwon I think I slightly prefer the way it was before, but I haven't thought too much about aligning with R runner classes. I'm all for refactoring these to deduplicate and make it easier to manage, so if this is a step towards that, then it's fine. There are a couple things I have in mind for a redesign, so it would be good if we could discuss some before jumping in.

@HyukjinKwon

Copy link
Copy Markdown
Member Author

I dont plan to redesign it right now but wanted both just to be smilar as it was before for now. Sure, let's discuss when we do this.. I think we might have to think about this soon maybe after Spark 3.0 release.

@HyukjinKwon HyukjinKwon deleted the SPARK-29317 branch March 3, 2020 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants