Skip to content

[SPARK-29126][PYSPARK][DOC] Pandas Cogroup udf usage guide#26110

Closed
d80tb7 wants to merge 6 commits into
apache:masterfrom
d80tb7:SPARK-29126-cogroup-udf-usage-guide
Closed

[SPARK-29126][PYSPARK][DOC] Pandas Cogroup udf usage guide#26110
d80tb7 wants to merge 6 commits into
apache:masterfrom
d80tb7:SPARK-29126-cogroup-udf-usage-guide

Conversation

@d80tb7

@d80tb7 d80tb7 commented Oct 14, 2019

Copy link
Copy Markdown
Contributor

This PR adds some extra documentation for the new Cogrouped map Pandas udfs. Specifically:

@SparkQA

SparkQA commented Oct 14, 2019

Copy link
Copy Markdown

Test build #112022 has finished for PR 26110 at commit 0ecba8a.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Oct 14, 2019

Copy link
Copy Markdown

Test build #112025 has finished for PR 26110 at commit 1802cbd.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Oct 14, 2019

Copy link
Copy Markdown

Test build #112029 has finished for PR 26110 at commit da4f00b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon

Copy link
Copy Markdown
Member

retest this please

@HyukjinKwon

Copy link
Copy Markdown
Member

From a cursory look, seems fine. cc @icexelloss, @BryanCutler, @viirya

Comment thread docs/sql-pyspark-pandas-with-arrow.md Outdated

### Cogrouped Map

CoGrouped map Pandas UDFs allow two DataFrames to be cogrouped a by a common key and then a python function applied to

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it CoGrouped or Cogrouped :-)?

Comment thread docs/sql-pyspark-pandas-with-arrow.md Outdated
on how to label columns when constructing a `pandas.DataFrame`.

Note that all data for a cogroup will be loaded into memory before the function is applied. This can lead to out of
memory exceptions, especially if the group sizes are skewed. The configuration for[maxRecordsPerBatch](#setting-arrow-batch-size)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typoe -> for[maxRecordsPerBatch] -> for [maxRecordsPerBatch]

@SparkQA

SparkQA commented Oct 23, 2019

Copy link
Copy Markdown

Test build #112538 has finished for PR 26110 at commit da4f00b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler BryanCutler left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @d80tb7 , I just had some minor comments but looks good overall.

Comment thread docs/sql-pyspark-pandas-with-arrow.md Outdated

### Cogrouped Map

CoGrouped map Pandas UDFs allow two DataFrames to be cogrouped a by a common key and then a python function applied to

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cogrouped a by -> cogrouped by

Comment thread docs/sql-pyspark-pandas-with-arrow.md Outdated
each cogroup. They are used with `groupBy().cogroup().apply()` which consists of the following steps:

* Shuffle the data such that the groups of each dataframe which share a key are cogrouped together.
* Apply a function to each cogroup. The input of of the function is two `pandas.DataFrame` (with an optional Tuple

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate of in input of of

Comment thread docs/sql-pyspark-pandas-with-arrow.md Outdated
* Shuffle the data such that the groups of each dataframe which share a key are cogrouped together.
* Apply a function to each cogroup. The input of of the function is two `pandas.DataFrame` (with an optional Tuple
representing the key). The output of the function is a `pandas.DataFrame`.
* Combine the results into a new `DataFrame`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe elaborate to explain results are pandas.DataFrames from all groups that are combined in a new pyspark.DataFrame

Comment thread python/pyspark/sql/functions.py Outdated

6. COGROUPED_MAP

A cogrouped map UDF defines transformation: two `pandas.DataFrame` -> `pandas.DataFrame`

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of "two pandas.DataFrame", better to show "(pandas.DataFrame, pandas.DataFrame) -> pandas.DataFrame"

Comment thread python/pyspark/sql/functions.py Outdated
6. COGROUPED_MAP

A cogrouped map UDF defines transformation: two `pandas.DataFrame` -> `pandas.DataFrame`
The returnType should be a :class:`StructType` describing the schema of the returned

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returnType -> returnType

@SparkQA

SparkQA commented Oct 30, 2019

Copy link
Copy Markdown

Test build #112939 has finished for PR 26110 at commit 81713b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Oct 30, 2019

Copy link
Copy Markdown

Test build #112944 has finished for PR 26110 at commit f7b9b80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon

Copy link
Copy Markdown
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants