[SPARK-29126][PYSPARK][DOC] Pandas Cogroup udf usage guide#26110
[SPARK-29126][PYSPARK][DOC] Pandas Cogroup udf usage guide#26110d80tb7 wants to merge 6 commits into
Conversation
…9126-cogroup-udf-usage-guide # Conflicts: # python/pyspark/sql/cogroup.py
|
Test build #112022 has finished for PR 26110 at commit
|
|
Test build #112025 has finished for PR 26110 at commit
|
|
Test build #112029 has finished for PR 26110 at commit
|
|
retest this please |
|
From a cursory look, seems fine. cc @icexelloss, @BryanCutler, @viirya |
|
|
||
| ### Cogrouped Map | ||
|
|
||
| CoGrouped map Pandas UDFs allow two DataFrames to be cogrouped a by a common key and then a python function applied to |
There was a problem hiding this comment.
Is it CoGrouped or Cogrouped :-)?
| on how to label columns when constructing a `pandas.DataFrame`. | ||
|
|
||
| Note that all data for a cogroup will be loaded into memory before the function is applied. This can lead to out of | ||
| memory exceptions, especially if the group sizes are skewed. The configuration for[maxRecordsPerBatch](#setting-arrow-batch-size) |
There was a problem hiding this comment.
typoe -> for[maxRecordsPerBatch] -> for [maxRecordsPerBatch]
|
Test build #112538 has finished for PR 26110 at commit
|
BryanCutler
left a comment
There was a problem hiding this comment.
Thanks for doing this @d80tb7 , I just had some minor comments but looks good overall.
|
|
||
| ### Cogrouped Map | ||
|
|
||
| CoGrouped map Pandas UDFs allow two DataFrames to be cogrouped a by a common key and then a python function applied to |
There was a problem hiding this comment.
cogrouped a by -> cogrouped by
| each cogroup. They are used with `groupBy().cogroup().apply()` which consists of the following steps: | ||
|
|
||
| * Shuffle the data such that the groups of each dataframe which share a key are cogrouped together. | ||
| * Apply a function to each cogroup. The input of of the function is two `pandas.DataFrame` (with an optional Tuple |
There was a problem hiding this comment.
duplicate of in input of of
| * Shuffle the data such that the groups of each dataframe which share a key are cogrouped together. | ||
| * Apply a function to each cogroup. The input of of the function is two `pandas.DataFrame` (with an optional Tuple | ||
| representing the key). The output of the function is a `pandas.DataFrame`. | ||
| * Combine the results into a new `DataFrame`. |
There was a problem hiding this comment.
Maybe elaborate to explain results are pandas.DataFrames from all groups that are combined in a new pyspark.DataFrame
|
|
||
| 6. COGROUPED_MAP | ||
|
|
||
| A cogrouped map UDF defines transformation: two `pandas.DataFrame` -> `pandas.DataFrame` |
There was a problem hiding this comment.
I think instead of "two pandas.DataFrame", better to show "(pandas.DataFrame, pandas.DataFrame) -> pandas.DataFrame"
| 6. COGROUPED_MAP | ||
|
|
||
| A cogrouped map UDF defines transformation: two `pandas.DataFrame` -> `pandas.DataFrame` | ||
| The returnType should be a :class:`StructType` describing the schema of the returned |
…9126-cogroup-udf-usage-guide
|
Test build #112939 has finished for PR 26110 at commit
|
|
Test build #112944 has finished for PR 26110 at commit
|
|
Merged to master. |
This PR adds some extra documentation for the new Cogrouped map Pandas udfs. Specifically:
COGROUPED_MAPPandas udfs added in [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs #24981