[SPARK-22797][PySpark] Bucketizer support multi-column by zhengruifeng · Pull Request #19892 · apache/spark

zhengruifeng · 2017-12-05T10:08:59Z

What changes were proposed in this pull request?

Bucketizer support multi-column in the python side

How was this patch tested?

existing tests and added tests

SparkQA · 2017-12-05T10:13:53Z

Test build #84478 has finished for PR 19892 at commit 5ed91fd.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-05T10:29:43Z

Test build #84479 has finished for PR 19892 at commit 906a81d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-05T11:06:51Z

Test build #84480 has finished for PR 19892 at commit 5efc94e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-12-06T01:39:14Z

This PR is currently blocked by #19894 (comment)

zhengruifeng · 2017-12-14T07:57:01Z

retest this please

SparkQA · 2017-12-14T08:05:01Z

Test build #84898 has finished for PR 19892 at commit 5efc94e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-12-14T08:06:43Z

retest this please

SparkQA · 2017-12-14T08:27:44Z

Test build #84900 has finished for PR 19892 at commit 5efc94e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2017-12-14T08:30:52Z

ping @holdenk , can you help reviewing this?

viirya · 2017-12-18T07:29:08Z

Update this doc too?

viirya · 2017-12-18T07:49:17Z

toListFloat requires each list entry is a numeric (TypeConverters._is_numeric). Should toListListFloat have such requirement?

viirya · 2017-12-18T08:29:25Z

This needs an individual JIRA. @MLnick created SPARK-22797 for this. Please use it.

zhengruifeng · 2017-12-19T02:09:47Z

@viirya Thanks a lot for reviewing this! I will update the title to use the new ticket.

SparkQA · 2017-12-19T03:33:40Z

Test build #85087 has finished for PR 19892 at commit e1fb379.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-12-19T08:06:06Z

Note: there is a work to change this behavior to throw an exception, instead of a log warning. We should change this document later.

@holdenk @zhengruifeng this comment will need to be changed as per #19993 - but that has not been merged yet. I think #19993 will block 2.3 though, so we could preemptively change the doc here to match the Scala side in #19993 about throwing an exception.

viirya · 2017-12-19T08:13:54Z

One minor comment, otherwise LGTM.

zhengruifeng · 2017-12-29T06:23:46Z

ping @MLnick ?

SparkQA · 2018-01-12T02:15:26Z

Test build #86009 has finished for PR 19892 at commit 4b2fc6f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2018-01-16T08:16:35Z

Perhaps it would be cleaner to do a df.show() here? Likewise above for bucketed we could change that part of the doctest too.

MLnick · 2018-01-16T08:17:10Z

values1 & values2?

MLnick · 2018-01-16T08:21:28Z

We need a test case in ParamTypeConversionTests for this new method; see test_list_float for reference.

SparkQA · 2018-01-16T10:08:41Z

Test build #86168 has finished for PR 19892 at commit e869e75.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-16T10:13:44Z

Test build #86169 has finished for PR 19892 at commit 9f20f5c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2018-01-16T10:16:10Z

@MLnick Thanks for your reviewing and suggestions. I have updated this PR

MLnick

One minor comment on the updated doctest.

I don't think this will make it into 2.3 given the code freeze and branch has been cut already. In which case we will need to change the @since tags.

Pending the error throwing PR for Scala Bucketizer, we can update the doc here.

MLnick · 2018-01-16T10:41:53Z

-    >>> bucketed[3].buckets
-    2.0
+    ...     inputCol="values1", outputCol="buckets")
+    >>> bucketed = bucketizer.setHandleInvalid("keep").transform(df)


It may actually be neater to show only values1 and bucketed - so perhaps .transform(df.select('values1'))?

SparkQA · 2018-01-16T11:24:50Z

Test build #86170 has finished for PR 19892 at commit 734db50.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-01-19T04:08:08Z

I mean I think it might have a chance, generally speaking we've allowed outstanding PRs to be merged after the freeze. Since there are outstanding blockers on the branch preventing us from cutting RC2 maybe its ok to move forward if we can do it quickly? Of course I defer to MLNick :)

MLnick · 2018-01-19T05:51:36Z

I’m generally ok with these small python api wrapper additions getting merged as long as the risk of breaking anything is low - and here it is since it’s just api parity

…

On Fri, 19 Jan 2018 at 06:08, Holden Karau ***@***.***> wrote: I mean I think it might have a chance, generally speaking we've allowed outstanding PRs to be merged after the freeze. Since there are outstanding blockers on the branch preventing us from cutting RC2 maybe its ok to move forward if we can do it quickly? Of course I defer to MLNick :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19892 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA_SBy2cr5MUJ9rN7egqwHf9GCLH0tCiks5tMBVSgaJpZM4Q2CRd> .

holdenk

LGTM and I'll merge today (Australia time) if there are no objections. (note: this means also waiting on @MLnick switching from -1 to approve).

MLnick · 2018-01-22T07:36:55Z

If it is going to get merged to branch-2.3 the since tags need to be 2.3.0 again

SparkQA · 2018-01-22T08:23:36Z

Test build #86464 has finished for PR 19892 at commit 014fb08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-01-22T08:36:21Z

@MLnick you ok with this then?

MLnick · 2018-01-22T10:44:36Z

@holdenk everything except my comment in #19892 (comment). I'd propose to just preemptively update the doc about an exception being thrown.

SparkQA · 2018-01-23T07:38:24Z

Test build #86519 has finished for PR 19892 at commit ad5d81d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2018-01-23T09:28:16Z

RC2 has been cut - @jkbradley do you see #19993 as a blocker? I think it should be merged for 2.3. And also there are QA JIRAs (sub-tasks of SPARK-23105) that are blockers that are not reflected in the list of blockers for 2.3 as they are not targeted.

MLnick · 2018-01-26T10:28:41Z

Merged to master / branch-2.3. Thanks!

## What changes were proposed in this pull request? Bucketizer support multi-column in the python side ## How was this patch tested? existing tests and added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #19892 from zhengruifeng/20542_py. (cherry picked from commit c22eaa9) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>

MLnick · 2018-01-26T21:53:14Z

I reverted this (see #20410 for details) - we can re-open it once that issue is solved.

viirya · 2018-04-19T10:43:50Z

Can this be re-open now?

sherrysk8r · 2019-09-09T19:30:58Z

Has there been any update on re-opening this? And also adding multiple column support to QuantileDiscretizer?

zhengruifeng · 2019-09-16T06:51:50Z

I am sorry, I am afread I can not re-open this PR beacuse I deleted it by mistake.
I just create another PR #25801 to continue.

viirya reviewed Dec 18, 2017

View reviewed changes

Comment thread python/pyspark/ml/feature.py Outdated

viirya Dec 18, 2017

Copy link
Copy Markdown

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this doc too?

zhengruifeng reacted with thumbs up emoji

viirya reviewed Dec 18, 2017

View reviewed changes

zhengruifeng changed the title ~~[SPARK-20542][FollowUp][PySpark] Bucketizer support multi-column~~ [SPARK-22797][PySpark] Bucketizer support multi-column Dec 19, 2017

zhengruifeng force-pushed the 20542_py branch from 5efc94e to e1fb379 Compare December 19, 2017 03:08

viirya reviewed Dec 19, 2017

View reviewed changes

zhengruifeng force-pushed the 20542_py branch from e1fb379 to 4b2fc6f Compare January 12, 2018 01:49

MLnick suggested changes Jan 16, 2018

View reviewed changes

zhengruifeng added 7 commits January 16, 2018 17:18

create pr

1646620

update

51d5cfa

update style

39a888f

update style 2

6f82831

update doc

248954f

update toListListFloat

76de8e6

update tests

e869e75

zhengruifeng force-pushed the 20542_py branch from 4b2fc6f to e869e75 Compare January 16, 2018 09:46

MLnick suggested changes Jan 16, 2018

View reviewed changes

update tests and since

734db50

holdenk approved these changes Jan 19, 2018

View reviewed changes

revert to 2.3.0

014fb08

update doc

ad5d81d

asfgit closed this in c22eaa9 Jan 26, 2018

MLnick mentioned this pull request Jan 26, 2018

[SPARK-23234][ML][PYSPARK] Remove setting defaults on Java params #20410

Closed

BryanCutler mentioned this pull request Jan 26, 2018

[SPARK-23238][SQL] Externalize SQLConf configurations exposed in documentation #20403

Closed

viirya mentioned this pull request Apr 19, 2018

[SPARK-20542][ML][SQL] Add an API to Bucketizer that can bin multiple columns #17819

Closed

zhengruifeng mentioned this pull request Sep 16, 2019

[SPARK-22797][ML][PYTHON] Bucketizer support multi-column #25801

Closed

Uh oh!

Conversation

zhengruifeng commented Dec 5, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 5, 2017

Uh oh!

SparkQA commented Dec 5, 2017

Uh oh!

SparkQA commented Dec 5, 2017

Uh oh!

zhengruifeng commented Dec 6, 2017

Uh oh!

zhengruifeng commented Dec 14, 2017

Uh oh!

SparkQA commented Dec 14, 2017

Uh oh!

zhengruifeng commented Dec 14, 2017

Uh oh!

SparkQA commented Dec 14, 2017

Uh oh!

zhengruifeng commented Dec 14, 2017

Uh oh!

viirya Dec 18, 2017

Choose a reason for hiding this comment

Uh oh!

viirya Dec 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Dec 18, 2017

Uh oh!

zhengruifeng commented Dec 19, 2017

Uh oh!

SparkQA commented Dec 19, 2017

Uh oh!

viirya Dec 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick Jan 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Dec 19, 2017

Uh oh!

zhengruifeng commented Dec 29, 2017

Uh oh!

SparkQA commented Jan 12, 2018

Uh oh!

MLnick Jan 16, 2018

Choose a reason for hiding this comment

Uh oh!

MLnick Jan 16, 2018

Choose a reason for hiding this comment

Uh oh!

MLnick Jan 16, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 16, 2018

Uh oh!

SparkQA commented Jan 16, 2018

Uh oh!

zhengruifeng commented Jan 16, 2018

Uh oh!

MLnick left a comment

Choose a reason for hiding this comment

Uh oh!

MLnick Jan 16, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 16, 2018

Uh oh!

holdenk commented Jan 19, 2018

Uh oh!

MLnick commented Jan 19, 2018 via email

Uh oh!

holdenk left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

viirya Dec 18, 2017 •

edited

Loading

viirya Dec 19, 2017 •

edited

Loading

MLnick Jan 22, 2018 •

edited

Loading

holdenk left a comment •

edited

Loading

MLnick commented Jan 22, 2018 •

edited

Loading