Skip to content

[SPARK-22797][PySpark] Bucketizer support multi-column#19892

Closed
zhengruifeng wants to merge 11 commits into
apache:masterfrom
zhengruifeng:20542_py
Closed

[SPARK-22797][PySpark] Bucketizer support multi-column#19892
zhengruifeng wants to merge 11 commits into
apache:masterfrom
zhengruifeng:20542_py

Conversation

@zhengruifeng

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Bucketizer support multi-column in the python side

How was this patch tested?

existing tests and added tests

@SparkQA

SparkQA commented Dec 5, 2017

Copy link
Copy Markdown

Test build #84478 has finished for PR 19892 at commit 5ed91fd.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Dec 5, 2017

Copy link
Copy Markdown

Test build #84479 has finished for PR 19892 at commit 906a81d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Dec 5, 2017

Copy link
Copy Markdown

Test build #84480 has finished for PR 19892 at commit 5efc94e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng

Copy link
Copy Markdown
Contributor Author

This PR is currently blocked by #19894 (comment)

@zhengruifeng

Copy link
Copy Markdown
Contributor Author

retest this please

@SparkQA

SparkQA commented Dec 14, 2017

Copy link
Copy Markdown

Test build #84898 has finished for PR 19892 at commit 5efc94e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng

Copy link
Copy Markdown
Contributor Author

retest this please

@SparkQA

SparkQA commented Dec 14, 2017

Copy link
Copy Markdown

Test build #84900 has finished for PR 19892 at commit 5efc94e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng

Copy link
Copy Markdown
Contributor Author

ping @holdenk , can you help reviewing this?

Comment thread python/pyspark/ml/feature.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this doc too?

Comment thread python/pyspark/ml/param/__init__.py Outdated

@viirya viirya Dec 18, 2017

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toListFloat requires each list entry is a numeric (TypeConverters._is_numeric). Should toListListFloat have such requirement?

@viirya

viirya commented Dec 18, 2017

Copy link
Copy Markdown
Member

This needs an individual JIRA. @MLnick created SPARK-22797 for this. Please use it.

@zhengruifeng

Copy link
Copy Markdown
Contributor Author

@viirya Thanks a lot for reviewing this! I will update the title to use the new ticket.

@zhengruifeng zhengruifeng changed the title [SPARK-20542][FollowUp][PySpark] Bucketizer support multi-column [SPARK-22797][PySpark] Bucketizer support multi-column Dec 19, 2017
@SparkQA

SparkQA commented Dec 19, 2017

Copy link
Copy Markdown

Test build #85087 has finished for PR 19892 at commit e1fb379.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment thread python/pyspark/ml/feature.py Outdated

@viirya viirya Dec 19, 2017

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: there is a work to change this behavior to throw an exception, instead of a log warning. We should change this document later.

@MLnick MLnick Jan 22, 2018

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holdenk @zhengruifeng this comment will need to be changed as per #19993 - but that has not been merged yet. I think #19993 will block 2.3 though, so we could preemptively change the doc here to match the Scala side in #19993 about throwing an exception.

@viirya

viirya commented Dec 19, 2017

Copy link
Copy Markdown
Member

One minor comment, otherwise LGTM.

@zhengruifeng

Copy link
Copy Markdown
Contributor Author

ping @MLnick ?

@SparkQA

SparkQA commented Jan 12, 2018

Copy link
Copy Markdown

Test build #86009 has finished for PR 19892 at commit 4b2fc6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment thread python/pyspark/ml/feature.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be cleaner to do a df.show() here? Likewise above for bucketed we could change that part of the doctest too.

Comment thread python/pyspark/ml/feature.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values1 & values2?

Comment thread python/pyspark/ml/param/__init__.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a test case in ParamTypeConversionTests for this new method; see test_list_float for reference.

@SparkQA

SparkQA commented Jan 16, 2018

Copy link
Copy Markdown

Test build #86168 has finished for PR 19892 at commit e869e75.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jan 16, 2018

Copy link
Copy Markdown

Test build #86169 has finished for PR 19892 at commit 9f20f5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng

Copy link
Copy Markdown
Contributor Author

@MLnick Thanks for your reviewing and suggestions. I have updated this PR

@MLnick MLnick left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment on the updated doctest.

I don't think this will make it into 2.3 given the code freeze and branch has been cut already. In which case we will need to change the @since tags.

Pending the error throwing PR for Scala Bucketizer, we can update the doc here.

Comment thread python/pyspark/ml/feature.py Outdated
>>> bucketed[3].buckets
2.0
... inputCol="values1", outputCol="buckets")
>>> bucketed = bucketizer.setHandleInvalid("keep").transform(df)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may actually be neater to show only values1 and bucketed - so perhaps .transform(df.select('values1'))?

@SparkQA

SparkQA commented Jan 16, 2018

Copy link
Copy Markdown

Test build #86170 has finished for PR 19892 at commit 734db50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk

holdenk commented Jan 19, 2018

Copy link
Copy Markdown
Contributor

I mean I think it might have a chance, generally speaking we've allowed outstanding PRs to be merged after the freeze. Since there are outstanding blockers on the branch preventing us from cutting RC2 maybe its ok to move forward if we can do it quickly? Of course I defer to MLNick :)

@MLnick

MLnick commented Jan 19, 2018 via email

Copy link
Copy Markdown
Contributor

@holdenk holdenk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and I'll merge today (Australia time) if there are no objections. (note: this means also waiting on @MLnick switching from -1 to approve).

@MLnick

MLnick commented Jan 22, 2018

Copy link
Copy Markdown
Contributor

If it is going to get merged to branch-2.3 the since tags need to be 2.3.0 again

@SparkQA

SparkQA commented Jan 22, 2018

Copy link
Copy Markdown

Test build #86464 has finished for PR 19892 at commit 014fb08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk

holdenk commented Jan 22, 2018

Copy link
Copy Markdown
Contributor

@MLnick you ok with this then?

@MLnick

MLnick commented Jan 22, 2018

Copy link
Copy Markdown
Contributor

@holdenk everything except my comment in #19892 (comment). I'd propose to just preemptively update the doc about an exception being thrown.

@SparkQA

SparkQA commented Jan 23, 2018

Copy link
Copy Markdown

Test build #86519 has finished for PR 19892 at commit ad5d81d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick

MLnick commented Jan 23, 2018

Copy link
Copy Markdown
Contributor

RC2 has been cut - @jkbradley do you see #19993 as a blocker? I think it should be merged for 2.3. And also there are QA JIRAs (sub-tasks of SPARK-23105) that are blockers that are not reflected in the list of blockers for 2.3 as they are not targeted.

@MLnick

MLnick commented Jan 26, 2018

Copy link
Copy Markdown
Contributor

Merged to master / branch-2.3. Thanks!

asfgit pushed a commit that referenced this pull request Jan 26, 2018
## What changes were proposed in this pull request?
Bucketizer support multi-column in the python side

## How was this patch tested?
existing tests and added tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #19892 from zhengruifeng/20542_py.

(cherry picked from commit c22eaa9)
Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
@asfgit asfgit closed this in c22eaa9 Jan 26, 2018
@MLnick

MLnick commented Jan 26, 2018

Copy link
Copy Markdown
Contributor

I reverted this (see #20410 for details) - we can re-open it once that issue is solved.

@viirya

viirya commented Apr 19, 2018

Copy link
Copy Markdown
Member

Can this be re-open now?

@sherrysk8r

Copy link
Copy Markdown

Has there been any update on re-opening this? And also adding multiple column support to QuantileDiscretizer?

@zhengruifeng

Copy link
Copy Markdown
Contributor Author

I am sorry, I am afread I can not re-open this PR beacuse I deleted it by mistake.
I just create another PR #25801 to continue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants