Skip to content

[SPARK-15064][ML] Locale support in StopWordsRemover#12968

Closed
burakkose wants to merge 3 commits into
apache:masterfrom
burakkose:15064
Closed

[SPARK-15064][ML] Locale support in StopWordsRemover#12968
burakkose wants to merge 3 commits into
apache:masterfrom
burakkose:15064

Conversation

@burakkose

@burakkose burakkose commented May 6, 2016

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

  • add locale support for the case insensitive operation
  • add a new Python test
  • change filter to filterNot

How was this patch tested?

unit tests

* Default: English locale ("en")
* @group param
*/
val locale: Param[String] = new Param[String](this, "locale",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, shouldn't all this perhaps be linked to the stopwords set? if you loaded the French stopwords you'd want the French locale always?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but, How can we know that users loaded the French stopwords? User can load stopwords by
StopWordsRemover.loadDefaultStopWords("french")
and setting is
new StopWordsRemover().setStopWords(stopWords)
. Do you have any suggestion about that case?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For supported languages, we can know the appropriate locale and maintain an internal mapping. So "french" is known to map to Locale.FRENCH. For loading an arbitrary list, we don't know, but you could provide an overload where you provide a Locale.

@HyukjinKwon

HyukjinKwon commented May 7, 2016

Copy link
Copy Markdown
Member

(@burakkose I think the cc:@mengxr can be left in comments not in the PR description because I guess cc for someone to review may not be the part of the PR itself. @ will be removed by merging script anyway but cc: and mengxr in the PR summary. I am a bit careful about this because it seems (I was told) some committers agree with this and others do not. but strictly I think it is right.)

@burakkose

Copy link
Copy Markdown
Contributor Author

@HyukjinKwon, thank you for informing. Yes, you're right.

setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
/**
* Locale for doing a case sensitive comparison
* Default: English locale ("en")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we list what're the available options, or provide some reference here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done

@hhbyyh

hhbyyh commented May 9, 2016

Copy link
Copy Markdown
Contributor

Made a pass. That's all from me.

@hhbyyh

hhbyyh commented May 13, 2016

Copy link
Copy Markdown
Contributor

This is blocking user guide /examples update for 2.0.

burakkose added 2 commits May 13, 2016 09:28
# Conflicts:
#	mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala
@burakkose

Copy link
Copy Markdown
Contributor Author

Can you specify the blocking?

} else {
// TODO: support user locale (SPARK-15064)
val toLower = (s: String) => if (s != null) s.toLowerCase else s
val loadedLocale = StopWordsRemover.loadLocale($(locale))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just new Locale($(locale))

@hhbyyh

hhbyyh commented May 23, 2016

Copy link
Copy Markdown
Contributor

I'm not sure if this will be shipped with Spark 2.0. If yes, we should update user guide accordingly.

/**
* Locale for doing a case sensitive comparison
* Default: English locale ("en")
* @see [[http://www.localeplanet.com/java/]]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link to the official Java doc: https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html or the Locale class.

@holdenk

holdenk commented Oct 7, 2016

Copy link
Copy Markdown
Contributor

@burakkose is this something you are still working on? If so can you update it to master and look at @mengxr's comments - if not interested in working on it anymore no worries.

@AmplabJenkins

Copy link
Copy Markdown

Can one of the admins verify this patch?

srowen added a commit to srowen/spark that referenced this pull request Jan 1, 2017
@srowen srowen mentioned this pull request Jan 1, 2017
@asfgit asfgit closed this in ba48812 Jan 2, 2017
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Closes apache#12968
Closes apache#16215
Closes apache#16212
Closes apache#16086
Closes apache#15713
Closes apache#16413
Closes apache#16396

Author: Sean Owen <sowen@cloudera.com>

Closes apache#16447 from srowen/CloseStalePRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants