Skip to content

[SPARK-10354] [MLLIB] fix some apparent memory issues in k-means|| initializaiton#8526

Closed
mengxr wants to merge 1 commit into
apache:masterfrom
mengxr:SPARK-10354
Closed

[SPARK-10354] [MLLIB] fix some apparent memory issues in k-means|| initializaiton#8526
mengxr wants to merge 1 commit into
apache:masterfrom
mengxr:SPARK-10354

Conversation

@mengxr

@mengxr mengxr commented Aug 30, 2015

Copy link
Copy Markdown
Contributor
  • do not cache first cost RDD
  • change following cost RDD cache level to MEMORY_AND_DISK
  • remove Vector wrapper to save a object per instance

Further improvements will be addressed in SPARK-10329

cc: @yu-iskw @hujiayin

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not calling BLAS here because runs == 1 in most cases

@SparkQA

SparkQA commented Aug 30, 2015

Copy link
Copy Markdown

Test build #41801 has finished for PR 8526 at commit 71db540.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class JavaTrainValidationSplitExample
    • case class LimitNode(limit: Int, child: LocalNode) extends UnaryLocalNode
    • case class UnionNode(children: Seq[LocalNode]) extends LocalNode

@hujy

hujy commented Aug 31, 2015

Copy link
Copy Markdown
Contributor

The fix reduces around 50G RDD based on data size below. The performance is improved. The user needs more than 8G memory to run the kmeans in Spark1.5 based on this data size. The data size:
Number of cluster: 5
Sample dimensions: 20
Number of samples: 1200000000
Sample per input file: 40000000
K: 10
Converge distance: 0.5
Max iteration: 10

@hujy

hujy commented Aug 31, 2015

Copy link
Copy Markdown
Contributor

LGTM

asfgit pushed a commit that referenced this pull request Aug 31, 2015
…nitializaiton

* do not cache first cost RDD
* change following cost RDD cache level to MEMORY_AND_DISK
* remove Vector wrapper to save a object per instance

Further improvements will be addressed in SPARK-10329

cc: yu-iskw HuJiayin

Author: Xiangrui Meng <meng@databricks.com>

Closes #8526 from mengxr/SPARK-10354.

(cherry picked from commit f0f563a)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
asfgit pushed a commit that referenced this pull request Aug 31, 2015
…nitializaiton

* do not cache first cost RDD
* change following cost RDD cache level to MEMORY_AND_DISK
* remove Vector wrapper to save a object per instance

Further improvements will be addressed in SPARK-10329

cc: yu-iskw HuJiayin

Author: Xiangrui Meng <meng@databricks.com>

Closes #8526 from mengxr/SPARK-10354.

(cherry picked from commit f0f563a)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
asfgit pushed a commit that referenced this pull request Aug 31, 2015
…nitializaiton

* do not cache first cost RDD
* change following cost RDD cache level to MEMORY_AND_DISK
* remove Vector wrapper to save a object per instance

Further improvements will be addressed in SPARK-10329

cc: yu-iskw HuJiayin

Author: Xiangrui Meng <meng@databricks.com>

Closes #8526 from mengxr/SPARK-10354.

(cherry picked from commit f0f563a)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@asfgit asfgit closed this in f0f563a Aug 31, 2015
@mengxr mengxr changed the title [SPARK-100354] [MLLIB] fix some apparent memory issues in k-means|| initializaiton [SPARK-10354] [MLLIB] fix some apparent memory issues in k-means|| initializaiton Aug 31, 2015
@mengxr

mengxr commented Aug 31, 2015

Copy link
Copy Markdown
Contributor Author

@hujiayin Thanks for testing! Merged into master, branch-1.5, 1.4, and 1.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants