[SPARK-5955][MLLIB] add checkpointInterval to ALS#5076
Conversation
|
Test build #28739 has finished for PR 5076 at commit
|
|
I've seen the first point before and thus I'm +1 for this change. |
There was a problem hiding this comment.
I kinda forget how checkpoint gets executed here. Is this count necessary? Or this is for caching?
There was a problem hiding this comment.
Ah, for implicit preference, this is not necessary because we are computing YtY anyway.
|
Test build #28801 has finished for PR 5076 at commit
|
|
test this please |
|
Test build #28820 has finished for PR 5076 at commit
|
|
Test build #28903 has finished for PR 5076 at commit
|
|
test this please |
|
Test build #28908 has finished for PR 5076 at commit
|
|
LGTM! |
|
Thanks! Merged into master. |
Add checkpiontInterval to ALS to prevent: 1. StackOverflow exceptions caused by long lineage, 2. large shuffle files generated during iterations, 3. slow recovery when some node fail. srowen coderxiang Author: Xiangrui Meng <meng@databricks.com> Closes #5076 from mengxr/SPARK-5955 and squashes the following commits: df56791 [Xiangrui Meng] update impl to reuse code 29affcb [Xiangrui Meng] do not materialize factors in implicit 20d3f7f [Xiangrui Meng] add checkpointInterval to ALS (cherry picked from commit 6b36470) Signed-off-by: Xiangrui Meng <meng@databricks.com> Conflicts: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
|
Merged this into branch-1.3 as well because this helps with scalability. |
|
Hi guys, First of all, I would like to thank you guys for developing spark and putting it open source that we can use. I'm new to Spark and Scala, and working in a project involving matrix factorizations in Spark. I have a problem regarding running ALS in Spark. It has a stackoverflow due to long linage chain as per comments on the internet. One of their suggestion is to use the setCheckpointInterval so that for every 10-20 iterations, we can checkpoint the RDDs and it prevents the error. Just want to ask details on how to do checkpointing with ALS. I am using spark-kernel developed by IBM: https://github.com/ibm-et/spark-kernel instead of spark-shell. Here are some of my specific questions regarding details on checkpoint:
Thanks a lot! |
Add checkpiontInterval to ALS to prevent:
@srowen @coderxiang