Skip to content

[SPARK-18890][CORE] Move task serialization from the TaskSetManager to the CoarseGrainedSchedulerBackend#15505

Closed
witgo wants to merge 14 commits into
apache:masterfrom
witgo:SPARK-17931
Closed

[SPARK-18890][CORE] Move task serialization from the TaskSetManager to the CoarseGrainedSchedulerBackend#15505
witgo wants to merge 14 commits into
apache:masterfrom
witgo:SPARK-17931

Conversation

@witgo

@witgo witgo commented Oct 16, 2016

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Performance Testing:

The code:

val rdd = sc.parallelize(0 until 100).repartition(100000)
rdd.localCheckpoint().count()
rdd.sum()
(1 to 10).foreach{ i=>
  val serializeStart = System.currentTimeMillis()
  rdd.sum()
  val serializeFinish = System.currentTimeMillis()
  println(f"Test $i: ${(serializeFinish - serializeStart) / 1000D}%1.2f")
}

and spark-defaults.conf file:

spark.master                                      yarn-client
spark.executor.instances                          20
spark.driver.memory                               64g
spark.executor.memory                             30g
spark.executor.cores                              5
spark.default.parallelism                         100 
spark.sql.shuffle.partitions                      100
spark.serializer                                  org.apache.spark.serializer.KryoSerializer
spark.driver.maxResultSize                        0
spark.ui.enabled                                  false 
spark.driver.extraJavaOptions                     -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=512M 
spark.executor.extraJavaOptions                   -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M 
spark.cleaner.referenceTracking.blocking          true
spark.cleaner.referenceTracking.blocking.shuffle  true

The test results are as follows:

SPARK-17931 db0ddce
9.427 s 9.566 s

How was this patch tested?

Existing tests.

@SparkQA

SparkQA commented Oct 16, 2016

Copy link
Copy Markdown

Test build #67033 has finished for PR 15505 at commit b51d00c.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Oct 16, 2016

Copy link
Copy Markdown

Test build #67035 has finished for PR 15505 at commit 771949e.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@witgo witgo changed the title [WIP][SPARK-17931]taskScheduler has some unneeded serialization [SPARK-17931]taskScheduler has some unneeded serialization Oct 17, 2016
@SparkQA

SparkQA commented Oct 17, 2016

Copy link
Copy Markdown

Test build #67062 has finished for PR 15505 at commit bee165a.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@witgo witgo changed the title [SPARK-17931]taskScheduler has some unneeded serialization [SPARK-17931][CORE] taskScheduler has some unneeded serialization Oct 17, 2016
@SparkQA

SparkQA commented Oct 17, 2016

Copy link
Copy Markdown

Test build #67056 has finished for PR 15505 at commit 8a6062d.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy

wzhfy commented Oct 18, 2016

Copy link
Copy Markdown
Contributor

There are many unnecessary changes, can you recover them to minimize diff? That'll be easier for others to review. :)

@witgo

witgo commented Oct 18, 2016

Copy link
Copy Markdown
Contributor Author

@wzhfy
Ok, the code has been modified

@SparkQA

SparkQA commented Oct 18, 2016

Copy link
Copy Markdown

Test build #67109 has finished for PR 15505 at commit d956ff5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Oct 18, 2016

Copy link
Copy Markdown

Test build #67110 has finished for PR 15505 at commit ca9da40.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Oct 18, 2016

Copy link
Copy Markdown

Test build #67126 has finished for PR 15505 at commit 80eed8f.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Oct 19, 2016

Copy link
Copy Markdown

Test build #67159 has finished for PR 15505 at commit 589f3bb.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@witgo

witgo commented Oct 19, 2016

Copy link
Copy Markdown
Contributor Author

cc @rxin

@witgo witgo changed the title [SPARK-17931][CORE] taskScheduler has some unneeded serialization [WIP][SPARK-17931][CORE] taskScheduler has some unneeded serialization Oct 19, 2016
@SparkQA

SparkQA commented Oct 19, 2016

Copy link
Copy Markdown

Test build #67179 has finished for PR 15505 at commit 84488c4.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Oct 19, 2016

Copy link
Copy Markdown

Test build #67184 has finished for PR 15505 at commit 8a6b37e.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Oct 20, 2016

Copy link
Copy Markdown

Test build #67242 has finished for PR 15505 at commit 07b0581.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@witgo witgo changed the title [WIP][SPARK-17931][CORE] taskScheduler has some unneeded serialization [SPARK-17931][CORE] taskScheduler has some unneeded serialization Nov 4, 2016
@SparkQA

SparkQA commented Nov 4, 2016

Copy link
Copy Markdown

Test build #68141 has finished for PR 15505 at commit ace0114.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin

rxin commented Nov 4, 2016

Copy link
Copy Markdown
Contributor

This is pretty big and will miss 2.1.

But cc @kayousterhout / @squito / @JoshRosen

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary, I will change it back.

@witgo witgo Nov 7, 2016

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wzhfy The source of this parameter is sc.addedFiles and sc.addedJars, Their types are mutable.Map[String, Long] , This change is reasonable.

@witgo

witgo commented Nov 21, 2016

Copy link
Copy Markdown
Contributor Author

ping @kayousterhout / @squito / @JoshRosen

@squito squito left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a very brief review -- the idea here makes a lot of sense. The only big problem I see with the patch now is the tests that have been eliminated, some of those we definitely need to bring back. But I need to do a longer pass to think about how the pieces fit together.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: double indent (4 spaces) for constructor params

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: each parameter on its own line (even the first one), and double indent all params

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't look like there is any replacement for this test, right? We certainly want to keep this check in some form.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
For rdd and closures, there are already related test cases, see DAGSchedulerSuite.scala#L506
For task, user can use a custom partition, and the partition instance may not be serializable. This has no test cases I will add one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case has been added.

@squito squito Dec 19, 2016

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand right, you are saying there are now test cases which cover all of the little individual pieces inside a Task, so we don't need an explicit test for the serializing the entire task.

I'd still prefer a test for serializing the entire task (to prevent future regressions), but I guess its hard to do b/c now the task actually gets serialized by the SchedulerBackends.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing on replacements for these tests. we definitely want to keep the test on not-serializability. The others should probably stay as well, unless there is some reason why its extremely hard to do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ShuffleMapTask and ResultTask, since they do not contain rdd instance, they are very small.
Detection of the size of the serialized RDD should be more reasonable. I added the corresponding code in the DAGScheduler class, but did not add the test case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a stretch, but Task could be big, right? Eg. if there was a long chain of partitions, and for some reason they had a lot of data? That seems unlikely, but I also have no idea if it ever happens or not. It seems easy enough to add the check back in -- is there any reason not to?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree with @squito that we should keep this test / check. I've seen many huge task sizes due to a developer mistake, so think we should absolutely keep warning about it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will add this test case back.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: with so many args, all with pretty generic types, its really helpful to name each one, eg

TaskDescription(
  taskId=1L,
  attemptNumber = 0,
...

(here and elsewhere)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I'll change it

@SparkQA

SparkQA commented Mar 1, 2017

Copy link
Copy Markdown

Test build #73694 has finished for PR 15505 at commit 335b7b9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Mar 2, 2017

Copy link
Copy Markdown

Test build #73721 has finished for PR 15505 at commit b2b1eec.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@witgo

witgo commented Mar 2, 2017

Copy link
Copy Markdown
Contributor Author

@kayousterhout It takes some time to update the test report.

@kayousterhout

Copy link
Copy Markdown
Contributor

@witgo OK I'll hold off on doing another pass on the code until you have the test results.

@witgo

witgo commented Mar 2, 2017

Copy link
Copy Markdown
Contributor Author

@kayousterhout
Test results have been updated:

SPARK-17931 db0ddce
9.427 s 9.566 s

@witgo

witgo commented Mar 2, 2017

Copy link
Copy Markdown
Contributor Author

I don't know which PR causes the run time of this test case to be reduced from 21.764 s to 9.566 s.

@kayousterhout

Copy link
Copy Markdown
Contributor

@witgo I don't think the ~1.5% improvement in runtime merits the added complexity of this change. I could be convinced to merge this if it simplified the code or the ability to reason about the code, but unfortunately this makes things somewhat more complicated because of the new logic about aborting task sets.

@mridulm

mridulm commented Mar 3, 2017

Copy link
Copy Markdown
Contributor

@kayousterhout I am surprised it is not more, but I agree that the added complexity for such low returns is not worth it.

@witgo

witgo commented Mar 3, 2017

Copy link
Copy Markdown
Contributor Author

Yes, maybe a multithreaded serialization task code can have a better performance, let me close the PR

@witgo witgo closed this Mar 3, 2017
@witgo

witgo commented Mar 3, 2017

Copy link
Copy Markdown
Contributor Author

SPARK-18890_20170303 `s code is older but the test case running time is 5.2 s

@libratiger

libratiger commented May 16, 2017

Copy link
Copy Markdown

I agree with Kay that putting in a smaller change first is better, assuming it still has the performance gains. That doesn't preclude any further optimizations that are bigger changes.

I'm a little surprised that the serializing tasks has much of an impact, given how little data is getting serialized. But if it really is, I feel like there is a much bigger optimization we're completely missing. Why are we repeating the work of serialization for each task in a taskset? The serialized data is almost exactly the same for every task. they only differ in the partition id (an int) and the preferred locations (which aren't even used by the executor at all).

Task serialization already leverages the idea of having info across all the tasks in the Broadcast for the task binary. We just need to use that same idea for all the rest of the task data that is sent to the executor. Then the only difference between the serialized task data sent to executors is the int for the partitionId. You'd serialize into a bytebuffer once, and then your per-task "serialization" becomes copying the buffer and modifying that int directly.

@squito I like this idea very much. I just encounte the de-serialization time is too long (about more than 10s for some tasks). Is there any PR try to solve this? If there is no related PR, I would like open an issue and try to solve this.:-D

@squito

squito commented May 30, 2017

Copy link
Copy Markdown
Contributor

@djvulee this is https://issues.apache.org/jira/browse/SPARK-19108. Note that there is some discussion there about this being a bit harder than what I originally thought, though I think its still worth exploring where task serialization is an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants