Skip to content

[SPARK-31197][CORE] Shutdown executor once we are done decommissioning#29211

Closed
holdenk wants to merge 20 commits into
apache:masterfrom
holdenk:SPARK-31197-exit-execs-redone
Closed

[SPARK-31197][CORE] Shutdown executor once we are done decommissioning#29211
holdenk wants to merge 20 commits into
apache:masterfrom
holdenk:SPARK-31197-exit-execs-redone

Conversation

@holdenk

@holdenk holdenk commented Jul 23, 2020

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Exit the executor when it has been asked to decommission and there is nothing left for it to do.

This is a rebase of #28817

Why are the changes needed?

If we want to use decommissioning in Spark's own scale down we should terminate the executor once finished.
Furthermore, in graceful shutdown it makes sense to release resources we no longer need if we've been asked to shutdown by the cluster manager instead of always holding the resources as long as possible.

Does this PR introduce any user-facing change?

The decommissioned executors will exit and the end of decommissioning. This is sort of a user facing change, however decommissioning hasn't been in any releases yet.

How was this patch tested?

I changed the unit test to not send the executor exit message and still wait on the executor exited message.

@SparkQA

SparkQA commented Jul 23, 2020

Copy link
Copy Markdown

Test build #126436 has started for PR 29211 at commit 525b335.

@shaneknapp

Copy link
Copy Markdown
Contributor

test this please

@holdenk

holdenk commented Jul 23, 2020

Copy link
Copy Markdown
Contributor Author

cc @attilapiros & @agrawaldevesh

maybeCleanupApplication(id)

case DecommissionSelf =>
case WorkerDecommission(_, _) =>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should this be DecommissionWorker ? That sounds more like a command to me.

Whereas WorkerDecommissioned sounds like a state.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong preferences (although this doesn't have ed at the end so I don't think it's implying a state).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most other command messages start with a verb, while the 'state updated' ones are in past tense. So I think we should rename this to DecommissionWorker to match that.

Comment thread core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala Outdated
Comment thread core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala Outdated
Comment thread core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala Outdated
Comment thread core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala Outdated
Comment thread core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala Outdated
// Just replicate blocks as fast as we can during testing, there isn't another
// workload we need to worry about.
.set(config.STORAGE_DECOMMISSION_REPLICATION_REATTEMPT_INTERVAL, 1L)
.set(config.STORAGE_DECOMMISSION_REPLICATION_REATTEMPT_INTERVAL, 10L)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this changed to 10 seconds ? If it is related to this PR (as opposed to just reducing flakyness), then please document that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reduced the log spam in debugging this test but still allowed the test to complete quickly.


import org.mockito.{ArgumentMatchers => mc}
import org.mockito.Mockito.{mock, times, verify, when}
import org.mockito.Mockito.{atLeast => least, mock, times, verify, when}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is atLeast renamed as least ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theres something else which is somehow imported and named atLeast (I suspect from ScalaTest?) and I needed to rename it or the compiler got confused.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a bit more poking, it's not scalatest. It's easier just to rename to avoid the conflict.

@SparkQA

SparkQA commented Jul 24, 2020

Copy link
Copy Markdown

Test build #126437 has finished for PR 29211 at commit ba51033.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jul 24, 2020

Copy link
Copy Markdown

Test build #126440 has finished for PR 29211 at commit ba51033.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment thread core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala Outdated
* Returns the last migration time and a boolean for if all blocks have been migrated.
* If there are any tasks running since that time the boolean may be incorrect.
*/
private[spark] def lastMigrationInfo(): (Long, Boolean) = {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool !. I am always anal about logging to a fault :-P

@holdenk holdenk force-pushed the SPARK-31197-exit-execs-redone branch from fea00dd to 12e9e23 Compare July 24, 2020 01:31

@agrawaldevesh agrawaldevesh left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

maybeCleanupApplication(id)

case DecommissionSelf =>
case WorkerDecommission(_, _) =>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most other command messages start with a verb, while the 'state updated' ones are in past tense. So I think we should rename this to DecommissionWorker to match that.

@SparkQA

SparkQA commented Jul 24, 2020

Copy link
Copy Markdown

Test build #126457 has finished for PR 29211 at commit 5678d28.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jul 24, 2020

Copy link
Copy Markdown

Test build #126456 has finished for PR 29211 at commit 46fb2f7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jul 24, 2020

Copy link
Copy Markdown

Test build #126453 has finished for PR 29211 at commit 12e9e23.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jul 24, 2020

Copy link
Copy Markdown

Test build #126451 has finished for PR 29211 at commit fea00dd.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jul 24, 2020

Copy link
Copy Markdown

Test build #126460 has finished for PR 29211 at commit 50b9cb2.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class DecommissionWorker(

@holdenk

holdenk commented Jul 24, 2020

Copy link
Copy Markdown
Contributor Author

All of the github actions pass, if no one else wants more time to comment on this, I'll merge this today.

@agrawaldevesh agrawaldevesh left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize for not paying attention to these things before.

* @param id the worker id
* @param worker the worker endpoint ref
*/
case class WorkerDecommission(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I messed up when reviewing this: I am so sorry :-(

Somehow I thought that WorkerDecommission is something being introduced in this PR, but apparently it is an existing message that went in #26440 itself.

Let's please keep the original message WorkerDecommission everywhere to avoid un-necessary merge conflicts etc in people's private forks :-)

At least I learned not to trust "changes since last reviewed" feature of Github when we all just force push into our PR branches :-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No stress, I did it as a seperate commit so I can revert it easily.

*/
private[executor] val taskResources = new mutable.HashMap[Long, Map[String, ResourceInformation]]

@volatile private var decommissioned = false

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In the spirit of minimizing code diff, is there is a strong reason why this decommissioned flag was moved down from line 67 ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't consider it related to the variables it was grouped with strongly enough so I moved it to stand alone.

@attilapiros

Copy link
Copy Markdown
Contributor

I would like to help here too but I am still busy with SPARK-32198.

@holdenk

holdenk commented Jul 24, 2020

Copy link
Copy Markdown
Contributor Author

Sure, I can hold off on this one then @attilapiros :)

@agrawaldevesh agrawaldevesh left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM again :-)

@attilapiros attilapiros left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just peeked into this and I can only continue the review in the second half of the next week.

val lastMigrationTime = if (
conf.get(config.STORAGE_DECOMMISSION_SHUFFLE_BLOCKS_ENABLED) &&
conf.get(config.STORAGE_DECOMMISSION_RDD_BLOCKS_ENABLED)) {
Math.min(lastRDDMigrationTime, lastShuffleMigrationTime)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be Math.max.

As I understand in CoarseGrainedExecutorBackend the lastTaskRunningTime increases with about 1000ms in every iteration, let's assume exactly 1000ms. So if the RDD migration finished at 500ms (let's count in the example from 0 here) but shuffle files to be migrated are still left and they will be finished only in the next round (let's assume in 1500ms) then we we never shutdown the executor: as in the current round blocksMigrated is false and in all the following ones migrationTime will be less than lastTaskRunningTime, so this condition will be never satisfied:

if (allBlocksMigrated && (migrationTime > lastTaskRunningTime)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking a bit more on this part.

The intention of lastRDDMigrationTime, lastShuffleMigrationTime and lastTaskRunningTime is to notify the driver only once about the exit of this executor.

My first thought was to improve this by expressing this intention directly with a simple flag in the shutdownThread(like exitTriggered which can be used as a stopping condition instead of the while (true) ) and remove all the three time variables and change lastMigrationInfo to return just a boolean and rename it for its new role (for example to isFinished or just simply to finished).

Then I went a bit further and checked the exitExecutor method and I think even the flag is not necessarily needed:

Still I like to have a stopping condition in the while loop so please consider to use the flag instead of the times.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holdenk @agrawaldevesh what do you think?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you @attilapiros. On both the counts:

  • Converting this into a level trigger vs an edge trigger: ie the flag goes on and remains on when it is time to exit.
  • Having consistency b/w the termination condition of the thread above and the computation of the boolean variable in this method. Having a helper method would help with this and also improve longer term maintainability.

IMHO, this is great feedback and will improve this PR. Thank you.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it only increases if there is a task running. We don’t want it to shutdown or consider the blocks migrated if there is a task migrated.
Unfortunately this flag can come and go depending on the tasks running on the executor which is why it’s structured the way it is.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR against your branch

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agrawaldevesh: what is your opinion/idea?

I confess that even I am finding the logic in the lastMigrationInfo a bit hard to follow. And as far as I can see, this logic is the key to getting the shutdown thread to exit cleanly. If the information returned by lastMigrationInfo is wrong then the shutdown thread may exit prematurely or never. Both of which should be avoided, particularly because the shutdown thread has no timeout on the amount of time it can hang around.

This complexity is intrinsic. The logic is indeed complex but I think what we can add some more documentation here, perhaps explaining the intent of the code using a couple of examples ? Even better would be to please add a unit test for this. I think it should be easy to unit test just the lastMigrationInfo. This is just testing that the function is behaving as expected by stubbing out other parts. This test would also help with providing the necessary guardrails as we change this function in the future. I expect this function to change as we production harden the cache/shuffle block migration in the near term.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I’m down to unit test this function, makes sense to me and it’s fairly standalone.

I’ll work on the documentation over the weekend.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the approach in that PR re-opens the race condition were preventing here. I would really rather not do that.

I’d also like us to make progress here though, so we can temporarily accept the race condition and file a JIRA and revisit it later as a stand-alone item if y’all are not comfortable with any of the ways to avoid the race.

}

// Make the executor we decommissioned exit
sched.client.killExecutors(List(execToDecommission))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are missing some tests to ensure that the executors do actually exit. Particularly in light of @attilapiros recent comment :-P

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that’s the point of taking out the kill executor. If you look a line bellow we wait on the semaphore to make sure the executor has exited.

@SparkQA

SparkQA commented Jul 24, 2020

Copy link
Copy Markdown

Test build #126510 has finished for PR 29211 at commit 484f8e2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class WorkerDecommission(

var lastTaskRunningTime = System.nanoTime()
val sleep_time = 1000 // 1s

while (true) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to execute this while loop within the Thread#run and also start the thread.

Currently the while runs as part of the construction of the new thread instance by the caller's thread and not as a separate new thread.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NICE CATCH. OMG!!.

@holdenk, I think we should add tests for this :-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good catch. This will block on the decommissioning message and we want to return that. Let me think of how we can check that we’re not blocking in the decommission message.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm checking that this is non blocking is going to be a pain, and from what we've seen it works out ok even if it does block. I'll switch it to non-blocking, but if it works regardless of blocking or non-blocking do we need a test for this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to have a test (either a new one dedicated for this case or better to modify an existing one to check the executor really shutdowned).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the executor would shut down (we already test for that), it's just the RPC call would have been blocking rather than non-blocking.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you probably implying this test and this line:

I had no time to run the test yet or check the details.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I'll double check by disabling the shutdown and seeing if the test fails in CI while I'm doing my other adventures today. (Then I'll either re-enable the shutdown code or work on getting it to fail first then re-enable it).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just wondering there is no timeout for this. So if somebody in the future make a mistake this will hang as long as the Jenkins admin comes. Should not we need to use tryAcquire?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched it to tryAcquire 👍

@SparkQA

SparkQA commented Jul 27, 2020

Copy link
Copy Markdown

Test build #126664 has finished for PR 29211 at commit 3debd73.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jul 27, 2020

Copy link
Copy Markdown

Test build #126666 has finished for PR 29211 at commit 63846f1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk holdenk force-pushed the SPARK-31197-exit-execs-redone branch from 63846f1 to 3910d50 Compare July 28, 2020 19:05
@SparkQA

SparkQA commented Jul 28, 2020

Copy link
Copy Markdown

Test build #126735 has started for PR 29211 at commit 285600a.

@SparkQA

SparkQA commented Jul 28, 2020

Copy link
Copy Markdown

Test build #126739 has finished for PR 29211 at commit 03a88ff.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk

holdenk commented Jul 29, 2020

Copy link
Copy Markdown
Contributor Author

So it seems without the thread start, we end up in the situation where there is an orphan worker (see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126735/console ) so I think that's covered. I'll double-check that this goes away once we put the thread start back in, otherwise I'll do some more poking.

@holdenk holdenk force-pushed the SPARK-31197-exit-execs-redone branch from 477e77f to e81c3fc Compare August 5, 2020 18:39
@holdenk

holdenk commented Aug 5, 2020

Copy link
Copy Markdown
Contributor Author

The only outstanding point of discussion is a test timeout length, which I don't believe is critical to address. If CI passes I intend to merge this PR and rebase the next PR which brings the feature full circle.

@agrawaldevesh agrawaldevesh left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the tests and working on this. More closer to better decommissioning !!.

// Wait for the executor to be removed
executorRemovedSem.acquire(1)
// Wait for the executor to be removed automatically after migration.
assert(executorRemovedSem.tryAcquire(1, 5L, TimeUnit.MINUTES))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Would it be okay to add a comment about how long does it usually take in the code itself and mention that the larger timeout is for good measure ?

assert(done && currentTime > previousTime)
}
} finally {
bmDecomManager.stop()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good !

@SparkQA

SparkQA commented Aug 5, 2020

Copy link
Copy Markdown

Test build #127101 has finished for PR 29211 at commit e81c3fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk

holdenk commented Aug 5, 2020

Copy link
Copy Markdown
Contributor Author

Since it's approved and passing jenkins & GHA I'm going to merge this, but I'll add the comment about the time in the follow up I'll rebase on top of this after merge.

@asfgit asfgit closed this in 375d348 Aug 5, 2020
@holdenk

holdenk commented Aug 5, 2020

Copy link
Copy Markdown
Contributor Author

Merged to the dev branch :)

@HyukjinKwon

Copy link
Copy Markdown
Member

cc @tgravescs, @mridulm, @squito, @Ngone51, @jiangxb1987 as well FYI

case e: Exception =>
logError(s"Unexpected error during decommissioning ${e.toString}", e)
}
// Send decommission message to the executor, this may be a duplicate since the executor

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need to prevent duplicate DecommissionSelf at the executor side? For now, I don't see we handle duplicate DecommissionSelf and it may create duplicate threads as a result.

}
} else {
logInfo("No running tasks, no block migration configured, stopping.")
exitExecutor(0, "Finished decommissioning", notifyDriver = true)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we need to wait for tasks to finish if storage decommission is disabled. I mean, for a shuffle map task and result task with indirect result, their outputs still base on blocks. As a result, we'd spend time waiting for them to finish but get nothing good in return.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless it's an action. If someone just wants things to exit immediately they should not enable decommissioning.

exitExecutor(0, "Finished decommissioning", notifyDriver = true)
}
} else {
logInfo("Blocked from shutdown by running ${executor.numRunningtasks} tasks")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you missed a s" here ... the string interpolation isn't happening

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants