Skip to content

Part of [SPARK-2456] Removed some HashMaps from DAGScheduler by storing information in Stage.#1561

Closed
rxin wants to merge 4 commits into
apache:masterfrom
rxin:dagSchedulerHashMaps
Closed

Part of [SPARK-2456] Removed some HashMaps from DAGScheduler by storing information in Stage.#1561
rxin wants to merge 4 commits into
apache:masterfrom
rxin:dagSchedulerHashMaps

Conversation

@rxin

@rxin rxin commented Jul 24, 2014

Copy link
Copy Markdown
Contributor

This is part of the scheduler cleanup/refactoring effort to make the scheduler code easier to maintain.

@kayousterhout @markhamstra please take a look ...

@SparkQA

SparkQA commented Jul 24, 2014

Copy link
Copy Markdown

QA tests have started for PR 1561. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17087/consoleFull

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's a Set not a List

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as we're nit-ing..."belong" -> "belongs"

@SparkQA

SparkQA commented Jul 24, 2014

Copy link
Copy Markdown

QA results for PR 1561:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17087/consoleFull

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why filter instead of filterKeys? Looks to me like that only serves to make the predicate harder to read.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd need to do more hash lookups if it is filterKeys

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What? How does stageIdToJobIds.filterKeys(stageId => registeredStages.get.contains(stageId)).foreach { do more hash lookups than does stageIdToStage.filter(s => registeredStages.get.contains(s._1)).foreach {? Looks the same to me: https://github.com/scala/scala/blob/v2.10.4/src/library/scala/collection/MapLike.scala#L230

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need the stage / value object in the foreach after the filter.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and you've got it after filterKeys just like after filter. The result of filterKeys is a Map[Int, Stage], and nothing needs to change within the foreach { ... }.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that's pretty cool. I looked into the source code more. Didn't realize the whole FilteredKeys thing is "lazy".

@markhamstra

Copy link
Copy Markdown
Contributor

JIRA?

@kayousterhout

Copy link
Copy Markdown
Contributor

Total nit but would you mind adding to the description (so it ends up in the commit) the point of this change (which, I think, is as part of a scheduler cleanup effort)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just do this as part of the Stage constructor?

@rxin rxin changed the title Removed some HashMaps from DAGScheduler by storing information in Stage. [SPARK-2456] Removed some HashMaps from DAGScheduler by storing information in Stage. Jul 24, 2014
@rxin rxin changed the title [SPARK-2456] Removed some HashMaps from DAGScheduler by storing information in Stage. Part of [SPARK-2456] Removed some HashMaps from DAGScheduler by storing information in Stage. Jul 24, 2014

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And as a result of the above, you can consolidate this if/else into one thing

@kayousterhout

Copy link
Copy Markdown
Contributor

Love the cleanup here!!!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do i need to call stage.pendingTasks = new HashSet here? @kayousterhout @markhamstra

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you should clear it here, I think the issue is if a stage gets resubmitted.

@rxin

rxin commented Jul 24, 2014

Copy link
Copy Markdown
Contributor Author

Thanks. Pushed a version that should have addressed most of the comments.

@SparkQA

SparkQA commented Jul 24, 2014

Copy link
Copy Markdown

QA tests have started for PR 1561. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17112/consoleFull

@SparkQA

SparkQA commented Jul 24, 2014

Copy link
Copy Markdown

QA results for PR 1561:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17112/consoleFull

@SparkQA

SparkQA commented Jul 25, 2014

Copy link
Copy Markdown

QA tests have started for PR 1561. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17157/consoleFull

@SparkQA

SparkQA commented Jul 25, 2014

Copy link
Copy Markdown

QA results for PR 1561:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17157/consoleFull

@mateiz

mateiz commented Jul 25, 2014

Copy link
Copy Markdown
Contributor

Looks good to me, this is definitely cleaner

@mateiz

mateiz commented Jul 25, 2014

Copy link
Copy Markdown
Contributor

Actually I made one note on clearing pendingTasks. Probably best to keep that code. I think it might come in if a stage is resubmitted.

@rxin

rxin commented Jul 25, 2014

Copy link
Copy Markdown
Contributor Author

That makes sense. I was wondering about it myself too.

@SparkQA

SparkQA commented Jul 25, 2014

Copy link
Copy Markdown

QA tests have started for PR 1561. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17191/consoleFull

@SparkQA

SparkQA commented Jul 25, 2014

Copy link
Copy Markdown

QA results for PR 1561:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17191/consoleFull

@JoshRosen

Copy link
Copy Markdown
Contributor

A lot of the methods and data structures in Stage are specific to only ShuffleMapStage or ResultStage. For example, we only track output locations for ShuffleMapStages and only have ActiveJobs for ResultStages. What do you think about splitting Stage into a trait and two subclasses? This could improve the understandability of other DAGScheduler data structures; for example, we have

shuffleToMapStage = new HashMap[Int, Stage]

which only holds ShuffleMapStages, so we could give it a more specific type of HashMap[Int, ShuffleMapStage].

@markhamstra

Copy link
Copy Markdown
Contributor

@JoshRosen Why a trait instead of an abstract class? We're not expecting to need to mixin Stage outside of the Stage class hierarchy, right?

@JoshRosen

Copy link
Copy Markdown
Contributor

@markhamstra Good point; an abstract class is fine, too.

@rxin

rxin commented Jul 25, 2014

Copy link
Copy Markdown
Contributor Author

That's a good discussion to have. I'm personally in favor of keeping it simple for now until it becomes more complicated. Refactoring common parts are good, but the extra level of indirection also introduces complexity. As I see it right now, the benefit isn't huge yet (it's only a couple variables that we are tracking).

@SparkQA

SparkQA commented Jul 25, 2014

Copy link
Copy Markdown

QA tests have started for PR 1561. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17199/consoleFull

@SparkQA

SparkQA commented Jul 25, 2014

Copy link
Copy Markdown

QA results for PR 1561:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17199/consoleFull

@rxin

rxin commented Jul 25, 2014

Copy link
Copy Markdown
Contributor Author

Jenkins, retest this please.

@rxin

rxin commented Jul 25, 2014

Copy link
Copy Markdown
Contributor Author

hive-thriftserver test failed.

@mateiz

mateiz commented Jul 25, 2014

Copy link
Copy Markdown
Contributor

I'd be okay deferring the subclass thing till later too, the benefit isn't huge right now.

@SparkQA

SparkQA commented Jul 26, 2014

Copy link
Copy Markdown

QA tests have started for PR 1561. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17210/consoleFull

@SparkQA

SparkQA commented Jul 26, 2014

Copy link
Copy Markdown

QA results for PR 1561:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17210/consoleFull

@rxin

rxin commented Jul 26, 2014

Copy link
Copy Markdown
Contributor Author

Merging this in master. Thanks for reviewing.

@asfgit asfgit closed this in 9d8666c Jul 26, 2014
@rxin rxin deleted the dagSchedulerHashMaps branch August 13, 2014 08:01
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…ng information in Stage.

This is part of the scheduler cleanup/refactoring effort to make the scheduler code easier to maintain.

@kayousterhout @markhamstra please take a look ...

Author: Reynold Xin <rxin@apache.org>

Closes apache#1561 from rxin/dagSchedulerHashMaps and squashes the following commits:

1c44e15 [Reynold Xin] Clear pending tasks in submitMissingTasks.
620a0d1 [Reynold Xin] Use filterKeys.
5b54404 [Reynold Xin] Code review feedback.
c1e9a1c [Reynold Xin] Removed some HashMaps from DAGScheduler by storing information in Stage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants