Skip to content

Fix the issue that the instance may not be assigned a replica as expected.#1098

Merged
jiajunwang merged 4 commits into
apache:masterfrom
jiajunwang:master
Jun 18, 2020
Merged

Fix the issue that the instance may not be assigned a replica as expected.#1098
jiajunwang merged 4 commits into
apache:masterfrom
jiajunwang:master

Conversation

@jiajunwang
Copy link
Copy Markdown
Contributor

@jiajunwang jiajunwang commented Jun 16, 2020

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

#1097

Description

  • Here are some details about my PR, including screenshots of any UI changes:

This is to fix a regression that was introduced by PR #986
The PR tried to prioritize the preference list to avoid unnecessary top state transitions. However, there was a bug in the prioritizing logic and if one participant is skipped due to low priority, it won't be picked up again during the calculating. As a result, this participant won't be assigned with any replica even it is originally in the preference list.
This only happens if the state model has been customized so it is multiple top states and there is an intermediate state with expected count -1 between the top state and the other states.

This fix will ensure the skipped participant being checked again until it gets the assignment.

Tests

  • The following tests are written for this issue:

TestAbstractRebalancer.java
Added test data that simulate a customized state model fits the problem statement.

  • The following is the result of the "mvn test" command on the appropriate module:

[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR] TestJobQueueCleanUp.testJobQueueAutoCleanUp » ThreadTimeout Method org.testng....
[ERROR] TestClusterVerifier.testResourceSubset:225 expected: but was:
[INFO]
[ERROR] Tests run: 1145, Failures: 2, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:25 h
[INFO] Finished at: 2020-06-17T18:59:29-07:00
[INFO] ------------------------------------------------------------------------

Rerun

[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 45.908 s - in TestSuite
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 52.290 s
[INFO] Finished at: 2020-06-18T11:40:16-07:00
[INFO] ------------------------------------------------------------------------

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

…cted.

This is to fix a regression which was introduced by PR apache#986.
The PR tried to prioritize the preference list to avoid unnecessary top state transition. However, there was a bug in the prioritizing logic and if one participant is skipped due to low priority, it won't be picked up again during the cauculating. As a result, this participant won't be assigned with any replica even it is originally in the preference list.
This fix will ensure the skipped participant being checked again until it gets the assignment.
@@ -385,10 +389,19 @@ protected Map<String, String> computeBestPossibleMap(List<String> preferenceList
// If the desired state is the top state, but the instance cannot be transited to the
// top state in one hop, try to keep the top state on current host or a host with a closer
// state.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also additional check for single top state since multi top state even does not require these operations

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, but if we have a state model which requires 2 top states, and replica is 4, do we still want this improvement? My answer is yes. I don't think we want to lose the universality.
In addition, it may not really prevent all the issues. For example, if the replica is 1, without a complete fix, I guess there will still be issues. So we should fix it and fix it for all cases.

Jiajun Wang added 2 commits June 17, 2020 17:13
Copy link
Copy Markdown
Contributor

@junkaixue junkaixue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM

Comment on lines +430 to +433
if (!assignedInstances.add(proposedInstance)) {
throw new AssertionError(String
.format("The proposed instance %s has been already assigned before.",
proposedInstance));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not happen, right. You do the assignedInstances both for peek of the queue and after adjust instance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is for the sanity check. If something goes wrong, I hope it fails early. This could be very hard to debug. And if we have some other customized SM defs, it might not be covered by the unit test. Adding this assertion will help us identify issues much easier.

@jiajunwang
Copy link
Copy Markdown
Contributor Author

This PR is ready to be merged, approved by @dasahcc

@jiajunwang jiajunwang merged commit 35857a8 into apache:master Jun 18, 2020
jiajunwang added a commit that referenced this pull request Jun 18, 2020
@jiajunwang
Copy link
Copy Markdown
Contributor Author

Accidentally hit "revert". Will discard the reverting branch. The Master is fine.

huizhilu pushed a commit to huizhilu/helix that referenced this pull request Aug 16, 2020
…cted. (apache#1098)

This is to fix a regression that was introduced by PR apache#986
The PR tried to prioritize the preference list to avoid unnecessary top state transitions. However, there was a bug in the prioritizing logic and if one participant is skipped due to low priority, it won't be picked up again during the calculating. As a result, this participant won't be assigned with any replica even it is originally in the preference list.
This only happens if the state model has been customized so it is multiple top states and there is an intermediate state with expected count -1 between the top state and the other states.

This fix will ensure the skipped participant being checked again until it gets the assignment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants