Fix the issue that the instance may not be assigned a replica as expected.#1098
Conversation
…cted. This is to fix a regression which was introduced by PR apache#986. The PR tried to prioritize the preference list to avoid unnecessary top state transition. However, there was a bug in the prioritizing logic and if one participant is skipped due to low priority, it won't be picked up again during the cauculating. As a result, this participant won't be assigned with any replica even it is originally in the preference list. This fix will ensure the skipped participant being checked again until it gets the assignment.
| @@ -385,10 +389,19 @@ protected Map<String, String> computeBestPossibleMap(List<String> preferenceList | |||
| // If the desired state is the top state, but the instance cannot be transited to the | |||
| // top state in one hop, try to keep the top state on current host or a host with a closer | |||
| // state. | |||
There was a problem hiding this comment.
We can also additional check for single top state since multi top state even does not require these operations
There was a problem hiding this comment.
I thought about it, but if we have a state model which requires 2 top states, and replica is 4, do we still want this improvement? My answer is yes. I don't think we want to lose the universality.
In addition, it may not really prevent all the issues. For example, if the replica is 1, without a complete fix, I guess there will still be issues. So we should fix it and fix it for all cases.
Also other changes to enhance performance.
| if (!assignedInstances.add(proposedInstance)) { | ||
| throw new AssertionError(String | ||
| .format("The proposed instance %s has been already assigned before.", | ||
| proposedInstance)); |
There was a problem hiding this comment.
This should not happen, right. You do the assignedInstances both for peek of the queue and after adjust instance.
There was a problem hiding this comment.
Yeah, this is for the sanity check. If something goes wrong, I hope it fails early. This could be very hard to debug. And if we have some other customized SM defs, it might not be covered by the unit test. Adding this assertion will help us identify issues much easier.
|
This PR is ready to be merged, approved by @dasahcc |
|
Accidentally hit "revert". Will discard the reverting branch. The Master is fine. |
…cted. (apache#1098) This is to fix a regression that was introduced by PR apache#986 The PR tried to prioritize the preference list to avoid unnecessary top state transitions. However, there was a bug in the prioritizing logic and if one participant is skipped due to low priority, it won't be picked up again during the calculating. As a result, this participant won't be assigned with any replica even it is originally in the preference list. This only happens if the state model has been customized so it is multiple top states and there is an intermediate state with expected count -1 between the top state and the other states. This fix will ensure the skipped participant being checked again until it gets the assignment.
Issues
#1097
Description
This is to fix a regression that was introduced by PR #986
The PR tried to prioritize the preference list to avoid unnecessary top state transitions. However, there was a bug in the prioritizing logic and if one participant is skipped due to low priority, it won't be picked up again during the calculating. As a result, this participant won't be assigned with any replica even it is originally in the preference list.
This only happens if the state model has been customized so it is multiple top states and there is an intermediate state with expected count -1 between the top state and the other states.
This fix will ensure the skipped participant being checked again until it gets the assignment.
Tests
TestAbstractRebalancer.java
Added test data that simulate a customized state model fits the problem statement.
[INFO] Results:
[INFO]
[ERROR] Failures:
[ERROR] TestJobQueueCleanUp.testJobQueueAutoCleanUp » ThreadTimeout Method org.testng....
[ERROR] TestClusterVerifier.testResourceSubset:225 expected: but was:
[INFO]
[ERROR] Tests run: 1145, Failures: 2, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:25 h
[INFO] Finished at: 2020-06-17T18:59:29-07:00
[INFO] ------------------------------------------------------------------------
Rerun
[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 45.908 s - in TestSuite
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 52.290 s
[INFO] Finished at: 2020-06-18T11:40:16-07:00
[INFO] ------------------------------------------------------------------------
Commits
Documentation (Optional)
(Link the GitHub wiki you added)
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)