-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[4.11/Future] CLOUDSTACK-9782: Host HA and KVM HA provider #1960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@blueorangutan package |
|
@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✔centos6 ✔centos7 ✔debian. JID-525 |
|
@blueorangutan test |
|
@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
@rhtyd tests looks good, except this one: Looks like when simulator is not deployed test would fail. |
DaanHoogland
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
general remark:description fields do not say more then the name of the command or parameter they refer to. some improvement may be possible. e.g. "lists HA providers" may read "return a list of High Availability providers for a Hypervisor type"
.travis.yml
Outdated
| smoke/test_dynamicroles | ||
| smoke/test_global_settings | ||
| smoke/test_guest_vlan_range | ||
| smoke/test_ha_for_host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 :)
|
|
||
| import javax.inject.Inject; | ||
|
|
||
| @APICommand(name = DisableHAForClusterCmd.APINAME, description = "Disables HA cluster-wide", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: looking at the code below this actually is DisableHAForHostsInClusterCmd. I don't think it is a biggy but might lead to conflicts at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DaanHoogland can you illustrate an example where we may hit such a conflict? This API actually disables the feature (framework etc) for cluster/zone etc. While the Host specific APIs work on the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to implement some kind of mirroring service for entire zones or clusters for high availability, these would then be disabled / enabled for a zone or cluster. In this case the enabling is done per host in the zone/cluster. That's why i consider the longer name I propose more appropriate. That said the shorter the name the better as long as it is unambiguous.
|
|
||
| import javax.inject.Inject; | ||
|
|
||
| @APICommand(name = DisableHAForZoneCmd.APINAME, description = "Disables HA for a zone", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: see disable cluster cmd
| responseObject = SuccessResponse.class, | ||
| requestHasSensitiveInfo = false, responseHasSensitiveInfo = false, | ||
| since = "4.11", authorized = {RoleType.Admin}) | ||
| public final class EnableHAForClusterCmd extends BaseAsyncCmd { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: see disable cmd
| responseObject = SuccessResponse.class, | ||
| requestHasSensitiveInfo = false, responseHasSensitiveInfo = false, | ||
| since = "4.11", authorized = {RoleType.Admin}) | ||
| public final class EnableHAForZoneCmd extends BaseAsyncCmd { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: see disable cluster
DaanHoogland
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do not squash next time. both history is lost and review is very hard on such a big chunk of code. individual smaller commits make for easier reading.
| } | ||
| } | ||
|
|
||
| public void setHostId(final Long hostId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: consistency would dictate this be called setId()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the setters and getters are named per the variable, since the variable is hostId IDEs generate getHostId, setHostId named methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the field name is id, so i don't understand your reply
| CallContext.current().putContextParameter(Host.class, host.getUuid()); | ||
|
|
||
| final OutOfBandManagementResponse response = outOfBandManagementService.changeOutOfBandManagementPassword(host, getPassword()); | ||
| final OutOfBandManagementResponse response = outOfBandManagementService.changePassword(host, getPassword()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
I went through the code and found no real issues. for any other reviewers I recommend reading the FS first. It is quit a big chunk but very neat. LGTM |
|
Thanks @DaanHoogland wherever applicable I'll address the comments. |
|
I have already raised some questions on dev@ on the need for a new HA framework when the existing HA framework can do all the things mentioned. The new framework only supports VM HA. If we see some concrete implementation of network/storage or any other type of resource HA which doesn't currently exist using the new framework then it would be much more easier to see the value add. I think more discussion is needed on this. |
|
@koushik-das The current framework is specifically implemented for Host HA with KVM HA as the initial implementation. This framework is supposed to replace the framework that is specifically written for VM-HA. @rhtyd |
|
@koushik-das I've shared a list of advantages of this work over existing framework on dev@ that explain why existing VM-HA framework cannot be used for host-ha implementation. If you've more questions or comments, we can discuss them here or on dev@. |
24ec0dd to
a1e27c4
Compare
|
@abhinandanprateek If you refer to the discussion on dev@ https://goo.gl/cU8RuX, @rhtyd proposed it as a generic HA framework for any resources (and not limited to VM). Now if it just a replacement of the existing VM-HA framework then we need to discuss the benefits it provides in more detail compared to the existing one. I have already made some specific comments on the justifications provided in favour of a new framework. |
|
@rhtyd PRs 2003 and 2011 got merged, could you please rebase against master so it'll pick up the fixes. Once we package I'll continue with the verification on the physical hosts. |
|
@borisstoyanov copy that, done |
|
@blueorangutan package |
|
@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✔centos6 ✔centos7 ✔debian. JID-600 |
|
@blueorangutan test |
|
@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
@rhtyd there seems to be a conflict for this merge, I'm currently running tests and will keep you posted |
borisstoyanov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've executed the following tests and they all look good
HA Configuration:
Setting OOBM, enabling HA -> HA state Ineligible (provider is not set)
Setting Provider, Enabling HA -> HA state Ineligible (OOBM is not set)
Settign OOBM and HA Provider, enabling HA -> HA state is Available
When Agent is killed:
Passed: Transitioned to Suspect/Checking -> Degraded
Agent started:
Passed Transitioned to Suspect/Checking -> Available
When Host is irresponsible
Passed: Recovering a host
Passed: Fencing a Host
Passed: HA Enabled VMs were migrated when host was Fenced
Passed: HA Disabled VMs were in stopped state after the host is Fenced
I think we need to address 2 things:
- SmokeTests are looking good the only failure related to these changes is because within the response there was no simulator provider.
- merge conflict
DaanHoogland
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. i had some comments on naming earlier only one is really. I re-itterated the odd one.
| } | ||
|
|
||
| public void setHostId(final Long hostId) { | ||
| id = hostId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this setter is still to be called setId(..) instead of setHostId(..), or should the field be renamed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed @DaanHoogland please see, turns out we dont' need this method.
|
@DaanHoogland I've fixed the issue now, it was not used so removed the method |
|
Ignoring intermittent test failures and new test failures around |
Fixed the issue now, closing the issue now.
borisstoyanov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ACS CI BVT RunSumarry: Link to logs Folder (search by build_no): https://www.dropbox.com/sh/r2si930m8xxzavs/AAAzNrnoF1fC3auFrvsKo_8-a?dl=0 Failed tests:
Skipped tests: Passed test suits: |
|
Travis is failing intermittently due to large log size, I've amended changes to reduce mvn build log and made changes in simulator integration test to pass for intermittent/edge cases. Considering LGTMs and test results, I'll merge this as soon as Travis is green. |
Host-HA offers investigation, fencing and recovery mechanisms for host that for any reason are malfunctioning. It uses Activity and Health checks to determine current host state based on which it may degrade a host or try to recover it. On failing to recover it, it may try to fence the host. The core feature is implemented in a hypervisor agnostic way, with two separate implementations of the driver/provider for Simulator and KVM hypervisors. The framework also allows for implementation of other hypervisor specific provider implementation in future. The Host-HA provider implementation for KVM hypervisor uses the out-of-band management sub-system to issue IPMI calls to reset (recover) or poweroff (fence) a host. The Host-HA provider implementation for Simulator provides a means of testing and validating the core framework implementation. Signed-off-by: Abhinandan Prateek <abhinandan.prateek@shapeblue.com> Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Nested out-of-band management plugin to work with hosts that are VMs in a CloudStack env. Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
- Removed three bg thread tasks, uses FSM event-trigger based scheduling - On successful recovery, kicks VM HA - Improves overall HA scheduling and task submission, lower DB access Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
5af039c to
2c3aad8
Compare
- All tests should pass on KVM, Simulator - Add test cases covering FSM state transitions and actions Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
|
Is this functionality ready to use in 4.11.1? Is there any docs for it? (how to enable host HA, because clicking button in UI just shows error.., what should be configured and so on) |
|
@izenk please find the FS here: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA |
|
@borisstoyanov thanks, went through docs, but still unclear for me the exact configuration.. But its not clear for me.. For example, my installation is CS 4.11.1 (KVM) + CEPH |
|
@rhtyd @borisstoyanov I'm trying to read up on the possibilities for high availability. Is there any way to differentiate the "old" HA from the one that's implemented here? Im sorry to say that the design document only adds to the confusion. I wasn't able to find any documentation that goes over the current state of HA in Cloudstack. Maybe you can point me in the right direction. |
|
@DennisKonrad I don't considder me to be an expert, but in my recollection 'old' ha is just for VMs and not for hosts. |
|
@DaanHoogland Ok, I understand the reasoning for the "new" ha. But where can I configure the old feature and where the new one. Specificly which config keys affect one or the other and what does the "HA Enabled" on a host actually do? Can I activate/deactivate only one or do I need to activate both? I cannot really find out if the features depend on one another? |
Host-HA offers investigation, fencing and recovery mechanisms for host that for
any reason are malfunctioning. It uses Activity and Health checks to determine
current host state based on which it may degrade a host or try to recover it. On
failing to recover it, it may try to fence the host.
The core feature is implemented in a hypervisor agnostic way, with two separate
implementations of the driver/provider for Simulator and KVM hypervisors. The
framework also allows for implementation of other hypervisor specific provider
implementation in future.
The Host-HA provider implementation for KVM hypervisor uses the out-of-band
management sub-system to issue IPMI calls to reset (recover) or poweroff (fence)
a host.
The Host-HA provider implementation for Simulator provides a means of testing
and validating the core framework implementation.
FS: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
Signed-off-by: Abhinandan Prateek abhinandan.prateek@shapeblue.com
Signed-off-by: Rohit Yadav rohit.yadav@shapeblue.com