Skip to content

Conversation

@yadvr
Copy link
Member

@yadvr yadvr commented Feb 22, 2017

Host-HA offers investigation, fencing and recovery mechanisms for host that for
any reason are malfunctioning. It uses Activity and Health checks to determine
current host state based on which it may degrade a host or try to recover it. On
failing to recover it, it may try to fence the host.

The core feature is implemented in a hypervisor agnostic way, with two separate
implementations of the driver/provider for Simulator and KVM hypervisors. The
framework also allows for implementation of other hypervisor specific provider
implementation in future.

The Host-HA provider implementation for KVM hypervisor uses the out-of-band
management sub-system to issue IPMI calls to reset (recover) or poweroff (fence)
a host.

The Host-HA provider implementation for Simulator provides a means of testing
and validating the core framework implementation.

FS: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA

Signed-off-by: Abhinandan Prateek abhinandan.prateek@shapeblue.com
Signed-off-by: Rohit Yadav rohit.yadav@shapeblue.com

@yadvr yadvr changed the title CLOUDSTACK-9782: Host HA and KVM HA provider [4.11/Future] CLOUDSTACK-9782: Host HA and KVM HA provider Feb 22, 2017
@borisstoyanov
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-525

@borisstoyanov
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@borisstoyanov
Copy link
Contributor

@rhtyd tests looks good, except this one:

ERROR: Tests default ha providers list
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/marvin/tests/smoke/test_hostha_simulator.py", line 199, in test_ha_list_providers
    response = self.apiclient.listHostHAProviders(cmd)[0]
TypeError: 'NoneType' object has no attribute '__getitem__'

Looks like when simulator is not deployed test would fail.

@yadvr yadvr closed this Feb 27, 2017
@yadvr yadvr reopened this Feb 27, 2017
Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general remark:description fields do not say more then the name of the command or parameter they refer to. some improvement may be possible. e.g. "lists HA providers" may read "return a list of High Availability providers for a Hypervisor type"

.travis.yml Outdated
smoke/test_dynamicroles
smoke/test_global_settings
smoke/test_guest_vlan_range
smoke/test_ha_for_host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 :)


import javax.inject.Inject;

@APICommand(name = DisableHAForClusterCmd.APINAME, description = "Disables HA cluster-wide",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: looking at the code below this actually is DisableHAForHostsInClusterCmd. I don't think it is a biggy but might lead to conflicts at some point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaanHoogland can you illustrate an example where we may hit such a conflict? This API actually disables the feature (framework etc) for cluster/zone etc. While the Host specific APIs work on the host.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to implement some kind of mirroring service for entire zones or clusters for high availability, these would then be disabled / enabled for a zone or cluster. In this case the enabling is done per host in the zone/cluster. That's why i consider the longer name I propose more appropriate. That said the shorter the name the better as long as it is unambiguous.


import javax.inject.Inject;

@APICommand(name = DisableHAForZoneCmd.APINAME, description = "Disables HA for a zone",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: see disable cluster cmd

responseObject = SuccessResponse.class,
requestHasSensitiveInfo = false, responseHasSensitiveInfo = false,
since = "4.11", authorized = {RoleType.Admin})
public final class EnableHAForClusterCmd extends BaseAsyncCmd {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: see disable cmd

responseObject = SuccessResponse.class,
requestHasSensitiveInfo = false, responseHasSensitiveInfo = false,
since = "4.11", authorized = {RoleType.Admin})
public final class EnableHAForZoneCmd extends BaseAsyncCmd {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: see disable cluster

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not squash next time. both history is lost and review is very hard on such a big chunk of code. individual smaller commits make for easier reading.

}
}

public void setHostId(final Long hostId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: consistency would dictate this be called setId()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the setters and getters are named per the variable, since the variable is hostId IDEs generate getHostId, setHostId named methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the field name is id, so i don't understand your reply

CallContext.current().putContextParameter(Host.class, host.getUuid());

final OutOfBandManagementResponse response = outOfBandManagementService.changeOutOfBandManagementPassword(host, getPassword());
final OutOfBandManagementResponse response = outOfBandManagementService.changePassword(host, getPassword());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@DaanHoogland
Copy link
Contributor

I went through the code and found no real issues. for any other reviewers I recommend reading the FS first. It is quit a big chunk but very neat.

LGTM

@yadvr
Copy link
Member Author

yadvr commented Feb 28, 2017

Thanks @DaanHoogland wherever applicable I'll address the comments.

@koushik-das
Copy link
Contributor

I have already raised some questions on dev@ on the need for a new HA framework when the existing HA framework can do all the things mentioned. The new framework only supports VM HA. If we see some concrete implementation of network/storage or any other type of resource HA which doesn't currently exist using the new framework then it would be much more easier to see the value add. I think more discussion is needed on this.

@abhinandanprateek
Copy link
Contributor

@koushik-das The current framework is specifically implemented for Host HA with KVM HA as the initial implementation. This framework is supposed to replace the framework that is specifically written for VM-HA. @rhtyd

@yadvr
Copy link
Member Author

yadvr commented Feb 28, 2017

@koushik-das I've shared a list of advantages of this work over existing framework on dev@ that explain why existing VM-HA framework cannot be used for host-ha implementation. If you've more questions or comments, we can discuss them here or on dev@.

@koushik-das
Copy link
Contributor

@abhinandanprateek If you refer to the discussion on dev@ https://goo.gl/cU8RuX, @rhtyd proposed it as a generic HA framework for any resources (and not limited to VM). Now if it just a replacement of the existing VM-HA framework then we need to discuss the benefits it provides in more detail compared to the existing one. I have already made some specific comments on the justifications provided in favour of a new framework.

@borisstoyanov
Copy link
Contributor

@rhtyd PRs 2003 and 2011 got merged, could you please rebase against master so it'll pick up the fixes. Once we package I'll continue with the verification on the physical hosts.

@yadvr
Copy link
Member Author

yadvr commented Mar 26, 2017

@borisstoyanov copy that, done

@borisstoyanov
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-600

@borisstoyanov
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@borisstoyanov
Copy link
Contributor

@rhtyd there seems to be a conflict for this merge, I'm currently running tests and will keep you posted

Copy link
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've executed the following tests and they all look good
HA Configuration:
Setting OOBM, enabling HA -> HA state Ineligible (provider is not set)
Setting Provider, Enabling HA -> HA state Ineligible (OOBM is not set)
Settign OOBM and HA Provider, enabling HA -> HA state is Available

When Agent is killed:
Passed: Transitioned to Suspect/Checking -> Degraded
Agent started:
Passed Transitioned to Suspect/Checking -> Available

When Host is irresponsible
Passed: Recovering a host
Passed: Fencing a Host
Passed: HA Enabled VMs were migrated when host was Fenced
Passed: HA Disabled VMs were in stopped state after the host is Fenced

I think we need to address 2 things:

  • SmokeTests are looking good the only failure related to these changes is because within the response there was no simulator provider.
  • merge conflict

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. i had some comments on naming earlier only one is really. I re-itterated the odd one.

}

public void setHostId(final Long hostId) {
id = hostId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this setter is still to be called setId(..) instead of setHostId(..), or should the field be renamed?

Copy link
Member Author

@yadvr yadvr Aug 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed @DaanHoogland please see, turns out we dont' need this method.

@yadvr
Copy link
Member Author

yadvr commented Aug 29, 2017

@DaanHoogland I've fixed the issue now, it was not used so removed the method

@yadvr
Copy link
Member Author

yadvr commented Aug 29, 2017

Ignoring intermittent test failures and new test failures around isos, this PR has enough LGTMs and test results now, I'll merge this tomorrow and wait for any new comments and objections.

@yadvr yadvr dismissed DaanHoogland’s stale review August 29, 2017 11:46

Fixed the issue now, closing the issue now.

Copy link
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cloudmonger
Copy link

ACS CI BVT Run

Sumarry:
Build Number 1164
Hypervisor xenserver
NetworkType Advanced
Passed=109
Failed=8
Skipped=40

Link to logs Folder (search by build_no): https://www.dropbox.com/sh/r2si930m8xxzavs/AAAzNrnoF1fC3auFrvsKo_8-a?dl=0

Failed tests:

  • test_vm_snapshots.py

  • test_change_service_offering_for_vm_with_snapshots Failed

  • test_router_dnsservice.py

  • test_router_dns_guestipquery Failing since 2 runs

  • test_non_contigiousvlan.py

  • test_extendPhysicalNetworkVlan Failed

  • test_volumes.py

  • test_06_download_detached_volume Failed

  • test_routers_network_ops.py

  • test_01_isolate_network_FW_PF_default_routes_egress_true Failing since 28 runs

  • test_02_isolate_network_FW_PF_default_routes_egress_false Failing since 155 runs

  • test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true Failing since 150 runs

  • test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false Failing since 150 runs

Skipped tests:
test_vm_nic_adapter_vmxnet3
test_01_verify_libvirt
test_02_verify_libvirt_after_restart
test_03_verify_libvirt_attach_disk
test_04_verify_guest_lspci
test_05_change_vm_ostype_restart
test_06_verify_guest_lspci_again
test_disable_oobm_ha_state_ineligible
test_ha_kvm_host_degraded
test_ha_kvm_host_fencing
test_ha_kvm_host_recovering
test_hostha_configure_default_driver
test_hostha_enable_ha_when_host_disabled
test_hostha_enable_ha_when_host_disconected
test_hostha_enable_ha_when_host_in_maintenance
test_remove_ha_provider_not_possible
test_configure_ha_provider_invalid
test_configure_ha_provider_valid
test_ha_disable_feature_invalid
test_ha_enable_feature_invalid
test_ha_enabledisable_across_clusterzones
test_ha_list_providers
test_ha_multiple_mgmt_server_ownership
test_ha_verify_fsm_available
test_ha_verify_fsm_degraded
test_ha_verify_fsm_fenced
test_ha_verify_fsm_recovering
test_hostha_configure_default_driver
test_hostha_configure_invalid_provider
test_hostha_disable_feature_valid
test_hostha_enable_feature_valid
test_hostha_enable_feature_without_setting_provider
test_list_ha_for_host
test_list_ha_for_host_invalid
test_list_ha_for_host_valid
test_static_role_account_acls
test_11_ss_nfs_version_on_ssvm
test_nested_virtualization_vmware
test_3d_gpu_support
test_deploy_vgpu_enabled_vm

Passed test suits:
test_deploy_vm_with_userdata.py
test_affinity_groups_projects.py
test_portable_publicip.py
test_over_provisioning.py
test_global_settings.py
test_scale_vm.py
test_service_offerings.py
test_routers_iptables_default_policy.py
test_loadbalance.py
test_routers.py
test_reset_vm_on_reboot.py
test_deploy_vms_with_varied_deploymentplanners.py
test_network.py
test_router_dns.py
test_outofbandmanagement_nestedplugin.py
test_login.py
test_deploy_vm_iso.py
test_list_ids_parameter.py
test_public_ip_range.py
test_multipleips_per_nic.py
test_metrics_api.py
test_regions.py
test_affinity_groups.py
test_network_acl.py
test_pvlan.py
test_nic.py
test_deploy_vm_root_resize.py
test_resource_detail.py
test_secondary_storage.py
test_disk_offerings.py

@yadvr
Copy link
Member Author

yadvr commented Aug 30, 2017

Travis is failing intermittently due to large log size, I've amended changes to reduce mvn build log and made changes in simulator integration test to pass for intermittent/edge cases.

Considering LGTMs and test results, I'll merge this as soon as Travis is green.

yadvr added 3 commits August 30, 2017 12:20
Host-HA offers investigation, fencing and recovery mechanisms for host that for
any reason are malfunctioning. It uses Activity and Health checks to determine
current host state based on which it may degrade a host or try to recover it. On
failing to recover it, it may try to fence the host.

The core feature is implemented in a hypervisor agnostic way, with two separate
implementations of the driver/provider for Simulator and KVM hypervisors. The
framework also allows for implementation of other hypervisor specific provider
implementation in future.

The Host-HA provider implementation for KVM hypervisor uses the out-of-band
management sub-system to issue IPMI calls to reset (recover) or poweroff (fence)
a host.

The Host-HA provider implementation for Simulator provides a means of testing
and validating the core framework implementation.

Signed-off-by: Abhinandan Prateek <abhinandan.prateek@shapeblue.com>
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Nested out-of-band management plugin to work with hosts that are VMs in
a CloudStack env.

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
- Removed three bg thread tasks, uses FSM event-trigger based scheduling
- On successful recovery, kicks VM HA
- Improves overall HA scheduling and task submission, lower DB access

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@yadvr yadvr force-pushed the host-ha-master branch 5 times, most recently from 5af039c to 2c3aad8 Compare August 30, 2017 14:06
- All tests should pass on KVM, Simulator
- Add test cases covering FSM state transitions and actions

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@yadvr yadvr merged commit f917ab6 into apache:master Aug 30, 2017
@izenk
Copy link

izenk commented Sep 12, 2018

Is this functionality ready to use in 4.11.1? Is there any docs for it? (how to enable host HA, because clicking button in UI just shows error.., what should be configured and so on)

@borisstoyanov
Copy link
Contributor

@izenk please find the FS here: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
there's full description what hardware you'll need and how to set it up. You can mail the dev@ or user@ mailing lists if you need assistance.
And yes, the feature is fully functional in 4.11.1.

@izenk
Copy link

izenk commented Oct 24, 2018

@borisstoyanov thanks, went through docs, but still unclear for me the exact configuration..
There is no description what should be done to enable and configure HA (Is it enough to switch it on in UI or not?).
What are requirements from point of feature support or software versions?
I found some text about:

A host must meet the following criteria to be deemed eligible for HA operations by the KVM HA host provider:
    The host must be a member of a cluster using the KVM hypervisor
    The host must have a power management status of ON or OFF
    The version of the KVM agent deployed on the host must support performing activity checks
    At least one volume attached to the VM(s) must support the activity check capability
    There should be at least one other host in the cluster.

But its not clear for me..
What does "The version of the KVM agent deployed on the host must support performing activity checks" mean? - There is no such term (activity checks) in KVM docs.
"At least one volume attached to the VM(s) must support the activity check capability" - hot it can be determined? In what terms? I terms of storage or kvm?

For example, my installation is CS 4.11.1 (KVM) + CEPH
I can enable HA for hosts - no errors appear, but in fact all hosts are in DEGRADED state.
And I cant understand what point I should start from to investigate the issue.
Are there any KVMHAProvider docs? (like what are checks? how they are executed and so on)

@DennisKonrad
Copy link
Contributor

@rhtyd @borisstoyanov I'm trying to read up on the possibilities for high availability.

Is there any way to differentiate the "old" HA from the one that's implemented here? Im sorry to say that the design document only adds to the confusion.

I wasn't able to find any documentation that goes over the current state of HA in Cloudstack. Maybe you can point me in the right direction.

@DaanHoogland
Copy link
Contributor

@DennisKonrad I don't considder me to be an expert, but in my recollection 'old' ha is just for VMs and not for hosts.

@DennisKonrad
Copy link
Contributor

@DaanHoogland Ok, I understand the reasoning for the "new" ha. But where can I configure the old feature and where the new one.

Specificly which config keys affect one or the other and what does the "HA Enabled" on a host actually do?

Can I activate/deactivate only one or do I need to activate both? I cannot really find out if the features depend on one another?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.