Skip to content

Conversation

@mlsorensen
Copy link
Contributor

@mlsorensen mlsorensen commented Sep 14, 2023

Description

Pushing this to get some feedback, trying to provide a feature and reusing some functionality like the existing VM power state reporting, rather than inventing whole new messaging.

This PR allows KVM to detect guests that stop or crash, and immediately trigger a power state report to update the VM state in CloudStack.

In current design, Agent is responsible for sending pings on interval. These pings contain a state report for each VM. If something changes in between these pings, it can potentially take a long time for discovery of VM state change.

When Agent first starts, it loads the ServerResource (implementation is LibvirtComputingResource in the shipping agent) and initializes it. Then later it calls the ServerResource getCurrentStatus() to collect the host and VM status to send PingCommand on intervals. This change adds two interfaces - if the ServerResource implements ResourceStatusUpdater, then the agent registers itself as an AgentStatusUpdater, which gives the ServerResource a way to trigger an update, by calling AgentStatusUpdater.triggerUpdate(). This keeps the implementation of monitoring and collecting the VM status in the ServerResource, while allowing the Agent to still handle sending the update, and not requiring all existing implementations of ServerResource and IAgentControl to implement these by changing the existing interfaces. There may be a cleaner way to do this.

In LibvirtComputingResource, we register an event listener, and process domain lifecycle events, looking only for STOPPED events that are due to a crash or a shutdown. Domain stop due to things like virsh destroy" or cloudstack issuing a stop will have a detail of DESTROYED or MIGRATED in the case of migration, rather than CRASHED or SHUTDOWN. I considered briefly adding some code to track if we were in the middle of a StopCommand or similar to filter out superfluous events, this seems simpler, at the expense of not being able to update on admin virsh destroy.

The PingCommand has been given a boolean to indicate if the Ping is out of band. This is important, because the code that processes pings on the management server will ignore pings that come more often than the expected interval (presumably to avoid state thrashing?). This boolean gives us the ability to force processing of pings if they are out of band. Therefore it's also important that we are only triggering these on valid state change events, and not issuing superfluous updates any time a VM stops due to cloudstack issuing stop, etc.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Tested locally by shutting down VM within guest, vs virsh destroy or stopping via CloudStack API. Tested to ensure listener still works after libvirt restart.

There are no existing tests for VirtualMachienPowerStateSyncImpl, and the change here is very minor (reacting to the boolean).

It seems tricky to build a unit test for the Libvirt event listener.

Could possibly write a smoke test to ssh into a vm, shut it down, and check the state of the VM via API?

@codecov
Copy link

codecov bot commented Sep 14, 2023

Codecov Report

Merging #7963 (f598354) into main (45616aa) will decrease coverage by 1.69%.
Report is 2 commits behind head on main.
The diff coverage is 6.55%.

@@             Coverage Diff              @@
##               main    #7963      +/-   ##
============================================
- Coverage     29.16%   27.48%   -1.69%     
+ Complexity    30377    28203    -2174     
============================================
  Files          5100     5100              
  Lines        358273   358325      +52     
  Branches      52304    52308       +4     
============================================
- Hits         104496    98481    -6015     
- Misses       239406   246236    +6830     
+ Partials      14371    13608     -763     
Flag Coverage Δ
simulator-marvin-tests 23.02% <4.91%> (-2.18%) ⬇️
uitests 4.87% <ø> (ø)
unit-tests 14.40% <2.38%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
agent/src/main/java/com/cloud/agent/Agent.java 0.00% <0.00%> (ø)
...ervisor/kvm/resource/LibvirtComputingResource.java 18.44% <2.70%> (-0.25%) ⬇️
...src/main/java/com/cloud/agent/api/PingCommand.java 18.51% <14.28%> (-1.49%) ⬇️
...com/cloud/vm/VirtualMachinePowerStateSyncImpl.java 62.50% <25.00%> (-1.14%) ⬇️
...n/java/com/cloud/vm/VirtualMachineManagerImpl.java 37.47% <100.00%> (-0.13%) ⬇️

... and 398 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code seems to do what you describe @mlsorensen . To be double clear; this marks VMs as Stopped when they are either stopped from within the client OS, or from the host CLI, leaving other functionality as is.
This is in addition to the migration events you recently submitted, is it?

@mlsorensen
Copy link
Contributor Author

code seems to do what you describe @mlsorensen . To be double clear; this marks VMs as Stopped when they are either stopped from within the client OS, or from the host CLI, leaving other functionality as is.
This is in addition to the migration events you recently submitted, is it?

It will notify Management Server immediately if VM is stopped within guest OS, or in the event of a crash. As for admin CLI on host, Libvirt events don't seem to be able to make a distinction between CloudStack calling it to destroy a VM vs admin CLI doing something like a virsh destroy or virsh shutdown - they are both calling the same Libvirt API with the same commands. In order to do that, we would probably need to add some sort of tracking to filter out/ignore events if we know the CloudStack agent is in the middle of processing a shutdown.

@yadvr
Copy link
Member

yadvr commented Sep 26, 2023

@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7104

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good, will do a monkey test

@blueorangutan
Copy link

Packaging result [LL]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6185

@yadvr
Copy link
Member

yadvr commented Sep 27, 2023

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[LL]Trillian test result (tid-6747)
Environment: kvm-alma8 (x1), Advanced Networking with Mgmt server a8
Total time taken: 36025 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7963-t6747-kvm-alma8.zip
Smoke tests completed. 108 look OK, 5 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_non_strict_host_anti_affinity Failure 107.83 test_nonstrict_affinity_group.py
test_02_non_strict_host_affinity Error 83.27 test_nonstrict_affinity_group.py
test_DeployVmAntiAffinityGroup Error 35.54 test_affinity_groups.py
test_DeployVmAntiAffinityGroup_in_project Error 73.40 test_affinity_groups_projects.py
test_03_deploy_and_scale_kubernetes_cluster Failure 22.59 test_kubernetes_clusters.py
test_07_deploy_kubernetes_ha_cluster Failure 0.03 test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster Failure 0.04 test_kubernetes_clusters.py
test_09_delete_kubernetes_ha_cluster Failure 0.05 test_kubernetes_clusters.py
test_hostha_enable_ha_when_host_in_maintenance Error 302.85 test_hostha_kvm.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-7725)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 52010 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7963-t7725-kvm-centos7.zip
Smoke tests completed. 112 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_invalid_upgrade_kubernetes_cluster Failure 3604.47 test_kubernetes_clusters.py
test_02_upgrade_kubernetes_cluster Failure 3610.24 test_kubernetes_clusters.py
test_03_deploy_and_scale_kubernetes_cluster Failure 0.04 test_kubernetes_clusters.py
test_04_autoscale_kubernetes_cluster Failure 0.04 test_kubernetes_clusters.py
test_05_basic_lifecycle_kubernetes_cluster Failure 0.04 test_kubernetes_clusters.py
test_06_delete_kubernetes_cluster Failure 0.03 test_kubernetes_clusters.py
test_07_deploy_kubernetes_ha_cluster Failure 0.04 test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster Failure 0.04 test_kubernetes_clusters.py
test_09_delete_kubernetes_ha_cluster Failure 0.04 test_kubernetes_clusters.py
test_10_vpc_tier_kubernetes_cluster Failure 47.76 test_kubernetes_clusters.py
test_11_test_unmanaged_cluster_lifecycle Error 1.21 test_kubernetes_clusters.py
ContextSuite context=TestKubernetesCluster>:teardown Error 54.43 test_kubernetes_clusters.py

@yadvr yadvr added this to the 4.19.0.0 milestone Sep 28, 2023
@yadvr yadvr merged commit 3694667 into apache:main Sep 28, 2023
shwstppr pushed a commit to shapeblue/cloudstack that referenced this pull request Oct 12, 2023
…pache#7963)

* Trigger out of band VM state update via libvirt event when VM stops

* Add License headers, refactor nested try

---------

Co-authored-by: Marcus Sorensen <mls@apple.com>
(cherry picked from commit 3694667)
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants