Skip to content

Conversation

@mlsorensen
Copy link
Contributor

Description

This PR fixes the libvirt event processing to handle changes in Libvirt domains (and other events if registered).

There is some history around this. Initially it seems there was an event loop initialization added in an earlier PR that supports live block copy. This introduced Library.initEventLoop(); to LibvirtConnection, however it never polled for events, and this caused the keepalive issue warned about by libvirt doc here:

For proper event handling, it is important that the event implementation is registered before a connection to the Hypervisor is opened.

Once registered, the application has to invoke [virEventRunDefaultImpl](https://libvirt.org/html/libvirt-libvirt-event.html#virEventRunDefaultImpl)() in a loop to process events. Failure to do so may result in connections being closed unexpectedly as a result of keepalive timeout.

Thus it was found that long running VM migrations would fail due to keep alive issues, which resulted in #7945

In the meantime, support for detecting out of band VM shutdowns was being developed to also utilize events, and in parallel was relying on the above Library.initEventLoop(); in LibvirtConnection. This new feature added the missing polling, but ultimately doesn't work (fails gracefully to provide new event handling) because Library.initEventLoop() was lost in parallel.

This PR moves the thread responsible for running the Libvirt event loop out of LibvirtComputingResource and into LibvirtConnection, and reintroduces Library.initEventLoop(), which shouldn't result in keep alive errors now that we are properly polling for events. It also sets up the event loop before any Libvirt connection as documentation requires.

Finally, when a connection is tested using conn.getVersion(), if an exception is thrown we simply create a new connection (but no attempt is made to clean up the old connection). This was resulting in file handle leaks in cases such as the keep alive timeout that was causing connections to go stale, so now we try to close the connection before creating a new one.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Tested with main by deleting VMs out of band, also checking for file handle leaks that were apparent in the previous observed introduction of Library.initEventLoop() using lsof +E -aUc java on the hypervisor hosts.

@mlsorensen mlsorensen force-pushed the main-fix-domain-events branch from 7eeabc6 to 7228071 Compare January 3, 2024 19:11
@codecov
Copy link

codecov bot commented Jan 3, 2024

Codecov Report

Attention: 19 lines in your changes are missing coverage. Please review.

Comparison is base (9d4f071) 30.80% compared to head (96eeed6) 30.71%.
Report is 31 commits behind head on main.

Files Patch % Lines
...oud/hypervisor/kvm/resource/LibvirtConnection.java 0.00% 18 Missing ⚠️
...ervisor/kvm/resource/LibvirtComputingResource.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #8437      +/-   ##
============================================
- Coverage     30.80%   30.71%   -0.10%     
- Complexity    33981    34021      +40     
============================================
  Files          5341     5341              
  Lines        374864   377364    +2500     
  Branches      54518    55347     +829     
============================================
+ Hits         115485   115906     +421     
- Misses       244114   246166    +2052     
- Partials      15265    15292      +27     
Flag Coverage Δ
simulator-marvin-tests 24.65% <0.00%> (-0.08%) ⬇️
uitests 4.35% <ø> (-0.04%) ⬇️
unit-tests 16.46% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@harikrishna-patnala
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8197

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm, no event handling changes (other than FAILED) should happen. needs testing though.

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8208

@DaanHoogland
Copy link
Contributor

@blueorangutan test ol9 kvm-ol9 keepEnv

@blueorangutan
Copy link

@DaanHoogland [SL] unsupported parameters provided. Supported mgmt server os are: centos7, centos6, suse15, alma8, ubuntu18, ubuntu22, ubuntu20, rocky8, alma9. Supported hypervisors are: kvm-centos6, kvm-centos7, kvm-rocky8, kvm-alma8, kvm-alma9, kvm-ubuntu18, kvm-ubuntu20, kvm-ubuntu22, kvm-suse15, vmware-55u3, vmware-60u2, vmware-65u2, vmware-67u3, vmware-70u1, vmware-70u2, vmware-70u3, vmware-80, vmware-80u1, xenserver-65sp1, xenserver-71, xenserver-74, xcpng74, xcpng76, xcpng80, xcpng81, xcpng82

@DaanHoogland
Copy link
Contributor

@blueorangutan test alm9 kvm-alma9 keepEnv

@blueorangutan
Copy link

@DaanHoogland [SL] unsupported parameters provided. Supported mgmt server os are: centos7, centos6, suse15, alma8, ubuntu18, ubuntu22, ubuntu20, rocky8, alma9. Supported hypervisors are: kvm-centos6, kvm-centos7, kvm-rocky8, kvm-alma8, kvm-alma9, kvm-ubuntu18, kvm-ubuntu20, kvm-ubuntu22, kvm-suse15, vmware-55u3, vmware-60u2, vmware-65u2, vmware-67u3, vmware-70u1, vmware-70u2, vmware-70u3, vmware-80, vmware-80u1, xenserver-65sp1, xenserver-71, xenserver-74, xcpng74, xcpng76, xcpng80, xcpng81, xcpng82

@DaanHoogland
Copy link
Contributor

@blueorangutan test alma9 kvm-alma9 keepEnv

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (alma9 mgmt + kvm-alma9) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-8739)
Environment: kvm-alma9 (x2), Advanced Networking with Mgmt server a9
Total time taken: 53509 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8437-t8739-kvm-alma9.zip
Smoke tests completed. 120 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_00_deploy_vm_root_resize Error 3661.56 test_deploy_vm_root_resize.py

@yadvr
Copy link
Member

yadvr commented Jan 8, 2024

@blueorangutan test ol9 kvm-ol9

@blueorangutan
Copy link

@rohityadavcloud [SL] unsupported parameters provided. Supported mgmt server os are: centos7, centos6, suse15, alma8, ubuntu18, ubuntu22, ubuntu20, rocky8, alma9. Supported hypervisors are: kvm-centos6, kvm-centos7, kvm-rocky8, kvm-alma8, kvm-alma9, kvm-ubuntu18, kvm-ubuntu20, kvm-ubuntu22, kvm-suse15, vmware-55u3, vmware-60u2, vmware-65u2, vmware-67u3, vmware-70u1, vmware-70u2, vmware-70u3, vmware-80, vmware-80u1, xenserver-65sp1, xenserver-71, xenserver-74, xcpng74, xcpng76, xcpng80, xcpng81, xcpng82

@DaanHoogland
Copy link
Contributor

teste out-of-bounds shutdown of a vm. Is there anything else that should be tested here, @mlsorensen ?

@mlsorensen
Copy link
Contributor Author

I tested libvirt restart. This triggers an agent restart (as it did before) so it doesn't really test well what happens if connection to libvirt is lost, but the idea is that any LibvirtConnection should set up event processing again, if needed.

@DaanHoogland
Copy link
Contributor

I tested libvirt restart. This triggers an agent restart (as it did before) so it doesn't really test well what happens if connection to libvirt is lost, but the idea is that any LibvirtConnection should set up event processing again, if needed.

ok, now focussing on 4.19, but I'll try to device a (more intrusive) test for this later.

Copy link
Contributor

@harikrishna-patnala harikrishna-patnala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, also tested the VM migrations with local storage.

@sureshanaparti
Copy link
Contributor

I tested libvirt restart. This triggers an agent restart (as it did before) so it doesn't really test well what happens if connection to libvirt is lost, but the idea is that any LibvirtConnection should set up event processing again, if needed.

tested it, the agent restarts after libvirtd restart (service libvirtd restart). Can the agent retry / wait for libvirtd sometime, and setup connection & event listeners?

Copy link
Contributor

@sureshanaparti sureshanaparti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested some VM / Volumes operations from CloudStack, including the live migration between local pools (validates the issue #7942, caused by Library.initEventLoop() earlier), and out of band shutdown. No issues observed. Changes LGTM.

@DaanHoogland DaanHoogland added this to the 4.18.2.0 milestone Jan 16, 2024
@yadvr yadvr closed this Jan 29, 2024
@yadvr yadvr reopened this Jan 29, 2024
@yadvr
Copy link
Member

yadvr commented Jan 29, 2024

@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8459

@mlsorensen
Copy link
Contributor Author

Sorry everyone, didn't mean to disappear on this and ghost any feedback. I initially had some time to look at the problem, created the PR, and promptly switched context to other issues. Thanks for taking the time to review and do some testing on it @sureshanaparti @harikrishna-patnala @DaanHoogland @GutoVeronezi @weizhouapache @rohityadavcloud

@yadvr yadvr changed the base branch from main to 4.19 February 5, 2024 07:59
@yadvr yadvr modified the milestones: 4.18.2.0, 4.19.1.0 Feb 5, 2024
@yadvr yadvr merged commit 9f1b34a into apache:4.19 Feb 5, 2024
@yadvr
Copy link
Member

yadvr commented Feb 5, 2024

cc @JoaoJandre if you want to consider this fix for 4.18.2.0, I'm not sure if it would be risky to have it in 4.18 branch so I've shifted this to 4.19 branch.

dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Feb 7, 2024
…he#8437)

* Fix libvirt domain event listener by properly processing events

* Add javadoc for setupEventListener

---------

Co-authored-by: Marcus Sorensen <mls@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

8 participants