-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Fix libvirt domain event listener by properly processing events #8437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7eeabc6 to
7228071
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #8437 +/- ##
============================================
- Coverage 30.80% 30.71% -0.10%
- Complexity 33981 34021 +40
============================================
Files 5341 5341
Lines 374864 377364 +2500
Branches 54518 55347 +829
============================================
+ Hits 115485 115906 +421
- Misses 244114 246166 +2052
- Partials 15265 15292 +27
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/LibvirtConnection.java
Outdated
Show resolved
Hide resolved
|
@blueorangutan package |
|
@harikrishna-patnala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8197 |
DaanHoogland
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clgtm, no event handling changes (other than FAILED) should happen. needs testing though.
|
@blueorangutan package |
|
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
weizhouapache
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8208 |
|
@blueorangutan test ol9 kvm-ol9 keepEnv |
|
@DaanHoogland [SL] unsupported parameters provided. Supported mgmt server os are: |
|
@blueorangutan test alm9 kvm-alma9 keepEnv |
|
@DaanHoogland [SL] unsupported parameters provided. Supported mgmt server os are: |
plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/LibvirtConnection.java
Show resolved
Hide resolved
|
@blueorangutan test alma9 kvm-alma9 keepEnv |
|
@DaanHoogland a [SL] Trillian-Jenkins test job (alma9 mgmt + kvm-alma9) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-8739)
|
|
@blueorangutan test ol9 kvm-ol9 |
|
@rohityadavcloud [SL] unsupported parameters provided. Supported mgmt server os are: |
|
teste out-of-bounds shutdown of a vm. Is there anything else that should be tested here, @mlsorensen ? |
|
I tested libvirt restart. This triggers an agent restart (as it did before) so it doesn't really test well what happens if connection to libvirt is lost, but the idea is that any LibvirtConnection should set up event processing again, if needed. |
ok, now focussing on 4.19, but I'll try to device a (more intrusive) test for this later. |
harikrishna-patnala
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, also tested the VM migrations with local storage.
tested it, the agent restarts after libvirtd restart ( |
sureshanaparti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested some VM / Volumes operations from CloudStack, including the live migration between local pools (validates the issue #7942, caused by Library.initEventLoop() earlier), and out of band shutdown. No issues observed. Changes LGTM.
|
@blueorangutan package |
|
@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8459 |
|
Sorry everyone, didn't mean to disappear on this and ghost any feedback. I initially had some time to look at the problem, created the PR, and promptly switched context to other issues. Thanks for taking the time to review and do some testing on it @sureshanaparti @harikrishna-patnala @DaanHoogland @GutoVeronezi @weizhouapache @rohityadavcloud |
|
cc @JoaoJandre if you want to consider this fix for 4.18.2.0, I'm not sure if it would be risky to have it in 4.18 branch so I've shifted this to 4.19 branch. |
…he#8437) * Fix libvirt domain event listener by properly processing events * Add javadoc for setupEventListener --------- Co-authored-by: Marcus Sorensen <mls@apple.com>
Description
This PR fixes the libvirt event processing to handle changes in Libvirt domains (and other events if registered).
There is some history around this. Initially it seems there was an event loop initialization added in an earlier PR that supports live block copy. This introduced
Library.initEventLoop();toLibvirtConnection, however it never polled for events, and this caused the keepalive issue warned about by libvirt doc here:Thus it was found that long running VM migrations would fail due to keep alive issues, which resulted in #7945
In the meantime, support for detecting out of band VM shutdowns was being developed to also utilize events, and in parallel was relying on the above
Library.initEventLoop();inLibvirtConnection. This new feature added the missing polling, but ultimately doesn't work (fails gracefully to provide new event handling) becauseLibrary.initEventLoop()was lost in parallel.This PR moves the thread responsible for running the Libvirt event loop out of LibvirtComputingResource and into LibvirtConnection, and reintroduces
Library.initEventLoop(), which shouldn't result in keep alive errors now that we are properly polling for events. It also sets up the event loop before any Libvirt connection as documentation requires.Finally, when a connection is tested using
conn.getVersion(), if an exception is thrown we simply create a new connection (but no attempt is made to clean up the old connection). This was resulting in file handle leaks in cases such as the keep alive timeout that was causing connections to go stale, so now we try to close the connection before creating a new one.Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Tested with main by deleting VMs out of band, also checking for file handle leaks that were apparent in the previous observed introduction of
Library.initEventLoop()usinglsof +E -aUc javaon the hypervisor hosts.