-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Speedup resource count calculation #8903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup resource count calculation #8903
Conversation
|
@blueorangutan package |
|
@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9234 |
|
@blueorangutan test |
|
@sureshanaparti a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
sureshanaparti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm
DaanHoogland
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm
| } | ||
|
|
||
| @Override | ||
| public boolean updateCountByDeltaForIds(List<Long> ids, boolean increment, long delta) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to remove increment and check for positive/negative numbers?
the higher level call to updateResourceCountForAccount is only done in two places so it wouldn't be a big refactor.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8903 +/- ##
============================================
+ Coverage 13.17% 15.20% +2.03%
- Complexity 9214 12936 +3722
============================================
Files 2725 4880 +2155
Lines 258235 327065 +68830
Branches 40249 46220 +5971
============================================
+ Hits 34013 49726 +15713
- Misses 219913 270395 +50482
- Partials 4309 6944 +2635
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
|
@blueorangutan package |
|
@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9239 |
|
[SF] Trillian test result (tid-9811)
|
borisstoyanov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I've tested it with resource tagging assigned to the offerings for both the volumes and VMs provisioning and both deploy and teardown did not prompt any of the spotted errors and locks before. In my setup with cs-bench I've created 50 domains and deployed 100 VM in each. Each VM containing 2 disks. Both compute and disk offerings had a resource limit tags.
With the patch
+----------+-------+-------+--------+--------+--------+-----------------+-----------------+-----------------+
| TYPE | COUNT | MIN | MAX | AVG | MEDIAN | 90TH PERCENTILE | 95TH PERCENTILE | 99TH PERCENTILE |
+----------+-------+-------+--------+--------+--------+-----------------+-----------------+-----------------+
| vm - All | 4600 | 2.078 | 33.925 | 11.619 | 7.487 | 24.935 | 27.468 | 29.964 |
+----------+-------+-------+--------+--------+--------+-----------------+-----------------+-----------------+
+------------------+-------+--------+--------+--------+--------+-----------------+-----------------+-----------------+
| TYPE | COUNT | MIN | MAX | AVG | MEDIAN | 90TH PERCENTILE | 95TH PERCENTILE | 99TH PERCENTILE |
+------------------+-------+--------+--------+--------+--------+-----------------+-----------------+-----------------+
| vm-destroy - All | 4600 | 10.395 | 41.303 | 26.924 | 29.519 | 30.658 | 31.201 | 38.534 |
+------------------+-------+--------+--------+--------+--------+-----------------+-----------------+-----------------+
Without the patch - removed errors/failed out of the equation
+-----------------+-------+-------+--------+-------+--------+-----------------+-----------------+-----------------+
| TYPE | COUNT | MIN | MAX | AVG | MEDIAN | 90TH PERCENTILE | 95TH PERCENTILE | 99TH PERCENTILE |
+-----------------+-------+-------+--------+-------+--------+-----------------+-----------------+-----------------+
| vm - Successful | 554 | 5.634 | 53.07 | 22.63 | 21.673 | 40.348 | 46.28 | 51.728 |
+-----------------+-------+-------+--------+-------+--------+-----------------+-----------------+-----------------+
+------------------+-------+--------+--------+-------+--------+-----------------+-----------------+-----------------+
| TYPE | COUNT | MIN | MAX | AVG | MEDIAN | 90TH PERCENTILE | 95TH PERCENTILE | 99TH PERCENTILE |
+------------------+-------+--------+--------+-------+--------+-----------------+-----------------+-----------------+
| vm-destroy - All | 560 | 18.001 | 72.317 | 49.22 | 49.947 | 60.231 | 61.157 | 62.736 |
+------------------+-------+--------+--------+-------+--------+-----------------+-----------------+-----------------+
* Speed up resource count calculation * Refactor resource count calculation * Start transaction for updateCountByDeltaForIds
Description
This PR fixes the issues which occur when increment/decrement methods are waiting for a lock on domain tables and
ResourceCountCheckTaskis running at the same time. This issue appears when innodb_lock_wait_timeout is many times less than the time it takes forrecalculateDomainResourceCountto complete. (Check steps below on how to reproduce the error).We do this by removing unnecessary locks and simplifying count updates.
As of now, to calculate the resource count for root domain, we are taking the lock on the entire table.
This PR also splits the domain count calculation transaction into multiple transactions locks. This is done by breaking up the domain count calculation process by:
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
time cmk update resourcecount domainid=1innodb_lock_wait_timeoutto a value less than by a few seconds it took for the above request to complete.innodb_lock_wait_timeoutchange to take effect.In parallel to above requests, execute
cmk update resourcecount domainid=1to trigger resource count recalculation while VMs are getting created or destroyed.ClientPreparedStatement.grep "ClientPreparedStatement" vmops.logResults
With patch - creation of VM in stopped state
Without patch