Skip to content

Conversation

@vishesh92
Copy link
Member

@vishesh92 vishesh92 commented Apr 12, 2024

Description

This PR fixes the issues which occur when increment/decrement methods are waiting for a lock on domain tables and ResourceCountCheckTask is running at the same time. This issue appears when innodb_lock_wait_timeout is many times less than the time it takes for recalculateDomainResourceCount to complete. (Check steps below on how to reproduce the error).

com.cloud.utils.exception.CloudRuntimeException: DB Exception on: com.mysql.cj.jdbc.ClientPreparedStatement: SELECT resource_count.id, resource_count.type, resource_count.account_i
d, resource_count.domain_id, resource_count.count, resource_count.tag FROM resource_count WHERE resource_count.id IN (33,4785,3513,4845)  FOR UPDATE 
        at com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:438)
        at com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:366)
        at com.cloud.utils.db.GenericDaoBase.search(GenericDaoBase.java:355)
        at com.cloud.utils.db.GenericDaoBase.lockRows(GenericDaoBase.java:341)
        .....

We do this by removing unnecessary locks and simplifying count updates.

As of now, to calculate the resource count for root domain, we are taking the lock on the entire table.
This PR also splits the domain count calculation transaction into multiple transactions locks. This is done by breaking up the domain count calculation process by:

  1. Calculate resource count for all accounts in a domain
  2. Calculate resource count for all child domains in a domain
  3. In a transaction, fetch the child domain & accounts count and update the count if required

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

  1. Setup multiple domains & networks. And update their limits. I used the below command.
csbench -create -domain -network -limits
# csbench-config
numdomains = 10
numnetworks = 1
numvms = 100
startvm = false  # For faster creation of VMs
  1. Check the time it takes for resource count calculation to run. To manually trigger resource count calculation, run this command:
time cmk update resourcecount domainid=1
  1. Update innodb_lock_wait_timeout to a value less than by a few seconds it took for the above request to complete.
SET GLOBAL innodb_lock_wait_timeout=3;
  1. Restart the management server for innodb_lock_wait_timeout change to take effect.
  2. Run the below commands.
csbench -create -vm -workers=50
csbench -teardown -vm -workers=50

In parallel to above requests, execute cmk update resourcecount domainid=1 to trigger resource count recalculation while VMs are getting created or destroyed.

  1. Check logs for ClientPreparedStatement.
grep "ClientPreparedStatement" vmops.log

Results

With patch - creation of VM in stopped state
+----------+-------+-------+--------+-------+--------+-----------------+-----------------+-----------------+
| TYPE     | COUNT |   MIN |    MAX |   AVG | MEDIAN | 90TH PERCENTILE | 95TH PERCENTILE | 99TH PERCENTILE |
+----------+-------+-------+--------+-------+--------+-----------------+-----------------+-----------------+
| vm - All |  1000 | 1.475 | 13.879 | 4.012 |  2.955 |           8.397 |           9.648 |          12.616 |
+----------+-------+-------+--------+-------+--------+-----------------+-----------------+-----------------+
+------------------+-------+--------+-------+--------+--------+-----------------+-----------------+-----------------+
| TYPE             | COUNT |    MIN |   MAX |    AVG | MEDIAN | 90TH PERCENTILE | 95TH PERCENTILE | 99TH PERCENTILE |
+------------------+-------+--------+-------+--------+--------+-----------------+-----------------+-----------------+
| vm-destroy - All |  1000 | 15.301 | 29.74 | 20.129 | 21.438 |          21.625 |          22.681 |          28.547 |
+------------------+-------+--------+-------+--------+--------+-----------------+-----------------+-----------------+
Without patch
+-----------------+-------+-------+--------+-------+--------+-----------------+-----------------+-----------------+
| TYPE            | COUNT |   MIN |    MAX |   AVG | MEDIAN | 90TH PERCENTILE | 95TH PERCENTILE | 99TH PERCENTILE |
+-----------------+-------+-------+--------+-------+--------+-----------------+-----------------+-----------------+
| vm - All        |  1000 | 2.039 | 17.463 | 5.656 |   4.77 |          10.484 |          11.758 |          13.645 |
| vm - Successful |   988 | 2.039 | 17.463 |  5.67 |  4.773 |          10.489 |          11.791 |          13.753 |
| vm - Failed     |    12 | 3.181 |  5.414 | 4.493 |  4.679 |            5.21 |           5.313 |           5.313 |
+-----------------+-------+-------+--------+-------+--------+-----------------+-----------------+-----------------+
+------------------+-------+--------+--------+--------+--------+-----------------+-----------------+-----------------+
| TYPE             | COUNT |    MIN |    MAX |    AVG | MEDIAN | 90TH PERCENTILE | 95TH PERCENTILE | 99TH PERCENTILE |
+------------------+-------+--------+--------+--------+--------+-----------------+-----------------+-----------------+
| vm-destroy - All |   996 | 10.295 | 29.176 | 20.111 | 21.417 |          21.655 |           22.27 |          28.691 |
+------------------+-------+--------+--------+--------+--------+-----------------+-----------------+-----------------+

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9234

@sureshanaparti
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@sureshanaparti a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Copy link
Contributor

@sureshanaparti sureshanaparti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

}

@Override
public boolean updateCountByDeltaForIds(List<Long> ids, boolean increment, long delta) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to remove increment and check for positive/negative numbers?

the higher level call to updateResourceCountForAccount is only done in two places so it wouldn't be a big refactor.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 87.64045% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 15.20%. Comparing base (a5508ac) to head (b1797bf).
Report is 558 commits behind head on main.

Files Patch % Lines
.../cloud/configuration/dao/ResourceCountDaoImpl.java 80.00% 5 Missing and 2 partials ⚠️
.../cloud/resourcelimit/ResourceLimitManagerImpl.java 91.30% 0 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #8903      +/-   ##
============================================
+ Coverage     13.17%   15.20%   +2.03%     
- Complexity     9214    12936    +3722     
============================================
  Files          2725     4880    +2155     
  Lines        258235   327065   +68830     
  Branches      40249    46220    +5971     
============================================
+ Hits          34013    49726   +15713     
- Misses       219913   270395   +50482     
- Partials       4309     6944    +2635     
Flag Coverage Δ
simulator-marvin-tests 15.20% <87.64%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9239

@blueorangutan
Copy link

[SF] Trillian test result (tid-9811)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 48680 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8903-t9811-kvm-centos7.zip
Smoke tests completed. 129 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

Copy link
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I've tested it with resource tagging assigned to the offerings for both the volumes and VMs provisioning and both deploy and teardown did not prompt any of the spotted errors and locks before. In my setup with cs-bench I've created 50 domains and deployed 100 VM in each. Each VM containing 2 disks. Both compute and disk offerings had a resource limit tags.

With the patch

+----------+-------+-------+--------+--------+--------+-----------------+-----------------+-----------------+
| TYPE     | COUNT |   MIN |    MAX |    AVG | MEDIAN | 90TH PERCENTILE | 95TH PERCENTILE | 99TH PERCENTILE |
+----------+-------+-------+--------+--------+--------+-----------------+-----------------+-----------------+
| vm - All |  4600 | 2.078 | 33.925 | 11.619 |  7.487 |          24.935 |          27.468 |          29.964 |
+----------+-------+-------+--------+--------+--------+-----------------+-----------------+-----------------+
+------------------+-------+--------+--------+--------+--------+-----------------+-----------------+-----------------+
| TYPE             | COUNT |    MIN |    MAX |    AVG | MEDIAN | 90TH PERCENTILE | 95TH PERCENTILE | 99TH PERCENTILE |
+------------------+-------+--------+--------+--------+--------+-----------------+-----------------+-----------------+
| vm-destroy - All |  4600 | 10.395 | 41.303 | 26.924 | 29.519 |          30.658 |          31.201 |          38.534 |
+------------------+-------+--------+--------+--------+--------+-----------------+-----------------+-----------------+

Without the patch - removed errors/failed out of the equation

+-----------------+-------+-------+--------+-------+--------+-----------------+-----------------+-----------------+
| TYPE            | COUNT |   MIN |    MAX |   AVG | MEDIAN | 90TH PERCENTILE | 95TH PERCENTILE | 99TH PERCENTILE |
+-----------------+-------+-------+--------+-------+--------+-----------------+-----------------+-----------------+
| vm - Successful |   554 | 5.634 |  53.07 | 22.63 | 21.673 |          40.348 |           46.28 |          51.728 |
+-----------------+-------+-------+--------+-------+--------+-----------------+-----------------+-----------------+
+------------------+-------+--------+--------+-------+--------+-----------------+-----------------+-----------------+
| TYPE             | COUNT |    MIN |    MAX |   AVG | MEDIAN | 90TH PERCENTILE | 95TH PERCENTILE | 99TH PERCENTILE |
+------------------+-------+--------+--------+-------+--------+-----------------+-----------------+-----------------+
| vm-destroy - All |   560 | 18.001 | 72.317 | 49.22 | 49.947 |          60.231 |          61.157 |          62.736 |
+------------------+-------+--------+--------+-------+--------+-----------------+-----------------+-----------------+

@yadvr yadvr added this to the 4.20.0.0 milestone Apr 17, 2024
@yadvr yadvr marked this pull request as ready for review April 17, 2024 08:51
@yadvr yadvr merged commit ebaf5a4 into apache:main Apr 17, 2024
@yadvr yadvr deleted the speedup-resource-count-calculation branch April 17, 2024 08:51
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request May 3, 2024
* Speed up resource count calculation

* Refactor resource count calculation

* Start transaction for updateCountByDeltaForIds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants