Skip to content

Conversation

@vishesh92
Copy link
Member

@vishesh92 vishesh92 commented Dec 5, 2023

Description

Needs some rework which will be done after #8362 is merged.

This PR fixes the resource count discrepancies which happen when resource count is being incremented or decremented and recalculation of resource count happens at the same time.

Requires 2 Management servers to reproduce

  1. On MS1, Add a debugger at
    for (ResourceCountVO rowToUpdate : rowsToUpdate) {
  2. Deploy a VM.
  3. When the debugger stops at above line, execute cmk update resourcecount domainid=`1on MS2 to trigger recalculation of resource count (this also happens periodically. cmk command triggers the same method on demand). cmk command will get blocked because of the debugger.
  4. Resume the debugger.
  5. cmk command will complete and you will see the discrepancy error in logs.

You will see a log line with the following text

Discrepency in the resource count has been detected (original count = 1 correct count = 2) for Type = user_vm for Domain ID = 2 is fixed during resource count recalculation

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

@vishesh92 vishesh92 changed the base branch from main to 4.18 December 5, 2023 09:10
@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@codecov
Copy link

codecov bot commented Dec 5, 2023

Codecov Report

Attention: Patch coverage is 75.86207% with 49 lines in your changes are missing coverage. Please review.

Project coverage is 30.92%. Comparing base (6dc3d06) to head (7dcc999).

Files Patch % Lines
.../src/main/java/com/cloud/vm/UserVmManagerImpl.java 62.96% 9 Missing and 1 partial ⚠️
.../cloud/resourcelimit/ResourceLimitManagerImpl.java 87.09% 1 Missing and 7 partials ⚠️
...n/java/com/cloud/vm/VirtualMachineManagerImpl.java 61.11% 4 Missing and 3 partials ⚠️
...va/com/cloud/resourcelimit/CheckedReservation.java 73.07% 5 Missing and 2 partials ⚠️
...main/java/com/cloud/storage/dao/VolumeDaoImpl.java 64.70% 4 Missing and 2 partials ⚠️
...cloudstack/reservation/dao/ReservationDaoImpl.java 83.78% 6 Missing ⚠️
.../src/main/java/com/cloud/vm/dao/UserVmDaoImpl.java 72.72% 1 Missing and 2 partials ⚠️
...g/apache/cloudstack/reservation/ReservationVO.java 60.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##               main    #8302       +/-   ##
=============================================
+ Coverage     15.88%   30.92%   +15.04%     
- Complexity    15718    33688    +17970     
=============================================
  Files          5172     5397      +225     
  Lines        364426   379499    +15073     
  Branches      53574    55373     +1799     
=============================================
+ Hits          57874   117354    +59480     
+ Misses       299648   246565    -53083     
- Partials       6904    15580     +8676     
Flag Coverage Δ
simulator-marvin-tests 24.41% <71.92%> (?)
uitests 4.34% <ø> (ø)
unit-tests 16.88% <41.87%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@blueorangutan
Copy link

Packaging result [SF]: ✖️ el7 ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 7924

@vishesh92 vishesh92 force-pushed the fix-resource-count-discrepancies branch 2 times, most recently from 45d6668 to e7ec0d4 Compare December 5, 2023 10:17
@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7928

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@yadvr
Copy link
Member

yadvr commented Dec 6, 2023

@vishesh92 does it affect only 4.18, or we're aiming to fix in main/4.19+ ?

@vishesh92 vishesh92 added this to the 4.19.0.0 milestone Dec 6, 2023
@vishesh92
Copy link
Member Author

@vishesh92 does it affect only 4.18, or we're aiming to fix in main/4.19+ ?

@rohityadavcloud Both 4.18 & main are affected. This requires a migration which will make the upgrade path a little complex with 4.18.2. So, I have raised this PR against main/4.19.

_resourceLimitMgr.incrementResourceCount(accountId, ResourceType.cpu, displayVm, cpu);
_resourceLimitMgr.incrementResourceCount(accountId, ResourceType.memory, displayVm, memory);
if (!VirtualMachineManager.ResourceCountRunningVMsonly.value()) {
Transaction.execute(new TransactionCallbackNoReturn() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to do this without incrementing all three in a transaction would require adding columns for each reservation type (user_vm, cpu, memory) in user_vm table since cpu & memory are calculated using the user_vm table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar solution for primary_storage & volume as well.

@vishesh92 vishesh92 marked this pull request as ready for review December 11, 2023 04:13
@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SL] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8855

@blueorangutan
Copy link

[SF] Trillian test result (tid-9393)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 46035 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8302-t9393-xenserver-71.zip
Smoke tests completed. 129 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

[SF] Trillian test result (tid-9395)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 47264 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8302-t9395-kvm-centos7.zip
Smoke tests completed. 129 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

[SF] Trillian test result (tid-9394)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server r8
Total time taken: 52646 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8302-t9394-vmware-67u3.zip
Smoke tests completed. 128 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_balanced_drs_algorithm Error 435.05 test_cluster_drs.py

@vishesh92
Copy link
Member Author

@blueorangutan test matrix

@blueorangutan
Copy link

@vishesh92 a [SL] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-9401)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 44625 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8302-t9401-kvm-centos7.zip
Smoke tests completed. 129 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

[SF] Trillian test result (tid-9399)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 47034 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8302-t9399-xenserver-71.zip
Smoke tests completed. 129 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

[SF] Trillian test result (tid-9400)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server r8
Total time taken: 63088 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8302-t9400-vmware-67u3.zip
Smoke tests completed. 127 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_balanced_drs_algorithm Error 435.71 test_cluster_drs.py
test_10_reboot_cpvm_forced Error 19.30 test_ssvm.py

@github-actions
Copy link

github-actions bot commented Mar 8, 2024

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@yadvr
Copy link
Member

yadvr commented Mar 12, 2024

@vishesh92 can you address the merge conflict and re-run the packaging and smoktests

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8916

@vishesh92
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@vishesh92 a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-9451)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 48193 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8302-t9451-kvm-centos7.zip
Smoke tests completed. 128 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_hostha_enable_ha_when_host_disabled Error 3.79 test_hostha_kvm.py
test_hostha_enable_ha_when_host_in_maintenance Error 306.30 test_hostha_kvm.py

@yadvr yadvr marked this pull request as ready for review March 13, 2024 12:52
@yadvr yadvr merged commit e87c6cf into apache:main Mar 13, 2024
@yadvr yadvr deleted the fix-resource-count-discrepancies branch March 13, 2024 12:52
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Mar 21, 2024
* Fix resource count discrepancies

* Fixup while removing vm

* Fix discrepancies when starting VMs

* Fixup tests

* Fix failing tests

* Don't take lock when amount is negative

---------

Co-authored-by: dahn <daan@onecht.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants