Skip to content

server: check if there are active nics before network GC#8204

Merged
DaanHoogland merged 3 commits intoapache:4.18from
weizhouapache:4.18-fix-network-gc
Nov 29, 2023
Merged

server: check if there are active nics before network GC#8204
DaanHoogland merged 3 commits intoapache:4.18from
weizhouapache:4.18-fix-network-gc

Conversation

@weizhouapache
Copy link
Member

Description

This PR fixes #8200

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

@weizhouapache weizhouapache added this to the 4.18.2.0 milestone Nov 9, 2023
@weizhouapache weizhouapache changed the base branch from main to 4.18 November 9, 2023 08:44
@weizhouapache
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@codecov
Copy link

codecov bot commented Nov 9, 2023

Codecov Report

Merging #8204 (f2992f3) into 4.18 (e790047) will decrease coverage by 0.01%.
Report is 3 commits behind head on 4.18.
The diff coverage is 45.00%.

@@             Coverage Diff              @@
##               4.18    #8204      +/-   ##
============================================
- Coverage     13.10%   13.09%   -0.01%     
- Complexity     9121     9124       +3     
============================================
  Files          2720     2720              
  Lines        257599   257636      +37     
  Branches      40158    40165       +7     
============================================
+ Hits          33747    33750       +3     
- Misses       219588   219620      +32     
- Partials       4264     4266       +2     
Files Coverage Δ
.../main/java/com/cloud/network/NetworkModelImpl.java 11.11% <0.00%> (ø)
...ema/src/main/java/com/cloud/vm/dao/NicDaoImpl.java 23.44% <50.00%> (-0.06%) ⬇️

... and 11 files with indirect coverage changes

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7692

@DaanHoogland
Copy link
Contributor

Code looks good @weizhouapache , one concern about the situation the a VM is being created but does not have any nics yet. Is this the situation you describe in #8200? If GC hits at that time, it might be possible that the network is going to be teared down, while the process creating the VM thinks it can continue.
Am I right?

@weizhouapache
Copy link
Member Author

Code looks good @weizhouapache , one concern about the situation the a VM is being created but does not have any nics yet.

This situlation has already been handled.

Is this the situation you describe in #8200? If GC hits at that time, it might be possible that the network is going to be teared down, while the process creating the VM thinks it can continue. Am I right?

This PR aims to handle the situation that nics_count is 0 but there are running /migrating/stopping vms. Ths nics_count should not be 0 but we have seens this issue sometimes already. The root cause is still unknown. To prevent the network to be shutdown and VR to be stopped , we should consider the situation that nics_count is wrong as 0.

@DaanHoogland
Copy link
Contributor

Code looks good @weizhouapache , one concern about the situation the a VM is being created but does not have any nics yet.

This situlation has already been handled.

How has it?

Is this the situation you describe in #8200? If GC hits at that time, it might be possible that the network is going to be teared down, while the process creating the VM thinks it can continue. Am I right?

This PR aims to handle the situation that nics_count is 0 but there are running /migrating/stopping vms. Ths nics_count should not be 0 but we have seens this issue sometimes already. The root cause is still unknown. To prevent the network to be shutdown and VR to be stopped , we should consider the situation that nics_count is wrong as 0.

I understand this part and I approve of this PR because of that.

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@DaanHoogland
Copy link
Contributor

Is there a reason this needs to stay in draft @weizhouapache ?

@weizhouapache
Copy link
Member Author

Code looks good @weizhouapache , one concern about the situation the a VM is being created but does not have any nics yet.

This situlation has already been handled.

How has it?

@DaanHoogland
When vm is being created, there might be two vm state: starting and stopped. For stopped vms, the nic is not considered active and it is not impacted if network is shutdown. For starting vms, there is already a check :

    if (_nicDao.countNicsForStartingVms(networkId) > 0) {

Is this the situation you describe in #8200? If GC hits at that time, it might be possible that the network is going to be teared down, while the process creating the VM thinks it can continue. Am I right?

This PR aims to handle the situation that nics_count is 0 but there are running /migrating/stopping vms. Ths nics_count should not be 0 but we have seens this issue sometimes already. The root cause is still unknown. To prevent the network to be shutdown and VR to be stopped , we should consider the situation that nics_count is wrong as 0.

I understand this part and I approve of this PR because of that.

@DaanHoogland
Copy link
Contributor

@DaanHoogland When vm is being created, there might be two vm state: starting and stopped. For stopped vms, the nic is not considered active and it is not impacted if network is shutdown. For starting vms, there is already a check :

    if (_nicDao.countNicsForStartingVms(networkId) > 0) {

Getting into "muggenzift"a area here but,
are we sure the nic is created before the check if a VR needs to be created is done? A vm is started (on deploy) and at that moment the VR is started if it is not running. I am looking for the possibility of a race condition there. Can it be that the GC doesn't see the nic (yet) and decides to shutdown the VR, while the VM start process has seen the network implemented and does not try to start the VR for it?

@weizhouapache
Copy link
Member Author

@DaanHoogland When vm is being created, there might be two vm state: starting and stopped. For stopped vms, the nic is not considered active and it is not impacted if network is shutdown. For starting vms, there is already a check :

    if (_nicDao.countNicsForStartingVms(networkId) > 0) {

Getting into "muggenzift"a area here but,
are we sure the nic is created before the check if a VR needs to be created is done? A vm is started (on deploy) and at that moment the VR is started if it is not running. I am looking for the possibility of a race condition there. Can it be that the GC doesn't see the nic (yet) and decides to shutdown the VR, while the VM start process has seen the network implemented and does not try to start the VR for it?

discussed with Daan, there is indeed a race condition that start vm when network is being shutdown, maybe there is an issue with it. It needs testing.

We have found an issue with my pr that running VRs should not be considered as ACS thread will shutdown the VRs in GC. I will update this PR.

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@weizhouapache weizhouapache marked this pull request as ready for review November 17, 2023 10:12
@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7783

@weizhouapache
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-8355)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 44513 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8204-t8355-kvm-centos7.zip
Smoke tests completed. 107 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_08_upgrade_kubernetes_ha_cluster Failure 658.73 test_kubernetes_clusters.py
test_08_migrate_vm Error 0.06 test_vm_life_cycle.py

@DaanHoogland
Copy link
Contributor

@harikrishna-patnala can you approve now?
@weizhouapache any test spec or advice possible?

@weizhouapache
Copy link
Member Author

@DaanHoogland @harikrishna-patnala @shwstppr
I did some testing, look good.

Here the steps

  • create a network and deploy/start a new vm
  • update the nics_count from 1 to 0 (update op_networks set nics_count =0 where id=295;)
  • wait for network GC to run

prior to this PR: network is shutdown (unexpected as there is a running VM)

2023-11-20 07:55:04,616 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Network-Scavenger-1:ctx-c2c336b0) (logid:0a493c1f) Lock is acquired for network Network {"id": 295, "name": "cks-001-network", "uuid": "0306e71a-d587-4437-84a7-3839fd0f53d8", "networkofferingid": 21} as a part of network shutdown
2023-11-20 07:55:04,623 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Network-Scavenger-1:ctx-c2c336b0) (logid:0a493c1f) Sending network shutdown to VirtualRouter
2023-11-20 07:55:04,625 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Network-Scavenger-1:ctx-c2c336b0) (logid:0a493c1f) Stopping router VM instance {"id":223,"instanceName":"r-223-VM","type":"DomainRouter","uuid":"74915ce2-bcf5-4f68-b2f5-89274093d10a"}

with this PR, network is not shutdown (expected as there is a running VM)

2023-11-20 09:27:50,359 DEBUG [c.c.n.NetworkModelImpl] (Network-Scavenger-1:ctx-8419537c) (logid:8b4be21a) Network id=295 is not ready for GC as it has vms that are not Stopped at the moment

stop the vm, and wait for network GC, network is shutdown (expected as there is not a running VM)

2023-11-20 09:37:50,394 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Network-Scavenger-1:ctx-f6ef312a) (logid:4343d2e6) Lock is acquired for network Network {"id": 210, "name": "admin-001", "uuid": "35f01c5c-f543-4c67-93ed-79bbeb52d7e1", "networkofferingid": 10} as a part of network shutdown
2023-11-20 09:37:50,401 DEBUG [o.a.c.e.o.NetworkOrchestrator] (Network-Scavenger-1:ctx-f6ef312a) (logid:4343d2e6) Sending network shutdown to VirtualRouter
2023-11-20 09:37:50,404 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Network-Scavenger-1:ctx-f6ef312a) (logid:4343d2e6) Stopping router VM instance {"id":163,"instanceName":"r-163-VM","type":"DomainRouter","uuid":"6ee36e60-a388-41d6-8688-162b0b402a55"}

Copy link
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code change looks okay for an improvement.
I'm not sure if this is a bug or a root cause for an issue as the only way is to update DB manually.

@DaanHoogland
Copy link
Contributor

@shwstppr , concerning

I'm not sure if this is a bug or a root cause for an issue as the only way is to update DB manually.

see

discussed with Daan, there is indeed a race condition that start vm when network is being shutdown, maybe there is an issue with it. It needs testing.

When a thread starts a VM while another one runs the GC for networks this may happen. We have not reproduced in a lab, but seen it in "the wild".

@weizhouapache
Copy link
Member Author

Code change looks okay for an improvement. I'm not sure if this is a bug or a root cause for an issue as the only way is to update DB manually.

@shwstppr @DaanHoogland
I think there is a way to reproduce the issue but I did not test it yet

  • create and start a vm, nics_count is increased by 1 (e.g. 0 -> 1)
  • stop the vm, nics_count is decreased by 1 (e.g. 1 -> 0)
  • start the vm out-of-band (e.g. in vCenter)
  • cloustack detects the VM is running, and changes the state from Stopped to Running, but nics_count is not changed (e.g. still 0)
    This is just an example that nics_count is incorrect.

@shwstppr
Copy link
Contributor

Thanks @weizhouapache @DaanHoogland I've approved the PR. We can merge unless you feel we need more testing.

@shwstppr
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7848

@weizhouapache
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-8406)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 44803 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8204-t8406-kvm-centos7.zip
Smoke tests completed. 109 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@shwstppr
Copy link
Contributor

@weizhouapache @DaanHoogland do we need any additional testing here or is it good to merge?

@weizhouapache
Copy link
Member Author

@weizhouapache @DaanHoogland do we need any additional testing here or is it good to merge?

@shwstppr
I have manually tested it, it looks good to me.
However, as you know, I am the author of this PR, please ignore my lgtm ... :-D

@DaanHoogland
Copy link
Contributor

@weizhouapache @DaanHoogland do we need any additional testing here or is it good to merge?

@shwstppr I have manually tested it, it looks good to me. However, as you know, I am the author of this PR, please ignore my lgtm ... :-D

I will repro @weizhouapache last testing description and merge if ok.

@DaanHoogland DaanHoogland self-assigned this Nov 29, 2023
@DaanHoogland
Copy link
Contributor

I think there is a way to reproduce the issue but I did not test it yet

* create and start a vm, nics_count is increased by 1 (e.g. 0 -> 1)

* stop the vm, nics_count is decreased by 1 (e.g. 1 -> 0)

* start the vm out-of-band (e.g. in vCenter)

* cloustack detects the VM is running, and changes the state from Stopped to Running, but nics_count is not changed (e.g. still 0)
  This is just an example that nics_count is incorrect.

I reproduced the faulty state this way

  • set the network.gc.interval and network.gc.timeout both to 30 secs
  • create and start a vm
  • stop the vm
  • start the vm out of bounds
    pre this PR the network gets shutdown despite the vm being started
    with the PR the network keeps running

merging

@DaanHoogland DaanHoogland merged commit cb2b6ac into apache:4.18 Nov 29, 2023
@DaanHoogland DaanHoogland deleted the 4.18-fix-network-gc branch November 29, 2023 17:55
@DaanHoogland DaanHoogland removed their assignment Nov 29, 2023
@weizhouapache
Copy link
Member Author

great, thanks a lot for the testing @DaanHoogland !

DaanHoogland added a commit that referenced this pull request Dec 4, 2023
* 4.18:
  server: Initial new vpnuser state (#8268)
  UI: Removed redundant IP Address Column when create Port forwarding rules (#8275)
  UI: Removed ICMP input fields for protocol number from ACL List rules modal (#8253)
  server: check if there are active nics before network GC (#8204)
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Dec 6, 2023
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Dec 6, 2023
* 4.18:
  server: Initial new vpnuser state (apache#8268)
  UI: Removed redundant IP Address Column when create Port forwarding rules (apache#8275)
  UI: Removed ICMP input fields for protocol number from ACL List rules modal (apache#8253)
  server: check if there are active nics before network GC (apache#8204)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Isolated network is shutdown even when there are running VMs on it

5 participants