Skip to content

Conversation

@remibergsma
Copy link
Contributor

It seems the VR needs more time for some of its commands. Until we figured out the root cause, this allows the VRs to start again.

Error seen:

2015-12-28 14:35:18,201 ERROR [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-1:ctx-34ff7f80 job-39723/job-39726 ctx-d63de41b) Timed out in waiting SSH execution result
2015-12-28 14:35:18,201 WARN [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-1:ctx-34ff7f80 job-39723/job-39726 ctx-d63de41b) Command: com.cloud.agent.api.Command failed while starting virtua
l router
2015-12-28 14:35:18,201 INFO [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-1:ctx-34ff7f80 job-39723/job-39726 ctx-d63de41b) The guru did not like the answers so stopping VM[DomainRouter|r-1534-VM]
.Answer":{"result":true,"wait":0}},{"com.cloud.agent.api.Answer":{"result":false,"details":"Timed out in waiting SSH execution result","wait":0}}] }

It seems the VR needs more time for some of its commands. Until we figured
out the root cause, this allows the VRs to start again.
@jayapalu
Copy link
Contributor

@remibergsma
I am facing the same issue while running the tests for strongswan s2s vpn.
I will apply this patch and run the tests once.

@remibergsma
Copy link
Contributor Author

With this fix it survives longer wait times:

2015-12-29 12:40:02,011 DEBUG [c.c.a.r.v.VirtualRoutingResource] (DirectAgent-111:ctx-d584fccd) Processing ScriptConfigItem, executing update_config.py forwarding_rules.json took 63767ms

@remibergsma
Copy link
Contributor Author

Especially in combination with bumping global setting router.aggregation.command.each.timeout from its default of 3 sec to something like 10-15 will resolve most issues. The issue that remains is of course why it takes so long in the first place. But at least it doesn't fail anymore.

@wido
Copy link
Contributor

wido commented Jan 5, 2016

LGTM based on the code. Didn't perform a test on it. Trivial fix for some integers.

@remibergsma
Copy link
Contributor Author

LGTM based on these tests:

nosetests --with-marvin --marvin-config=${marvinCfg} -s -a tags=advanced,required_hardware=true \
component/test_vpc_redundant.py \
component/test_routers_iptables_default_policy.py \
component/test_routers_network_ops.py \
component/test_vpc_router_nics.py \
smoke/test_loadbalance.py \
smoke/test_internal_lb.py \
smoke/test_ssvm.py \
smoke/test_network.py

Result:

Create a redundant VPC with two networks with two VMs in each network ... === TestName: test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL | Status : SUCCESS ===
ok
Create a redundant VPC with two networks with two VMs in each network and check default routes ... === TestName: test_02_redundant_VPC_default_routes | Status : SUCCESS ===
ok
Create a redundant VPC with two networks with two VMs in each network ... === TestName: test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers | Status : SUCCESS ===
ok
Test iptables default INPUT/FORWARD policy on RouterVM ... === TestName: test_02_routervm_iptables_policies | Status : SUCCESS ===
ok
Test iptables default INPUT/FORWARD policies on VPC router ... === TestName: test_01_single_VPC_iptables_policies | Status : SUCCESS ===
ok
Test redundant router internals ... === TestName: test_01_isolate_network_FW_PF_default_routes_egress_true | Status : SUCCESS ===
ok
Test redundant router internals ... === TestName: test_02_isolate_network_FW_PF_default_routes_egress_false | Status : SUCCESS ===
ok
Test redundant router internals ... === TestName: test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true | Status : SUCCESS ===
ok
Test redundant router internals ... === TestName: test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false | Status : SUCCESS ===
ok
Create a VPC with two networks with one VM in each network and test nics after destroy ... === TestName: test_01_VPC_nics_after_destroy | Status : SUCCESS ===
ok
Create a VPC with two networks with one VM in each network and test default routes ... === TestName: test_02_VPC_default_routes | Status : SUCCESS ===
ok
Check the password file in the Router VM ... === TestName: test_isolate_network_password_server | Status : SUCCESS ===
ok
Check that the /etc/dhcphosts.txt doesn't contain duplicate IPs ... === TestName: test_router_dhcphosts | Status : SUCCESS ===
ok
Test to create Load balancing rule with source NAT ... === TestName: test_01_create_lb_rule_src_nat | Status : SUCCESS ===
ok
Test to create Load balancing rule with non source NAT ... === TestName: test_02_create_lb_rule_non_nat | Status : SUCCESS ===
ok
Test for assign & removing load balancing rule ... === TestName: test_assign_and_removal_lb | Status : SUCCESS ===
ok
Test to verify access to loadbalancer haproxy admin stats page ... === TestName: test02_internallb_haproxy_stats_on_all_interfaces | Status : SUCCESS ===
ok
Test create, assign, remove of an Internal LB with roundrobin http traffic to 3 vm's ... === TestName: test_01_internallb_roundrobin_1VPC_3VM_HTTP_port80 | Status : SUCCESS ===
ok
Test SSVM Internals ... === TestName: test_03_ssvm_internals | Status : SUCCESS ===
ok
Test CPVM Internals ... === TestName: test_04_cpvm_internals | Status : SUCCESS ===
ok
Test stop SSVM ... === TestName: test_05_stop_ssvm | Status : SUCCESS ===
ok
Test stop CPVM ... === TestName: test_06_stop_cpvm | Status : SUCCESS ===
ok
Test reboot SSVM ... === TestName: test_07_reboot_ssvm | Status : SUCCESS ===
ok
Test reboot CPVM ... === TestName: test_08_reboot_cpvm | Status : SUCCESS ===
ok
Test destroy SSVM ... === TestName: test_09_destroy_ssvm | Status : SUCCESS ===
ok
Test destroy CPVM ... === TestName: test_10_destroy_cpvm | Status : SUCCESS ===
ok
Test Remote Access VPN in VPC ... === TestName: test_vpc_remote_access_vpn | Status : SUCCESS ===
ok
Test VPN in VPC ... === TestName: test_vpc_site2site_vpn | Status : SUCCESS ===
ok
Test for port forwarding on source NAT ... === TestName: test_01_port_fwd_on_src_nat | Status : SUCCESS ===
ok
Test for port forwarding on non source NAT ... === TestName: test_02_port_fwd_on_non_src_nat | Status : SUCCESS ===
ok
Test for reboot router ... === TestName: test_reboot_router | Status : SUCCESS ===
ok
Test for Router rules for network rules on acquired public IP ... === TestName: test_network_rules_acquired_public_ip_1_static_nat_rule | Status : SUCCESS ===
ok
Test for Router rules for network rules on acquired public IP ... === TestName: test_network_rules_acquired_public_ip_2_nat_rule | Status : SUCCESS ===
ok
Test for Router rules for network rules on acquired public IP ... === TestName: test_network_rules_acquired_public_ip_3_Load_Balancer_Rule | Status : SUCCESS ===
ok

----------------------------------------------------------------------
Ran 34 tests in 17286.238s

OK

And:

nosetests --with-marvin --marvin-config=${marvinCfg} -s -a tags=advanced,required_hardware=false \
smoke/test_routers.py \
smoke/test_network_acl.py \
smoke/test_privategw_acl.py \
smoke/test_reset_vm_on_reboot.py \
smoke/test_vm_life_cycle.py \
smoke/test_vpc_vpn.py \
smoke/test_service_offerings.py \
component/test_vpc_offerings.py \
component/test_vpc_routers.py

Result:

Test router internal advanced zone ... === TestName: test_02_router_internal_adv | Status : SUCCESS ===
ok
Test restart network ... === TestName: test_03_restart_network_cleanup | Status : SUCCESS ===
ok
Test router basic setup ... === TestName: test_05_router_basic | Status : SUCCESS ===
ok
Test router advanced setup ... === TestName: test_06_router_advanced | Status : SUCCESS ===
ok
Test stop router ... === TestName: test_07_stop_router | Status : SUCCESS ===
ok
Test start router ... === TestName: test_08_start_router | Status : SUCCESS ===
ok
Test reboot router ... === TestName: test_09_reboot_router | Status : SUCCESS ===
ok
Test reset virtual machine on reboot ... === TestName: test_01_reset_vm_on_reboot | Status : SUCCESS ===
ok
Test advanced zone virtual router ... === TestName: test_advZoneVirtualRouter | Status : SUCCESS ===
ok
Test Deploy Virtual Machine ... === TestName: test_deploy_vm | Status : SUCCESS ===
ok
Test Multiple Deploy Virtual Machine ... === TestName: test_deploy_vm_multiple | Status : SUCCESS ===
ok
Test Stop Virtual Machine ... === TestName: test_01_stop_vm | Status : SUCCESS ===
ok
Test Start Virtual Machine ... === TestName: test_02_start_vm | Status : SUCCESS ===
ok
Test Reboot Virtual Machine ... === TestName: test_03_reboot_vm | Status : SUCCESS ===
ok
Test destroy Virtual Machine ... === TestName: test_06_destroy_vm | Status : SUCCESS ===
ok
Test recover Virtual Machine ... === TestName: test_07_restore_vm | Status : SUCCESS ===
ok
Test migrate VM ... === TestName: test_08_migrate_vm | Status : SUCCESS ===
ok
Test destroy(expunge) Virtual Machine ... === TestName: test_09_expunge_vm | Status : SUCCESS ===
ok
Test to create service offering ... === TestName: test_01_create_service_offering | Status : SUCCESS ===
ok
Test to update existing service offering ... === TestName: test_02_edit_service_offering | Status : SUCCESS ===
ok
Test to delete service offering ... === TestName: test_03_delete_service_offering | Status : SUCCESS ===
ok
Test for delete account ... === TestName: test_delete_account | Status : SUCCESS ===
ok
Test for Associate/Disassociate public IP address for admin account ... === TestName: test_public_ip_admin_account | Status : SUCCESS ===
ok
Test for Associate/Disassociate public IP address for user account ... === TestName: test_public_ip_user_account | Status : SUCCESS ===
ok
Test for release public IP address ... === TestName: test_releaseIP | Status : SUCCESS ===
ok
Test create VPC offering ... === TestName: test_01_create_vpc_offering | Status : SUCCESS ===
ok
Test VPC offering without load balancing service ... === TestName: test_03_vpc_off_without_lb | Status : SUCCESS ===
ok
Test VPC offering without static NAT service ... === TestName: test_04_vpc_off_without_static_nat | Status : SUCCESS ===
ok
Test VPC offering without port forwarding service ... === TestName: test_05_vpc_off_without_pf | Status : SUCCESS ===
ok
Test VPC offering with invalid services ... === TestName: test_06_vpc_off_invalid_services | Status : SUCCESS ===
ok
Test update VPC offering ... === TestName: test_07_update_vpc_off | Status : SUCCESS ===
ok
Test list VPC offering ... === TestName: test_08_list_vpc_off | Status : SUCCESS ===
ok
test_09_create_redundant_vpc_offering (integration.component.test_vpc_offerings.TestVPCOffering) ... === TestName: test_09_create_redundant_vpc_offering | Status : SUCCESS ===
ok
Test start/stop of router after addition of one guest network ... === TestName: test_01_start_stop_router_after_addition_of_one_guest_network | Status : SUCCESS ===
ok
Test reboot of router after addition of one guest network ... === TestName: test_02_reboot_router_after_addition_of_one_guest_network | Status : SUCCESS ===
ok
Test to change service offering of router after addition of one guest network ... === TestName: test_04_chg_srv_off_router_after_addition_of_one_guest_network | Status : SUCCESS ===
ok
Test destroy of router after addition of one guest network ... === TestName: test_05_destroy_router_after_addition_of_one_guest_network | Status : SUCCESS ===
ok
Test to stop and start router after creation of VPC ... === TestName: test_01_stop_start_router_after_creating_vpc | Status : SUCCESS ===
ok
Test to reboot the router after creating a VPC ... === TestName: test_02_reboot_router_after_creating_vpc | Status : SUCCESS ===
ok
Tests to change service offering of the Router after ... === TestName: test_04_change_service_offerring_vpc | Status : SUCCESS ===
ok
Test to destroy the router after creating a VPC ... === TestName: test_05_destroy_router_after_creating_vpc | Status : SUCCESS ===
ok

----------------------------------------------------------------------
Ran 41 tests in 8487.523s

OK

We successfully run this on a production cloud.

@wilderrodrigues
Copy link
Contributor

Ping @remibergsma

LGTM, please proceed with merge.

Cheers,
Wilder

@asfgit asfgit merged commit 1c5e899 into apache:4.7 Jan 16, 2016
asfgit pushed a commit that referenced this pull request Jan 16, 2016
CLOUDSTACK-9202 Bump ssh timeout for VR commandsIt seems the VR needs more time for some of its commands. Until we figured out the root cause, this allows the VRs to start again.

Error seen:
```
2015-12-28 14:35:18,201 ERROR [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-1:ctx-34ff7f80 job-39723/job-39726 ctx-d63de41b) Timed out in waiting SSH execution result
2015-12-28 14:35:18,201 WARN [c.c.n.r.VirtualNetworkApplianceManagerImpl] (Work-Job-Executor-1:ctx-34ff7f80 job-39723/job-39726 ctx-d63de41b) Command: com.cloud.agent.api.Command failed while starting virtua
l router
2015-12-28 14:35:18,201 INFO [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-1:ctx-34ff7f80 job-39723/job-39726 ctx-d63de41b) The guru did not like the answers so stopping VM[DomainRouter|r-1534-VM]
.Answer":{"result":true,"wait":0}},{"com.cloud.agent.api.Answer":{"result":false,"details":"Timed out in waiting SSH execution result","wait":0}}] }
```

* pr/1291:
  CLOUDSTACK-9202 Bump ssh timeout

Signed-off-by: Remi Bergsma <github@remi.nl>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants