Skip to content

intermittent network connectivity loss observed on helios guest #334

Description

@jordanhendricks

Summary

I was running a helios guest on maxwell, a lab machine. I wanted to scp some files onto the guest, so I assigned it an IP and set up ssh access. It worked initially. Then I observed that intermittently (on the order of hours to minutes), both ping and ssh to the guest from external systems in the lab (including maxwell) stopped working, then would start working again, and repeat.

I'm not certain where exactly this problem is coming from, but I wanted to file it somewhere until it's better understood.

Setup Details

  • propolis-server was on a branch of mine for migrating tsc-related data; the most recent commit from master was e874e4
  • maxwell was also on custom helios bits with only bhyve changes related to migrating tsc data; the most recent commit from master there was 5d9d909
  • helios image is helios-1.0.21408, built on January 9, 2023
  • the vnic is named vnic_prop0, with igb0 as the underlay device, and the vnic's MAC address is "02:08:20:ac:e9:16"
  • In most of my debugging, I was running ping or ssh from maxwell itself
  • the guest IP is 172.20.30.160
  • maxwell's IP is 172.20.30.73

Failure Modes

I observed two failure modes of ping:

  1. No answer from IP, with no additional errors
jordan@maxwell ~ $ ping 172.20.30.160
no answer from 172.20.30.160
  1. No answer, with an ICMP Net Unreachable Error
jordan@maxwell ~ $ ping 172.20.30.160
ICMP Net Unreachable from gateway xe-2-0-0.edge01.emy01.paxio.net (64.201.240.182) for icmp from 172.20.3.73 to 172.20.30.160
ICMP Net Unreachable from gateway xe-2-0-0.edge01.emy01.paxio.net (64.201.240.182) for icmp from 172.20.3.73 to 172.20.30.160
no answer from 172.20.30.160

Of the two, (1) was far more common, and most of what my debugging observations were from.

Observations

I ran some variation of ping 172.20.30.160 (sometimes with -nv) on maxwell. I snooped igb0, vnic_prop0, and vioif0 inside the guest for both ICMP and ARP traffic.

For ICMP traffic, I found that:

  • I saw no ICMP traffic inside the guest on vioif0
  • I saw no ICMP traffic on vnic_prop0 on the host
  • I did see ICMP echo requests outgoing on igb0 of the host

For ARP traffic, I found that:

  • I saw ARP traffic inside the guest that matched what vnic_prop0 was seeing
  • I didn't see any ARP replies coming from the guest.

Here's some example ARP output from the vnic while an unsuccessful ping was running:

jordan@maxwell ~ $ pfexec snoop -t a -r -d vnic_prop0 arp                                                
Using device vnic_prop0 (promiscuous mode)                                                               
00:12:39.35278   172.20.3.1 -> 172.20.3.11  ARP R 172.20.3.1, 172.20.3.1 is aa:0:4:0:ca:fe               
00:12:44.08748 172.20.3.160 -> *            ARP C Who is 172.20.3.1, 172.20.3.1 ?                        
00:12:44.08755   172.20.3.1 -> 172.20.3.160 ARP R 172.20.3.1, 172.20.3.1 is aa:0:4:0:ca:fe               
00:12:57.07828  172.20.3.63 -> 172.20.3.3   ARP R 172.20.3.63, 172.20.3.63 is a0:42:3f:42:91:50          
00:13:01.43025   172.20.3.1 -> 172.20.3.72  ARP R 172.20.3.1, 172.20.3.1 is aa:0:4:0:ca:fe  

I saw that the ARP cache on the host had the guest IP / MAC address populated:

jordan@maxwell ~ $ arp -a
Net to Media Table: IPv4
Device   IP Address               Mask      Flags      Phys Addr
------ -------------------- --------------- -------- ---------------
igb0   172.20.3.160         255.255.255.255          02:08:20:ac:e9:16
igb0   centrum.eng.oxide.computer 255.255.255.255          aa:00:04:00:ca:fe
igb0   scarydoor.eng.oxide.computer 255.255.255.255          aa:00:04:00:ca:fe
igb0   172.20.3.73          255.255.255.255 SPLA     d8:5e:d3:09:1f:6b
igb0   172.20.3.71          255.255.255.255          d8:5e:d3:09:1e:d7
igb0   atrium.eng.oxide.computer 255.255.255.255          18:c0:4d:81:7e:7c

The guest also had its own address populated:

root@unknown:~# arp -a                              
Net to Media Table: IPv4                            
Device   IP Address               Mask      Flags      Phys Addr                                                                                                                                                  
------ -------------------- --------------- -------- ---------------
vioif0 172.20.3.160         255.255.255.255 SPLA     02:08:20:ac:e9:16
vioif0 scarydoor.eng.oxide.computer 255.255.255.255          aa:00:04:00:ca:fe                           
vioif0 172.20.3.73          255.255.255.255          d8:5e:d3:09:1f:6b         

I did manage to capture some ARP output from when ping was working, and I noticed one difference in the output, that the guest was replying to an ARP request for itself:

00:28:34.28157   172.20.3.1 -> 172.20.3.60  ARP R 172.20.3.1, 172.20.3.1 is aa:0:4:0:ca:fe
00:28:37.26110 172.20.3.160 -> *            ARP C Who is 172.20.3.73, 172.20.3.73 ?
00:28:37.26116  172.20.3.73 -> 172.20.3.160 ARP R 172.20.3.73, 172.20.3.73 is d8:5e:d3:9:1f:6b
00:28:37.26122  172.20.3.73 -> *            ARP C Who is 172.20.3.160, 172.20.3.160 ?

# guest reply
00:28:37.26133 172.20.3.160 -> 172.20.3.73  ARP R 172.20.3.160, 172.20.3.160 is 2:8:20:ac:e9:16
00:28:41.09649  172.20.3.63 -> 172.20.3.3   ARP R 172.20.3.63, 172.20.3.63 is a0:42:3f:42:91:50
00:28:52.35501  172.20.3.64 -> (broadcast)  ARP C Who is 172.20.3.64, 172.20.3.64 ?
00:29:03.41786   172.20.3.1 -> 172.20.3.71  ARP R 172.20.3.1, 172.20.3.1 is aa:0:4:0:ca:fe
00:29:09.82297 172.20.3.160 -> (broadcast)  ARP C Who is 172.20.3.160, 172.20.3.160 ?
00:29:15.21767  172.20.3.64 -> 172.20.3.3   ARP R 172.20.3.64, 172.20.3.64 is 0:90:fb:65:d6:6d
00:29:17.38574  172.20.3.63 -> 172.20.3.3   ARP R 172.20.3.63, 172.20.3.63 is a0:42:3f:42:91:50

My expectation is that this shouldn't matter whether the guest is replying in the context in which ping fails, as the host already has the IP/MAC address of the guest populated. But I could be wrong there?

In any event, it seems like ICMP packets are making it out of igb0, but not making it to vnic_prop0 for whatever reason.

Metadata

Metadata

Assignees

No one assigned

    Labels

    guest-osRelated to compatibility and/or functionality observed by guest software.networkingRelated to networking devices/backends.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions