Summary
I was running a helios guest on maxwell, a lab machine. I wanted to scp some files onto the guest, so I assigned it an IP and set up ssh access. It worked initially. Then I observed that intermittently (on the order of hours to minutes), both ping and ssh to the guest from external systems in the lab (including maxwell) stopped working, then would start working again, and repeat.
I'm not certain where exactly this problem is coming from, but I wanted to file it somewhere until it's better understood.
Setup Details
- propolis-server was on a branch of mine for migrating tsc-related data; the most recent commit from master was
e874e4
maxwell was also on custom helios bits with only bhyve changes related to migrating tsc data; the most recent commit from master there was 5d9d909
- helios image is
helios-1.0.21408, built on January 9, 2023
- the vnic is named
vnic_prop0, with igb0 as the underlay device, and the vnic's MAC address is "02:08:20:ac:e9:16"
- In most of my debugging, I was running
ping or ssh from maxwell itself
- the guest IP is
172.20.30.160
- maxwell's IP is
172.20.30.73
Failure Modes
I observed two failure modes of ping:
- No answer from IP, with no additional errors
jordan@maxwell ~ $ ping 172.20.30.160
no answer from 172.20.30.160
- No answer, with an ICMP Net Unreachable Error
jordan@maxwell ~ $ ping 172.20.30.160
ICMP Net Unreachable from gateway xe-2-0-0.edge01.emy01.paxio.net (64.201.240.182) for icmp from 172.20.3.73 to 172.20.30.160
ICMP Net Unreachable from gateway xe-2-0-0.edge01.emy01.paxio.net (64.201.240.182) for icmp from 172.20.3.73 to 172.20.30.160
no answer from 172.20.30.160
Of the two, (1) was far more common, and most of what my debugging observations were from.
Observations
I ran some variation of ping 172.20.30.160 (sometimes with -nv) on maxwell. I snooped igb0, vnic_prop0, and vioif0 inside the guest for both ICMP and ARP traffic.
For ICMP traffic, I found that:
- I saw no ICMP traffic inside the guest on
vioif0
- I saw no ICMP traffic on
vnic_prop0 on the host
- I did see ICMP echo requests outgoing on
igb0 of the host
For ARP traffic, I found that:
- I saw ARP traffic inside the guest that matched what
vnic_prop0 was seeing
- I didn't see any ARP replies coming from the guest.
Here's some example ARP output from the vnic while an unsuccessful ping was running:
jordan@maxwell ~ $ pfexec snoop -t a -r -d vnic_prop0 arp
Using device vnic_prop0 (promiscuous mode)
00:12:39.35278 172.20.3.1 -> 172.20.3.11 ARP R 172.20.3.1, 172.20.3.1 is aa:0:4:0:ca:fe
00:12:44.08748 172.20.3.160 -> * ARP C Who is 172.20.3.1, 172.20.3.1 ?
00:12:44.08755 172.20.3.1 -> 172.20.3.160 ARP R 172.20.3.1, 172.20.3.1 is aa:0:4:0:ca:fe
00:12:57.07828 172.20.3.63 -> 172.20.3.3 ARP R 172.20.3.63, 172.20.3.63 is a0:42:3f:42:91:50
00:13:01.43025 172.20.3.1 -> 172.20.3.72 ARP R 172.20.3.1, 172.20.3.1 is aa:0:4:0:ca:fe
I saw that the ARP cache on the host had the guest IP / MAC address populated:
jordan@maxwell ~ $ arp -a
Net to Media Table: IPv4
Device IP Address Mask Flags Phys Addr
------ -------------------- --------------- -------- ---------------
igb0 172.20.3.160 255.255.255.255 02:08:20:ac:e9:16
igb0 centrum.eng.oxide.computer 255.255.255.255 aa:00:04:00:ca:fe
igb0 scarydoor.eng.oxide.computer 255.255.255.255 aa:00:04:00:ca:fe
igb0 172.20.3.73 255.255.255.255 SPLA d8:5e:d3:09:1f:6b
igb0 172.20.3.71 255.255.255.255 d8:5e:d3:09:1e:d7
igb0 atrium.eng.oxide.computer 255.255.255.255 18:c0:4d:81:7e:7c
The guest also had its own address populated:
root@unknown:~# arp -a
Net to Media Table: IPv4
Device IP Address Mask Flags Phys Addr
------ -------------------- --------------- -------- ---------------
vioif0 172.20.3.160 255.255.255.255 SPLA 02:08:20:ac:e9:16
vioif0 scarydoor.eng.oxide.computer 255.255.255.255 aa:00:04:00:ca:fe
vioif0 172.20.3.73 255.255.255.255 d8:5e:d3:09:1f:6b
I did manage to capture some ARP output from when ping was working, and I noticed one difference in the output, that the guest was replying to an ARP request for itself:
00:28:34.28157 172.20.3.1 -> 172.20.3.60 ARP R 172.20.3.1, 172.20.3.1 is aa:0:4:0:ca:fe
00:28:37.26110 172.20.3.160 -> * ARP C Who is 172.20.3.73, 172.20.3.73 ?
00:28:37.26116 172.20.3.73 -> 172.20.3.160 ARP R 172.20.3.73, 172.20.3.73 is d8:5e:d3:9:1f:6b
00:28:37.26122 172.20.3.73 -> * ARP C Who is 172.20.3.160, 172.20.3.160 ?
# guest reply
00:28:37.26133 172.20.3.160 -> 172.20.3.73 ARP R 172.20.3.160, 172.20.3.160 is 2:8:20:ac:e9:16
00:28:41.09649 172.20.3.63 -> 172.20.3.3 ARP R 172.20.3.63, 172.20.3.63 is a0:42:3f:42:91:50
00:28:52.35501 172.20.3.64 -> (broadcast) ARP C Who is 172.20.3.64, 172.20.3.64 ?
00:29:03.41786 172.20.3.1 -> 172.20.3.71 ARP R 172.20.3.1, 172.20.3.1 is aa:0:4:0:ca:fe
00:29:09.82297 172.20.3.160 -> (broadcast) ARP C Who is 172.20.3.160, 172.20.3.160 ?
00:29:15.21767 172.20.3.64 -> 172.20.3.3 ARP R 172.20.3.64, 172.20.3.64 is 0:90:fb:65:d6:6d
00:29:17.38574 172.20.3.63 -> 172.20.3.3 ARP R 172.20.3.63, 172.20.3.63 is a0:42:3f:42:91:50
My expectation is that this shouldn't matter whether the guest is replying in the context in which ping fails, as the host already has the IP/MAC address of the guest populated. But I could be wrong there?
In any event, it seems like ICMP packets are making it out of igb0, but not making it to vnic_prop0 for whatever reason.
Summary
I was running a helios guest on
maxwell, a lab machine. I wanted to scp some files onto the guest, so I assigned it an IP and set up ssh access. It worked initially. Then I observed that intermittently (on the order of hours to minutes), bothpingandsshto the guest from external systems in the lab (includingmaxwell) stopped working, then would start working again, and repeat.I'm not certain where exactly this problem is coming from, but I wanted to file it somewhere until it's better understood.
Setup Details
e874e4maxwellwas also on custom helios bits with only bhyve changes related to migrating tsc data; the most recent commit from master there was5d9d909helios-1.0.21408, built on January 9, 2023vnic_prop0, withigb0as the underlay device, and the vnic's MAC address is "02:08:20:ac:e9:16"pingorsshfrommaxwellitself172.20.30.160172.20.30.73Failure Modes
I observed two failure modes of ping:
Of the two, (1) was far more common, and most of what my debugging observations were from.
Observations
I ran some variation of
ping 172.20.30.160(sometimes with-nv) on maxwell. I snoopedigb0,vnic_prop0, andvioif0inside the guest for both ICMP and ARP traffic.For ICMP traffic, I found that:
vioif0vnic_prop0on the hostigb0of the hostFor ARP traffic, I found that:
vnic_prop0was seeingHere's some example ARP output from the vnic while an unsuccessful
pingwas running:I saw that the ARP cache on the host had the guest IP / MAC address populated:
The guest also had its own address populated:
I did manage to capture some ARP output from when
pingwas working, and I noticed one difference in the output, that the guest was replying to an ARP request for itself:My expectation is that this shouldn't matter whether the guest is replying in the context in which ping fails, as the host already has the IP/MAC address of the guest populated. But I could be wrong there?
In any event, it seems like ICMP packets are making it out of
igb0, but not making it tovnic_prop0for whatever reason.