Skip to content

[networking] Clean up stale veth devices#11391

Merged
dan-stowell merged 4 commits intomasterfrom
investigate-network-failures
Feb 24, 2026
Merged

[networking] Clean up stale veth devices#11391
dan-stowell merged 4 commits intomasterfrom
investigate-network-failures

Conversation

@dan-stowell
Copy link
Contributor

@dan-stowell dan-stowell commented Feb 23, 2026

Root cause: when a process using veth pairs is killed without cleanup (e.g. SIGKILL from Bazel test runner), the flock on the IP range is released but the kernel-level veth devices and routes persist in the root network namespace. On the next allocation of the same /30 network, duplicate routes are created. Return traffic then gets routed to a stale veth (whose namespace-side peer no longer exists) instead of the new one, causing 100% packet loss.

The fix:

  • networking.go: add cleanupStaleVeths() which enumerates all network interfaces and deletes any that have the same host IP we're about to assign. Called in both setupVethPair (new veth creation) and VethNetworkPool.Get (pooled veth reuse with new IP). This is defense-in-depth for production executors.

@dan-stowell dan-stowell force-pushed the investigate-network-failures branch 3 times, most recently from 14d8af7 to 0e0ed28 Compare February 23, 2026 20:53
Root cause: when a process using veth pairs is killed without cleanup
(e.g. SIGKILL from Bazel test runner), the flock on the IP range is
released but the kernel-level veth devices and routes persist in the
root network namespace. On the next allocation of the same /30 network,
duplicate routes are created. Return traffic then gets routed to a stale
veth (whose namespace-side peer no longer exists) instead of the new
one, causing 100% packet loss.

Two-part fix:

1. networking.go: add cleanupStaleVeths() which enumerates all veth
   interfaces and deletes any that have the same host IP we're about to
   assign. Called in both setupVethPair (new veth creation) and
   VethNetworkPool.Get (pooled veth reuse with new IP). This is
   defense-in-depth for production executors.

2. firecracker_test.go: use a dedicated task_ip_range (10.200.0.0/16)
   so the test doesn't share an IP range with other tests on the same
   executor host. Other tests use the default 192.168.0.0/16, and their
   stale veths would cause routing conflicts even with cleanup, since
   new stale veths can appear between cleanup and use.

Co-authored-by: Shelley <shelley@exe.dev>
@dan-stowell dan-stowell force-pushed the investigate-network-failures branch from 0e0ed28 to c962fc7 Compare February 23, 2026 20:53

// Clean up stale veths before assigning the new IP, to avoid routing
// conflicts with orphaned devices from killed processes.
if err := cleanupStaleVeths(ctx, network.HostIPWithCIDR()); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good fix for the tests, but there might be some performance implications for prod, because networking-related operations tend to be on the slower side.

Maybe we could start by putting it behind a flag and only doing this in tests, and if we want to enable the flag everywhere we could do some benchmarking first on some prod machines.

(Same comment re the callsite below)

dan-stowell and others added 2 commits February 23, 2026 21:55
Address review feedback:

1. Remove the dedicated 10.200.0.0/16 test IP range. firecracker_test is
   the only test on the bare pool, so there's no cross-test conflict.

2. Add executor.cleanup_stale_veth_devices flag (default false) to gate
   the cleanupStaleVeths() calls. This avoids potential performance
   impact in prod from enumerating links on every veth setup. The test
   enables the flag since test processes are routinely killed without
   cleanup by the test runner.

The cleanup logic itself is unchanged -- when enabled, it still deletes
any veth carrying the host IP we're about to assign, which is safe
because holding the flock means any such device is orphaned.

Co-authored-by: Shelley <shelley@exe.dev>
@dan-stowell dan-stowell marked this pull request as ready for review February 23, 2026 22:02
@dan-stowell dan-stowell changed the title Clean up stale veth devices and use dedicated test IP range Clean up stale veth devices Feb 23, 2026
@dan-stowell dan-stowell changed the title Clean up stale veth devices [networking] Clean up stale veth devices Feb 23, 2026
Co-authored-by: Brandon Duffany <brandon@buildbuddy.io>
@dan-stowell dan-stowell merged commit d0f727d into master Feb 24, 2026
9 of 13 checks passed
@dan-stowell dan-stowell deleted the investigate-network-failures branch February 24, 2026 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants