[networking] Clean up stale veth devices by dan-stowell · Pull Request #11391 · buildbuddy-io/buildbuddy

dan-stowell · 2026-02-23T20:46:37Z

Root cause: when a process using veth pairs is killed without cleanup (e.g. SIGKILL from Bazel test runner), the flock on the IP range is released but the kernel-level veth devices and routes persist in the root network namespace. On the next allocation of the same /30 network, duplicate routes are created. Return traffic then gets routed to a stale veth (whose namespace-side peer no longer exists) instead of the new one, causing 100% packet loss.

The fix:

networking.go: add cleanupStaleVeths() which enumerates all network interfaces and deletes any that have the same host IP we're about to assign. Called in both setupVethPair (new veth creation) and VethNetworkPool.Get (pooled veth reuse with new IP). This is defense-in-depth for production executors.

Root cause: when a process using veth pairs is killed without cleanup (e.g. SIGKILL from Bazel test runner), the flock on the IP range is released but the kernel-level veth devices and routes persist in the root network namespace. On the next allocation of the same /30 network, duplicate routes are created. Return traffic then gets routed to a stale veth (whose namespace-side peer no longer exists) instead of the new one, causing 100% packet loss. Two-part fix: 1. networking.go: add cleanupStaleVeths() which enumerates all veth interfaces and deletes any that have the same host IP we're about to assign. Called in both setupVethPair (new veth creation) and VethNetworkPool.Get (pooled veth reuse with new IP). This is defense-in-depth for production executors. 2. firecracker_test.go: use a dedicated task_ip_range (10.200.0.0/16) so the test doesn't share an IP range with other tests on the same executor host. Other tests use the default 192.168.0.0/16, and their stale veths would cause routing conflicts even with cleanup, since new stale veths can appear between cleanup and use. Co-authored-by: Shelley <shelley@exe.dev>

enterprise/server/remote_execution/containers/firecracker/firecracker_test.go

bduffany · 2026-02-23T21:46:30Z

server/util/networking/networking.go


+	// Clean up stale veths before assigning the new IP, to avoid routing
+	// conflicts with orphaned devices from killed processes.
+	if err := cleanupStaleVeths(ctx, network.HostIPWithCIDR()); err != nil {


I think this is a good fix for the tests, but there might be some performance implications for prod, because networking-related operations tend to be on the slower side.

Maybe we could start by putting it behind a flag and only doing this in tests, and if we want to enable the flag everywhere we could do some benchmarking first on some prod machines.

(Same comment re the callsite below)

Address review feedback: 1. Remove the dedicated 10.200.0.0/16 test IP range. firecracker_test is the only test on the bare pool, so there's no cross-test conflict. 2. Add executor.cleanup_stale_veth_devices flag (default false) to gate the cleanupStaleVeths() calls. This avoids potential performance impact in prod from enumerating links on every veth setup. The test enables the flag since test processes are routinely killed without cleanup by the test runner. The cleanup logic itself is unchanged -- when enabled, it still deletes any veth carrying the host IP we're about to assign, which is safe because holding the flock means any such device is orphaned. Co-authored-by: Shelley <shelley@exe.dev>

server/util/networking/networking.go

Co-authored-by: Brandon Duffany <brandon@buildbuddy.io>

dan-stowell force-pushed the investigate-network-failures branch 3 times, most recently from 14d8af7 to 0e0ed28 Compare February 23, 2026 20:53

dan-stowell force-pushed the investigate-network-failures branch from 0e0ed28 to c962fc7 Compare February 23, 2026 20:53

bduffany reviewed Feb 23, 2026

View reviewed changes

dan-stowell and others added 2 commits February 23, 2026 21:55

trim flag comment

990d74a

dan-stowell marked this pull request as ready for review February 23, 2026 22:02

dan-stowell changed the title ~~Clean up stale veth devices and use dedicated test IP range~~ Clean up stale veth devices Feb 23, 2026

dan-stowell changed the title ~~Clean up stale veth devices~~ [networking] Clean up stale veth devices Feb 23, 2026

bduffany approved these changes Feb 23, 2026

View reviewed changes

server/util/networking/networking.go Outdated Show resolved Hide resolved

Update server/util/networking/networking.go

9633a4b

Co-authored-by: Brandon Duffany <brandon@buildbuddy.io>

dan-stowell merged commit d0f727d into master Feb 24, 2026
9 of 13 checks passed

dan-stowell deleted the investigate-network-failures branch February 24, 2026 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[networking] Clean up stale veth devices#11391

[networking] Clean up stale veth devices#11391
dan-stowell merged 4 commits intomasterfrom
investigate-network-failures

dan-stowell commented Feb 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

bduffany Feb 23, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dan-stowell commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bduffany Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dan-stowell commented Feb 23, 2026 •

edited

Loading