[networking] Clean up stale veth devices#11391
Merged
dan-stowell merged 4 commits intomasterfrom Feb 24, 2026
Merged
Conversation
14d8af7 to
0e0ed28
Compare
Root cause: when a process using veth pairs is killed without cleanup (e.g. SIGKILL from Bazel test runner), the flock on the IP range is released but the kernel-level veth devices and routes persist in the root network namespace. On the next allocation of the same /30 network, duplicate routes are created. Return traffic then gets routed to a stale veth (whose namespace-side peer no longer exists) instead of the new one, causing 100% packet loss. Two-part fix: 1. networking.go: add cleanupStaleVeths() which enumerates all veth interfaces and deletes any that have the same host IP we're about to assign. Called in both setupVethPair (new veth creation) and VethNetworkPool.Get (pooled veth reuse with new IP). This is defense-in-depth for production executors. 2. firecracker_test.go: use a dedicated task_ip_range (10.200.0.0/16) so the test doesn't share an IP range with other tests on the same executor host. Other tests use the default 192.168.0.0/16, and their stale veths would cause routing conflicts even with cleanup, since new stale veths can appear between cleanup and use. Co-authored-by: Shelley <shelley@exe.dev>
0e0ed28 to
c962fc7
Compare
bduffany
reviewed
Feb 23, 2026
enterprise/server/remote_execution/containers/firecracker/firecracker_test.go
Outdated
Show resolved
Hide resolved
server/util/networking/networking.go
Outdated
|
|
||
| // Clean up stale veths before assigning the new IP, to avoid routing | ||
| // conflicts with orphaned devices from killed processes. | ||
| if err := cleanupStaleVeths(ctx, network.HostIPWithCIDR()); err != nil { |
Member
There was a problem hiding this comment.
I think this is a good fix for the tests, but there might be some performance implications for prod, because networking-related operations tend to be on the slower side.
Maybe we could start by putting it behind a flag and only doing this in tests, and if we want to enable the flag everywhere we could do some benchmarking first on some prod machines.
(Same comment re the callsite below)
Address review feedback: 1. Remove the dedicated 10.200.0.0/16 test IP range. firecracker_test is the only test on the bare pool, so there's no cross-test conflict. 2. Add executor.cleanup_stale_veth_devices flag (default false) to gate the cleanupStaleVeths() calls. This avoids potential performance impact in prod from enumerating links on every veth setup. The test enables the flag since test processes are routinely killed without cleanup by the test runner. The cleanup logic itself is unchanged -- when enabled, it still deletes any veth carrying the host IP we're about to assign, which is safe because holding the flock means any such device is orphaned. Co-authored-by: Shelley <shelley@exe.dev>
bduffany
approved these changes
Feb 23, 2026
Co-authored-by: Brandon Duffany <brandon@buildbuddy.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause: when a process using veth pairs is killed without cleanup (e.g. SIGKILL from Bazel test runner), the flock on the IP range is released but the kernel-level veth devices and routes persist in the root network namespace. On the next allocation of the same /30 network, duplicate routes are created. Return traffic then gets routed to a stale veth (whose namespace-side peer no longer exists) instead of the new one, causing 100% packet loss.
The fix: