This issue/investigation is derived from having opened these two PRs #508 and #509 where we reduce the number of conflicts by using .Patch() instead of .Update() to update the status of the sandbox and sandboxclaim resources.
During high-concurrency scale testing (kube-api-qps=1000, sandbox-concurrent-workers=400), we observed a high volume of 409 Conflict: the object has been modified errors across both the Sandbox and SandboxClaim controllers.
The SandboxReconciler uses `Owns(&corev1.Pod{}) to watch for changes on its child Pods. When a new sandbox pod is provisioned, the Kubelet rapidly transitions the pod through multiple setup phases (Scheduled -> Initialized -> PodIP assigned -> ContainersReady -> Running) in a fraction of a second.
The following test run shows what's happening in baseline main (without using .Patch() in sandbox or sandboxclaim controller).
Test Conditions: BURST=1, WARMPOOL=2 (Sandbox Status Unpatched, SandboxClaim Status Unpatched)
Agent Sandbox Claim Startup latency (ms):
- "StartupLatency50": 653.8145060214703,
- "StartupLatency90": 1150.0000000000002,
- "StartupLatency99": 2364.999999999999
API Server View (HTTP Audit Logs): This trace isolates the exact moment the Kubelet spam causes the controllers to repeatedly crash into each other over a single second.
| Step |
Delta (ms) |
Method |
Target Resource |
Result / Notes |
| 1 |
0 |
PATCH |
sandboxclaims/agent-claim-4 |
🟢 Start: Test Runner injects Claim |
| 2 |
17 |
UPDATE |
sandboxes/warmpool-0-6qkhx |
🟢 Claim Controller adopts Sandbox |
| 3 |
28 |
DELETE |
networkpolicies/... |
🟡 404 Not Found |
| 4 |
32 |
CREATE |
events/agent-claim-4... |
🟢 K8s records adoption event |
| 5 |
42 |
CREATE |
sandboxes |
🟢 WarmPool orders replacement |
| 6 |
45 |
UPDATE |
sandboxclaims/agent-claim-4/status |
🟢 Claim Controller updates status |
| 7 |
56 |
DELETE |
networkpolicies/... |
🟡 404 Not Found (Retry) |
| 8 |
58 |
PATCH |
sandboxwarmpools/warmpool-0/status |
🟢 WarmPool updates pool status |
| 9 |
67 |
UPDATE |
sandboxclaims/agent-claim-4/status |
🔴 409 Conflict (Claim collides w/ itself) |
| 10 |
68 |
UPDATE |
pods/warmpool-0-6qkhx |
🟢 Sandbox Controller updates pod |
| 11 |
71 |
PATCH |
sandboxes/warmpool-0-qcblt |
🟢 WarmPool configures replacement |
| 12 |
77 |
DELETE |
networkpolicies/... |
🟡 404 Not Found (Retry) |
| 13 |
85 |
PATCH |
sandboxwarmpools/warmpool-0/status |
🟢 WarmPool updates pool status |
| 14 |
90 |
PATCH |
sandboxes/warmpool-0-6qkhx/status |
🟢 SUCCESS: Sandbox Status patched |
| 15 |
97 |
PATCH |
sandboxwarmpools/warmpool-0/status |
🟢 WarmPool updates pool status |
| 16 |
111 |
DELETE |
networkpolicies/... |
🟡 404 Not Found (Retry) |
| 17 |
127 |
CREATE |
pods/warmpool-0-qcblt |
🟢 Sandbox Controller provisions pod |
| 18 |
129 |
UPDATE |
sandboxclaims/agent-claim-4/status |
🟢 Retry Succeeds: Claim status resolves |
| 19 |
132 |
UPDATE |
pods/warmpool-0-6qkhx |
🟢 Sandbox Controller updates pod |
| 20 |
147 |
PATCH |
sandboxes/warmpool-0-6qkhx/status |
🟢 SUCCESS: Sandbox Status patched again |
| 21 |
150 |
DELETE |
networkpolicies/... |
🟡 404 Not Found (Retry) |
| 22 |
152 |
PATCH |
sandboxes/warmpool-0-qcblt |
🟢 WarmPool configures replacement |
| 23 |
166 |
CREATE |
services/warmpool-0-qcblt |
🟢 Service provisioned |
| 24 |
191 |
UPDATE |
pods/warmpool-0-6qkhx |
🟢 Sandbox Controller updates pod |
| 25 |
193 |
PATCH |
sandboxes/warmpool-0-qcblt/status |
🟢 SUCCESS: New Sandbox Status patched |
| 26 |
244 |
UPDATE |
pods/warmpool-0-qcblt |
🔴 409 Conflict (Sandbox collides w/ Kubelet) |
| 27 |
264 |
PATCH |
sandboxes/warmpool-0-qcblt/status |
🟢 SUCCESS: New Sandbox Status patched again |
| 28 |
312 |
UPDATE |
pods/warmpool-0-qcblt |
🟢 Sandbox Controller updates pod |
| 29 |
361 |
UPDATE |
pods/warmpool-0-qcblt |
🟢 Sandbox Controller updates pod |
| 30 |
380 |
PATCH |
sandboxes/warmpool-0-qcblt/status |
🟢 SUCCESS: New Sandbox Status patched again |
| 31 |
715 |
UPDATE |
pods/warmpool-0-qcblt |
🟢 Sandbox Controller updates pod |
| 32 |
734 |
PATCH |
sandboxes/warmpool-0-qcblt/status |
🟢 SUCCESS: New Sandbox Status patched again |
| 33 |
758 |
PATCH |
sandboxwarmpools/warmpool-0/status |
🟢 WarmPool updates pool status |
| 34 |
796 |
UPDATE |
pods/warmpool-0-qcblt |
🟢 Sandbox Controller updates pod |
This trace tracks the reconcileID generated by controller-runtime for a single replacement pod (warmpool-0-qcblt). It proves the Kubelet's rapid status updates are generating excessive Informer notifications, causing the worker thread to execute 7 full loops in under 1 second.
| Step |
Delta (ms) |
Reconcile ID |
Controller Action / Event |
| 1 |
0 |
9a3bf8f1... |
🟢 Loop 1: "Creating a new Pod" & "Creating a new Headless Service" |
| 2 |
124 |
ce0ba836... |
🟡 Loop 2: "Found Pod" (Woken up by Kubelet Scheduled event) |
| 3 |
196 |
ce0ba836... |
🔴 CRASH: "failed to update pod... the object has been modified" |
| 4 |
196 |
ce0ba836... |
🔴 CRASH: "Failed to update sandbox status" (409 Conflict) |
| 5 |
196 |
9c2064ce... |
🟡 Loop 3: "Found Pod" (Woken up by Kubelet Initialized event) |
| 6 |
1232 |
8f2137f8... |
🟡 Loop 4: "Found Pod" (Woken up by Kubelet network/IP assignment) |
| 7 |
1308 |
258ff9f0... |
🟡 Loop 5: "Found Pod" (Woken up by Kubelet Ready event) |
| - |
- |
- |
(Note: Controller continues to loop for subsequent Kubelet phase shifts) |
This issue/investigation is derived from having opened these two PRs #508 and #509 where we reduce the number of conflicts by using .Patch() instead of .Update() to update the status of the sandbox and sandboxclaim resources.
During high-concurrency scale testing (kube-api-qps=1000, sandbox-concurrent-workers=400), we observed a high volume of 409 Conflict: the object has been modified errors across both the Sandbox and SandboxClaim controllers.
The SandboxReconciler uses `Owns(&corev1.Pod{}) to watch for changes on its child Pods. When a new sandbox pod is provisioned, the Kubelet rapidly transitions the pod through multiple setup phases (Scheduled -> Initialized -> PodIP assigned -> ContainersReady -> Running) in a fraction of a second.
The following test run shows what's happening in baseline main (without using
.Patch()in sandbox or sandboxclaim controller).Test Conditions: BURST=1, WARMPOOL=2 (Sandbox Status Unpatched, SandboxClaim Status Unpatched)
Agent Sandbox Claim Startup latency (ms):
API Server View (HTTP Audit Logs): This trace isolates the exact moment the Kubelet spam causes the controllers to repeatedly crash into each other over a single second.
This trace tracks the
reconcileIDgenerated by controller-runtime for a single replacement pod (warmpool-0-qcblt). It proves the Kubelet's rapid status updates are generating excessive Informer notifications, causing the worker thread to execute 7 full loops in under 1 second.