Skip to content

[Performance] Mitigate redundant Reconciles #527

@vicentefb

Description

@vicentefb

This issue/investigation is derived from having opened these two PRs #508 and #509 where we reduce the number of conflicts by using .Patch() instead of .Update() to update the status of the sandbox and sandboxclaim resources.

During high-concurrency scale testing (kube-api-qps=1000, sandbox-concurrent-workers=400), we observed a high volume of 409 Conflict: the object has been modified errors across both the Sandbox and SandboxClaim controllers.

The SandboxReconciler uses `Owns(&corev1.Pod{}) to watch for changes on its child Pods. When a new sandbox pod is provisioned, the Kubelet rapidly transitions the pod through multiple setup phases (Scheduled -> Initialized -> PodIP assigned -> ContainersReady -> Running) in a fraction of a second.

The following test run shows what's happening in baseline main (without using .Patch() in sandbox or sandboxclaim controller).

Test Conditions: BURST=1, WARMPOOL=2 (Sandbox Status Unpatched, SandboxClaim Status Unpatched)

Agent Sandbox Claim Startup latency (ms):

  • "StartupLatency50": 653.8145060214703,
  • "StartupLatency90": 1150.0000000000002,
  • "StartupLatency99": 2364.999999999999

API Server View (HTTP Audit Logs): This trace isolates the exact moment the Kubelet spam causes the controllers to repeatedly crash into each other over a single second.

Step Delta (ms) Method Target Resource Result / Notes
1 0 PATCH sandboxclaims/agent-claim-4 🟢 Start: Test Runner injects Claim
2 17 UPDATE sandboxes/warmpool-0-6qkhx 🟢 Claim Controller adopts Sandbox
3 28 DELETE networkpolicies/... 🟡 404 Not Found
4 32 CREATE events/agent-claim-4... 🟢 K8s records adoption event
5 42 CREATE sandboxes 🟢 WarmPool orders replacement
6 45 UPDATE sandboxclaims/agent-claim-4/status 🟢 Claim Controller updates status
7 56 DELETE networkpolicies/... 🟡 404 Not Found (Retry)
8 58 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
9 67 UPDATE sandboxclaims/agent-claim-4/status 🔴 409 Conflict (Claim collides w/ itself)
10 68 UPDATE pods/warmpool-0-6qkhx 🟢 Sandbox Controller updates pod
11 71 PATCH sandboxes/warmpool-0-qcblt 🟢 WarmPool configures replacement
12 77 DELETE networkpolicies/... 🟡 404 Not Found (Retry)
13 85 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
14 90 PATCH sandboxes/warmpool-0-6qkhx/status 🟢 SUCCESS: Sandbox Status patched
15 97 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
16 111 DELETE networkpolicies/... 🟡 404 Not Found (Retry)
17 127 CREATE pods/warmpool-0-qcblt 🟢 Sandbox Controller provisions pod
18 129 UPDATE sandboxclaims/agent-claim-4/status 🟢 Retry Succeeds: Claim status resolves
19 132 UPDATE pods/warmpool-0-6qkhx 🟢 Sandbox Controller updates pod
20 147 PATCH sandboxes/warmpool-0-6qkhx/status 🟢 SUCCESS: Sandbox Status patched again
21 150 DELETE networkpolicies/... 🟡 404 Not Found (Retry)
22 152 PATCH sandboxes/warmpool-0-qcblt 🟢 WarmPool configures replacement
23 166 CREATE services/warmpool-0-qcblt 🟢 Service provisioned
24 191 UPDATE pods/warmpool-0-6qkhx 🟢 Sandbox Controller updates pod
25 193 PATCH sandboxes/warmpool-0-qcblt/status 🟢 SUCCESS: New Sandbox Status patched
26 244 UPDATE pods/warmpool-0-qcblt 🔴 409 Conflict (Sandbox collides w/ Kubelet)
27 264 PATCH sandboxes/warmpool-0-qcblt/status 🟢 SUCCESS: New Sandbox Status patched again
28 312 UPDATE pods/warmpool-0-qcblt 🟢 Sandbox Controller updates pod
29 361 UPDATE pods/warmpool-0-qcblt 🟢 Sandbox Controller updates pod
30 380 PATCH sandboxes/warmpool-0-qcblt/status 🟢 SUCCESS: New Sandbox Status patched again
31 715 UPDATE pods/warmpool-0-qcblt 🟢 Sandbox Controller updates pod
32 734 PATCH sandboxes/warmpool-0-qcblt/status 🟢 SUCCESS: New Sandbox Status patched again
33 758 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
34 796 UPDATE pods/warmpool-0-qcblt 🟢 Sandbox Controller updates pod

This trace tracks the reconcileID generated by controller-runtime for a single replacement pod (warmpool-0-qcblt). It proves the Kubelet's rapid status updates are generating excessive Informer notifications, causing the worker thread to execute 7 full loops in under 1 second.

Step Delta (ms) Reconcile ID Controller Action / Event
1 0 9a3bf8f1... 🟢 Loop 1: "Creating a new Pod" & "Creating a new Headless Service"
2 124 ce0ba836... 🟡 Loop 2: "Found Pod" (Woken up by Kubelet Scheduled event)
3 196 ce0ba836... 🔴 CRASH: "failed to update pod... the object has been modified"
4 196 ce0ba836... 🔴 CRASH: "Failed to update sandbox status" (409 Conflict)
5 196 9c2064ce... 🟡 Loop 3: "Found Pod" (Woken up by Kubelet Initialized event)
6 1232 8f2137f8... 🟡 Loop 4: "Found Pod" (Woken up by Kubelet network/IP assignment)
7 1308 258ff9f0... 🟡 Loop 5: "Found Pod" (Woken up by Kubelet Ready event)
- - - (Note: Controller continues to loop for subsequent Kubelet phase shifts)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions