Problem
Today, when a SeiNodeDeployment has spec.deletionPolicy: Retain set and the SND is deleted, the controller calls orphanChildSeiNodes / orphanNetworkingResources (internal/controller/nodedeployment/controller.go:160), which removes ownerReferences from child SeiNodes so they survive the SND deletion. However, this orphan path does not touch the bound PersistentVolume's persistentVolumeReclaimPolicy. For dynamically provisioned EBS volumes the PV's reclaim policy defaults to Delete, so any downstream operation that ultimately deletes the PersistentVolumeClaim will cascade to destroying the underlying EBS volume — even though the SND was explicitly set to Retain.
Observed during incident triage on 2026-05-19 in pacific-1: while triaging a stuck node-0-0-0 pod (missing rbac-proxy ConfigMap), the operator's instinct was "delete the SND and let Flux recreate it cleanly." Read-only checks revealed that the SND's Retain field offered no protection for the data on the bound PV. The field's name strongly implies data preservation; the actual behavior only preserves K8s object-graph orphaning.
Impact
Production data-loss risk on archive nodes. In pacific-1, three archive SeiNodes (archive-0-0, archive-1-0, archive-2-0) hold large state-snapshot datasets that take many hours to days to rebuild from peers. Operators relying on deletionPolicy: Retain for safety today have a false sense of protection — a careless kubectl delete pvc (or any controller-side cascade that ultimately deletes the PVC) destroys irreplaceable chain data.
This also blocks "delete and recreate from Flux" as a generally safe troubleshooting pattern. Today, any incident triage that reaches for "nuke and let Flux reconcile" must separately patch every PV's reclaim policy as a defensive step. The friction means operators avoid an otherwise simple recovery path.
Relevant experts
kubernetes-specialist — controller-runtime, ensure_pvc.go, lifecycle handling
platform-engineer — StorageClass design, EBS reclaim semantics, infra patterns
Proposed approach
Two viable paths:
-
Per-template storageClassName (preferred, smaller change). Add storageClassName to SeiNodeTemplate spec; expose a corresponding Retain-reclaim StorageClass (e.g. gp3-retain-archive) in the platform repo. Archive SNDs opt in via storageClassName: gp3-retain-archive. The SC has reclaimPolicy: Retain so all provisioned PVs inherit that policy at creation time — no post-bind patching needed. ~½ day Go + tests, plus 1 StorageClass YAML.
-
Post-bind PV patch in ensure_pvc.go. When the SND's deletionPolicy: Retain is set, after the PVC is Bound, look up the PV and patch persistentVolumeReclaimPolicy: Retain. Requires a reconcile loop until the PV exists. ~1 day Go + tests. Doesn't require any platform-side StorageClass.
Approach (1) is cleaner — declarative at provisioning time, no runtime patching. Approach (2) is zero-opt-in friction for any SND that already sets Retain but introduces an asynchronous patch.
Acceptance criteria
Out of scope
- Migrating existing PVs to Retain reclaim. This issue addresses new provisioning; existing volumes get a one-time manual patch as a runbook task.
- Name-collision handling on SND recreate. Orphaning a SeiNode and having Flux recreate the SND today causes a name conflict (orphan still exists with the same name). Solving "delete the SND, let Flux recreate cleanly" end-to-end requires this issue plus a separate orphan-adoption or replace flow. File as follow-up if/when needed.
- Non-EBS storage backends. Framing is EBS-specific; other CSI drivers (EFS, FSx) may have different reclaim semantics worth verifying separately.
References
- Discovered during incident triage 2026-05-19 —
pacific-1 node-0-0 stuck on missing rbac-proxy CM
internal/controller/nodedeployment/controller.go:160 — current orphan-children behavior under DeletionPolicy: Retain
internal/task/ensure_pvc.go — PVC creation path that doesn't honor SND retain semantics
api/v1alpha1/seinodedeployment_types.go:21-27 — current DeletionPolicy field definition + doc
Problem
Today, when a
SeiNodeDeploymenthasspec.deletionPolicy: Retainset and the SND is deleted, the controller callsorphanChildSeiNodes/orphanNetworkingResources(internal/controller/nodedeployment/controller.go:160), which removes ownerReferences from child SeiNodes so they survive the SND deletion. However, this orphan path does not touch the bound PersistentVolume'spersistentVolumeReclaimPolicy. For dynamically provisioned EBS volumes the PV's reclaim policy defaults toDelete, so any downstream operation that ultimately deletes thePersistentVolumeClaimwill cascade to destroying the underlying EBS volume — even though the SND was explicitly set to Retain.Observed during incident triage on 2026-05-19 in
pacific-1: while triaging a stucknode-0-0-0pod (missing rbac-proxy ConfigMap), the operator's instinct was "delete the SND and let Flux recreate it cleanly." Read-only checks revealed that the SND'sRetainfield offered no protection for the data on the bound PV. The field's name strongly implies data preservation; the actual behavior only preserves K8s object-graph orphaning.Impact
Production data-loss risk on archive nodes. In
pacific-1, three archive SeiNodes (archive-0-0,archive-1-0,archive-2-0) hold large state-snapshot datasets that take many hours to days to rebuild from peers. Operators relying ondeletionPolicy: Retainfor safety today have a false sense of protection — a carelesskubectl delete pvc(or any controller-side cascade that ultimately deletes the PVC) destroys irreplaceable chain data.This also blocks "delete and recreate from Flux" as a generally safe troubleshooting pattern. Today, any incident triage that reaches for "nuke and let Flux reconcile" must separately patch every PV's reclaim policy as a defensive step. The friction means operators avoid an otherwise simple recovery path.
Relevant experts
kubernetes-specialist— controller-runtime,ensure_pvc.go, lifecycle handlingplatform-engineer— StorageClass design, EBS reclaim semantics, infra patternsProposed approach
Two viable paths:
Per-template
storageClassName(preferred, smaller change). AddstorageClassNametoSeiNodeTemplatespec; expose a correspondingRetain-reclaim StorageClass (e.g.gp3-retain-archive) in the platform repo. Archive SNDs opt in viastorageClassName: gp3-retain-archive. The SC hasreclaimPolicy: Retainso all provisioned PVs inherit that policy at creation time — no post-bind patching needed. ~½ day Go + tests, plus 1 StorageClass YAML.Post-bind PV patch in
ensure_pvc.go. When the SND'sdeletionPolicy: Retainis set, after the PVC is Bound, look up the PV and patchpersistentVolumeReclaimPolicy: Retain. Requires a reconcile loop until the PV exists. ~1 day Go + tests. Doesn't require any platform-side StorageClass.Approach (1) is cleaner — declarative at provisioning time, no runtime patching. Approach (2) is zero-opt-in friction for any SND that already sets
Retainbut introduces an asynchronous patch.Acceptance criteria
deletionPolicy: Retaindoes not result in PV deletion under any cascade path (e2e: create SND with Retain, delete SND, confirm PVC + PV remain, confirm EBS volume not destroyed).DeletionPolicyfield doc (api/v1alpha1/seinodedeployment_types.go) clarifies volume-preservation behavior.storageClassNamefield onSeiNodeTemplate; integration test for a SeiNode pointing at a Retain-reclaim SC.Out of scope
References
pacific-1node-0-0stuck on missing rbac-proxy CMinternal/controller/nodedeployment/controller.go:160— current orphan-children behavior underDeletionPolicy: Retaininternal/task/ensure_pvc.go— PVC creation path that doesn't honor SND retain semanticsapi/v1alpha1/seinodedeployment_types.go:21-27— currentDeletionPolicyfield definition + doc