`deletionPolicy: Retain` on SeiNodeDeployment doesn't protect the underlying PV reclaim policy

## Problem

Today, when a `SeiNodeDeployment` has `spec.deletionPolicy: Retain` set and the SND is deleted, the controller calls `orphanChildSeiNodes` / `orphanNetworkingResources` (`internal/controller/nodedeployment/controller.go:160`), which removes ownerReferences from child SeiNodes so they survive the SND deletion. However, this orphan path does not touch the bound PersistentVolume's `persistentVolumeReclaimPolicy`. For dynamically provisioned EBS volumes the PV's reclaim policy defaults to `Delete`, so any downstream operation that ultimately deletes the `PersistentVolumeClaim` will cascade to destroying the underlying EBS volume — even though the SND was explicitly set to Retain.

Observed during incident triage on 2026-05-19 in `pacific-1`: while triaging a stuck `node-0-0-0` pod (missing rbac-proxy ConfigMap), the operator's instinct was "delete the SND and let Flux recreate it cleanly." Read-only checks revealed that the SND's `Retain` field offered no protection for the data on the bound PV. The field's name strongly implies data preservation; the actual behavior only preserves K8s object-graph orphaning.

## Impact

**Production data-loss risk on archive nodes.** In `pacific-1`, three archive SeiNodes (`archive-0-0`, `archive-1-0`, `archive-2-0`) hold large state-snapshot datasets that take many hours to days to rebuild from peers. Operators relying on `deletionPolicy: Retain` for safety today have a false sense of protection — a careless `kubectl delete pvc` (or any controller-side cascade that ultimately deletes the PVC) destroys irreplaceable chain data.

This also blocks "delete and recreate from Flux" as a generally safe troubleshooting pattern. Today, any incident triage that reaches for "nuke and let Flux reconcile" must separately patch every PV's reclaim policy as a defensive step. The friction means operators avoid an otherwise simple recovery path.

## Relevant experts

- `kubernetes-specialist` — controller-runtime, `ensure_pvc.go`, lifecycle handling
- `platform-engineer` — StorageClass design, EBS reclaim semantics, infra patterns

## Proposed approach

Two viable paths:

1. **Per-template `storageClassName`** (preferred, smaller change). Add `storageClassName` to `SeiNodeTemplate` spec; expose a corresponding `Retain`-reclaim StorageClass (e.g. `gp3-retain-archive`) in the platform repo. Archive SNDs opt in via `storageClassName: gp3-retain-archive`. The SC has `reclaimPolicy: Retain` so all provisioned PVs inherit that policy at creation time — no post-bind patching needed. ~½ day Go + tests, plus 1 StorageClass YAML.

2. **Post-bind PV patch in `ensure_pvc.go`**. When the SND's `deletionPolicy: Retain` is set, after the PVC is Bound, look up the PV and patch `persistentVolumeReclaimPolicy: Retain`. Requires a reconcile loop until the PV exists. ~1 day Go + tests. Doesn't require any platform-side StorageClass.

Approach (1) is cleaner — declarative at provisioning time, no runtime patching. Approach (2) is zero-opt-in friction for any SND that already sets `Retain` but introduces an asynchronous patch.

## Acceptance criteria

- [ ] Deleting an SND with `deletionPolicy: Retain` does not result in PV deletion under any cascade path (e2e: create SND with Retain, delete SND, confirm PVC + PV remain, confirm EBS volume not destroyed).
- [ ] `DeletionPolicy` field doc (`api/v1alpha1/seinodedeployment_types.go`) clarifies volume-preservation behavior.
- [ ] If approach (1) chosen: `storageClassName` field on `SeiNodeTemplate`; integration test for a SeiNode pointing at a Retain-reclaim SC.
- [ ] If approach (2) chosen: unit test for PV-patch reconcile with race coverage on PV creation latency.

## Out of scope

- **Migrating existing PVs** to Retain reclaim. This issue addresses new provisioning; existing volumes get a one-time manual patch as a runbook task.
- **Name-collision handling on SND recreate.** Orphaning a SeiNode and having Flux recreate the SND today causes a name conflict (orphan still exists with the same name). Solving "delete the SND, let Flux recreate cleanly" end-to-end requires this issue plus a separate orphan-adoption or replace flow. File as follow-up if/when needed.
- **Non-EBS storage backends.** Framing is EBS-specific; other CSI drivers (EFS, FSx) may have different reclaim semantics worth verifying separately.

## References

- Discovered during incident triage 2026-05-19 — `pacific-1` `node-0-0` stuck on missing rbac-proxy CM
- `internal/controller/nodedeployment/controller.go:160` — current orphan-children behavior under `DeletionPolicy: Retain`
- `internal/task/ensure_pvc.go` — PVC creation path that doesn't honor SND retain semantics
- `api/v1alpha1/seinodedeployment_types.go:21-27` — current `DeletionPolicy` field definition + doc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`deletionPolicy: Retain` on SeiNodeDeployment doesn't protect the underlying PV reclaim policy #292

Problem

Impact

Relevant experts

Proposed approach

Acceptance criteria

Out of scope

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

deletionPolicy: Retain on SeiNodeDeployment doesn't protect the underlying PV reclaim policy #292

Description

Problem

Impact

Relevant experts

Proposed approach

Acceptance criteria

Out of scope

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`deletionPolicy: Retain` on SeiNodeDeployment doesn't protect the underlying PV reclaim policy #292