feat: bump seictl to use S3 transfer manager for parallel snapshot downloads by bdchatham · Pull Request #3 · sei-protocol/sei-k8s-controller

bdchatham · 2026-03-15T15:53:34Z

Summary

Bumps seictl dependency to pick up sei-protocol/seictl#27 which integrates the AWS S3 transfer manager for parallel byte-range downloads during snapshot restore
Updates defaultSidecarImage in labels.go and sample manifests (pacific-1-snapshotter, pacific-1-replay) to the new container digest (sha256:ad50d546c3aa...)

Test plan

Verify controller builds cleanly with updated dependency
Deploy to brandon cluster and confirm snapshot restore uses parallel downloads
Monitor snapshotter and replay nodes through full state-sync cycle

…wnloads Update seictl dependency to pick up the transfer manager integration (sei-protocol/seictl#27) which uses the AWS S3 transfer manager for parallel byte-range downloads during snapshot restore. Also updates the default sidecar image and sample manifests to the new container digest.

Updates the LLD's Open Dependencies section to reflect the resolution of three companion sub-issues: #3 SND inline genesis rejection — Resolved by #148 (merged); the CEL rule lands on SeiNodeDeploymentSpec rejecting genesis on non- validator-role deployments. Broader than the original ValidationRun- side framing (covers fullNode/archive/replayer; SND has a 4-role discriminator). #4 PodMonitor for sei-k8s-controller — Resolved as already-implemented. The upstream config/monitoring/ ships a ServiceMonitor (functionally equivalent to PodMonitor) which is pulled in transitively via the platform's kustomization. sei-protocol/platform#243 closed-as-already- implemented after kubectl kustomize verification. Same pattern as Open Dependency #1's lag_status mis-claim. #6 monitoring/ namespace label — Resolved by sei-protocol/platform#269 (merged); the cluster's monitoring/ namespace now carries sei.io/validation-shared-rules=true. Open Dependencies #2 (TaskPlan.TargetPhase decision), #5 (heartbeat PrometheusRule, deferred), #7 (Indexed-Job shard env injection), and #8 (controller registration opt-in plumbing) remain open. Refs sei-k8s-controller#143 #145 #144 platform#243 #245 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l) (#328) Pins the operator-visible "stuck rollout" shape when AwaitSpecUpdate cannot observe child convergence. Establishes the contrast baseline the forthcoming Paused feature must remain observably distinguishable from on every status field. What's asserted (stuck shape, holds for 10s via SatisfyAll + HaveField under Consistently): - Status.Rollout != nil, RolloutInProgress=True - Status.Plan != nil, Phase=Active, ConditionPlanInProgress=True - Status.Phase = Upgrading - Status.ReadyReplicas = replicas (existing pods stay healthy) - ConditionNodesReady = True - Status.Rollout.StartedAt stable across reconciles (alert-rule load-bearing field; a regression that re-stamps it would hide stalls from age-based alerting) - No RolloutComplete event fires during the stall - Children Spec.Image advanced to v2 (UpdateNodeSpecs ran) - Children Status.CurrentImage stays at v1 (ObserveImage stalled) Recovery (Eventually with 60s budget): resuming the StatefulSet status faker lets the rollout reach terminal state on its own, proving the stuck state is the stall, not a deeper controller failure. Harness: uses the existing StatusFaker.Pause/Resume primitive introduced in #325. No production code touched. Coral cross-review applied (k8s-specialist + sre-engineer): Incorporated: - Struct-snapshot pattern + SatisfyAll(HaveField(...)) for per-field diagnostic messages on Consistently flakes - Status.Phase == Upgrading (kubectl get snd column #3 — the literal thing an on-call types first) - ReadyReplicas + ConditionNodesReady simultaneously True (the diagnostic contradiction pattern operators look for) - StartedAt stability inside Consistently (load-bearing for any future "rollout stuck >1h" alert) - ConditionPlanInProgress (operators read conditions before reading embedded Status.Plan) - Typed CurrentImage assertion (Equal(v1), not the weak NotTo(Equal(v2)) — a regression writing garbage would have passed before) - Recovery uses Eventually with 60s budget, not the default 30s pollTimeout — recovery has to drain a 10s faker backlog plus 5+ reconcile hops, and p99 under loaded CI can spike past 30s Deferred to follow-ups (out of scope for this PR): - Controller-still-reconciling liveness probe (harness pattern, separate test class) - Sibling failure-class tests for image-pull, PVC bind, sidecar crash (each has a distinct envtest-stub shape) - PromQL/alert-rule rotation in /Users/brandon/platform/clusters/ monitoring stacks — file /issue against observability-platform- engineer; no SND-keyed PromQL exists today, so a stuck prod rollout pages nobody until someone manually checks - Test rename to TestInPlaceRollout_StuckOnAwaitSpecUpdate alongside the sibling failure-class tests Verifications: 3 consecutive runs at 46.2s / 45.5s / 44.8s wall (all 5 envtest tests); stuck test individual ~17s. verify-generated clean.

Cross-review feedback from platform-engineer + kubernetes-specialist on the #339 chain of fixes: both reviewers independently noted that we've been discovering contract drift between the seitask binary internals and the scenario YAML / RBAC layer at first-fire instead of at build time. Each first-fire bug (#334, #337, #339) is the same shape: an internal helper has a convention, the scenario author has to mirror it manually, no test catches the drift. Two narrow guards land here, ranked by ROI: - TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have caught the #339 scheme-registration bug at `go test`. Validates that the package-level taskScheme actually has every sei.io/v1alpha1 type the seitask subcommands construct via typed Create/Get. - TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught the #337 CM-name drift at `go test`. Walks every scenario YAML in the opt-in allow list (release-test.yaml today), extracts the Workflow CR's metadata.name, asserts every envFrom configMapRef.name matches WorkflowVarsName(metadataName). Major-upgrade is excluded — its CM is bash-created with a different convention; revisit when the half-bash legacy retires. Defers (filed/tracked separately, not in scope for this PR): - RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist ranked #3; defer until a third recurrence). - Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340). - EXIT_REASON write-once-or-fail-classification semantics for #340 (amend on #340). - Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env approach (file new issue). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(seitask): register sei.io scheme + grant workflownodes RBAC Third manual fire of release-test surfaced two more contract bugs: 1. provision-validator-chain + provision-rpc-fleet failed at Create time with `no kind is registered for the type v1alpha1.SeiNodeDeployment in scheme`. cmd/seitask/main.go's kubeClientFromEnv built a controller-runtime client with client-go's built-in scheme (K8s types only); sei.io/v1alpha1 was never registered. Fix: local taskScheme registering builtin + sei.io/v1alpha1. 2. upload-report failed listing workflownodes: 403 from runner/rbac.yaml granting workflows (get) but not workflownodes. Fix: add workflownodes (get, list). Run-release-test's "Base URL is missing a protocol" was a downstream symptom — provision-snd never published endpoints to workflow-vars, so $(RPC_TM_RPC) couldn't resolve and the literal string passed through to the release-test image. Resolves automatically once #1 is fixed. Separately: Chaos Mesh Serial does NOT fail-fast on child Task errors — each WorkflowNode transitions to Accomplished=True on pod termination regardless of exit code, and Serial proceeds to the next child. Filed as separate follow-up; tracked as architectural concern, not in scope for this fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(seitask): build-time guards for scheme + scenario CM-name contracts Cross-review feedback from platform-engineer + kubernetes-specialist on the #339 chain of fixes: both reviewers independently noted that we've been discovering contract drift between the seitask binary internals and the scenario YAML / RBAC layer at first-fire instead of at build time. Each first-fire bug (#334, #337, #339) is the same shape: an internal helper has a convention, the scenario author has to mirror it manually, no test catches the drift. Two narrow guards land here, ranked by ROI: - TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have caught the #339 scheme-registration bug at `go test`. Validates that the package-level taskScheme actually has every sei.io/v1alpha1 type the seitask subcommands construct via typed Create/Get. - TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught the #337 CM-name drift at `go test`. Walks every scenario YAML in the opt-in allow list (release-test.yaml today), extracts the Workflow CR's metadata.name, asserts every envFrom configMapRef.name matches WorkflowVarsName(metadataName). Major-upgrade is excluded — its CM is bash-created with a different convention; revisit when the half-bash legacy retires. Defers (filed/tracked separately, not in scope for this PR): - RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist ranked #3; defer until a third recurrence). - Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340). - EXIT_REASON write-once-or-fail-classification semantics for #340 (amend on #340). - Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env approach (file new issue). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bdchatham marked this pull request as ready for review March 15, 2026 16:00

bdchatham merged commit da619cb into main Mar 15, 2026
2 checks passed

bdchatham deleted the feat/seictl-transfer-manager branch March 19, 2026 21:00

This was referenced Apr 29, 2026

Admission validation: reject SeiNodeDeployment with genesis on full-node-role deployments #145

Closed

docs: mark resolved Open Dependencies — sweep #3, #4, #6 #151

Merged

bdchatham mentioned this pull request May 21, 2026

test(envtest): Phase 4 — stuck rollout baseline (AwaitSpecUpdate stall) #328

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: bump seictl to use S3 transfer manager for parallel snapshot downloads#3

feat: bump seictl to use S3 transfer manager for parallel snapshot downloads#3
bdchatham merged 1 commit into
mainfrom
feat/seictl-transfer-manager

bdchatham commented Mar 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bdchatham commented Mar 15, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant