feat: bump seictl to use S3 transfer manager for parallel snapshot downloads#3
Merged
Merged
Conversation
…wnloads Update seictl dependency to pick up the transfer manager integration (sei-protocol/seictl#27) which uses the AWS S3 transfer manager for parallel byte-range downloads during snapshot restore. Also updates the default sidecar image and sample manifests to the new container digest.
This was referenced Apr 29, 2026
Closed
bdchatham
added a commit
that referenced
this pull request
Apr 30, 2026
Updates the LLD's Open Dependencies section to reflect the resolution of three companion sub-issues: #3 SND inline genesis rejection — Resolved by #148 (merged); the CEL rule lands on SeiNodeDeploymentSpec rejecting genesis on non- validator-role deployments. Broader than the original ValidationRun- side framing (covers fullNode/archive/replayer; SND has a 4-role discriminator). #4 PodMonitor for sei-k8s-controller — Resolved as already-implemented. The upstream config/monitoring/ ships a ServiceMonitor (functionally equivalent to PodMonitor) which is pulled in transitively via the platform's kustomization. sei-protocol/platform#243 closed-as-already- implemented after kubectl kustomize verification. Same pattern as Open Dependency #1's lag_status mis-claim. #6 monitoring/ namespace label — Resolved by sei-protocol/platform#269 (merged); the cluster's monitoring/ namespace now carries sei.io/validation-shared-rules=true. Open Dependencies #2 (TaskPlan.TargetPhase decision), #5 (heartbeat PrometheusRule, deferred), #7 (Indexed-Job shard env injection), and #8 (controller registration opt-in plumbing) remain open. Refs sei-k8s-controller#143 #145 #144 platform#243 #245 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
bdchatham
added a commit
that referenced
this pull request
May 21, 2026
…l) (#328) Pins the operator-visible "stuck rollout" shape when AwaitSpecUpdate cannot observe child convergence. Establishes the contrast baseline the forthcoming Paused feature must remain observably distinguishable from on every status field. What's asserted (stuck shape, holds for 10s via SatisfyAll + HaveField under Consistently): - Status.Rollout != nil, RolloutInProgress=True - Status.Plan != nil, Phase=Active, ConditionPlanInProgress=True - Status.Phase = Upgrading - Status.ReadyReplicas = replicas (existing pods stay healthy) - ConditionNodesReady = True - Status.Rollout.StartedAt stable across reconciles (alert-rule load-bearing field; a regression that re-stamps it would hide stalls from age-based alerting) - No RolloutComplete event fires during the stall - Children Spec.Image advanced to v2 (UpdateNodeSpecs ran) - Children Status.CurrentImage stays at v1 (ObserveImage stalled) Recovery (Eventually with 60s budget): resuming the StatefulSet status faker lets the rollout reach terminal state on its own, proving the stuck state is the stall, not a deeper controller failure. Harness: uses the existing StatusFaker.Pause/Resume primitive introduced in #325. No production code touched. Coral cross-review applied (k8s-specialist + sre-engineer): Incorporated: - Struct-snapshot pattern + SatisfyAll(HaveField(...)) for per-field diagnostic messages on Consistently flakes - Status.Phase == Upgrading (kubectl get snd column #3 — the literal thing an on-call types first) - ReadyReplicas + ConditionNodesReady simultaneously True (the diagnostic contradiction pattern operators look for) - StartedAt stability inside Consistently (load-bearing for any future "rollout stuck >1h" alert) - ConditionPlanInProgress (operators read conditions before reading embedded Status.Plan) - Typed CurrentImage assertion (Equal(v1), not the weak NotTo(Equal(v2)) — a regression writing garbage would have passed before) - Recovery uses Eventually with 60s budget, not the default 30s pollTimeout — recovery has to drain a 10s faker backlog plus 5+ reconcile hops, and p99 under loaded CI can spike past 30s Deferred to follow-ups (out of scope for this PR): - Controller-still-reconciling liveness probe (harness pattern, separate test class) - Sibling failure-class tests for image-pull, PVC bind, sidecar crash (each has a distinct envtest-stub shape) - PromQL/alert-rule rotation in /Users/brandon/platform/clusters/ monitoring stacks — file /issue against observability-platform- engineer; no SND-keyed PromQL exists today, so a stuck prod rollout pages nobody until someone manually checks - Test rename to TestInPlaceRollout_StuckOnAwaitSpecUpdate alongside the sibling failure-class tests Verifications: 3 consecutive runs at 46.2s / 45.5s / 44.8s wall (all 5 envtest tests); stuck test individual ~17s. verify-generated clean.
bdchatham
added a commit
that referenced
this pull request
May 21, 2026
Cross-review feedback from platform-engineer + kubernetes-specialist on the #339 chain of fixes: both reviewers independently noted that we've been discovering contract drift between the seitask binary internals and the scenario YAML / RBAC layer at first-fire instead of at build time. Each first-fire bug (#334, #337, #339) is the same shape: an internal helper has a convention, the scenario author has to mirror it manually, no test catches the drift. Two narrow guards land here, ranked by ROI: - TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have caught the #339 scheme-registration bug at `go test`. Validates that the package-level taskScheme actually has every sei.io/v1alpha1 type the seitask subcommands construct via typed Create/Get. - TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught the #337 CM-name drift at `go test`. Walks every scenario YAML in the opt-in allow list (release-test.yaml today), extracts the Workflow CR's metadata.name, asserts every envFrom configMapRef.name matches WorkflowVarsName(metadataName). Major-upgrade is excluded — its CM is bash-created with a different convention; revisit when the half-bash legacy retires. Defers (filed/tracked separately, not in scope for this PR): - RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist ranked #3; defer until a third recurrence). - Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340). - EXIT_REASON write-once-or-fail-classification semantics for #340 (amend on #340). - Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env approach (file new issue). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bdchatham
added a commit
that referenced
this pull request
May 21, 2026
* fix(seitask): register sei.io scheme + grant workflownodes RBAC Third manual fire of release-test surfaced two more contract bugs: 1. provision-validator-chain + provision-rpc-fleet failed at Create time with `no kind is registered for the type v1alpha1.SeiNodeDeployment in scheme`. cmd/seitask/main.go's kubeClientFromEnv built a controller-runtime client with client-go's built-in scheme (K8s types only); sei.io/v1alpha1 was never registered. Fix: local taskScheme registering builtin + sei.io/v1alpha1. 2. upload-report failed listing workflownodes: 403 from runner/rbac.yaml granting workflows (get) but not workflownodes. Fix: add workflownodes (get, list). Run-release-test's "Base URL is missing a protocol" was a downstream symptom — provision-snd never published endpoints to workflow-vars, so $(RPC_TM_RPC) couldn't resolve and the literal string passed through to the release-test image. Resolves automatically once #1 is fixed. Separately: Chaos Mesh Serial does NOT fail-fast on child Task errors — each WorkflowNode transitions to Accomplished=True on pod termination regardless of exit code, and Serial proceeds to the next child. Filed as separate follow-up; tracked as architectural concern, not in scope for this fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(seitask): build-time guards for scheme + scenario CM-name contracts Cross-review feedback from platform-engineer + kubernetes-specialist on the #339 chain of fixes: both reviewers independently noted that we've been discovering contract drift between the seitask binary internals and the scenario YAML / RBAC layer at first-fire instead of at build time. Each first-fire bug (#334, #337, #339) is the same shape: an internal helper has a convention, the scenario author has to mirror it manually, no test catches the drift. Two narrow guards land here, ranked by ROI: - TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have caught the #339 scheme-registration bug at `go test`. Validates that the package-level taskScheme actually has every sei.io/v1alpha1 type the seitask subcommands construct via typed Create/Get. - TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught the #337 CM-name drift at `go test`. Walks every scenario YAML in the opt-in allow list (release-test.yaml today), extracts the Workflow CR's metadata.name, asserts every envFrom configMapRef.name matches WorkflowVarsName(metadataName). Major-upgrade is excluded — its CM is bash-created with a different convention; revisit when the half-bash legacy retires. Defers (filed/tracked separately, not in scope for this PR): - RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist ranked #3; defer until a third recurrence). - Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340). - EXIT_REASON write-once-or-fail-classification semantics for #340 (amend on #340). - Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env approach (file new issue). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
seictldependency to pick up sei-protocol/seictl#27 which integrates the AWS S3 transfer manager for parallel byte-range downloads during snapshot restoredefaultSidecarImageinlabels.goand sample manifests (pacific-1-snapshotter,pacific-1-replay) to the new container digest (sha256:ad50d546c3aa...)Test plan