fix(seitask): register sei.io scheme + grant workflownodes RBAC#339
Conversation
Third manual fire of release-test surfaced two more contract bugs: 1. provision-validator-chain + provision-rpc-fleet failed at Create time with `no kind is registered for the type v1alpha1.SeiNodeDeployment in scheme`. cmd/seitask/main.go's kubeClientFromEnv built a controller-runtime client with client-go's built-in scheme (K8s types only); sei.io/v1alpha1 was never registered. Fix: local taskScheme registering builtin + sei.io/v1alpha1. 2. upload-report failed listing workflownodes: 403 from runner/rbac.yaml granting workflows (get) but not workflownodes. Fix: add workflownodes (get, list). Run-release-test's "Base URL is missing a protocol" was a downstream symptom — provision-snd never published endpoints to workflow-vars, so $(RPC_TM_RPC) couldn't resolve and the literal string passed through to the release-test image. Resolves automatically once #1 is fixed. Separately: Chaos Mesh Serial does NOT fail-fast on child Task errors — each WorkflowNode transitions to Accomplished=True on pod termination regardless of exit code, and Serial proceeds to the next child. Filed as separate follow-up; tracked as architectural concern, not in scope for this fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR SummaryMedium Risk Overview Adds regression tests to ensure the scheme includes required Sei CRDs and adds a scenario YAML contract test to prevent Extends runner RBAC to allow read/list access to Chaos Mesh Reviewed by Cursor Bugbot for commit 295f480. Bugbot is set up for automated code reviews on this repo. Configure here. |
Cross-review feedback from platform-engineer + kubernetes-specialist on the #339 chain of fixes: both reviewers independently noted that we've been discovering contract drift between the seitask binary internals and the scenario YAML / RBAC layer at first-fire instead of at build time. Each first-fire bug (#334, #337, #339) is the same shape: an internal helper has a convention, the scenario author has to mirror it manually, no test catches the drift. Two narrow guards land here, ranked by ROI: - TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have caught the #339 scheme-registration bug at `go test`. Validates that the package-level taskScheme actually has every sei.io/v1alpha1 type the seitask subcommands construct via typed Create/Get. - TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught the #337 CM-name drift at `go test`. Walks every scenario YAML in the opt-in allow list (release-test.yaml today), extracts the Workflow CR's metadata.name, asserts every envFrom configMapRef.name matches WorkflowVarsName(metadataName). Major-upgrade is excluded — its CM is bash-created with a different convention; revisit when the half-bash legacy retires. Defers (filed/tracked separately, not in scope for this PR): - RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist ranked #3; defer until a third recurrence). - Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340). - EXIT_REASON write-once-or-fail-classification semantics for #340 (amend on #340). - Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env approach (file new issue). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Third manual fire of the release-test Workflow surfaced two more contract bugs across the seitask + RBAC interface — both load-bearing for the harness to actually function. Both fixed here.
Bug 1 — `no kind is registered for the type v1alpha1.SeiNodeDeployment in scheme`
`provision-validator-chain` and `provision-rpc-fleet` Task pods both failed at `c.Create(snd, ...)`. Root cause: `cmd/seitask/main.go:kubeClientFromEnv` built the controller-runtime client with `client-go/kubernetes/scheme.Scheme` — which only has builtin K8s types registered. `sei.io/v1alpha1.SeiNodeDeployment` was never added. The client couldn't marshal a typed SND.
Fix: explicit local `taskScheme` initialized once at package level, registering builtins + sei.io/v1alpha1. Chaos Mesh CRs stay on `unstructured` and don't need registration.
Bug 2 — `workflownodes.chaos-mesh.org is forbidden`
`upload-report` failed listing WorkflowNodes for the S3 snapshot. `runner/rbac.yaml` granted `workflows: [get]` for `LoadWorkflowIdentity` but not `workflownodes`.
Fix: extend the existing chaos-mesh.org rule to `["workflows", "workflownodes"]` with verbs `["get", "list"]`.
Note on the third symptom
`run-release-test` errored with `Base URL is missing a protocol. Expected 'ws://' or 'wss://'`. This was a downstream symptom of Bug 1: provision-snd never reached its endpoint-publishing step, so `RPC_TM_RPC` never landed in workflow-vars, and the release-test pod's `SEI_TENDERMINT_RPC:$(RPC_TM_RPC)` resolved to the literal string `$ (RPC_TM_RPC)` (K8s leaves unresolved `$(VAR)` references unchanged). Should resolve automatically once Bug 1 is fixed.
Separately: Chaos Mesh Serial doesn't fail-fast
The manual fire also confirmed a structural Chaos Mesh quirk: in v2.8.0 `Serial` template type does NOT fail-fast on child Task errors. All 4 downstream pods ran to termination despite Bug 1 + Bug 2; each WorkflowNode showed `Accomplished=True` regardless of pod exit code. Chaos Mesh's primary use case is fault injection where "the fault ran" is the goal, so marching through child failures is upstream design intent.
Not in scope for this PR. Filed as a separate follow-up — needs design: ConditionalBranches gating, orchestrator-side EXIT_REASON polling + workflow abort, or bash-Task wrappers that abort the parent on `exit 1`.
Test plan
🤖 Generated with Claude Code