Skip to content

fix(seitask): register sei.io scheme + grant workflownodes RBAC#339

Merged
bdchatham merged 2 commits into
mainfrom
fix/scheme-and-workflownodes-rbac
May 21, 2026
Merged

fix(seitask): register sei.io scheme + grant workflownodes RBAC#339
bdchatham merged 2 commits into
mainfrom
fix/scheme-and-workflownodes-rbac

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

@bdchatham bdchatham commented May 21, 2026

Summary

Third manual fire of the release-test Workflow surfaced two more contract bugs across the seitask + RBAC interface — both load-bearing for the harness to actually function. Both fixed here.

Bug 1 — `no kind is registered for the type v1alpha1.SeiNodeDeployment in scheme`

`provision-validator-chain` and `provision-rpc-fleet` Task pods both failed at `c.Create(snd, ...)`. Root cause: `cmd/seitask/main.go:kubeClientFromEnv` built the controller-runtime client with `client-go/kubernetes/scheme.Scheme` — which only has builtin K8s types registered. `sei.io/v1alpha1.SeiNodeDeployment` was never added. The client couldn't marshal a typed SND.

Fix: explicit local `taskScheme` initialized once at package level, registering builtins + sei.io/v1alpha1. Chaos Mesh CRs stay on `unstructured` and don't need registration.

Bug 2 — `workflownodes.chaos-mesh.org is forbidden`

`upload-report` failed listing WorkflowNodes for the S3 snapshot. `runner/rbac.yaml` granted `workflows: [get]` for `LoadWorkflowIdentity` but not `workflownodes`.

Fix: extend the existing chaos-mesh.org rule to `["workflows", "workflownodes"]` with verbs `["get", "list"]`.

Note on the third symptom

`run-release-test` errored with `Base URL is missing a protocol. Expected 'ws://' or 'wss://'`. This was a downstream symptom of Bug 1: provision-snd never reached its endpoint-publishing step, so `RPC_TM_RPC` never landed in workflow-vars, and the release-test pod's `SEI_TENDERMINT_RPC: $(RPC_TM_RPC)` resolved to the literal string `$(RPC_TM_RPC)` (K8s leaves unresolved `$(VAR)` references unchanged). Should resolve automatically once Bug 1 is fixed.

Separately: Chaos Mesh Serial doesn't fail-fast

The manual fire also confirmed a structural Chaos Mesh quirk: in v2.8.0 `Serial` template type does NOT fail-fast on child Task errors. All 4 downstream pods ran to termination despite Bug 1 + Bug 2; each WorkflowNode showed `Accomplished=True` regardless of pod exit code. Chaos Mesh's primary use case is fault injection where "the fault ran" is the goal, so marching through child failures is upstream design intent.

Not in scope for this PR. Filed as a separate follow-up — needs design: ConditionalBranches gating, orchestrator-side EXIT_REASON polling + workflow abort, or bash-Task wrappers that abort the parent on `exit 1`.

Test plan

  • `go test ./...` passes
  • `golangci-lint run` clean
  • After merge + image build + SCENARIO_REF bump in platform: manual fire walks past provision-validator-chain successfully (the bug-1 truth-test)

🤖 Generated with Claude Code

Third manual fire of release-test surfaced two more contract bugs:

1. provision-validator-chain + provision-rpc-fleet failed at Create
   time with `no kind is registered for the type v1alpha1.SeiNodeDeployment
   in scheme`. cmd/seitask/main.go's kubeClientFromEnv built a
   controller-runtime client with client-go's built-in scheme (K8s
   types only); sei.io/v1alpha1 was never registered. Fix: local
   taskScheme registering builtin + sei.io/v1alpha1.

2. upload-report failed listing workflownodes: 403 from runner/rbac.yaml
   granting workflows (get) but not workflownodes. Fix: add workflownodes
   (get, list).

Run-release-test's "Base URL is missing a protocol" was a downstream
symptom — provision-snd never published endpoints to workflow-vars, so
$(RPC_TM_RPC) couldn't resolve and the literal string passed through to
the release-test image. Resolves automatically once #1 is fixed.

Separately: Chaos Mesh Serial does NOT fail-fast on child Task errors —
each WorkflowNode transitions to Accomplished=True on pod termination
regardless of exit code, and Serial proceeds to the next child. Filed
as separate follow-up; tracked as architectural concern, not in scope
for this fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 21, 2026

PR Summary

Medium Risk
Medium risk: changes how seitask constructs its controller-runtime client scheme and expands Chaos Mesh RBAC, which can affect runtime behavior and permissions in test harness clusters.

Overview
Fixes seitask Kubernetes client initialization by introducing a package-level taskScheme that registers built-in K8s types plus sei.io/v1alpha1, enabling typed Create/Get round-trips for SeiNodeDeployment/SeiNodeTask.

Adds regression tests to ensure the scheme includes required Sei CRDs and adds a scenario YAML contract test to prevent workflow-vars ConfigMap name mismatches.

Extends runner RBAC to allow read/list access to Chaos Mesh workflownodes (alongside workflows) so upload-report can enumerate workflow node trees.

Reviewed by Cursor Bugbot for commit 295f480. Bugbot is set up for automated code reviews on this repo. Configure here.

Cross-review feedback from platform-engineer + kubernetes-specialist on
the #339 chain of fixes: both reviewers independently noted that we've
been discovering contract drift between the seitask binary internals
and the scenario YAML / RBAC layer at first-fire instead of at build
time. Each first-fire bug (#334, #337, #339) is the same shape: an
internal helper has a convention, the scenario author has to mirror it
manually, no test catches the drift.

Two narrow guards land here, ranked by ROI:

- TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have
  caught the #339 scheme-registration bug at `go test`. Validates that
  the package-level taskScheme actually has every sei.io/v1alpha1 type
  the seitask subcommands construct via typed Create/Get.
- TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught
  the #337 CM-name drift at `go test`. Walks every scenario YAML in
  the opt-in allow list (release-test.yaml today), extracts the
  Workflow CR's metadata.name, asserts every envFrom configMapRef.name
  matches WorkflowVarsName(metadataName). Major-upgrade is excluded —
  its CM is bash-created with a different convention; revisit when the
  half-bash legacy retires.

Defers (filed/tracked separately, not in scope for this PR):
- RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist
  ranked #3; defer until a third recurrence).
- Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340).
- EXIT_REASON write-once-or-fail-classification semantics for #340
  (amend on #340).
- Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env
  approach (file new issue).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham merged commit 922d599 into main May 21, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant