fix(scenarios/release-test): align workflow-vars CM name with seitask convention by bdchatham · Pull Request #337 · sei-protocol/sei-k8s-controller

bdchatham · 2026-05-21T17:45:56Z

Summary

Second bug from the post-#334 manual fire of the release-test Workflow.

`keygen-admin` successfully created the workflow-vars CM at `workflow-vars-release-test-20260521-173309` (derived from `taskruntime.WorkflowVarsName(workflowName)` where workflowName is the full Workflow CR name). The next Task (`provision-validator-chain`) referenced `workflow-vars-20260521-173309` via envFrom — missing the `release-test-` prefix. Kubelet never found the CM and the pod stuck in `CreateContainerConfigError` indefinitely.

Fix

Three `configMapRef.name` references in `scenarios/release-test.yaml` updated from `workflow-vars-$SEI_WORKFLOW_RUN_ID` to `workflow-vars-release-test-$SEI_WORKFLOW_RUN_ID`. Matches what `taskruntime.WorkflowVarsName` produces and is consistent with the existing `admin-release-test-$SEI_WORKFLOW_RUN_ID` secretKeyRef at line 155 (same workflow-name-based convention).

Why this got missed

Unit tests pass because `taskruntime.WorkflowVarsName(...)` is internally consistent — keygen, provision-snd, and upload-report all use the same helper. The break is at the YAML interface boundary where the scenario author has to manually mirror the convention; there's no compile-time enforcement.

Follow-up worth tracking: factoring the CM name derivation out into a render-time substitution (e.g., `{{ workflowVarsCM }}` template helper) would eliminate the manual-mirroring class of bug entirely. Out of scope for this fix.

Test plan

`SEI_NAMESPACE=nightly SEI_CHAIN_ID=rel-test SEI_WORKFLOW_RUN_ID=test123 ... envsubst < scenarios/release-test.yaml | kubectl apply --dry-run=server --validate=strict` passes on harbor
After merge + SCENARIO_REF bump in sei-protocol/platform: manual fire of release-test-workflow walks past provision-validator-chain successfully

🤖 Generated with Claude Code

… convention Second bug from the post-#334 manual fire. keygen-admin creates the workflow-vars CM via taskruntime.EnsureWorkflowVarsCM which derives the CM name from the FULL Workflow CR name (release-test-${RUN_ID}). The scenario YAML's three envFrom references used just ${SEI_WORKFLOW_RUN_ID} (no scenario prefix), so the kubelet looked for a CM that didn't exist and provision-validator-chain's pod stuck in CreateContainerConfigError indefinitely. Fix: align all three envFrom references to workflow-vars-release-test-${SEI_WORKFLOW_RUN_ID}, matching what keygen actually creates (and the admin-secret reference at line 155 already uses the same workflow-name-based convention). Convention: workflow-vars-<workflow-cr-name>. Sourced from taskruntime.WorkflowVarsName which has no scenario knowledge and only takes the Workflow CR name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor · 2026-05-21T17:46:03Z

PR Summary

Low Risk
Low risk: updates only YAML configMapRef.name values for the release-test workflow; main risk is misnaming could still break workflow startup, but change is narrowly scoped.

Overview
Fixes the release-test Chaos Mesh workflow to load env vars from the correctly named workflow vars ConfigMap.

Updates three envFrom.configMapRef.name references to include the release-test- prefix (i.e., workflow-vars-release-test-$SEI_WORKFLOW_RUN_ID), preventing pods from failing with missing ConfigMap during provision-* and run-release-test steps.

^{Reviewed by Cursor Bugbot for commit ef6e9ff. Bugbot is set up for automated code reviews on this repo. Configure here.}

Cross-review feedback from platform-engineer + kubernetes-specialist on the #339 chain of fixes: both reviewers independently noted that we've been discovering contract drift between the seitask binary internals and the scenario YAML / RBAC layer at first-fire instead of at build time. Each first-fire bug (#334, #337, #339) is the same shape: an internal helper has a convention, the scenario author has to mirror it manually, no test catches the drift. Two narrow guards land here, ranked by ROI: - TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have caught the #339 scheme-registration bug at `go test`. Validates that the package-level taskScheme actually has every sei.io/v1alpha1 type the seitask subcommands construct via typed Create/Get. - TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught the #337 CM-name drift at `go test`. Walks every scenario YAML in the opt-in allow list (release-test.yaml today), extracts the Workflow CR's metadata.name, asserts every envFrom configMapRef.name matches WorkflowVarsName(metadataName). Major-upgrade is excluded — its CM is bash-created with a different convention; revisit when the half-bash legacy retires. Defers (filed/tracked separately, not in scope for this PR): - RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist ranked #3; defer until a third recurrence). - Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340). - EXIT_REASON write-once-or-fail-classification semantics for #340 (amend on #340). - Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env approach (file new issue). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(seitask): register sei.io scheme + grant workflownodes RBAC Third manual fire of release-test surfaced two more contract bugs: 1. provision-validator-chain + provision-rpc-fleet failed at Create time with `no kind is registered for the type v1alpha1.SeiNodeDeployment in scheme`. cmd/seitask/main.go's kubeClientFromEnv built a controller-runtime client with client-go's built-in scheme (K8s types only); sei.io/v1alpha1 was never registered. Fix: local taskScheme registering builtin + sei.io/v1alpha1. 2. upload-report failed listing workflownodes: 403 from runner/rbac.yaml granting workflows (get) but not workflownodes. Fix: add workflownodes (get, list). Run-release-test's "Base URL is missing a protocol" was a downstream symptom — provision-snd never published endpoints to workflow-vars, so $(RPC_TM_RPC) couldn't resolve and the literal string passed through to the release-test image. Resolves automatically once #1 is fixed. Separately: Chaos Mesh Serial does NOT fail-fast on child Task errors — each WorkflowNode transitions to Accomplished=True on pod termination regardless of exit code, and Serial proceeds to the next child. Filed as separate follow-up; tracked as architectural concern, not in scope for this fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(seitask): build-time guards for scheme + scenario CM-name contracts Cross-review feedback from platform-engineer + kubernetes-specialist on the #339 chain of fixes: both reviewers independently noted that we've been discovering contract drift between the seitask binary internals and the scenario YAML / RBAC layer at first-fire instead of at build time. Each first-fire bug (#334, #337, #339) is the same shape: an internal helper has a convention, the scenario author has to mirror it manually, no test catches the drift. Two narrow guards land here, ranked by ROI: - TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have caught the #339 scheme-registration bug at `go test`. Validates that the package-level taskScheme actually has every sei.io/v1alpha1 type the seitask subcommands construct via typed Create/Get. - TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught the #337 CM-name drift at `go test`. Walks every scenario YAML in the opt-in allow list (release-test.yaml today), extracts the Workflow CR's metadata.name, asserts every envFrom configMapRef.name matches WorkflowVarsName(metadataName). Major-upgrade is excluded — its CM is bash-created with a different convention; revisit when the half-bash legacy retires. Defers (filed/tracked separately, not in scope for this PR): - RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist ranked #3; defer until a third recurrence). - Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340). - EXIT_REASON write-once-or-fail-classification semantics for #340 (amend on #340). - Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env approach (file new issue). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…draft) (#343) * feat(seitask): per-pod EVM endpoint publishing + load-test Workflow draft Scaffolding for N=2 scenario migration (load-test) while N=1 (release-test) is still validating end-to-end on harbor. Draft PR — do not merge until release-test fires fully green. provision-snd additive change: - Adds taskruntime.KeyEVMJSONRPCList var key for comma-separated per-pod EVM URLs. RoleScoped publishes RPC_EVM_RPC_LIST alongside the existing pod-0 RPC_EVM_RPC. Seiload needs all-pod URLs for its stateful EVM workload (sei-k8s-controller#244); release-test stays on pod-0-only. - Test extended to assert both keys, fake SND now has 2 nodes. scenarios/load-test.yaml (Chaos Mesh Workflow): - Mirrors release-test's structure: keygen omitted (seiload manages its own accounts), wait-rpc-caught-up + render-seiload-profile added as bash kubectl Tasks specific to this scenario. - Profile JSON renders via sed substitution against a source ConfigMap mounted from the nightly namespace (seiload-profiles). Per-run rendered CM stamped with workflow ownerRef → cascade on delete. - Seiload Task mirrors the existing clusters/harbor/nightly/load/ seiload-job.yaml's container shape (resources, env, ports). scenarios/load-test/{validator,rpc}.yaml.tmpl: - Minimal SND templates. No genesis.accounts (seiload doesn't fund admin), no rpc overrides (release-test's lag_threshold tightening is harness-specific). Open prereqs for first fire (not in this PR): - pods/exec verb on the seitask-runner Role for wait-rpc-caught-up (kubectl exec seid status). Track in platform-repo wrapper PR. - Platform-repo wrapper CronJob mirroring clusters/harbor/nightly/ release/workflow.yaml's shape. Follows same merge order: this PR → image build → platform wrapper PR. - seiload-profiles ConfigMap in nightly namespace (currently the profile lives in nightly-load-orchestrator CM alongside bash; split when wrapper PR lands). Build-time scenarios_test extended to include load-test.yaml in its CM-name validation allowlist — guards against #337-class drift for the new scenario. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address platform + k8s cross-review on #343 Both reviewers converged on three load-bearing fixes before un-drafting: 1. **pods/exec verb in runner/rbac.yaml**. wait-rpc-caught-up uses `kubectl exec seid status` to poll catching_up=false; without this verb the Task 403s on first fire. Source-of-truth Role belongs in this repo; platform-vendored copy needs a coordinated mirror PR. 2. **--duration interpolation**. Was `--duration=${DURATION_MINUTES}m` (envsubst form), which K8s containers do not substitute. Switched to `--duration=$(DURATION_MINUTES)m` with a matching env: entry — resolves at pod creation via K8s container env interpolation, no dependency on wrapper envsubst expanding it. 3. **render-seiload-profile preflight CM check**. Without it, a missing `seiload-profiles` source CM hangs the pod in ContainerCreating for ~5m before deadline fires. One-line `kubectl get configmap` upfront converts silent failure into a labeled exit. Also folded in: - SEILOAD_COMMIT_ID env on run-seiload (legacy seiload-job.yaml has it; dashboards keyed on this label would drop a dimension on the new path). Wrapper parses from $SEID_IMAGE tag and passes via envsubst. - Dropped dead-weight SEI_WORKFLOW_RUN_ID env on render-seiload-profile (already arrives via envFrom workflow-vars). NOT applying: both reviewers' suggestion to stamp `serviceAccountName: seitask-runner` on Task containers. Chaos Mesh v2.8's Task.container schema does not expose serviceAccountName (documented in project_chaos_mesh_workflow_gaps memory + the comment block in clusters/harbor/nightly/major-upgrade/rbac.yaml). The established workaround is the platform#626 RoleBinding that grants seitask-runner permissions to the namespace `default` SA, which every Chaos Mesh Task pod runs as. Both reviewers missed this constraint; the fix is already in place at platform layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: wait-rpc-caught-up via Tendermint /status network call User suggestion on cross-review: hit Tendermint /status directly instead of kubectl exec into the RPC pod. Strict improvement: - Removes pods/exec RBAC requirement entirely — no platform-vendored Role update needed for first fire. (Reverts the runner/rbac.yaml pods/exec verb added in the prior commit.) - Drops kubectl from the wait Task; uses curlimages/curl (leaner). - Hits ${RPC_TM_RPC}/status (aggregate ClusterIP, already published by provision-snd to workflow-vars). For a "is anyone caught up" check, load-balanced polling gives natural multi-pod coverage as iterations land on different backends. - Uses the same `.result // .` envelope tolerance pattern as provision-snd's waitForFirstBlock — Sei CometBFT sometimes returns /status unwrapped. One fewer cross-repo coordination point: this PR no longer requires platform/clusters/harbor/nightly/major-upgrade/rbac.yaml to mirror a pods/exec verb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bdchatham merged commit bbf6a15 into main May 21, 2026
5 checks passed

bdchatham mentioned this pull request May 21, 2026

Chaos Mesh Serial doesn't fail-fast on child Task errors — need explicit abort mechanism #340

Open

4 tasks

bdchatham mentioned this pull request May 21, 2026

Scenario contract enforcement: build-time guards + single-sourced CM name #341

Open

4 tasks

bdchatham mentioned this pull request May 21, 2026

feat(seitask): per-pod EVM endpoint publishing + load-test Workflow (draft) #343

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scenarios/release-test): align workflow-vars CM name with seitask convention#337

fix(scenarios/release-test): align workflow-vars CM name with seitask convention#337
bdchatham merged 1 commit into
mainfrom
fix/workflow-vars-cm-name

bdchatham commented May 21, 2026 •

edited

Loading

Uh oh!

cursor Bot commented May 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bdchatham commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Why this got missed

Test plan

Uh oh!

cursor Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bdchatham commented May 21, 2026 •

edited

Loading

cursor Bot commented May 21, 2026 •

edited

Loading