fix(scenarios/release-test): align workflow-vars CM name with seitask convention#337
Merged
Conversation
… convention Second bug from the post-#334 manual fire. keygen-admin creates the workflow-vars CM via taskruntime.EnsureWorkflowVarsCM which derives the CM name from the FULL Workflow CR name (release-test-${RUN_ID}). The scenario YAML's three envFrom references used just ${SEI_WORKFLOW_RUN_ID} (no scenario prefix), so the kubelet looked for a CM that didn't exist and provision-validator-chain's pod stuck in CreateContainerConfigError indefinitely. Fix: align all three envFrom references to workflow-vars-release-test-${SEI_WORKFLOW_RUN_ID}, matching what keygen actually creates (and the admin-secret reference at line 155 already uses the same workflow-name-based convention). Convention: workflow-vars-<workflow-cr-name>. Sourced from taskruntime.WorkflowVarsName which has no scenario knowledge and only takes the Workflow CR name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR SummaryLow Risk Overview Updates three Reviewed by Cursor Bugbot for commit ef6e9ff. Bugbot is set up for automated code reviews on this repo. Configure here. |
4 tasks
bdchatham
added a commit
that referenced
this pull request
May 21, 2026
Cross-review feedback from platform-engineer + kubernetes-specialist on the #339 chain of fixes: both reviewers independently noted that we've been discovering contract drift between the seitask binary internals and the scenario YAML / RBAC layer at first-fire instead of at build time. Each first-fire bug (#334, #337, #339) is the same shape: an internal helper has a convention, the scenario author has to mirror it manually, no test catches the drift. Two narrow guards land here, ranked by ROI: - TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have caught the #339 scheme-registration bug at `go test`. Validates that the package-level taskScheme actually has every sei.io/v1alpha1 type the seitask subcommands construct via typed Create/Get. - TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught the #337 CM-name drift at `go test`. Walks every scenario YAML in the opt-in allow list (release-test.yaml today), extracts the Workflow CR's metadata.name, asserts every envFrom configMapRef.name matches WorkflowVarsName(metadataName). Major-upgrade is excluded — its CM is bash-created with a different convention; revisit when the half-bash legacy retires. Defers (filed/tracked separately, not in scope for this PR): - RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist ranked #3; defer until a third recurrence). - Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340). - EXIT_REASON write-once-or-fail-classification semantics for #340 (amend on #340). - Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env approach (file new issue). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
bdchatham
added a commit
that referenced
this pull request
May 21, 2026
* fix(seitask): register sei.io scheme + grant workflownodes RBAC Third manual fire of release-test surfaced two more contract bugs: 1. provision-validator-chain + provision-rpc-fleet failed at Create time with `no kind is registered for the type v1alpha1.SeiNodeDeployment in scheme`. cmd/seitask/main.go's kubeClientFromEnv built a controller-runtime client with client-go's built-in scheme (K8s types only); sei.io/v1alpha1 was never registered. Fix: local taskScheme registering builtin + sei.io/v1alpha1. 2. upload-report failed listing workflownodes: 403 from runner/rbac.yaml granting workflows (get) but not workflownodes. Fix: add workflownodes (get, list). Run-release-test's "Base URL is missing a protocol" was a downstream symptom — provision-snd never published endpoints to workflow-vars, so $(RPC_TM_RPC) couldn't resolve and the literal string passed through to the release-test image. Resolves automatically once #1 is fixed. Separately: Chaos Mesh Serial does NOT fail-fast on child Task errors — each WorkflowNode transitions to Accomplished=True on pod termination regardless of exit code, and Serial proceeds to the next child. Filed as separate follow-up; tracked as architectural concern, not in scope for this fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(seitask): build-time guards for scheme + scenario CM-name contracts Cross-review feedback from platform-engineer + kubernetes-specialist on the #339 chain of fixes: both reviewers independently noted that we've been discovering contract drift between the seitask binary internals and the scenario YAML / RBAC layer at first-fire instead of at build time. Each first-fire bug (#334, #337, #339) is the same shape: an internal helper has a convention, the scenario author has to mirror it manually, no test catches the drift. Two narrow guards land here, ranked by ROI: - TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have caught the #339 scheme-registration bug at `go test`. Validates that the package-level taskScheme actually has every sei.io/v1alpha1 type the seitask subcommands construct via typed Create/Get. - TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught the #337 CM-name drift at `go test`. Walks every scenario YAML in the opt-in allow list (release-test.yaml today), extracts the Workflow CR's metadata.name, asserts every envFrom configMapRef.name matches WorkflowVarsName(metadataName). Major-upgrade is excluded — its CM is bash-created with a different convention; revisit when the half-bash legacy retires. Defers (filed/tracked separately, not in scope for this PR): - RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist ranked #3; defer until a third recurrence). - Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340). - EXIT_REASON write-once-or-fail-classification semantics for #340 (amend on #340). - Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env approach (file new issue). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
bdchatham
added a commit
that referenced
this pull request
May 21, 2026
…draft) (#343) * feat(seitask): per-pod EVM endpoint publishing + load-test Workflow draft Scaffolding for N=2 scenario migration (load-test) while N=1 (release-test) is still validating end-to-end on harbor. Draft PR — do not merge until release-test fires fully green. provision-snd additive change: - Adds taskruntime.KeyEVMJSONRPCList var key for comma-separated per-pod EVM URLs. RoleScoped publishes RPC_EVM_RPC_LIST alongside the existing pod-0 RPC_EVM_RPC. Seiload needs all-pod URLs for its stateful EVM workload (sei-k8s-controller#244); release-test stays on pod-0-only. - Test extended to assert both keys, fake SND now has 2 nodes. scenarios/load-test.yaml (Chaos Mesh Workflow): - Mirrors release-test's structure: keygen omitted (seiload manages its own accounts), wait-rpc-caught-up + render-seiload-profile added as bash kubectl Tasks specific to this scenario. - Profile JSON renders via sed substitution against a source ConfigMap mounted from the nightly namespace (seiload-profiles). Per-run rendered CM stamped with workflow ownerRef → cascade on delete. - Seiload Task mirrors the existing clusters/harbor/nightly/load/ seiload-job.yaml's container shape (resources, env, ports). scenarios/load-test/{validator,rpc}.yaml.tmpl: - Minimal SND templates. No genesis.accounts (seiload doesn't fund admin), no rpc overrides (release-test's lag_threshold tightening is harness-specific). Open prereqs for first fire (not in this PR): - pods/exec verb on the seitask-runner Role for wait-rpc-caught-up (kubectl exec seid status). Track in platform-repo wrapper PR. - Platform-repo wrapper CronJob mirroring clusters/harbor/nightly/ release/workflow.yaml's shape. Follows same merge order: this PR → image build → platform wrapper PR. - seiload-profiles ConfigMap in nightly namespace (currently the profile lives in nightly-load-orchestrator CM alongside bash; split when wrapper PR lands). Build-time scenarios_test extended to include load-test.yaml in its CM-name validation allowlist — guards against #337-class drift for the new scenario. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address platform + k8s cross-review on #343 Both reviewers converged on three load-bearing fixes before un-drafting: 1. **pods/exec verb in runner/rbac.yaml**. wait-rpc-caught-up uses `kubectl exec seid status` to poll catching_up=false; without this verb the Task 403s on first fire. Source-of-truth Role belongs in this repo; platform-vendored copy needs a coordinated mirror PR. 2. **--duration interpolation**. Was `--duration=${DURATION_MINUTES}m` (envsubst form), which K8s containers do not substitute. Switched to `--duration=$(DURATION_MINUTES)m` with a matching env: entry — resolves at pod creation via K8s container env interpolation, no dependency on wrapper envsubst expanding it. 3. **render-seiload-profile preflight CM check**. Without it, a missing `seiload-profiles` source CM hangs the pod in ContainerCreating for ~5m before deadline fires. One-line `kubectl get configmap` upfront converts silent failure into a labeled exit. Also folded in: - SEILOAD_COMMIT_ID env on run-seiload (legacy seiload-job.yaml has it; dashboards keyed on this label would drop a dimension on the new path). Wrapper parses from $SEID_IMAGE tag and passes via envsubst. - Dropped dead-weight SEI_WORKFLOW_RUN_ID env on render-seiload-profile (already arrives via envFrom workflow-vars). NOT applying: both reviewers' suggestion to stamp `serviceAccountName: seitask-runner` on Task containers. Chaos Mesh v2.8's Task.container schema does not expose serviceAccountName (documented in project_chaos_mesh_workflow_gaps memory + the comment block in clusters/harbor/nightly/major-upgrade/rbac.yaml). The established workaround is the platform#626 RoleBinding that grants seitask-runner permissions to the namespace `default` SA, which every Chaos Mesh Task pod runs as. Both reviewers missed this constraint; the fix is already in place at platform layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: wait-rpc-caught-up via Tendermint /status network call User suggestion on cross-review: hit Tendermint /status directly instead of kubectl exec into the RPC pod. Strict improvement: - Removes pods/exec RBAC requirement entirely — no platform-vendored Role update needed for first fire. (Reverts the runner/rbac.yaml pods/exec verb added in the prior commit.) - Drops kubectl from the wait Task; uses curlimages/curl (leaner). - Hits ${RPC_TM_RPC}/status (aggregate ClusterIP, already published by provision-snd to workflow-vars). For a "is anyone caught up" check, load-balanced polling gives natural multi-pod coverage as iterations land on different backends. - Uses the same `.result // .` envelope tolerance pattern as provision-snd's waitForFirstBlock — Sei CometBFT sometimes returns /status unwrapped. One fewer cross-repo coordination point: this PR no longer requires platform/clusters/harbor/nightly/major-upgrade/rbac.yaml to mirror a pods/exec verb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Second bug from the post-#334 manual fire of the release-test Workflow.
`keygen-admin` successfully created the workflow-vars CM at `workflow-vars-release-test-20260521-173309` (derived from `taskruntime.WorkflowVarsName(workflowName)` where workflowName is the full Workflow CR name). The next Task (`provision-validator-chain`) referenced `workflow-vars-20260521-173309` via envFrom — missing the `release-test-` prefix. Kubelet never found the CM and the pod stuck in `CreateContainerConfigError` indefinitely.
Fix
Three `configMapRef.name` references in `scenarios/release-test.yaml` updated from `workflow-vars-$SEI_WORKFLOW_RUN_ID` to `workflow-vars-release-test-$SEI_WORKFLOW_RUN_ID`. Matches what `taskruntime.WorkflowVarsName` produces and is consistent with the existing `admin-release-test-$SEI_WORKFLOW_RUN_ID` secretKeyRef at line 155 (same workflow-name-based convention).
Why this got missed
Unit tests pass because `taskruntime.WorkflowVarsName(...)` is internally consistent — keygen, provision-snd, and upload-report all use the same helper. The break is at the YAML interface boundary where the scenario author has to manually mirror the convention; there's no compile-time enforcement.
Follow-up worth tracking: factoring the CM name derivation out into a render-time substitution (e.g., `{{ workflowVarsCM }}` template helper) would eliminate the manual-mirroring class of bug entirely. Out of scope for this fix.
Test plan
🤖 Generated with Claude Code