Skip to content

fix(scenarios/release-test): align workflow-vars CM name with seitask convention#337

Merged
bdchatham merged 1 commit into
mainfrom
fix/workflow-vars-cm-name
May 21, 2026
Merged

fix(scenarios/release-test): align workflow-vars CM name with seitask convention#337
bdchatham merged 1 commit into
mainfrom
fix/workflow-vars-cm-name

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

@bdchatham bdchatham commented May 21, 2026

Summary

Second bug from the post-#334 manual fire of the release-test Workflow.

`keygen-admin` successfully created the workflow-vars CM at `workflow-vars-release-test-20260521-173309` (derived from `taskruntime.WorkflowVarsName(workflowName)` where workflowName is the full Workflow CR name). The next Task (`provision-validator-chain`) referenced `workflow-vars-20260521-173309` via envFrom — missing the `release-test-` prefix. Kubelet never found the CM and the pod stuck in `CreateContainerConfigError` indefinitely.

Fix

Three `configMapRef.name` references in `scenarios/release-test.yaml` updated from `workflow-vars-$SEI_WORKFLOW_RUN_ID` to `workflow-vars-release-test-$SEI_WORKFLOW_RUN_ID`. Matches what `taskruntime.WorkflowVarsName` produces and is consistent with the existing `admin-release-test-$SEI_WORKFLOW_RUN_ID` secretKeyRef at line 155 (same workflow-name-based convention).

Why this got missed

Unit tests pass because `taskruntime.WorkflowVarsName(...)` is internally consistent — keygen, provision-snd, and upload-report all use the same helper. The break is at the YAML interface boundary where the scenario author has to manually mirror the convention; there's no compile-time enforcement.

Follow-up worth tracking: factoring the CM name derivation out into a render-time substitution (e.g., `{{ workflowVarsCM }}` template helper) would eliminate the manual-mirroring class of bug entirely. Out of scope for this fix.

Test plan

  • `SEI_NAMESPACE=nightly SEI_CHAIN_ID=rel-test SEI_WORKFLOW_RUN_ID=test123 ... envsubst < scenarios/release-test.yaml | kubectl apply --dry-run=server --validate=strict` passes on harbor
  • After merge + SCENARIO_REF bump in sei-protocol/platform: manual fire of release-test-workflow walks past provision-validator-chain successfully

🤖 Generated with Claude Code

… convention

Second bug from the post-#334 manual fire. keygen-admin creates the
workflow-vars CM via taskruntime.EnsureWorkflowVarsCM which derives the
CM name from the FULL Workflow CR name (release-test-${RUN_ID}). The
scenario YAML's three envFrom references used just ${SEI_WORKFLOW_RUN_ID}
(no scenario prefix), so the kubelet looked for a CM that didn't exist
and provision-validator-chain's pod stuck in CreateContainerConfigError
indefinitely.

Fix: align all three envFrom references to
workflow-vars-release-test-${SEI_WORKFLOW_RUN_ID}, matching what keygen
actually creates (and the admin-secret reference at line 155 already
uses the same workflow-name-based convention).

Convention: workflow-vars-<workflow-cr-name>. Sourced from
taskruntime.WorkflowVarsName which has no scenario knowledge and only
takes the Workflow CR name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 21, 2026

PR Summary

Low Risk
Low risk: updates only YAML configMapRef.name values for the release-test workflow; main risk is misnaming could still break workflow startup, but change is narrowly scoped.

Overview
Fixes the release-test Chaos Mesh workflow to load env vars from the correctly named workflow vars ConfigMap.

Updates three envFrom.configMapRef.name references to include the release-test- prefix (i.e., workflow-vars-release-test-$SEI_WORKFLOW_RUN_ID), preventing pods from failing with missing ConfigMap during provision-* and run-release-test steps.

Reviewed by Cursor Bugbot for commit ef6e9ff. Bugbot is set up for automated code reviews on this repo. Configure here.

@bdchatham bdchatham merged commit bbf6a15 into main May 21, 2026
5 checks passed
bdchatham added a commit that referenced this pull request May 21, 2026
Cross-review feedback from platform-engineer + kubernetes-specialist on
the #339 chain of fixes: both reviewers independently noted that we've
been discovering contract drift between the seitask binary internals
and the scenario YAML / RBAC layer at first-fire instead of at build
time. Each first-fire bug (#334, #337, #339) is the same shape: an
internal helper has a convention, the scenario author has to mirror it
manually, no test catches the drift.

Two narrow guards land here, ranked by ROI:

- TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have
  caught the #339 scheme-registration bug at `go test`. Validates that
  the package-level taskScheme actually has every sei.io/v1alpha1 type
  the seitask subcommands construct via typed Create/Get.
- TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught
  the #337 CM-name drift at `go test`. Walks every scenario YAML in
  the opt-in allow list (release-test.yaml today), extracts the
  Workflow CR's metadata.name, asserts every envFrom configMapRef.name
  matches WorkflowVarsName(metadataName). Major-upgrade is excluded —
  its CM is bash-created with a different convention; revisit when the
  half-bash legacy retires.

Defers (filed/tracked separately, not in scope for this PR):
- RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist
  ranked #3; defer until a third recurrence).
- Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340).
- EXIT_REASON write-once-or-fail-classification semantics for #340
  (amend on #340).
- Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env
  approach (file new issue).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bdchatham added a commit that referenced this pull request May 21, 2026
* fix(seitask): register sei.io scheme + grant workflownodes RBAC

Third manual fire of release-test surfaced two more contract bugs:

1. provision-validator-chain + provision-rpc-fleet failed at Create
   time with `no kind is registered for the type v1alpha1.SeiNodeDeployment
   in scheme`. cmd/seitask/main.go's kubeClientFromEnv built a
   controller-runtime client with client-go's built-in scheme (K8s
   types only); sei.io/v1alpha1 was never registered. Fix: local
   taskScheme registering builtin + sei.io/v1alpha1.

2. upload-report failed listing workflownodes: 403 from runner/rbac.yaml
   granting workflows (get) but not workflownodes. Fix: add workflownodes
   (get, list).

Run-release-test's "Base URL is missing a protocol" was a downstream
symptom — provision-snd never published endpoints to workflow-vars, so
$(RPC_TM_RPC) couldn't resolve and the literal string passed through to
the release-test image. Resolves automatically once #1 is fixed.

Separately: Chaos Mesh Serial does NOT fail-fast on child Task errors —
each WorkflowNode transitions to Accomplished=True on pod termination
regardless of exit code, and Serial proceeds to the next child. Filed
as separate follow-up; tracked as architectural concern, not in scope
for this fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(seitask): build-time guards for scheme + scenario CM-name contracts

Cross-review feedback from platform-engineer + kubernetes-specialist on
the #339 chain of fixes: both reviewers independently noted that we've
been discovering contract drift between the seitask binary internals
and the scenario YAML / RBAC layer at first-fire instead of at build
time. Each first-fire bug (#334, #337, #339) is the same shape: an
internal helper has a convention, the scenario author has to mirror it
manually, no test catches the drift.

Two narrow guards land here, ranked by ROI:

- TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have
  caught the #339 scheme-registration bug at `go test`. Validates that
  the package-level taskScheme actually has every sei.io/v1alpha1 type
  the seitask subcommands construct via typed Create/Get.
- TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught
  the #337 CM-name drift at `go test`. Walks every scenario YAML in
  the opt-in allow list (release-test.yaml today), extracts the
  Workflow CR's metadata.name, asserts every envFrom configMapRef.name
  matches WorkflowVarsName(metadataName). Major-upgrade is excluded —
  its CM is bash-created with a different convention; revisit when the
  half-bash legacy retires.

Defers (filed/tracked separately, not in scope for this PR):
- RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist
  ranked #3; defer until a third recurrence).
- Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340).
- EXIT_REASON write-once-or-fail-classification semantics for #340
  (amend on #340).
- Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env
  approach (file new issue).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bdchatham added a commit that referenced this pull request May 21, 2026
…draft) (#343)

* feat(seitask): per-pod EVM endpoint publishing + load-test Workflow draft

Scaffolding for N=2 scenario migration (load-test) while N=1
(release-test) is still validating end-to-end on harbor. Draft PR —
do not merge until release-test fires fully green.

provision-snd additive change:
- Adds taskruntime.KeyEVMJSONRPCList var key for comma-separated
  per-pod EVM URLs. RoleScoped publishes RPC_EVM_RPC_LIST alongside
  the existing pod-0 RPC_EVM_RPC. Seiload needs all-pod URLs for its
  stateful EVM workload (sei-k8s-controller#244); release-test stays
  on pod-0-only.
- Test extended to assert both keys, fake SND now has 2 nodes.

scenarios/load-test.yaml (Chaos Mesh Workflow):
- Mirrors release-test's structure: keygen omitted (seiload manages
  its own accounts), wait-rpc-caught-up + render-seiload-profile
  added as bash kubectl Tasks specific to this scenario.
- Profile JSON renders via sed substitution against a source
  ConfigMap mounted from the nightly namespace (seiload-profiles).
  Per-run rendered CM stamped with workflow ownerRef → cascade on
  delete.
- Seiload Task mirrors the existing clusters/harbor/nightly/load/
  seiload-job.yaml's container shape (resources, env, ports).

scenarios/load-test/{validator,rpc}.yaml.tmpl:
- Minimal SND templates. No genesis.accounts (seiload doesn't fund
  admin), no rpc overrides (release-test's lag_threshold tightening
  is harness-specific).

Open prereqs for first fire (not in this PR):
- pods/exec verb on the seitask-runner Role for wait-rpc-caught-up
  (kubectl exec seid status). Track in platform-repo wrapper PR.
- Platform-repo wrapper CronJob mirroring clusters/harbor/nightly/
  release/workflow.yaml's shape. Follows same merge order: this PR
  → image build → platform wrapper PR.
- seiload-profiles ConfigMap in nightly namespace (currently the
  profile lives in nightly-load-orchestrator CM alongside bash;
  split when wrapper PR lands).

Build-time scenarios_test extended to include load-test.yaml in
its CM-name validation allowlist — guards against #337-class drift
for the new scenario.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address platform + k8s cross-review on #343

Both reviewers converged on three load-bearing fixes before un-drafting:

1. **pods/exec verb in runner/rbac.yaml**. wait-rpc-caught-up uses
   `kubectl exec seid status` to poll catching_up=false; without this
   verb the Task 403s on first fire. Source-of-truth Role belongs in
   this repo; platform-vendored copy needs a coordinated mirror PR.
2. **--duration interpolation**. Was `--duration=${DURATION_MINUTES}m`
   (envsubst form), which K8s containers do not substitute. Switched
   to `--duration=$(DURATION_MINUTES)m` with a matching env: entry —
   resolves at pod creation via K8s container env interpolation, no
   dependency on wrapper envsubst expanding it.
3. **render-seiload-profile preflight CM check**. Without it, a missing
   `seiload-profiles` source CM hangs the pod in ContainerCreating for
   ~5m before deadline fires. One-line `kubectl get configmap` upfront
   converts silent failure into a labeled exit.

Also folded in:
- SEILOAD_COMMIT_ID env on run-seiload (legacy seiload-job.yaml has it;
  dashboards keyed on this label would drop a dimension on the new
  path). Wrapper parses from $SEID_IMAGE tag and passes via envsubst.
- Dropped dead-weight SEI_WORKFLOW_RUN_ID env on render-seiload-profile
  (already arrives via envFrom workflow-vars).

NOT applying: both reviewers' suggestion to stamp
`serviceAccountName: seitask-runner` on Task containers. Chaos Mesh
v2.8's Task.container schema does not expose serviceAccountName
(documented in project_chaos_mesh_workflow_gaps memory + the comment
block in clusters/harbor/nightly/major-upgrade/rbac.yaml). The
established workaround is the platform#626 RoleBinding that grants
seitask-runner permissions to the namespace `default` SA, which every
Chaos Mesh Task pod runs as. Both reviewers missed this constraint;
the fix is already in place at platform layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: wait-rpc-caught-up via Tendermint /status network call

User suggestion on cross-review: hit Tendermint /status directly
instead of kubectl exec into the RPC pod. Strict improvement:

- Removes pods/exec RBAC requirement entirely — no platform-vendored
  Role update needed for first fire. (Reverts the runner/rbac.yaml
  pods/exec verb added in the prior commit.)
- Drops kubectl from the wait Task; uses curlimages/curl (leaner).
- Hits ${RPC_TM_RPC}/status (aggregate ClusterIP, already published
  by provision-snd to workflow-vars). For a "is anyone caught up"
  check, load-balanced polling gives natural multi-pod coverage as
  iterations land on different backends.
- Uses the same `.result // .` envelope tolerance pattern as
  provision-snd's waitForFirstBlock — Sei CometBFT sometimes returns
  /status unwrapped.

One fewer cross-repo coordination point: this PR no longer requires
platform/clusters/harbor/nightly/major-upgrade/rbac.yaml to mirror a
pods/exec verb.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant