docs: LLD for ValidationRun CRD (for #139)#143
Conversation
Captures the council-driven low-level design for ValidationRun — a new
namespaced CRD (group validation.sei.io/v1alpha1) that orchestrates
ephemeral-chain validation workloads on Harbor.
v1 ships:
- ValidationRun with spec.type=LoadTest discriminator
- chain.{validators,fullNodes} embedding SeiNodeDeploymentSpec
- rules: alert + query types, continuous polling with stop-on-failure
- 7-task plan (ensure-chain, wait-chain-ready, resolve-endpoints,
render-config, apply-job, monitor-run, mark-done)
- Conditions[Succeeded, TestComplete]; .status.report.s3Url only
- Tenant-pre-provisioned SAs; same-namespace by construction
- CRDs reserved (no controllers in v1) for ValidationSuite +
ValidationSchedule so v1 doesn't paint future kinds into a corner
Council session: kubernetes-specialist (primary author), platform-engineer,
product-manager, opentelemetry-expert. /coral mid-council answered the
Argo Workflows pivot question: unanimous Path X (custom CRD); Argo
re-evaluation triggers documented in Future Work.
Council gate closed 2026-04-28 with 18 resolved one-way-door decisions
documented in the LLD's "Resolved one-way-door decisions" section.
For #139.
Companion Phase 1 workstream: sei-protocol/platform#235.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Material revision following continued council session refinement. Spec
shape generalizes; controller architecture splits into rails for
multi-actor expansion.
Schema:
- chain.deployments[] (named map of SeiNodeDeploymentSpec) replaces
chain.{validators, fullNodes}; CEL enforces "exactly one validator
+ ≥1 fullNode role"; child SNDs named {chainId}-{deploymentName}
- drop spec.type discriminator; introduce composable optional blocks
spec.{load, sequence, chaos}; rules-only Runs are valid
- rename spec.loadTest → spec.load
- drop endpointPolicy field; fullNodes-fleet endpoints always used
- IntegrationTest collapses into "load + rules with workload-as-verifier"
Controller architecture (rails for v2 actors):
- Two sub-controllers in one binary: OrchestrationReconciler (always-on)
and LoadGenerationReconciler (Helm-opt-in, default-on)
- Predicate-gated event delivery on LoadGen (spec.load != nil AND
Conditions[TestRunning]=True); gate-on-creation, no barrier task
- Field-manager isolation per controller; single-writer condition table
- .status.plans.{orchestration, loadGeneration} replaces .status.plan
- OrchestrationPlan + LoadGenerationPlan; v2 actors slot in additively
- Helm chart controllers.<name>.enabled opt-in
- Cancellation via Conditions[TestCancelled] (cooperative halt)
- rename monitor-run → monitor-task-completion (gateway role)
Coral findings incorporated:
- pods/exec REJECTED for v2 sequence plan (one-way door); txs submit
via short-lived Jobs against RPC service
- chaos plan namespace-label gate sei.io/chaos-allowed=true +
--enable-chaos-plan compile-time flag for v2
- TaskPlan.TargetPhase decoupling: v1 controllers ignore the field
and own phase transitions in finalize task
- .status.failedPlan pointer reserved for v2 multi-actor debugging
- Implementation invariants: monitor-task-completion uses RequeueAfter
for transient errors (never TerminalError); status patches use
optimistic concurrency
Resolved one-way-door table: 18 → 26 rows
Status block updated: architectural refinement noted 2026-04-29
For #139.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Revision pushed (commit ef3de08)Material LLD revision following continued council session refinement after gate close. Spec shape generalizes; controller architecture introduces sub-controller rails for v2 actor expansion. Schema changes
Controller architectureTwo sub-controllers in the same binary, registered via Helm-opt-in:
Coordination is conditions-as-IPC with field-manager isolation:
Single-writer table in the LLD documents which controller owns each condition + status field. v2 expansion is purely additive — Sequence, Chaos, etc. each land as new sub-controllers, predicate-gated identically, no refactor of v1 code.
Architectural decisions baked in (Coral-validated)
Resolved one-way-door decisions tableGrew from 18 to 26 rows. New entries cover composable-blocks shift, deployments[] generalization, two-controller architecture, predicate-gated event delivery, field-manager isolation, TargetPhase decoupling, Helm opt-in, pods/exec rejection, chaos namespace gate, and Tensions worth explicit reviewer attention
Stat1369 lines diff (819 insertions, 550 deletions). 1610 total lines, ~12,400 words. |
Two small refinements per user feedback:
Naming consistency — add Validation prefix to all validation
reconciler types so they're distinguishable from the always-on
SeiNode/SeiNodeDeployment defaults:
- OrchestrationReconciler → ValidationOrchestrationReconciler
- LoadGenerationReconciler → ValidationLoadGenerationReconciler
- SequenceReconciler → ValidationSequenceReconciler (v2)
- ChaosReconciler → ValidationChaosReconciler (v2)
- OrchestrationPlan → ValidationOrchestrationPlan
- LoadGenerationPlan → ValidationLoadGenerationPlan
Reframe opt-in deployment — pull back Helm-specific framing since
the chart isn't shipping in v1:
- Section retitled "Controller registration and opt-in deployment"
(was "Helm chart and controller registration")
- Drop specific values.yaml YAML; document mechanism-agnostic Go
sketch instead
- Frame the validation slice as opt-in at deployment time;
SeiNode/SND controllers remain the always-on default
- Document that node operators running production validators
don't need the validation machinery
- Specific opt-in mechanism (values flags, build tags, env vars,
separate Deployment manifests) deferred to implementation
- One-way-door table row 23 updated; future work entries +
open dependency 8 reframed accordingly
For #139.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refinement pushed (commit df7e0f8)Two small refinements per user direction: 1. Naming consistency. All validation reconciler types now carry the
2. Opt-in deployment, mechanism-agnostic. The prior draft prescribed a Helm-chart-specific values.yaml shape. Since the chart isn't shipping in v1, that framing was premature. Refinements:
Stat: 248 lines diff (113 insertions, 135 deletions). 1588 total lines (down from 1610). |
Companion sub-issues filedThe seven Open Dependencies surfaced in the LLD's "Open dependencies" section are now tracked as separate workstreams. Implementation of the ValidationRun controllers themselves (this PR) is independent of these — they're cross-cutting hygiene + platform-side work that benefits this PR plus other consumers. sei-k8s-controller:
sei-protocol/platform:
None of these block this PR's review. They land independently and converge with the controller implementation when it ships. Implementation handoff issues for the controllers (types + deepcopy, ValidationOrchestrationReconciler, ValidationLoadGenerationReconciler, etc.) will be filed once the LLD direction here is locked by review feedback. |
Superseded — pivot to CLI substrate (sei-protocol/seictl#96)Following implementation start, the value question surfaced: what does putting validation orchestration into a CRD actually buy? Honest answer after re-evaluation: GitOps applicability + resilience to controller restart mid-run. Both are deliverable from a CLI substrate. Everything else this LLD's shape was paying for (declarative desired state, edge-triggered reconciliation, status partitioning, two cooperating reconcilers, phase machine, condition machine) is paying for semantics that test orchestration doesn't actually need — tests are imperative, time-bounded, and one-shot. The replacement design is captured at sei-protocol/seictl#96 (`docs/design/validation-substrate.md`). Highlights:
What survives from this LLD
What dies
Process notes for future workThe new design ran a tight coral round (platform-engineer / product-manager / product-engineer in parallel) rather than full council. The reasoning: this is one CLI's surface, not multi-component cross-cutting work. The LLD here was correctly council-tier because it spanned controller types + planner internals + observability contract; the CLI replacement is component-tier and lighter in surface. The merged 1588 lines of design here aren't lost — the problem statement, OSS survey, and resolved gate decisions remain useful reference material when individual primitives or composites land. This LLD becomes the "why we considered this and walked away" artifact. |
…#96) ## Summary Captures the pivot away from the merged ValidationRun CRD LLD (sei-protocol/sei-k8s-controller#143) toward a CLI-substrate model: seictl primitives (chain/rpc/load/harness/rules) composed by sugar verbs (bench/qa/shadow). The runtime workload contract from sei-protocol/platform#235 is kept verbatim and already implemented in \`bench up\`. ## Key decisions captured - **Replace CRD with CLI primitives.** Test orchestration is imperative + time-bounded; the CRD's phase machine + condition machine + two-controller dance was paying for declarative-desired-state semantics that tests don't need. - **v1 ships effectively zero new code.** Today's \`bench up\` covers the seiload-nightly use case (the LLD's primary Phase 1 consumer). Primitives land on demand with named triggers, not speculatively. - **Single binary, two install paths.** Standalone \`seictl\` AND kubectl plugin via \`kubectl-sei\` symlink — one parser, one help tree, zero code change. - **Label-driven cascade-delete, not OwnerRefs across primitives.** Cross-primitive coupling is rejected in favor of \`sei.io/chain-id\` selectors. - **\`rules watch\` is a Job, not a controller** — deferred until a real engineer hits a "passed-but-validators-OOM" signal. ## Anti-features (deliberate) The doc explicitly enumerates what the LLD's gravitational pull would tempt us to build: - Unified \`validation.sei.io/v1\` YAML schema - Generic \`harness\` substrate - Symmetric verb sets for symmetry's sake - Observability-as-test-oracle in the CLI - Per-verb kubectl plugin symlinks ## Process Coral round dispatched three specialists in parallel — platform-engineer (substrate), product-manager (scope discipline), product-engineer (cross-surface ergonomics). Outputs synthesized inline. The PM's "v1 ships nothing new" stance won on scope; the platform-engineer's label contract + peer-discovery mechanism won on substrate; the product-engineer's MCP composite-as-tool / kubectl-sei prefix won on distribution. ## Test plan - [ ] Skim the doc for tone consistency with existing \`docs/design/cluster-cli.md\` - [ ] Confirm the v1 ship cut table matches what's actually shipped today (\`bench up/down/list\` only) - [ ] Confirm anti-features list reflects coral synthesis (not random YAGNI) - [ ] Comment thread on sei-protocol/sei-k8s-controller#143 documenting the supersession (will fire after this lands) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
) Previous DefaultSidecarImage (sha256:f3ed1297..., set in PR #168) predates seictl PR #143 which bumped to sei-config v0.0.13. As a result the rendered /sei/config/app.toml on archive nodes has no [receipt-store] section — seid uses the upstream default keep-recent=100000 and prunes historical receipts on first boot. This makes BYOV archive nodes with pre-populated receipt data (e.g. pacific-1-archive-0) effectively unusable: the data on disk is fine, but seid prunes it within minutes of starting. Bump to a6e00256... — the current ghcr.io/sei-protocol/seictl:latest multi-arch index — built from seictl main which has sei-config v0.0.13 ([receipt-store] archive-mode override) and the most recent seictl client features through v0.0.48. Pinning by digest (not :latest) keeps deploys deterministic. Note: a separate chain-id rendering issue surfaced during the pacific-1-archive-0 attempt — config.toml is missing chain-id entirely, causing seid to panic on genesis/config mismatch. This bump may or may not address that; chain-id rendering needs to be investigated in sei-config legacy.go regardless. After this lands: build new controller image, bump controller image tag in platform repo so the running controller picks up the new DefaultSidecarImage. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…201) PR #199 bumped DefaultSidecarImage to sha256:a6e00256... (the seictl:latest tag), but that tag is stale and predates two critical changes: - /v0/livez handler (commit a595641, present in v0.0.31+) — kubelet's livenessProbe on /v0/livez gets 404 → restarts sei-sidecar in a CrashLoopBackOff - sei-config v0.0.13 [receipt-store] archive override (seictl PR #143) The seictl Containerize workflow does not push a `latest` tag — the metadata-action config only emits semver, branch (`main`), and SHA tags. The `:latest` tag in the registry is from some other publishing mechanism that hasn't been updated in a while. Bump to sha256:d3ecb1a0... — the index digest for ghcr.io/sei-protocol/seictl:main / :sha-d829dcf... built by the latest Containerize run on commit d829dcf (chore: bump version to v0.0.48). Confirmed: - go.mod: github.com/sei-protocol/sei-config v0.0.13 - server.go: registers GET /v0/livez handler Verified live in pod: - GET /v0/healthz → 503 Service Unavailable (handler exists) - GET /v0/livez → 404 Not Found (handler MISSING — confirms a6e00256 predates a595641) - GET /v0/status → 200 OK (handler exists) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Captures the council-driven low-level design for ValidationRun — a new namespaced CRD (
validation.sei.io/v1alpha1) that orchestrates ephemeral-chain validation workloads on Harbor.ValidationRun(implemented),ValidationSuiteandValidationSchedule(CRDs ship with v1, no reconcilers; reserved so future kinds don't paint v1 into a corner).spec.type:LoadTest(v1) /SequenceTest/IntegrationTest(future). Avoids the TestkubeTestWorkflowconsolidation lesson.spec.chain.{validators,fullNodes}embedsSeiNodeDeploymentSpecdirectly (both required) — the controller materializes both fleets as owned children with chainId/role-discriminator/peer-selector injection. Cascade delete via OwnerReferences.alert+queryagainst Prometheus, continuously polled perrunProperties.interval(default 30s). One Prometheus client serves both types —alertrules use the syntheticALERTSseries. Observability-as-test-oracle: verdict =workload_exit_code AND ⋀rules.monitor-runplan task is the central refinement: a single async polling loop combining workload-monitoring + per-rule evaluation in each iteration.Conditions[TestComplete]is the workload-completion boundary;runProperties.stopOnFailureenables cooperative Job cancel on rule trip.alert.ruleRefinto label-allowlisted monitoring namespaces). Tenants pre-provision ServiceAccounts; the controller is never an IAM controller.0/1/2),${RESULT_DIR}artifacts,.status.report.s3Urlonly (no termination-message echo into status).internal/planner/— same Build → Persist → Execute → Complete/Fail lifecycle, single-patch model, condition ownership on planner.Council process
This LLD is the artifact of a
/councilComponent-tier design pass. Process documented in the local checkpoint (not committed; lives at.council/workstream.yaml).validation.sei.io, discriminatorspec.type, embedSeiNodeDeploymentSpec, hard-reject reserved env conflicts via CEL, dropmodefield (continuous-only), drop.status.report.raw(S3 is authoritative), spec immutability via CEL, Tekton-styleSucceeded+TestCompleteconditions,FailedvsErrordistinct phases.Test plan
kubernetes-specialist,platform-engineer,product-manager,opentelemetry-expert(council reviewers); resolve any MISMATCH/MISSING findings before mergeseiloadnightly + qa-testing on Harbor) without per-workload special casesspec.loadTest.workloadenvelopecatching_upvia sidecar/healthprobe;monitoring/namespace getssei.io/validation-shared-rules=truelabel; PodMonitor forsei-k8s-controlleritself; heartbeatPrometheusRule)Out of scope for this PR
ValidationSuiteandValidationSchedulereconcilers (CRDs ship with v1; controllers are deferred)SequenceTestandIntegrationTestdiscriminator bodies (designed-as-extension-points, not implemented)References
docs/design/composable-genesis.md(matched here)🤖 Generated with Claude Code