Skip to content

feat: bump seictl to use S3 transfer manager for parallel snapshot downloads#3

Merged
bdchatham merged 1 commit into
mainfrom
feat/seictl-transfer-manager
Mar 15, 2026
Merged

feat: bump seictl to use S3 transfer manager for parallel snapshot downloads#3
bdchatham merged 1 commit into
mainfrom
feat/seictl-transfer-manager

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Summary

  • Bumps seictl dependency to pick up sei-protocol/seictl#27 which integrates the AWS S3 transfer manager for parallel byte-range downloads during snapshot restore
  • Updates defaultSidecarImage in labels.go and sample manifests (pacific-1-snapshotter, pacific-1-replay) to the new container digest (sha256:ad50d546c3aa...)

Test plan

  • Verify controller builds cleanly with updated dependency
  • Deploy to brandon cluster and confirm snapshot restore uses parallel downloads
  • Monitor snapshotter and replay nodes through full state-sync cycle

…wnloads

Update seictl dependency to pick up the transfer manager integration
(sei-protocol/seictl#27) which uses the AWS S3 transfer manager for
parallel byte-range downloads during snapshot restore. Also updates the
default sidecar image and sample manifests to the new container digest.
@bdchatham bdchatham marked this pull request as ready for review March 15, 2026 16:00
@bdchatham bdchatham merged commit da619cb into main Mar 15, 2026
2 checks passed
@bdchatham bdchatham deleted the feat/seictl-transfer-manager branch March 19, 2026 21:00
bdchatham added a commit that referenced this pull request Apr 30, 2026
Updates the LLD's Open Dependencies section to reflect the resolution
of three companion sub-issues:

#3 SND inline genesis rejection — Resolved by #148 (merged); the CEL
   rule lands on SeiNodeDeploymentSpec rejecting genesis on non-
   validator-role deployments. Broader than the original ValidationRun-
   side framing (covers fullNode/archive/replayer; SND has a 4-role
   discriminator).

#4 PodMonitor for sei-k8s-controller — Resolved as already-implemented.
   The upstream config/monitoring/ ships a ServiceMonitor (functionally
   equivalent to PodMonitor) which is pulled in transitively via the
   platform's kustomization. sei-protocol/platform#243 closed-as-already-
   implemented after kubectl kustomize verification. Same pattern as
   Open Dependency #1's lag_status mis-claim.

#6 monitoring/ namespace label — Resolved by sei-protocol/platform#269
   (merged); the cluster's monitoring/ namespace now carries
   sei.io/validation-shared-rules=true.

Open Dependencies #2 (TaskPlan.TargetPhase decision), #5 (heartbeat
PrometheusRule, deferred), #7 (Indexed-Job shard env injection), and
#8 (controller registration opt-in plumbing) remain open.

Refs sei-k8s-controller#143 #145 #144 platform#243 #245

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bdchatham added a commit that referenced this pull request May 21, 2026
…l) (#328)

Pins the operator-visible "stuck rollout" shape when AwaitSpecUpdate
cannot observe child convergence. Establishes the contrast baseline
the forthcoming Paused feature must remain observably distinguishable
from on every status field.

What's asserted (stuck shape, holds for 10s via SatisfyAll +
HaveField under Consistently):

- Status.Rollout != nil, RolloutInProgress=True
- Status.Plan != nil, Phase=Active, ConditionPlanInProgress=True
- Status.Phase = Upgrading
- Status.ReadyReplicas = replicas (existing pods stay healthy)
- ConditionNodesReady = True
- Status.Rollout.StartedAt stable across reconciles (alert-rule
  load-bearing field; a regression that re-stamps it would hide
  stalls from age-based alerting)
- No RolloutComplete event fires during the stall
- Children Spec.Image advanced to v2 (UpdateNodeSpecs ran)
- Children Status.CurrentImage stays at v1 (ObserveImage stalled)

Recovery (Eventually with 60s budget): resuming the StatefulSet
status faker lets the rollout reach terminal state on its own,
proving the stuck state is the stall, not a deeper controller
failure.

Harness: uses the existing StatusFaker.Pause/Resume primitive
introduced in #325. No production code touched.

Coral cross-review applied (k8s-specialist + sre-engineer):

Incorporated:
- Struct-snapshot pattern + SatisfyAll(HaveField(...)) for
  per-field diagnostic messages on Consistently flakes
- Status.Phase == Upgrading (kubectl get snd column #3 — the
  literal thing an on-call types first)
- ReadyReplicas + ConditionNodesReady simultaneously True
  (the diagnostic contradiction pattern operators look for)
- StartedAt stability inside Consistently (load-bearing for any
  future "rollout stuck >1h" alert)
- ConditionPlanInProgress (operators read conditions before
  reading embedded Status.Plan)
- Typed CurrentImage assertion (Equal(v1), not the weak
  NotTo(Equal(v2)) — a regression writing garbage would have
  passed before)
- Recovery uses Eventually with 60s budget, not the default 30s
  pollTimeout — recovery has to drain a 10s faker backlog plus
  5+ reconcile hops, and p99 under loaded CI can spike past 30s

Deferred to follow-ups (out of scope for this PR):
- Controller-still-reconciling liveness probe (harness pattern,
  separate test class)
- Sibling failure-class tests for image-pull, PVC bind, sidecar
  crash (each has a distinct envtest-stub shape)
- PromQL/alert-rule rotation in /Users/brandon/platform/clusters/
  monitoring stacks — file /issue against observability-platform-
  engineer; no SND-keyed PromQL exists today, so a stuck prod
  rollout pages nobody until someone manually checks
- Test rename to TestInPlaceRollout_StuckOnAwaitSpecUpdate
  alongside the sibling failure-class tests

Verifications: 3 consecutive runs at 46.2s / 45.5s / 44.8s wall
(all 5 envtest tests); stuck test individual ~17s. verify-generated
clean.
bdchatham added a commit that referenced this pull request May 21, 2026
Cross-review feedback from platform-engineer + kubernetes-specialist on
the #339 chain of fixes: both reviewers independently noted that we've
been discovering contract drift between the seitask binary internals
and the scenario YAML / RBAC layer at first-fire instead of at build
time. Each first-fire bug (#334, #337, #339) is the same shape: an
internal helper has a convention, the scenario author has to mirror it
manually, no test catches the drift.

Two narrow guards land here, ranked by ROI:

- TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have
  caught the #339 scheme-registration bug at `go test`. Validates that
  the package-level taskScheme actually has every sei.io/v1alpha1 type
  the seitask subcommands construct via typed Create/Get.
- TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught
  the #337 CM-name drift at `go test`. Walks every scenario YAML in
  the opt-in allow list (release-test.yaml today), extracts the
  Workflow CR's metadata.name, asserts every envFrom configMapRef.name
  matches WorkflowVarsName(metadataName). Major-upgrade is excluded —
  its CM is bash-created with a different convention; revisit when the
  half-bash legacy retires.

Defers (filed/tracked separately, not in scope for this PR):
- RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist
  ranked #3; defer until a third recurrence).
- Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340).
- EXIT_REASON write-once-or-fail-classification semantics for #340
  (amend on #340).
- Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env
  approach (file new issue).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bdchatham added a commit that referenced this pull request May 21, 2026
* fix(seitask): register sei.io scheme + grant workflownodes RBAC

Third manual fire of release-test surfaced two more contract bugs:

1. provision-validator-chain + provision-rpc-fleet failed at Create
   time with `no kind is registered for the type v1alpha1.SeiNodeDeployment
   in scheme`. cmd/seitask/main.go's kubeClientFromEnv built a
   controller-runtime client with client-go's built-in scheme (K8s
   types only); sei.io/v1alpha1 was never registered. Fix: local
   taskScheme registering builtin + sei.io/v1alpha1.

2. upload-report failed listing workflownodes: 403 from runner/rbac.yaml
   granting workflows (get) but not workflownodes. Fix: add workflownodes
   (get, list).

Run-release-test's "Base URL is missing a protocol" was a downstream
symptom — provision-snd never published endpoints to workflow-vars, so
$(RPC_TM_RPC) couldn't resolve and the literal string passed through to
the release-test image. Resolves automatically once #1 is fixed.

Separately: Chaos Mesh Serial does NOT fail-fast on child Task errors —
each WorkflowNode transitions to Accomplished=True on pod termination
regardless of exit code, and Serial proceeds to the next child. Filed
as separate follow-up; tracked as architectural concern, not in scope
for this fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(seitask): build-time guards for scheme + scenario CM-name contracts

Cross-review feedback from platform-engineer + kubernetes-specialist on
the #339 chain of fixes: both reviewers independently noted that we've
been discovering contract drift between the seitask binary internals
and the scenario YAML / RBAC layer at first-fire instead of at build
time. Each first-fire bug (#334, #337, #339) is the same shape: an
internal helper has a convention, the scenario author has to mirror it
manually, no test catches the drift.

Two narrow guards land here, ranked by ROI:

- TestTaskScheme_RoundTripsSND / _RoundTripsSeiNodeTask: would have
  caught the #339 scheme-registration bug at `go test`. Validates that
  the package-level taskScheme actually has every sei.io/v1alpha1 type
  the seitask subcommands construct via typed Create/Get.
- TestScenarioYAMLs_CMNameMatchesWorkflowVarsName: would have caught
  the #337 CM-name drift at `go test`. Walks every scenario YAML in
  the opt-in allow list (release-test.yaml today), extracts the
  Workflow CR's metadata.name, asserts every envFrom configMapRef.name
  matches WorkflowVarsName(metadataName). Major-upgrade is excluded —
  its CM is bash-created with a different convention; revisit when the
  half-bash legacy retires.

Defers (filed/tracked separately, not in scope for this PR):
- RBAC vs kubebuilder-marker reconciliation test (kubernetes-specialist
  ranked #3; defer until a third recurrence).
- Wrapper SA workflows: [patch] prereq for #340 path 1 (amend on #340).
- EXIT_REASON write-once-or-fail-classification semantics for #340
  (amend on #340).
- Scenario contract enforcement subcommand + SEI_WORKFLOW_VARS_CM env
  approach (file new issue).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant