Problem
SeiNodeDeployment.status.phase=Ready (computed at api/v1alpha1/seinodedeployment_types.go:188) is satisfied when child node phase=Running plus ConditionNodesReady, but does not include "all full nodes have caught up" (catching_up=false). The AwaitNodesCaughtUp task exists at internal/task/deployment.go:46-51 and is invoked during hard-fork rollouts, but not during initial bring-up.
For genesis-bootstrap chains the gap is essentially zero — validators produce blocks from height 0. For full-node fleets that block-sync from validators or external state, there is a real gap between phase=Ready (pods running) and "RPC is actually serving caught-up data."
Impact
Primary use case — multi-consumer reliability
This affects three consumers, none of which is currently blocked but all of which would benefit:
Cost of not addressing
Each consumer reimplements its own catch-up probe (kubectl exec seid status | grep catching_up) — exactly the bash-glue pattern the controller boundary exists to absorb. This issue moves the responsibility into the right place once.
Relevant experts
kubernetes-specialist — readiness probe contract, sidecar /health endpoint design
blockchain-developer — seid status semantics, catching_up signal interpretation
Proposed approach
Per the ValidationRun LLD's recommendation (option b in Open Dependency #1):
Extend the seictl sidecar's HTTP /health endpoint to return 503 while seid status.SyncInfo.catching_up=true. kubelet readiness probe consumes /health; Pod isn't Ready until caught up; SND phase=Ready automatically requires catch-up via the existing ConditionNodesReady chain.
Concretely:
- Sidecar
/health handler: poll seid status (or its more efficient ABCI-direct equivalent), return 503 if catching_up=true, 200 otherwise.
- StatefulSet pod template:
readinessProbe.httpGet.path=/health (already wired for the sidecar in most deployments — verify).
- No SND status schema change needed; the existing
ConditionNodesReady already aggregates from pod-readiness.
Alternative options considered and rejected per the LLD:
- (a) Extend SND status with explicit
CaughtUp condition: more invasive schema change.
- (c) ValidationRun separately probes child SeiNodes: pushes responsibility into the wrong controller.
Acceptance criteria
Out of scope
- Surfacing
catching_up as a separate first-class SND status field (handled by the option-b approach via existing readiness aggregation)
- ValidationRun controller probes (delegated to SND readiness; do not reimplement)
References
Problem
SeiNodeDeployment.status.phase=Ready(computed atapi/v1alpha1/seinodedeployment_types.go:188) is satisfied when child nodephase=RunningplusConditionNodesReady, but does not include "all full nodes have caught up" (catching_up=false). TheAwaitNodesCaughtUptask exists atinternal/task/deployment.go:46-51and is invoked during hard-fork rollouts, but not during initial bring-up.For genesis-bootstrap chains the gap is essentially zero — validators produce blocks from height 0. For full-node fleets that block-sync from validators or external state, there is a real gap between
phase=Ready(pods running) and "RPC is actually serving caught-up data."Impact
Primary use case — multi-consumer reliability
This affects three consumers, none of which is currently blocked but all of which would benefit:
Cost of not addressing
Each consumer reimplements its own catch-up probe (
kubectl exec seid status | grep catching_up) — exactly the bash-glue pattern the controller boundary exists to absorb. This issue moves the responsibility into the right place once.Relevant experts
kubernetes-specialist— readiness probe contract, sidecar/healthendpoint designblockchain-developer—seid statussemantics, catching_up signal interpretationProposed approach
Per the ValidationRun LLD's recommendation (option b in Open Dependency #1):
Extend the
seictlsidecar's HTTP/healthendpoint to return 503 whileseid status.SyncInfo.catching_up=true. kubelet readiness probe consumes/health; Pod isn't Ready until caught up; SNDphase=Readyautomatically requires catch-up via the existingConditionNodesReadychain.Concretely:
/healthhandler: pollseid status(or its more efficient ABCI-direct equivalent), return 503 ifcatching_up=true, 200 otherwise.readinessProbe.httpGet.path=/health(already wired for the sidecar in most deployments — verify).ConditionNodesReadyalready aggregates from pod-readiness.Alternative options considered and rejected per the LLD:
CaughtUpcondition: more invasive schema change.Acceptance criteria
/healthendpoint returns 503 whenseid status.SyncInfo.catching_up=true, 200 otherwise/healthon validator and full-node SND templatesphase=Readyautomatically gates on the new probe via existingConditionNodesReadySeiNodeDeploymentCRD docsOut of scope
catching_upas a separate first-class SND status field (handled by the option-b approach via existing readiness aggregation)References
api/v1alpha1/seinodedeployment_types.go:188— existingphase=Readycomputationinternal/task/deployment.go:46-51— existingAwaitNodesCaughtUptask (hard-fork rollout path)sei-protocol/sei-k8s-controller#139— design ask referencing this gap