Skip to content

SeiNode readiness probe should include catching_up=false on full-node-role nodes #144

@bdchatham

Description

@bdchatham

Problem

SeiNodeDeployment.status.phase=Ready (computed at api/v1alpha1/seinodedeployment_types.go:188) is satisfied when child node phase=Running plus ConditionNodesReady, but does not include "all full nodes have caught up" (catching_up=false). The AwaitNodesCaughtUp task exists at internal/task/deployment.go:46-51 and is invoked during hard-fork rollouts, but not during initial bring-up.

For genesis-bootstrap chains the gap is essentially zero — validators produce blocks from height 0. For full-node fleets that block-sync from validators or external state, there is a real gap between phase=Ready (pods running) and "RPC is actually serving caught-up data."

Impact

Primary use case — multi-consumer reliability

This affects three consumers, none of which is currently blocked but all of which would benefit:

Cost of not addressing

Each consumer reimplements its own catch-up probe (kubectl exec seid status | grep catching_up) — exactly the bash-glue pattern the controller boundary exists to absorb. This issue moves the responsibility into the right place once.

Relevant experts

  • kubernetes-specialist — readiness probe contract, sidecar /health endpoint design
  • blockchain-developerseid status semantics, catching_up signal interpretation

Proposed approach

Per the ValidationRun LLD's recommendation (option b in Open Dependency #1):

Extend the seictl sidecar's HTTP /health endpoint to return 503 while seid status.SyncInfo.catching_up=true. kubelet readiness probe consumes /health; Pod isn't Ready until caught up; SND phase=Ready automatically requires catch-up via the existing ConditionNodesReady chain.

Concretely:

  • Sidecar /health handler: poll seid status (or its more efficient ABCI-direct equivalent), return 503 if catching_up=true, 200 otherwise.
  • StatefulSet pod template: readinessProbe.httpGet.path=/health (already wired for the sidecar in most deployments — verify).
  • No SND status schema change needed; the existing ConditionNodesReady already aggregates from pod-readiness.

Alternative options considered and rejected per the LLD:

  • (a) Extend SND status with explicit CaughtUp condition: more invasive schema change.
  • (c) ValidationRun separately probes child SeiNodes: pushes responsibility into the wrong controller.

Acceptance criteria

  • Sidecar /health endpoint returns 503 when seid status.SyncInfo.catching_up=true, 200 otherwise
  • Pod readiness probe wired to /health on validator and full-node SND templates
  • Verify SND phase=Ready automatically gates on the new probe via existing ConditionNodesReady
  • Integration test: bootstrap a chain, force a fullnode behind, observe Pod NotReady → SND not Ready, then catch up → Ready
  • Document the contract in the sidecar's README and in SeiNodeDeployment CRD docs

Out of scope

  • Surfacing catching_up as a separate first-class SND status field (handled by the option-b approach via existing readiness aggregation)
  • ValidationRun controller probes (delegated to SND readiness; do not reimplement)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions