Add Druid deployment lifecycle tracking, pipeline status semantics, and rollout metrics#16
Add Druid deployment lifecycle tracking, pipeline status semantics, and rollout metrics#16razinbouzar wants to merge 1 commit into
Conversation
|
Thanks for the PR. Do you have any proposal for this ? It will be easier to review. Even an AI generated doc would help. BTW what do you mean by pipeline-facing status contract ? |
I’ll put together a short proposal to make this easier to review. On “pipeline-facing status contract,” the idea is to expose a stable status surface on the Druid CR that external pipelines can rely on, instead of inferring state from pod state, metrics, StatefulSets, or events. The operator would publish rollout state under something like status.deploymentLifecycle with fields for trigger, observed and expected revision, phase, reason or message, timestamps, etc. That way a pipeline can watch the CR and know if a rollout is pending, in progress, succeeded, or failed. The main goal is to make the operator the source of truth for rollout completion, especially for image or manual rollouts where “pods are ready” does not necessarily mean the cluster is on the expected build. |
|
+1 on this. The current status tbh isn't mature and gives an aggregated status. Looking forward to the proposal. :) |
This PR adds deployment lifecycle tracking to the Druid operator, exposes it as a pipeline-facing status contract, and adds rollout metrics.
It introduces spec.forceRedeployToken, spec.expectedBuildRevision, and status.deploymentLifecycle, then wires lifecycle state through reconcile with trigger classification (SpecChange, ImageChange, ManualRollout), phase tracking (Pending, InProgress, Succeeded, Failed), generation/revision semantics for polling, and Kubernetes events for observability. For image and manual rollouts, the operator verifies the live Druid runtime build identifier before completing the lifecycle, using sys.servers.build_revision when available and falling back to sys.servers.version for older Druid versions.
The PR also adds a Druid-specific Prometheus metrics surface for cluster and workload rollout health, makes lifecycle metrics per cluster, standardizes labels on namespace, druid_instance, and node_type, and includes test/doc/e2e updates to support the new contract.
Fixes #XXXX.
Description
This PR has:
Key changed/added files in this PR
MyFooOurBarTheirBaz