Skip to content

Add Druid deployment lifecycle tracking, pipeline status semantics, and rollout metrics#16

Open
razinbouzar wants to merge 1 commit into
apache:masterfrom
razinbouzar:deployment-signals
Open

Add Druid deployment lifecycle tracking, pipeline status semantics, and rollout metrics#16
razinbouzar wants to merge 1 commit into
apache:masterfrom
razinbouzar:deployment-signals

Conversation

@razinbouzar
Copy link
Copy Markdown
Contributor

This PR adds deployment lifecycle tracking to the Druid operator, exposes it as a pipeline-facing status contract, and adds rollout metrics.

It introduces spec.forceRedeployToken, spec.expectedBuildRevision, and status.deploymentLifecycle, then wires lifecycle state through reconcile with trigger classification (SpecChange, ImageChange, ManualRollout), phase tracking (Pending, InProgress, Succeeded, Failed), generation/revision semantics for polling, and Kubernetes events for observability. For image and manual rollouts, the operator verifies the live Druid runtime build identifier before completing the lifecycle, using sys.servers.build_revision when available and falling back to sys.servers.version for older Druid versions.

The PR also adds a Druid-specific Prometheus metrics surface for cluster and workload rollout health, makes lifecycle metrics per cluster, standardizes labels on namespace, druid_instance, and node_type, and includes test/doc/e2e updates to support the new contract.

Fixes #XXXX.

Description


This PR has:

  • been tested on a real K8S cluster to ensure creation of a brand new Druid cluster works.
  • been tested for backward compatibility on a real K*S cluster by applying the changes introduced here on an existing Druid cluster. If there are any backward incompatible changes then they have been noted in the PR description.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Key changed/added files in this PR
  • MyFoo
  • OurBar
  • TheirBaz

@AdheipSingh
Copy link
Copy Markdown
Member

Thanks for the PR. Do you have any proposal for this ? It will be easier to review. Even an AI generated doc would help.

BTW what do you mean by pipeline-facing status contract ?

@razinbouzar
Copy link
Copy Markdown
Contributor Author

Thanks for the PR. Do you have any proposal for this ? It will be easier to review. Even an AI generated doc would help.

BTW what do you mean by pipeline-facing status contract ?

I’ll put together a short proposal to make this easier to review.

On “pipeline-facing status contract,” the idea is to expose a stable status surface on the Druid CR that external pipelines can rely on, instead of inferring state from pod state, metrics, StatefulSets, or events.

The operator would publish rollout state under something like status.deploymentLifecycle with fields for trigger, observed and expected revision, phase, reason or message, timestamps, etc. That way a pipeline can watch the CR and know if a rollout is pending, in progress, succeeded, or failed.

The main goal is to make the operator the source of truth for rollout completion, especially for image or manual rollouts where “pods are ready” does not necessarily mean the cluster is on the expected build.

@AdheipSingh
Copy link
Copy Markdown
Member

+1 on this. The current status tbh isn't mature and gives an aggregated status. Looking forward to the proposal. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants