fix: deterministic rollout order for multiple nodes per NodeType by aruraghuwanshi · Pull Request #15 · apache/druid-operator

aruraghuwanshi · 2026-04-23T06:19:56Z

Summary

getNodeSpecsByOrder groups node specs by NodeType but previously appended them from for key, nodeSpec := range m.Spec.Nodes. In Go, map iteration order is not stable, so the relative order of multiple specs with the same NodeType could change between reconciles.

With rollingDeploy: true, the handler walks that ordered list and may return early while a workload is still rolling. If the intra–NodeType order flips between calls, the operator can effectively start or advance rollouts for more than one StatefulSet/Deployment of the same NodeType at a time, instead of finishing one before the next.

This PR sorts specs by their map key (ascending) within each NodeType before building the final list, while keeping the existing cross–NodeType order from druidServicesOrder unchanged.

What changed

getNodeSpecsByOrder (controllers/druid/ordering.go): sort.Slice on each per–NodeType slice by ServiceGroup.key.
Unit tests (controllers/druid/ordering_test.go): Ginkgo test data uses multiple historical tiers (historicalstier1–3); added testing.T tests that would fail on the pre-fix map-only ordering and pass with stable sorting, plus a guard for cross–NodeType order.
E2E (e2e/configs/druid-rolling-deploy-cr.yaml, e2e/test-rolling-deploy-ordering.sh, wired from e2e/e2e.sh): two historical tiers (historicalstier1 / historicalstier2) with rollingDeploy: true, patch to trigger a rollout, and checks that only one of the two historical StatefulSets is mid-update at a time, with lexicographically first tier finishing before the second starts (when transitions are observable at the poll interval).

Testing

Tested on a real Kubernetes cluster
Tested for backward compatibility on an existing cluster (no API/schema change; behavior change is ordering only, which is the intended fix for rollingDeploy with multiple nodes per NodeType).
Comments in getNodeSpecsByOrder document why sorting is required (kept short).
User-facing docs (e.g. operator docs) — not added; release note text below is suitable for release notes if the project uses them.

Release note (suggested)

Druid Operator: When rollingDeploy is enabled, rollout order for multiple StatefulSets/Deployments that share the same NodeType (e.g. historicalstier1 and historicalstier2) is now stable (sorted by node spec key). That avoids concurrent rollouts within the same NodeType caused by non-deterministic map iteration.

This is especially helpful if these two teirs are holding segment replicas across (1 in each tier). Both historicals getting rolled out causes the Druid cluster to have partial unavailability today.

Key changed/added files

controllers/druid/ordering.go — sort within each NodeType by spec key
controllers/druid/ordering_test.go — Ginkgo + testing.T coverage
controllers/druid/testdata/ordering.yaml — fixture with multiple historical tiers
e2e/configs/druid-rolling-deploy-cr.yaml — rolling-deploy E2E CR
e2e/test-rolling-deploy-ordering.sh — E2E script
e2e/e2e.sh — invoke the new E2E test

Fixes #XXXX.

Sort node specs by map key within each NodeType in getNodeSpecsByOrder so rollingDeploy does not flap on Go map iteration. Add unit tests (prove non-determinism pre-fix) and an E2E check for two historical tiers (historicalstier1/2).

razinbouzar · 2026-04-23T18:52:53Z

@aruraghuwanshi I believe the failing 900s timeout is caused by the test’s rollout trigger, not by the deterministic ordering change itself.

e2e/test-rolling-deploy-ordering.sh currently patches spec.nodes.historicalstier{1,2}.workloadAnnotations and assumes that this forces both StatefulSets to roll. In practice, that only changes StatefulSet object metadata, not the pod template, so the StatefulSet updateRevision never changes and the script eventually times out.

I’d recommend:

patch podAnnotations instead of workloadAnnotations to force a real StatefulSet revision change
fail early if tier1 never picks up a new revision
assert the ordering directly via revision progression:
- tier1 revision changes first
- tier2 revision is still unchanged at that moment
- tier1 rollout completes
- tier2 revision changes only afterwards
use trap-based cleanup so failed runs do not leak test resources

I tested this locally and the revised flow behaves as expected: the rollout is triggered, tier1 updates first, tier2 waits, and the test completes successfully without hitting the 900s timeout.

razinbouzar · 2026-04-23T20:11:07Z

@AdheipSingh can you take a look at this PR?

workloadAnnotations only touch StatefulSet object metadata, not the pod template, so updateRevision never changes and the test times out at 900s. Switch to podAnnotations (which flow into PodTemplateSpec), add trap-based cleanup, and fail fast if tier1 never picks up a new revision.

aruraghuwanshi · 2026-04-24T23:33:40Z

Thanks @razinbouzar for the insights. Does seem to be the core issue. I've pushed another commit fixing that. Lets see.

razinbouzar · 2026-04-25T00:00:31Z

@abhishekrb19 or @AdheipSingh can you kick off the test workflow and review?

AdheipSingh · 2026-04-26T13:58:43Z

Note: The following is review is purely written by me and not edited or reviewed or taken any inspiration by any AI.

Problem Statement

The issue raised is valid, as the operator reconciles multiple druid nodes based on purely nodeType. The underlying structure is map[string]nodeSpec, within nodeSpec the only way to segregate nodes is based on nodeType. So if i have 10 historical with X,Y,Z names, they all will get upgraded at same time, which is a pure disruption.

The current implementation raised in this PR, is basically a workaround where you end up sorting nodes on based upon the map[string]nodeSpec, just on string. The Go's sort.Slice with string comparison (specs[i].key < specs[j].key) uses lexicographic (dictionary) ordering. This isn't solving the core problem, that the operator isn't aware of the upgrade order and tiers. Solving this within the code isn't the best way until we abstract and built the awareness. Also tbh druid CR's are large and its not a very friendly way that we maintain the upgrade order based on order in which the spec is written. I should be able to define my order and write my spec in anyway.

I would propose for an implementation which fixes the problem at its core, and sticks to druid native design on how it represents nodeTypes and tiers.

Approach

If i break down the main abstractions, we are dealing with 4 constructs. ( nodeType is present as of now, i plan to introduce order of upgrade, tiers and order of upgrade of tiers ).

OrderOfUpgrade
OrderOfUpgradeOfTiers
NodeTypes
Tiers

As a user i should be able to define the following for my druid nodes.

i want to upgrade my druid nodes in a certain order defined.
i want to define tiers within my druid nodes.
i want to define the order of upgrades of my tiers for each druid node defined.

Tiers

In druid we can categorize historicals as well as brokers into tiers. A tier which defines separate groups of Historicals and Brokers to receive different query assignments / loading rules etc.

How does this map in the operator spec ?

We introduce tier as a key within the nodeSpec, scoped same with nodeType.

historical-aws-az1:
    tier: hot
    nodeType: historical
historical-aws-az2:
    tier: cold
    nodeType:historical
broker-aws-az1:
    tier: hot
    nodeType: broker
broker-aws-az1:
    tier: cold
    nodeType: broker
.....

OrderOfUpgrade

Remove the hardcoded order in the code, though the order is based on druid's recommendation, a lot of times i had a custom order of upgrade. The main for loop should construct an order based on this.

orderOfUpgrade:
 - historical
 - broker
 - middlemanager
 - coordinator
 - router
 - overlord

This is a []string structure and on each reconcile should be built.

OrderOfUpgradeOfTiers

WIthin a nodeType we need to have the ability to define order of upgrade of nodeTypes. So that needs to be defined as a separate structure for each tier

OrderOfUpgradeOfTiers:

- histroricals:
     - hot
     - cold
     - glacier
- broker
     - hot
     - cold

Here's a combined spec.

spec:
  orderOfUpgrade:
    - historical
    - broker
    - middleManager
    - coordinator
    - router
    - overlord
  
  orderOfUpgradeOfTiers:
    historical:
      - hot
      - cold
      - glacier
    broker:
      - hot
      - cold
  
  nodes:
    historical-aws-az1:
      tier: hot
      nodeType: historical
    
    historical-aws-az2:
      tier: cold
      nodeType: historical
    
    historical-aws-az3:
      tier: glacier
      nodeType: historical
    
    broker-aws-az1:
      tier: hot
      nodeType: broker
    
    broker-aws-az2:
      tier: cold
      nodeType: broker
    
    middlemanagers:
      nodeType: middleManager
    
    coordinators:
      nodeType: coordinator
    
    routers:
      nodeType: router
    
    overlords:
      nodeType: overlord

Upgrade execution:

historical:hot → historical-aws-az1
historical:cold → historical-aws-az2
historical:glacier → historical-aws-az3
broker:hot → broker-aws-az1
broker:cold → broker-aws-az2
middleManager → middlemanagers
coordinator → coordinators
router → routers
overlord → overlords

Implementation

Signature remains as it is

func getNodeSpecsByOrder(m *v1alpha1.Druid) []*ServiceGroup {
    // Returns ordered list of ServiceGroups ready for sequential rollout
}

The existing service group should be extended to use :

type ServiceGroup struct {
    key      string                  
    nodeType string                      
    tier     string                      
    spec     v1alpha1.DruidNodeSpec      
}

An ideal service group after construction should look like this:

[
    {key: "historical-aws-az1", nodeType: "historical", tier: "hot", spec: ...},
    {key: "historical-aws-az2", nodeType: "historical", tier: "cold", spec: ...},
    {key: "historical-aws-az3", nodeType: "historical", tier: "glacier", spec: ...},
    {key: "broker-az1", nodeType: "broker", tier: "hot", spec: ...},
    {key: "broker-az2", nodeType: "broker", tier: "cold", spec: ...},
    {key: "mm-1", nodeType: "middleManager", tier: "", spec: ...},
]

This way we solve the problem at the core. I have half of the implementation already done, and would like to raise a PR and get feedback. Also this is needs to be backward compatible, so we make sure regardless of user specifying the above we fallback to the default/current way.

aruraghuwanshi · 2026-04-27T21:34:18Z

Thanks for putting this up @AdheipSingh ; makes sense to have the structure you're proposing. Looking forward to your PR.

AdheipSingh · 2026-05-06T19:43:26Z

closing this, PR raised

AdheipSingh closed this May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deterministic rollout order for multiple nodes per NodeType#15

fix: deterministic rollout order for multiple nodes per NodeType#15
aruraghuwanshi wants to merge 2 commits into
apache:masterfrom
aruraghuwanshi:deterministic-rollout-ordering

aruraghuwanshi commented Apr 23, 2026 •

edited

Loading

Uh oh!

razinbouzar commented Apr 23, 2026

Uh oh!

razinbouzar commented Apr 23, 2026

Uh oh!

aruraghuwanshi commented Apr 24, 2026

Uh oh!

razinbouzar commented Apr 25, 2026

Uh oh!

AdheipSingh commented Apr 26, 2026 •

edited

Loading

Uh oh!

aruraghuwanshi commented Apr 27, 2026

Uh oh!

AdheipSingh commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aruraghuwanshi commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Testing

Release note (suggested)

Key changed/added files

Uh oh!

razinbouzar commented Apr 23, 2026

Uh oh!

razinbouzar commented Apr 23, 2026

Uh oh!

aruraghuwanshi commented Apr 24, 2026

Uh oh!

razinbouzar commented Apr 25, 2026

Uh oh!

AdheipSingh commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note: The following is review is purely written by me and not edited or reviewed or taken any inspiration by any AI.

Problem Statement

Approach

Tiers

OrderOfUpgrade

OrderOfUpgradeOfTiers

Implementation

Uh oh!

aruraghuwanshi commented Apr 27, 2026

Uh oh!

AdheipSingh commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aruraghuwanshi commented Apr 23, 2026 •

edited

Loading

AdheipSingh commented Apr 26, 2026 •

edited

Loading