Skip to content

fix: deterministic rollout order for multiple nodes per NodeType#15

Closed
aruraghuwanshi wants to merge 2 commits into
apache:masterfrom
aruraghuwanshi:deterministic-rollout-ordering
Closed

fix: deterministic rollout order for multiple nodes per NodeType#15
aruraghuwanshi wants to merge 2 commits into
apache:masterfrom
aruraghuwanshi:deterministic-rollout-ordering

Conversation

@aruraghuwanshi
Copy link
Copy Markdown

@aruraghuwanshi aruraghuwanshi commented Apr 23, 2026

Summary

getNodeSpecsByOrder groups node specs by NodeType but previously appended them from for key, nodeSpec := range m.Spec.Nodes. In Go, map iteration order is not stable, so the relative order of multiple specs with the same NodeType could change between reconciles.

With rollingDeploy: true, the handler walks that ordered list and may return early while a workload is still rolling. If the intra–NodeType order flips between calls, the operator can effectively start or advance rollouts for more than one StatefulSet/Deployment of the same NodeType at a time, instead of finishing one before the next.

This PR sorts specs by their map key (ascending) within each NodeType before building the final list, while keeping the existing cross–NodeType order from druidServicesOrder unchanged.


What changed

  • getNodeSpecsByOrder (controllers/druid/ordering.go): sort.Slice on each per–NodeType slice by ServiceGroup.key.
  • Unit tests (controllers/druid/ordering_test.go): Ginkgo test data uses multiple historical tiers (historicalstier13); added testing.T tests that would fail on the pre-fix map-only ordering and pass with stable sorting, plus a guard for cross–NodeType order.
  • E2E (e2e/configs/druid-rolling-deploy-cr.yaml, e2e/test-rolling-deploy-ordering.sh, wired from e2e/e2e.sh): two historical tiers (historicalstier1 / historicalstier2) with rollingDeploy: true, patch to trigger a rollout, and checks that only one of the two historical StatefulSets is mid-update at a time, with lexicographically first tier finishing before the second starts (when transitions are observable at the poll interval).

Testing

  • Tested on a real Kubernetes cluster
  • Tested for backward compatibility on an existing cluster (no API/schema change; behavior change is ordering only, which is the intended fix for rollingDeploy with multiple nodes per NodeType).
  • Comments in getNodeSpecsByOrder document why sorting is required (kept short).
  • User-facing docs (e.g. operator docs) — not added; release note text below is suitable for release notes if the project uses them.

Release note (suggested)

Druid Operator: When rollingDeploy is enabled, rollout order for multiple StatefulSets/Deployments that share the same NodeType (e.g. historicalstier1 and historicalstier2) is now stable (sorted by node spec key). That avoids concurrent rollouts within the same NodeType caused by non-deterministic map iteration.

This is especially helpful if these two teirs are holding segment replicas across (1 in each tier). Both historicals getting rolled out causes the Druid cluster to have partial unavailability today.


Key changed/added files

  • controllers/druid/ordering.go — sort within each NodeType by spec key
  • controllers/druid/ordering_test.go — Ginkgo + testing.T coverage
  • controllers/druid/testdata/ordering.yaml — fixture with multiple historical tiers
  • e2e/configs/druid-rolling-deploy-cr.yaml — rolling-deploy E2E CR
  • e2e/test-rolling-deploy-ordering.sh — E2E script
  • e2e/e2e.sh — invoke the new E2E test

Fixes #XXXX.

Sort node specs by map key within each NodeType in getNodeSpecsByOrder
so rollingDeploy does not flap on Go map iteration. Add unit tests
(prove non-determinism pre-fix) and an E2E check for two historical
tiers (historicalstier1/2).
@razinbouzar
Copy link
Copy Markdown
Contributor

@aruraghuwanshi I believe the failing 900s timeout is caused by the test’s rollout trigger, not by the deterministic ordering change itself.

e2e/test-rolling-deploy-ordering.sh currently patches spec.nodes.historicalstier{1,2}.workloadAnnotations and assumes that this forces both StatefulSets to roll. In practice, that only changes StatefulSet object metadata, not the pod template, so the StatefulSet updateRevision never changes and the script eventually times out.

I’d recommend:

  • patch podAnnotations instead of workloadAnnotations to force a real StatefulSet revision change
  • fail early if tier1 never picks up a new revision
  • assert the ordering directly via revision progression:
    • tier1 revision changes first
    • tier2 revision is still unchanged at that moment
    • tier1 rollout completes
    • tier2 revision changes only afterwards
  • use trap-based cleanup so failed runs do not leak test resources

I tested this locally and the revised flow behaves as expected: the rollout is triggered, tier1 updates first, tier2 waits, and the test completes successfully without hitting the 900s timeout.

@razinbouzar
Copy link
Copy Markdown
Contributor

@AdheipSingh can you take a look at this PR?

workloadAnnotations only touch StatefulSet object metadata, not the pod
template, so updateRevision never changes and the test times out at 900s.

Switch to podAnnotations (which flow into PodTemplateSpec), add trap-based
cleanup, and fail fast if tier1 never picks up a new revision.
@aruraghuwanshi
Copy link
Copy Markdown
Author

Thanks @razinbouzar for the insights. Does seem to be the core issue. I've pushed another commit fixing that. Lets see.

@razinbouzar
Copy link
Copy Markdown
Contributor

@abhishekrb19 or @AdheipSingh can you kick off the test workflow and review?

@AdheipSingh
Copy link
Copy Markdown
Member

AdheipSingh commented Apr 26, 2026

Note: The following is review is purely written by me and not edited or reviewed or taken any inspiration by any AI.

Problem Statement

The issue raised is valid, as the operator reconciles multiple druid nodes based on purely nodeType. The underlying structure is map[string]nodeSpec, within nodeSpec the only way to segregate nodes is based on nodeType. So if i have 10 historical with X,Y,Z names, they all will get upgraded at same time, which is a pure disruption.

The current implementation raised in this PR, is basically a workaround where you end up sorting nodes on based upon the map[string]nodeSpec, just on string. The Go's sort.Slice with string comparison (specs[i].key < specs[j].key) uses lexicographic (dictionary) ordering. This isn't solving the core problem, that the operator isn't aware of the upgrade order and tiers. Solving this within the code isn't the best way until we abstract and built the awareness. Also tbh druid CR's are large and its not a very friendly way that we maintain the upgrade order based on order in which the spec is written. I should be able to define my order and write my spec in anyway.

I would propose for an implementation which fixes the problem at its core, and sticks to druid native design on how it represents nodeTypes and tiers.

Approach

If i break down the main abstractions, we are dealing with 4 constructs. ( nodeType is present as of now, i plan to introduce order of upgrade, tiers and order of upgrade of tiers ).

  • OrderOfUpgrade
  • OrderOfUpgradeOfTiers
  • NodeTypes
  • Tiers

As a user i should be able to define the following for my druid nodes.

  • i want to upgrade my druid nodes in a certain order defined.
  • i want to define tiers within my druid nodes.
  • i want to define the order of upgrades of my tiers for each druid node defined.

Tiers

In druid we can categorize historicals as well as brokers into tiers. A tier which defines separate groups of Historicals and Brokers to receive different query assignments / loading rules etc.

How does this map in the operator spec ?

We introduce tier as a key within the nodeSpec, scoped same with nodeType.

historical-aws-az1:
    tier: hot
    nodeType: historical
historical-aws-az2:
    tier: cold
    nodeType:historical
broker-aws-az1:
    tier: hot
    nodeType: broker
broker-aws-az1:
    tier: cold
    nodeType: broker
.....

OrderOfUpgrade

Remove the hardcoded order in the code, though the order is based on druid's recommendation, a lot of times i had a custom order of upgrade. The main for loop should construct an order based on this.

orderOfUpgrade:
 - historical
 - broker
 - middlemanager
 - coordinator
 - router
 - overlord

This is a []string structure and on each reconcile should be built.

OrderOfUpgradeOfTiers

WIthin a nodeType we need to have the ability to define order of upgrade of nodeTypes. So that needs to be defined as a separate structure for each tier

OrderOfUpgradeOfTiers:

- histroricals:
     - hot
     - cold
     - glacier
- broker
     - hot
     - cold

Here's a combined spec.

spec:
  orderOfUpgrade:
    - historical
    - broker
    - middleManager
    - coordinator
    - router
    - overlord
  
  orderOfUpgradeOfTiers:
    historical:
      - hot
      - cold
      - glacier
    broker:
      - hot
      - cold
  
  nodes:
    historical-aws-az1:
      tier: hot
      nodeType: historical
    
    historical-aws-az2:
      tier: cold
      nodeType: historical
    
    historical-aws-az3:
      tier: glacier
      nodeType: historical
    
    broker-aws-az1:
      tier: hot
      nodeType: broker
    
    broker-aws-az2:
      tier: cold
      nodeType: broker
    
    middlemanagers:
      nodeType: middleManager
    
    coordinators:
      nodeType: coordinator
    
    routers:
      nodeType: router
    
    overlords:
      nodeType: overlord

Upgrade execution:

  1. historical:hot → historical-aws-az1
  2. historical:cold → historical-aws-az2
  3. historical:glacier → historical-aws-az3
  4. broker:hot → broker-aws-az1
  5. broker:cold → broker-aws-az2
  6. middleManager → middlemanagers
  7. coordinator → coordinators
  8. router → routers
  9. overlord → overlords

Implementation

Signature remains as it is

func getNodeSpecsByOrder(m *v1alpha1.Druid) []*ServiceGroup {
    // Returns ordered list of ServiceGroups ready for sequential rollout
}

The existing service group should be extended to use :

type ServiceGroup struct {
    key      string                  
    nodeType string                      
    tier     string                      
    spec     v1alpha1.DruidNodeSpec      
}

An ideal service group after construction should look like this:

[
    {key: "historical-aws-az1", nodeType: "historical", tier: "hot", spec: ...},
    {key: "historical-aws-az2", nodeType: "historical", tier: "cold", spec: ...},
    {key: "historical-aws-az3", nodeType: "historical", tier: "glacier", spec: ...},
    {key: "broker-az1", nodeType: "broker", tier: "hot", spec: ...},
    {key: "broker-az2", nodeType: "broker", tier: "cold", spec: ...},
    {key: "mm-1", nodeType: "middleManager", tier: "", spec: ...},
]

This way we solve the problem at the core. I have half of the implementation already done, and would like to raise a PR and get feedback. Also this is needs to be backward compatible, so we make sure regardless of user specifying the above we fallback to the default/current way.

@aruraghuwanshi
Copy link
Copy Markdown
Author

Thanks for putting this up @AdheipSingh ; makes sense to have the structure you're proposing. Looking forward to your PR.

@AdheipSingh
Copy link
Copy Markdown
Member

closing this, PR raised

@AdheipSingh AdheipSingh closed this May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants