feat: add DruidNode deploymentGroup field to support R/B deployments by jtuglu1 · Pull Request #19413 · apache/druid

jtuglu1 · 2026-05-05T22:19:38Z

Description

Druid Deployment Overview

Currently, deployment in Druid is geared towards "rolling" deployments, which, while potentially cheaper/faster are not the safest deployment mechanisms due to the lack of isolation during new cluster bring-up.

A red/black (a.k.a blue/green) deployment is better suited for cases where you want to bring another Druid cluster up in isolation of the existing one (but in same ZK/K8s discoverability namespace) in order to perform safety/performance checks before cutting over to the new deployment. The Overlord already supports a similar concept of worker "versioning" where it will only schedule peons on MMs/Indexers that are running the version it itself is configured with (or higher), allowing the cluster to eventually drain the ASGs with older Druid versions.

However, this functionality gets us part of the way to supporting what's effectively a zero-downtime (both query + ingest) deployment. To achieve a fully isolated (with the exception of master nodes: {coordinator, overlord}) deployment environment where we can mirror queries, observe state, etc. we also need to support version-based routing of queries.

Requirements

To do this, we need two things:

Data Isolation

Users cannot be impacted by data loading/unavailability issues when a node is rolled. This is especially pertinent in cases where there is no replica for a segment in a given tier, meaning rolling an instance requires either cloning() the historical first or taking downtime. This applies for both historical and realtime segments: we should be avoiding any/all data availability/freshness problems during deployment.

681cbde provided support for tier aliases (so duplicate historical tier deployments can be brought up transparently to the user/operator).

Query Isolation

Since Druid queries are generally read-only, isolating the new cluster from user traffic until it is deemed "healthy" is critical to ensure no regressions are deployed. This is also particularly helpful in performing performance/load tests. This PR provides the query routing support to allow user queries to route to strictly old nodes (router/broker/historical/peon), and any traffic sent to new nodes to route to new router/broker/historical/peon.

NB: we allow queries from both versions to hit old/new peons to allow for convenient roll-back and easier data availability properties when forcing task group hand-off. This behavior can be toggled via druid.broker.segment.strictRealtimeDeploymentGroupFilter, where setting it to true would imply realtime servers with non-matching versions would be excluded from query planning.

Deployment Steps

The combination of these 2 changes support the following deployment process:

Deploy new Druid ASGs: router, broker, historical, coordinator, overlord, MM, etc.
Configure Coordinator dynamic config to set up tier aliases for new/existing Druid historical tiers (so same set of segments are loaded in parallel onto equivalent tiers across the 2 versions).
Wait for segments to load on the new Druid ASGs
Switch coordinator leader to new Druid version coordinator
Optionally mirror traffic to the new ASGs (new router/broker will be able to query only historicals of their same version; peons are by default queryable by all versions).
Switch leader overlord to newer version (using generous timeouts/retries to avoid ingest task RPC failure)
Finally, cutover to new router/broker/historicals.

This deployment method combines the traditional red/black deployment with Druid's rolling deployment, providing zero ingest downtime as well as zero query downtime for users (both in terms of availability and data freshness). It also provides ample time to experiment/canary changes without impacting user traffic.

Release note

This PR has:

…Druid deployments

FrankChen021

Severity	Findings
P0	0
P1	0
P2	1
P3	0
Total	1

This is an automated review by Codex GPT-5

FrankChen021

Severity	Findings
P0	0
P1	0
P2	2
P3	0
Total	2

This is an automated review by Codex GPT-5

FrankChen021

Severity	Findings
P0	0
P1	0
P2	1
P3	0
Total	1

Severity	Findings
P0	0
P1	0
P2	1
P3	0
Total	1

Reviewed 33 of 33 changed files.

This is an automated review by Codex GPT-5.5

    );
    EasyMock.replay(brokerSelector);
+
+    final Server server = newQueryHostFinder().findServer(newQuery());


FrankChen021

Severity	Findings
P0	0
P1	1
P2	1
P3	0
Total	2

Reviewed 35 of 35 changed files.

This is an automated review by Codex GPT-5.5

FrankChen021 · 2026-05-12T14:00:15Z

+    }
+
+    final Set<String> watched = segmentWatcherConfig.getWatchedDeploymentGroups();
+    if (watched != null && !watched.contains(server.getDeploymentGroup())) {


[P1] Propagate deploymentGroup through inventory DruidServer construction

This filter depends on DruidServerMetadata.getDeploymentGroup(), but the normal discovery inventory path still builds DruidServer instances through DruidServer constructors that do not pass DruidNode.getDeploymentGroup() into DruidServerMetadata. In a real cluster, Historicals therefore arrive here with a null deploymentGroup, so setting druid.broker.segment.watchedDeploymentGroups filters out all Historicals; the same lost metadata also prevents coordinator resources from seeing active deployment groups. Please preserve the DruidNode deploymentGroup when constructing inventory DruidServer metadata.

FrankChen021 · 2026-05-12T14:00:15Z

+              (remainingSegmentsToMove + remainingGroups - 1) / remainingGroups;
+          if (groupMaxSegmentsToMove > 0) {
+            new TierSegmentBalancer(tier, groupServers, groupMaxSegmentsToMove, params).run();
+            remainingSegmentsToMove -= groupMaxSegmentsToMove;


[P2] Do not consume unused balance budget

The per-group loop subtracts groupMaxSegmentsToMove even when TierSegmentBalancer moves zero segments. With maxSegmentsToMove lower than the number of groups, a first group that is already balanced can consume the whole budget every coordinator run, leaving later skewed groups with groupMaxSegmentsToMove == 0 indefinitely. This still honors the numeric cap, but can permanently prevent balancing in later deployment groups; base the budget on actual moves or rotate/fairly allocate across runs.

github-actions Bot added Area - Documentation Area - Querying labels May 5, 2026

jtuglu1 requested a review from clintropolis May 5, 2026 22:28

jtuglu1 force-pushed the support-red-black-style-deployment-groups-in-druid branch from 95da3b9 to b016aa3 Compare May 5, 2026 22:44

feat: add DruidNode deploymentGroup field to support red/black style …

7d9c627

…Druid deployments

jtuglu1 force-pushed the support-red-black-style-deployment-groups-in-druid branch from b016aa3 to 7d9c627 Compare May 5, 2026 23:51

jtuglu1 requested review from abhishekrb19 and gianm May 6, 2026 00:31

jtuglu1 marked this pull request as ready for review May 6, 2026 18:44

FrankChen021 reviewed May 7, 2026

View reviewed changes

Comment thread services/src/main/java/org/apache/druid/server/router/TieredBrokerHostSelector.java

github-advanced-security AI found potential problems May 7, 2026

View reviewed changes

Comment thread .../src/test/java/org/apache/druid/server/coordinator/loading/StrategicSegmentAssignerTest.java Fixed

FrankChen021 reviewed May 8, 2026

View reviewed changes

Comment thread server/src/main/java/org/apache/druid/server/http/DataSourcesResource.java

Comment thread server/src/main/java/org/apache/druid/server/coordinator/loading/StrategicSegmentAssigner.java

jtuglu1 force-pushed the support-red-black-style-deployment-groups-in-druid branch from 30b4c6d to 95204a9 Compare May 9, 2026 20:58

github-advanced-security AI found potential problems May 9, 2026

View reviewed changes

Comment thread .../src/test/java/org/apache/druid/server/coordinator/loading/StrategicSegmentAssignerTest.java Dismissed

FrankChen021 reviewed May 10, 2026

View reviewed changes

Comment thread server/src/main/java/org/apache/druid/server/coordinator/duty/BalanceSegments.java Outdated

Allow coordinator to move segments cognizant of deploymentGroup

059de00

jtuglu1 force-pushed the support-red-black-style-deployment-groups-in-druid branch from 95204a9 to 059de00 Compare May 12, 2026 01:05

github-advanced-security AI found potential problems May 12, 2026

View reviewed changes

Comment thread services/src/test/java/org/apache/druid/server/router/QueryHostFinderTest.java

);

EasyMock.replay(brokerSelector);

final Server server = newQueryHostFinder().findServer(newQuery());

FrankChen021 reviewed May 12, 2026

View reviewed changes

jtuglu1 marked this pull request as draft May 13, 2026 04:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add DruidNode deploymentGroup field to support R/B deployments#19413

feat: add DruidNode deploymentGroup field to support R/B deployments#19413
jtuglu1 wants to merge 2 commits into
apache:masterfrom
jtuglu1:support-red-black-style-deployment-groups-in-druid

jtuglu1 commented May 5, 2026 •

edited

Loading

Uh oh!

FrankChen021 left a comment

Uh oh!

Uh oh!

Uh oh!

FrankChen021 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FrankChen021 left a comment

Uh oh!

Uh oh!

FrankChen021 left a comment

Uh oh!

FrankChen021 May 12, 2026

Uh oh!

FrankChen021 May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jtuglu1 commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Druid Deployment Overview

Requirements

Deployment Steps

Release note

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

FrankChen021 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

FrankChen021 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jtuglu1 commented May 5, 2026 •

edited

Loading