Skip to content

Introduce cost-based tasks autoscaler for streaming ingestion#18819

Merged
kfaraz merged 23 commits into
apache:masterfrom
Fly-Style:new-autoscaler
Dec 17, 2025
Merged

Introduce cost-based tasks autoscaler for streaming ingestion#18819
kfaraz merged 23 commits into
apache:masterfrom
Fly-Style:new-autoscaler

Conversation

@Fly-Style
Copy link
Copy Markdown
Contributor

@Fly-Style Fly-Style commented Dec 5, 2025

Cost-Based Autoscaler for Seekable Stream Supervisors

Overview

Implements a cost-based autoscaling algorithm for seekable stream supervisor tasks that optimizes task count by balancing lag reduction against resource efficiency.

Note: this patch doesn't support autoscaling (down) during task rollover. Temporarily, it scales down in the same manner as scales up.
Introduces WeightedCostFunction for cost-based autoscaling decisions. The function computes a cost score (in seconds) for each candidate task count, balancing lag recovery time against idle resource waste.

Key Design Decisions

Cost Formula

totalCost = lagWeight × lagRecoveryTime + idleWeight × idlenessCost
  • lagRecoveryTime = aggregateLag / (taskCount × avgProcessingRate) — time to clear backlog
  • idlenessCost = taskCount × taskDuration × predictedIdleRatio — wasted compute time

Idle Prediction Model

Uses capacity-based linear scaling:

predictedIdle = 1 - (1 - currentIdle) / (proposedTasks / currentTasks)

More tasks → more idle per task; fewer tasks → busier tasks.

Ideal Idle Range

Defines optimal utilization as idle ratio within [0.2, 0.6]:

  • Below 0.2: overloaded → scale up
  • Within range: optimal → no action
  • Above 0.6: underutilized → scale down

Conservative Cold Start Behavior

When processing rate is unavailable (cold start, new tasks):

  • Current task count: cost = 0.01 (allowed)
  • Any scaling: cost = +∞ (prohibited)

This prevents scaling decisions based on incomplete data.

Additionally, we add reading poll-idle ratio-avg from /rowStats task endpoint.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants