Skip to content

[Swarming] Add backpressure mechanism to SwarmingService#5284

Merged
IvanBM18 merged 43 commits into
masterfrom
feature/swarming/backpressure
Jun 2, 2026
Merged

[Swarming] Add backpressure mechanism to SwarmingService#5284
IvanBM18 merged 43 commits into
masterfrom
feature/swarming/backpressure

Conversation

@IvanBM18
Copy link
Copy Markdown
Collaborator

@IvanBM18 IvanBM18 commented May 19, 2026

Overview

This PR implements a backpressure mechanism in SwarmingService to prevent overloading the Swarming pool with too many tasks.

Changes

  • Added MAX_PENDING_TASKS = 25 to limit the number of pending tasks.
    • This threshold was chosen considering a pool of at most 25 bots and that a single scheduler takes ~3 min to schedule all of its tasks. And since we have 1000 schedulers in prod.. the best choice is to instead look for a stable backlog.
    • 25 ensures that the 25 bots always have a buffer of work without building up a massive backlog
  • Updated create_utask_main_jobs to check the queue size before pushing each task using Swarming's CountTasks RPC.
  • Implemented a Fail Closed strategy: if the CountTasks rpc call fails, we assume the queue is full and stop scheduling further tasks in the batch.
  • Added unit tests to verify backpressure and failure handling.

Tests performed

In this image we can appreciate that the utask main schedulers picks tasks from both the utask_main queue and the new utask_main-swarming queue
image

Heres another logs showing us that non swarming tasks are still being scheduled:
image

Here are some logs of the scheduler pushing tasks into swarming:
image

@IvanBM18 IvanBM18 self-assigned this May 20, 2026
@IvanBM18 IvanBM18 added the swarming Changes related to the clusterfuzz-swarming integration label May 20, 2026
@IvanBM18 IvanBM18 marked this pull request as ready for review May 26, 2026 06:14
@IvanBM18 IvanBM18 requested a review from a team as a code owner May 26, 2026 06:14
Comment thread src/clusterfuzz/_internal/swarming/service.py Outdated
Comment thread src/clusterfuzz/_internal/swarming/service.py
@IvanBM18 IvanBM18 requested review from Xeicker and jardondiego and removed request for jardondiego May 26, 2026 21:37
Comment thread src/clusterfuzz/_internal/base/feature_flags.py
Comment thread src/clusterfuzz/_internal/cron/schedule_fuzz.py
Copy link
Copy Markdown
Collaborator

@letitz letitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One substantive comment about the queue size limit being per-OS, otherwise a bunch of small comments.

Comment thread src/clusterfuzz/_internal/swarming/api.py Outdated
Comment thread src/clusterfuzz/_internal/swarming/service.py Outdated
Comment thread src/clusterfuzz/_internal/swarming/service.py Outdated
Comment thread src/clusterfuzz/_internal/swarming/service.py
Comment thread src/clusterfuzz/_internal/swarming/api.py Outdated
Comment thread src/clusterfuzz/_internal/swarming/service.py Outdated
Comment thread src/clusterfuzz/_internal/tests/core/swarming/api_test.py
Comment thread src/clusterfuzz/_internal/tests/core/swarming/service_test.py Outdated
Comment thread src/clusterfuzz/_internal/tests/core/swarming/service_test.py Outdated
Comment thread src/clusterfuzz/_internal/tests/core/swarming/service_test.py Outdated
Copy link
Copy Markdown
Collaborator

@letitz letitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % a couple remaining comments to avoid blocking on me further.

Comment thread src/clusterfuzz/_internal/swarming/service.py Outdated
Comment thread src/clusterfuzz/_internal/swarming/service.py Outdated
Comment thread src/clusterfuzz/_internal/swarming/api.py Outdated
Comment thread src/clusterfuzz/_internal/swarming/service.py Outdated
Comment thread src/clusterfuzz/_internal/tests/core/swarming/service_test.py Outdated
Comment thread src/clusterfuzz/_internal/swarming/service.py Outdated
return True

count = response.count
max_pending_tasks = get_max_size_for_queue(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewer:
Since the swarming queue is not a pub/sub queue i extracted some of the logic from the data class to the public to reuse it here.

The other alternative to this is create a PubSubTaskQueue instance for the swarming queue(im not against this!) but if i do so i would also like to rename the PubSubTaskQueue module, since it would be confusing to have a object instance that its not really pub/sub queue.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, and it's good enough for me to land this PR. I think ideally you would have something like:

# task_queue.py

class TaskQueueLimiter:
  default_limit: int
  feature_flag: FeatureFlags

  def max_pending_tasks(self) -> int:
    ...


class PubSubTaskQueue:
  name: str
  limiter: TaskQueueLimiter

Since the module name here is still awkward. I don't think you should actually do it though, this is good enough and we have other things to do :)

@IvanBM18 IvanBM18 requested a review from letitz June 1, 2026 04:14
Copy link
Copy Markdown
Collaborator

@letitz letitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, we're nearly there!

Comment thread src/clusterfuzz/_internal/swarming/api.py Outdated
Comment thread src/clusterfuzz/_internal/base/tasks/pub_sub_task_queue.py
Comment thread src/clusterfuzz/_internal/swarming/service.py Outdated
Comment thread src/clusterfuzz/_internal/swarming/service.py Outdated
return True

count = response.count
max_pending_tasks = get_max_size_for_queue(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, and it's good enough for me to land this PR. I think ideally you would have something like:

# task_queue.py

class TaskQueueLimiter:
  default_limit: int
  feature_flag: FeatureFlags

  def max_pending_tasks(self) -> int:
    ...


class PubSubTaskQueue:
  name: str
  limiter: TaskQueueLimiter

Since the module name here is still awkward. I don't think you should actually do it though, this is good enough and we have other things to do :)

@IvanBM18 IvanBM18 requested a review from letitz June 1, 2026 14:18
Copy link
Copy Markdown
Collaborator

@letitz letitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % tiny fixes to SwarmingApi.count_tasks().

Comment thread src/clusterfuzz/_internal/swarming/api.py Outdated
Comment thread src/clusterfuzz/_internal/swarming/api.py Outdated
@IvanBM18 IvanBM18 requested a review from javanlacerda June 1, 2026 19:45
Copy link
Copy Markdown
Collaborator

@javanlacerda javanlacerda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@IvanBM18 IvanBM18 merged commit 2b32898 into master Jun 2, 2026
14 checks passed
@IvanBM18 IvanBM18 deleted the feature/swarming/backpressure branch June 2, 2026 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

swarming Changes related to the clusterfuzz-swarming integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants