Skip to content

[build_manager] Skip unzipping during fuzz task target discovery#5298

Open
PauloVLB wants to merge 1 commit into
masterfrom
fix/target-discovery-space-optimization
Open

[build_manager] Skip unzipping during fuzz task target discovery#5298
PauloVLB wants to merge 1 commit into
masterfrom
fix/target-discovery-space-optimization

Conversation

@PauloVLB

@PauloVLB PauloVLB commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Context b/500991018 and b/509600495.

The Problem

During the very first target discovery run of a new engine fuzzer job (or after mappings are reset), no target is selected yet (fuzz_target=None / "unknown").

The build manager was currently attempting to download and uncompress the entire build archive (which are massive: 123 GB on Linux, and up to 397 GB on Windows) just to list fuzzer target names and exit.

This causes space allocation checks (_make_space) to fail on standard GCE bot disks (75GB - 200GB), leaving the job stuck in an infinite crash loop.

The Fix

During fuzz task target discovery runs (where fuzz_target is None for engine jobs with selective unzipping), we bypass zip file extraction entirely. We only open the archive, read fuzzer target names in memory from the catalog index (which takes milliseconds over HTTP without disk allocation), save them to Datastore, and exit early.

Once saved, subsequent runs select a target and run selective unzipping (only ~500 MB), which fits comfortably.

Impact in other workflows

To ensure this optimization has zero impact on other active workflows, the bypass is protected by five guards:

  1. not self.fuzz_target: Restricts only to target-discovery runs (where no target is selected yet).
  2. not self._unpack_everything: Restricts only when selective target unzipping is enabled (if a job disables selective unzipping, it requires a full unpack of all targets, so we must not bypass).
  3. environment.is_engine_fuzzer_job(): Restricts only to engine fuzzers (blackbox fuzzer jobs don't selective-unpack and always need to fully unzip their application binaries).
  4. environment.get_value('TASK_NAME') == 'fuzz': Restricts only to fuzzing tasks (progression/regression tasks working on crashes always need their target build unpacked to disk to run reproductions).
  5. not self.build_prefix: Restricts only to the primary target build and not supporting extra engine binary packages (which must always be fully unpacked to disk).
  6. environment.platform() == 'WINDOWS': Restricts this bypass strictly to Windows bots. This limits the production rollout blast radius exclusively to the Windows platform to resolve the Windows bot crash block (b/509600495) with absolute safety.

Note on Testing

Unit tests were updated and a new discovery test was added. All tests are passing. Note that this cannot be easily tested in dev because we don't have working local Windows bots running right now, but the logic is fully covered by unit tests.

@PauloVLB PauloVLB requested a review from a team as a code owner May 29, 2026 13:46
@PauloVLB PauloVLB force-pushed the fix/target-discovery-space-optimization branch from e3c6c9b to d1f75c6 Compare May 29, 2026 17:50
@PauloVLB PauloVLB force-pushed the fix/target-discovery-space-optimization branch from d1f75c6 to 3be95e5 Compare May 29, 2026 18:27
@letitz

letitz commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Drive-by comments.

Context b/500991018 and b/509600495.

The Problem

During the very first target discovery run of a new engine fuzzer job (or after mappings are reset), no target is selected yet (fuzz_target=None / "unknown").

The build manager was currently attempting to download and uncompress the entire build archive (which are massive: 123 GB on Linux, and up to 397 GB on Windows) just to list fuzzer target names and exit.

This causes space allocation checks (_make_space) to fail on standard GCE bot disks (75GB - 200GB), leaving the job stuck in an infinite crash loop.

The Fix

During fuzz task target discovery runs (where fuzz_target is None for engine jobs with selective unzipping), we bypass zip file extraction entirely. We only open the archive, read fuzzer target names in memory from the catalog index (which takes milliseconds over HTTP without disk allocation), save them to Datastore, and exit early.

Note that checking for fuzz targets in the archive will still involve unzipping a significant portion of the contents of the archive (over HTTP, in-memory only) due to fuzzer_utils.is_fuzz_target() 1 checking the contents of the files 2 if it's unsure about the file.

@notvictorl is working on improving this in crbug.com/508214240, so for chrome archives we will soon only need to unzip a tiny json file.

Once saved, subsequent runs select a target and run selective unzipping (only ~500 MB), which fits comfortably.

Impact in other workflows

To ensure this optimization has zero impact on other active workflows, the bypass is protected by five guards:

  1. not self.fuzz_target: Restricts only to target-discovery runs (where no target is selected yet).
  2. not self._unpack_everything: Restricts only when selective target unzipping is enabled (if a job disables selective unzipping, it requires a full unpack of all targets, so we must not bypass).
  3. environment.is_engine_fuzzer_job(): Restricts only to engine fuzzers (blackbox fuzzer jobs don't selective-unpack and always need to fully unzip their application binaries).

Blackbox fuzzer jobs don't have a concept of fuzz targets to discover anyway.

  1. environment.get_value('TASK_NAME') == 'fuzz': Restricts only to fuzzing tasks (progression/regression tasks working on crashes always need their target build unpacked to disk to run reproductions).

Same, progression and regression task never need to discover fuzz targets anyway? They should always run with a specific fuzz target.

  1. not self.build_prefix: Restricts only to the primary target build and not supporting extra engine binary packages (which must always be fully unpacked to disk).
  2. environment.platform() == 'WINDOWS': Restricts this bypass strictly to Windows bots. This limits the production rollout blast radius exclusively to the Windows platform to resolve the Windows bot crash block (b/509600495) with absolute safety.

This bug has been affecting linux bots too.

Note on Testing

Unit tests were updated and a new discovery test was added. All tests are passing. Note that this cannot be easily tested in dev because we don't have working local Windows bots running right now, but the logic is fully covered by unit tests.

IIRC we hit this issue in dev also. @notvictorl will remember the details, if I'm not hallucinating :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants