Skip to content

win: improve PATH search worst-case performance#5120

Open
squeek502 wants to merge 1 commit into
libuv:v1.xfrom
squeek502:win-wildcard-path-search
Open

win: improve PATH search worst-case performance#5120
squeek502 wants to merge 1 commit into
libuv:v1.xfrom
squeek502:win-wildcard-path-search

Conversation

@squeek502
Copy link
Copy Markdown
Contributor

@squeek502 squeek502 commented Apr 12, 2026

Take advantage of FindFirstFileExW and wildcards to reduce the number of Win32 API calls used per-directory when searching the PATH. Instead of GetFileAttributesW calls per directory in the PATH, each entry is now first checked for viability using a FindFirstFileExW call, and then only if matches are found will it check each possibility using GetFileAttributesW.

This does not affect average case performance at all, but it does make an impact when both (a) many extensions are allowed, and (b) the file is not found early in the PATH search. The impact also scales with the size of the PATH (the more entries, the larger the potential impact).


benchmark code
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include "uv.h"

uv_loop_t *loop;
uv_process_t child_req;
uv_process_options_t options;

void on_exit(uv_process_t *req, int64_t exit_status, int term_signal) {
    uv_close((uv_handle_t*) req, NULL);
}

int main() {
    loop = uv_default_loop();

    for (int i=0; i<100; i++) {
        char* args[2];
        args[0] = "hello";
        args[1] = NULL;

        options.exit_cb = on_exit;
        options.file = "hello";
        options.args = args;

        int r;
        if ((r = uv_spawn(loop, &child_req, &options))) {
            fprintf(stderr, "%s\n", uv_strerror(r));
            return 1;
        }

        uv_run(loop, UV_RUN_DEFAULT);
    }

    return uv_run(loop, UV_RUN_DEFAULT);
}

For my testing, hello.exe is found in the 25th entry in PATH.

No difference with default spawn flags:

Benchmark 1: bench-getattributes.exe
  Time (mean ± σ):     313.2 ms ±   1.7 ms    [User: 34.9 ms, System: 260.6 ms]
  Range (min … max):   311.5 ms … 316.6 ms    10 runs

Benchmark 2: bench-findfirst.exe
  Time (mean ± σ):     312.4 ms ±   3.6 ms    [User: 23.9 ms, System: 259.1 ms]
  Range (min … max):   308.1 ms … 321.0 ms    10 runs

Summary
  'bench-findfirst.exe' ran
    1.00 ± 0.01 times faster than 'bench-getattributes.exe'

With the changes in #5096 and UV_PROCESS_WINDOWS_RESOLVE_BATCH set, though, there is a difference since the extra 2 possible extensions add 2 * <number of path entries searched> extra GetFileAttributesW calls:

Benchmark 1: bench-allpathext-getattributes.exe
  Time (mean ± σ):     394.7 ms ±   3.0 ms    [User: 60.6 ms, System: 327.2 ms]
  Range (min … max):   390.6 ms … 398.3 ms    10 runs

Benchmark 2: bench-allpathext-findfirst.exe
  Time (mean ± σ):     312.4 ms ±   3.6 ms    [User: 52.8 ms, System: 235.0 ms]
  Range (min … max):   309.1 ms … 318.8 ms    10 runs

Summary
  'bench-allpathext-findfirst.exe' ran
    1.26 ± 0.02 times faster than 'bench-allpathext-getattributes.exe'

The motivation for this is twofold:


I'm fairly certain the implementation could be further improved by swapping out FindFirstFileExW/etc for a CreateFileW + the lower-level NtQueryDirectoryFile since it allows for more control over the buffers used, memory allocated, etc. (and this is what Libuv's scandir already does). I'll probably try that out, but I don't think it needs to be a blocker.

EDIT: I tried this and it didn't affect performance at all

Copy link
Copy Markdown
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too keen on reimplementing this lookup manually in libuv. I expect that FindNextFileW can often end up much slower since it always has to read the whole directory, instead of using the btree for guaranteed performance.

@squeek502
Copy link
Copy Markdown
Contributor Author

squeek502 commented May 14, 2026

I expect that FindNextFileW can often end up much slower since it always has to read the whole directory, instead of using the btree for guaranteed performance.

It should be possible to check this assumption.

  • Create a directory with 1 million randomly generated filenames (base64 of 12 bytes)
  • Compare running FirstFirstFileExW(L"foo*") in that directory vs 5 GetFileAttributesW calls with foo/foo.com/foo.exe/foo.bat/foo.cmd
Test code

findfirst.c

#include <windows.h>
#include <stdio.h>

int main() {
  HANDLE find;
  WIN32_FIND_DATAW find_data;
  int num_iterations = 0;

  find = FindFirstFileExW(L"foo*", FindExInfoBasic, &find_data, FindExSearchNameMatch, NULL, 0);
  if (find == INVALID_HANDLE_VALUE) {
    goto end;
  }
  do {
    if (find_data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) continue;
    num_iterations++;
  } while (FindNextFileW(find, &find_data));
  FindClose(find);

end:
  printf("%d\n", num_iterations);
  return 0;
}

getattr.c

#include <windows.h>
#include <stdio.h>

int main() {
  DWORD attrs;

  attrs = GetFileAttributesW(L"foo");
  attrs = GetFileAttributesW(L"foo.com");
  attrs = GetFileAttributesW(L"foo.exe");
  attrs = GetFileAttributesW(L"foo.bat");
  attrs = GetFileAttributesW(L"foo.cmd");

  printf("5\n");
  return 0;
}

populate.zig (to generate the 1 million files)

const std = @import("std");

pub fn main(init: std.process.Init) !void {
    const io = init.io;

    const random_bytes_count = 12;
    const sub_path_len = comptime std.base64.url_safe.Encoder.calcSize(random_bytes_count);

    for (0..1_000_000) |_| {
        var random_bytes: [random_bytes_count]u8 = undefined;
        io.random(&random_bytes);
        var sub_path: [sub_path_len]u8 = undefined;
        _ = std.base64.url_safe.Encoder.encode(&sub_path, &random_bytes);

        const file = try std.Io.Dir.cwd().createFile(io, sub_path[0..], .{});
        defer file.close(io);
    }
}

Results:

> hyperfine --warmup 1 ..\build\Release\findfirst.exe ..\build\Release\getattr.exe
Benchmark 1: ..\build\Release\findfirst.exe
  Time (mean ± σ):       3.6 ms ±   0.4 ms    [User: 4.8 ms, System: 5.8 ms]
  Range (min … max):     2.8 ms …   5.1 ms    128 runs

Benchmark 2: ..\build\Release\getattr.exe
  Time (mean ± σ):       3.6 ms ±   0.4 ms    [User: 4.8 ms, System: 4.7 ms]
  Range (min … max):     2.7 ms …   4.7 ms    132 runs

Summary
  '..\build\Release\getattr.exe' ran
    1.03 ± 0.15 times faster than '..\build\Release\findfirst.exe'

The two approaches perform the same, so it appears that there are likely some optimizations for trailing wildcards in the FindFirstFile/FindNextFile implementation that allows avoiding full iteration.


However, this does introduce the possibility for a new worst-case whenever there are a ton of files that match the wildcard search.

  • Create a directory with 10 thousand randomly generated filenames that all start with foo and then are followed by the base64 of 12 bytes (e.g. fooF6K7l7jJSPNA4O6a)
  • Run the same test as above in that directory
Test code

populate-foo.zig

const std = @import("std");

pub fn main(init: std.process.Init) !void {
    const io = init.io;

    const random_bytes_count = 12;
    const sub_path_len = comptime std.base64.url_safe.Encoder.calcSize(random_bytes_count);

    var sub_path: [sub_path_len + 3]u8 = undefined;
    @memcpy(sub_path[0..3], "foo");

    for (0..100_000) |_| {
        var random_bytes: [random_bytes_count]u8 = undefined;
        io.random(&random_bytes);
        _ = std.base64.url_safe.Encoder.encode(sub_path[3..], &random_bytes);

        const file = try std.Io.Dir.cwd().createFile(io, sub_path[0..], .{});
        defer file.close(io);
    }
}

Now there's an extreme effect:

> hyperfine --warmup 1 ..\build\Release\findfirst.exe ..\build\Relea
se\getattr.exe
Benchmark 1: ..\build\Release\findfirst.exe
  Time (mean ± σ):      98.0 ms ±   0.9 ms    [User: 6.4 ms, System: 97.5 ms]
  Range (min … max):    96.9 ms … 100.8 ms    25 runs

Benchmark 2: ..\build\Release\getattr.exe
  Time (mean ± σ):       3.4 ms ±   0.3 ms    [User: 2.9 ms, System: 7.7 ms]
  Range (min … max):     2.7 ms …   4.4 ms    124 runs

Summary
  '..\build\Release\getattr.exe' ran
   28.73 ± 2.60 times faster than '..\build\Release\findfirst.exe'

For context, this whole approach was inspired from using NtTrace to check how PATH searching is done from within a batch script, so e.g.

test.bat

@echo off
foo

Running nttrace test.bat will show something like:

NtOpenFile( FileHandle=0x41392fe320 [0x138], DesiredAccess=SYNCHRONIZE|0x1, ObjectAttributes="\??\C:\Users\Ryan\Programming\Luvit\libuv-test\", IoStatusBlock=0x41392fe388 [0/1], ShareAccess=7, OpenOptions=0x4021 ) => 0
NtQueryDirectoryFileEx( FileHandle=0x138, Event=0, ApcRoutine=null, ApcContext=null, IoStatusBlock=0x41392fe388, FileInformation=0x41392fe3d0, Length=0x268, FileInformationClass=2 [FileFullDirectoryInformation], QueryFlags=2, FileName="foo"*" ) => 0xc000000f [2 'The system cannot find the file specified.']
NtClose( Handle=0x138 ) => 0

for every entry in PATH.

Something I missed, though, is that it doesn't fall into this worst-case trap when there are a ton of files that match the wildcard--it only uses the wildcard to determine if any match exists, and then it switches to non-wildcard searching:

NtQueryDirectoryFileEx( FileHandle=0x144, Event=0, ApcRoutine=null, ApcContext=null, IoStatusBlock=0xc3b10fe848 [0/0x52], FileInformation=0xc3b10fe890, Length=0x268, FileInformationClass=2 [FileFullDirectoryInformation], QueryFlags=2, FileName="foo"*" ) => 0
NtClose( Handle=0x144 ) => 0
NtOpenFile( FileHandle=0xc3b10fe7e0 [0x144], DesiredAccess=SYNCHRONIZE|0x1, ObjectAttributes="\??\C:\Users\Ryan\Programming\Luvit\libuv-test\worstcase\", IoStatusBlock=0xc3b10fe848 [0/1], ShareAccess=7, OpenOptions=0x4021 ) => 0
NtQueryDirectoryFileEx( FileHandle=0x144, Event=0, ApcRoutine=null, ApcContext=null, IoStatusBlock=0xc3b10fe848, FileInformation=0xc3b10fe890, Length=0x268, FileInformationClass=2 [FileFullDirectoryInformation], QueryFlags=2, FileName="foo.COM" ) => 0xc000000f [2 'The system cannot find the file specified.']
NtClose( Handle=0x144 ) => 0
NtOpenFile( FileHandle=0xc3b10fe7e0 [0x144], DesiredAccess=SYNCHRONIZE|0x1, ObjectAttributes="\??\C:\Users\Ryan\Programming\Luvit\libuv-test\worstcase\", IoStatusBlock=0xc3b10fe848 [0/1], ShareAccess=7, OpenOptions=0x4021 ) => 0
NtQueryDirectoryFileEx( FileHandle=0x144, Event=0, ApcRoutine=null, ApcContext=null, IoStatusBlock=0xc3b10fe848, FileInformation=0xc3b10fe890, Length=0x268, FileInformationClass=2 [FileFullDirectoryInformation], QueryFlags=2, FileName="foo.EXE" ) => 0xc000000f [2 'The system cannot find the file specified.']
NtClose( Handle=0x144 ) => 0

The use of NtQueryDirectoryFileEx for this non-wildcard searching may not be ideal, though, as including test.bat into our worst case benchmark doesn't show the full expected improvement (although there are potentially a lot of confounders):

Modified test.bat to only check the CWD rather than all of PATH:

@echo off
.\foo
> hyperfine -i --warmup 1 ..\build\Release\findfirst.exe ..\build\Release\getattr.exe ..\test.bat
Benchmark 1: ..\build\Release\findfirst.exe
  Time (mean ± σ):      97.9 ms ±   1.0 ms    [User: 6.5 ms, System: 95.6 ms]
  Range (min … max):    95.9 ms … 100.1 ms    25 runs

Benchmark 2: ..\build\Release\getattr.exe
  Time (mean ± σ):       3.6 ms ±   0.6 ms    [User: 4.7 ms, System: 4.7 ms]
  Range (min … max):     2.6 ms …   5.8 ms    128 runs

Benchmark 3: ..\test.bat
  Time (mean ± σ):      26.6 ms ±   0.3 ms    [User: 4.2 ms, System: 26.4 ms]
  Range (min … max):    25.9 ms …  27.9 ms    65 runs

Summary
  '..\build\Release\getattr.exe' ran
    7.31 ± 1.11 times faster than '..\test.bat'
   26.88 ± 4.08 times faster than '..\build\Release\findfirst.exe'

Ultimately, I think it should be possible to get a best-of-both-worlds out of this. Here's what I'm thinking:

  • Use CreateFileW/NtQueryDirectoryFile instead of FindFirstFileExW/FindNextFile to have control over the buffer and iteration (instead of the one-entry-per iteration API of the kernel32 APIs)
  • Use CreateFileW/NtQueryDirectoryFile to rule in/out a full directory (if it fails, then we can just move on; this is what I see as the primary benefit of this whole approach)
  • Check the found entries from the initial NtQueryDirectoryFile call as is done in this PR currently, except add a cap on the number of NtQueryDirectoryFile calls (potentially a max of 1 if that's feasible), so something like:
    • if there are more entries than could fit in the first N buffers, then fall back to using individual GetFileAttributesW calls instead to avoid the above worst case

Will try that out and report back, unless you still prefer the "N GetFileAttributesW calls" approach (where N is the number of allowed extensions, so that'd be a max of 5 per PATH entry if #5096 is merged).

@vtjnash
Copy link
Copy Markdown
Member

vtjnash commented May 14, 2026

People canonically assume it will be PATHEXT (https://gitlab.kitware.com/cmake/cmake/-/work_items/23992), which is around 11 items or more: https://superuser.com/questions/1027078/what-is-the-default-value-of-the-pathext-environment-variable-for-windows. I like that option, though I don't think we should necessarily try to use the result either, since it would only be sufficient if it matches the first entry in the list, but the first entry in PATHEXT is .com and unlikely to ever match

@squeek502
Copy link
Copy Markdown
Contributor Author

squeek502 commented May 14, 2026

Ah, right, without full iteration we can't fully know if a higher priority entry exists (FindFirstFileExW/NtQueryDirectoryFile iteration order is undefined and depends on the underlying filesystem). So, it'd basically be:

  • Use wildcard search to rule each PATH entry in/out for possible matches
  • Use GetFileAttributesW calls to actually find the best match (for directories that have at least one match)

EDIT: Getting STATUS_NO_MORE_FILES from NtQueryDirectoryFile should be sufficient in determining if all matches were iterated, so the hybrid approach in my previous comment would actually work but may not be worth it.

Take advantage of FindFirstFileExW and wildcards to reduce the number of
Win32 API calls used per-directory when searching the PATH. Instead of
<number of allowed extensions> GetFileAttributesW calls per directory
in the PATH, each entry is now first checked for viability using a
FindFirstFileExW call, and then only if matches are found will it check
each possibility using GetFileAttributesW.

This does not affect average case performance at all, but it does make
an impact when both (a) many extensions are allowed, and (b) the file
is not found early in the PATH search. The impact also scales with the
size of the PATH (the more entries, the larger the potential impact).
@squeek502 squeek502 force-pushed the win-wildcard-path-search branch from 498d3d8 to 362caa9 Compare May 14, 2026 23:47
@squeek502
Copy link
Copy Markdown
Contributor Author

squeek502 commented May 15, 2026

Went with the simpler approach (just use FindFirstFileExW to check for viability, ignore the result otherwise).

All the info (and benchmark results) in the OP is still accurate (just made some minor updates to the description).

(btw in terms of order of operations, I think it'd make the most sense to merge #5096 first since this PR kind of assumes those changes are going to be made [support for finding .bat/.cmd files], and then I'll be able to resolve the resulting conflicts in this PR)

EDIT: For completeness, here's some confirmation that this approach avoids the worst case:

Test code

hybrid.c

#include <windows.h>
#include <stdio.h>

int main() {
  HANDLE find;
  WIN32_FIND_DATAW find_data;
  DWORD attrs;
  int num_iterations = 0;

  find = FindFirstFileExW(L"foo*", FindExInfoBasic, &find_data, FindExSearchNameMatch, NULL, 0);
  if (find == INVALID_HANDLE_VALUE) {
    goto end;
  }

  attrs = GetFileAttributesW(L"foo");
  attrs = GetFileAttributesW(L"foo.com");
  attrs = GetFileAttributesW(L"foo.exe");
  attrs = GetFileAttributesW(L"foo.bat");
  attrs = GetFileAttributesW(L"foo.cmd");

end:
  printf("6\n");
  return 0;
}
> hyperfine --warmup 1 ..\build\Release\findfirst.exe ..\build\Release\getattr.exe ..\build\Release\hybrid.exe
Benchmark 1: ..\build\Release\findfirst.exe
  Time (mean ± σ):      97.2 ms ±   0.8 ms    [User: 6.5 ms, System: 92.9 ms]
  Range (min … max):    95.6 ms …  98.9 ms    26 runs

Benchmark 2: ..\build\Release\getattr.exe
  Time (mean ± σ):       3.6 ms ±   0.5 ms    [User: 4.4 ms, System: 4.6 ms]
  Range (min … max):     2.4 ms …   5.2 ms    128 runs

Benchmark 3: ..\build\Release\hybrid.exe
  Time (mean ± σ):       3.2 ms ±   0.5 ms    [User: 4.4 ms, System: 4.4 ms]
  Range (min … max):     2.3 ms …   5.3 ms    133 runs

Summary
  '..\build\Release\hybrid.exe' ran
    1.10 ± 0.24 times faster than '..\build\Release\getattr.exe'
   29.94 ± 4.80 times faster than '..\build\Release\findfirst.exe'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants