perf: use get_unchecked for TwoWaySearcher#155607
Conversation
This comment has been minimized.
This comment has been minimized.
6d37582 to
40692f5
Compare
|
rustbot has assigned @Mark-Simulacrum. Use Why was this reviewer chosen?The reviewer was selected based on:
|
There was a problem hiding this comment.
Didn't review past the first block. In general it seems promising that you're seeing good perf improvement but I'm also hesitant to approve unsafe code in these fairly tricky string searching routines. I'm not sure how thorough our test coverage is on boundary conditions that would have caused panics in the indexing before... at minimum I don't think we have any fuzzing which would normally help build confidence.
cc @BurntSushi, I'm wondering if the out-of-tree(?) implementations for ripgrep/regex of the algorithms here have these optimizations applied?
| // `self.position + i < haystack.len()`. | ||
| // Every path that mutates `self.position` below either returns or re-enters `'search`, | ||
| // which re-runs the check before reaching the loop again. | ||
| // `i < needle.len()` also guarantees `needle.get_unchecked(i)` is safe. |
There was a problem hiding this comment.
Can you benchmark just changing the haystack indexing? It seems like LLVM should be able to make use of the for loops range being bounded by needle.len() to avoid that indexing actually staying in the final program.
The haystack bit (self.position + needle.len() <= haystack.len()) doesn't seem quite sufficient to me without deeper scrutiny of the underlying algorithm, since self.position is getting updated by a non-obviously +1 on each iteration... and if it was a +1 on each iteration, I'd guess LLVM would be able to simplify this itself?
There was a problem hiding this comment.
Sure, I'll give that a go!
There was a problem hiding this comment.
Sorry for taking so long to get back to you on this... Lots of things going on. Anyway, here are the results.
I have three versions of the code (that I pushed to the branch for now):
- The baseline
- My initial diff with
get_uncheckedon both haystack and needle - The partial diff with
get_uncheckedon haystack only (needle[i]stays checked)
One of the reasons I took so much time to do this is that running the benchmarks on my macOS laptop kept yielding noisy results and I wasn't sure exactly what to think about them, so I eventually moved to an Ubuntu box.
The results were sort of disappointing there (and I think that's exactly what you expected with LLVM figuring things out on its own) -- the diff is pretty much 0 for all existing benchmarks (although the improvement seems to be consistently showing on macOS).
... But maybe there was something else.
I decided to add different benchmarks.
The existing ones use a 42 bytes haystack and a 1 character needle with starts_with / ends_with. This actually doesn't do much with TwoWaySearcher.
Since the proposed changes optimise the inner loop in TwoWaySearcher::next (and TwoWaySearcher::next_back) which matter mostly for find / rfind with longer needles on longer haystacks, I added benchmarks that exercise that path. (So it's a ~30 KB haystack, and a 20 chars needle).
(The new benchmarks are on the branch as well.)
Those are the results (run on Linux).
| Benchmark | main |
Partial | All changes | Partial vs mainline | All vs mainline |
|---|---|---|---|---|---|
find_str |
14,163 | 8,190 | 7,017 | −42% | −50% |
rfind_str |
23,360 | 11,800 | 8,306 | −49% | −64% |
find_str_worst_case |
1,103 | 1,104 | 1,399 | ~0% | +27% |
rfind_str_worst_case |
61,031 | 36,630 | 35,380 | −40% | −42% |
So what it seems like is that
- Having all the changes leads to slightly better results in most cases but a regression in one benchmark (and a somewhat significant one at that)
- Keeping only the changes you suggested shows smaller improvements in most cases but regresses on no benchmark.
... So it might be worth pursuing this, but it also probably is worth pursuing it only partially.
|
@Mark-Simulacrum Thanks for taking a look! I appreciate it.
Is there anything I could do to increase this confidence (regarding testing on edge cases and/or fuzzing)? If that requires additional work somewhere else and you're open to me contributing there, I'm definitely open to. As I mentioned in the description, this function is currently a hotspot of ours and we'd really like to see it go faster. |
40692f5 to
9c879de
Compare
|
This comment has been minimized.
This comment has been minimized.
9c879de to
4627283
Compare
This comment has been minimized.
This comment has been minimized.
| // loop applies: `haystack.get(self.position + needle_last)` at the | ||
| // top of `'search` established the bound for this iteration, and every mutation | ||
| // of `self.position` is followed by `continue 'search` (which re-runs the check) | ||
| // or a `return` on match. |
There was a problem hiding this comment.
I think this needs to say something about start / self.crit_pos, no? Otherwise it's not clear to me how to determine this is actually sound from this SAFETY comment.
There was a problem hiding this comment.
Does this sound better?
// SAFETY: on every iteration of `'search`, the `haystack.get(self.position + needle_last)`
// check returned `Some`, so `self.position + needle_last < haystack.len()`.
// Since `i < self.crit_pos <= needle.len()`, we have `i <= needle_last`, and thus
// `self.position + i <= self.position + needle_last < haystack.len()`.
// Every path that mutates `self.position` below either returns or re-enters `'search`,
// which re-runs the check before reaching the loop again.
I initially was a bit handwavy in the interest of non being too long-winded; this version should be more explicit about why this is correct.
There was a problem hiding this comment.
I think that seems better. Thanks!
| // `self.position + i < haystack.len()`. | ||
| // Every path that mutates `self.position` below either returns or re-enters `'search`, | ||
| // which re-runs the check before reaching the loop again. | ||
| if needle[i] != unsafe { *haystack.get_unchecked(self.position + i) } { |
There was a problem hiding this comment.
The rationale here makes sense to me. I do wonder if we could tell LLVM enough that it would see it -- e.g. maybe subslicing needle and then having haystack[self.position..][..needle_slice.len()] would be enough? But I'm OK with this one as-is too.
There was a problem hiding this comment.
I tried to do...
let haystack_window = &haystack[self.position..][..needle.len()];... to then use it in the condition (I believe that's what you meant)...
if needle[i] != haystack_window[i] {... but the benchmarks results ended up being way worse than the implementation currently on the PR (on the newly added benchmarks):
| Benchmark | Latest version of the PR | LLVM trick | Delta |
|---|---|---|---|
find_str |
8,190 ns | 16,167 ns | +97% |
rfind_str |
11,800 ns | 10,539 ns | −11% |
find_str_worst_case |
1,104 ns | 2,070 ns | +87% |
rfind_str_worst_case |
36,630 ns | 38,000 ns | +4% |
This comment has been minimized.
This comment has been minimized.
|
r=me with tidy fixed |
|
Trying to fix the tidy job by adding |
…d-for-pattern, r=Mark-Simulacrum perf: use `get_unchecked` for `TwoWaySearcher` ## What is this PR? *This is related to #27721.* This PR is a proposal for a performance improvement in `std::pattern`. Profiling of [https://github.com/quickwit-oss/quickwit](https://github.com/quickwit-oss/quickwit) in production shows that `TwoWaySearcher::next` is one of the most CPU-time-consuming functions, so I thought I would give it a look. I read the [contribution guide](https://std-dev-guide.rust-lang.org/development/perf-benchmarking.html) and this seems to be a fitting proposal. It seems like `TwoWaySearcher::next` and `TwoWaySearcher::next_back` could be made faster by using `get_unchecked` in the inner loop comparisons instead of regular indexing, which is safe in the conditions where it would be done (indices are within bounds by construction). I added some `SAFETY` comments in the code to explain why this is safe, as I believe is customary in those cases (and according to [this page as well](https://std-dev-guide.rust-lang.org/policy/safety-comments.html)). ### Benchmarks I ran the existing bencharmks before/after the changes (only on my laptop, I can run them in other places if that's necessary). ``` ./x.py bench library/coretests -- pattern:: ``` We seem to be getting a ~7.5-12% performance improvement at a very low cost, which sounds worthwhile to me. But this is the first time I'm proposing a change in Rust, so I'm looking forward to feedback on this. ``` BEFORE CHANGES pattern::ends_with_char 3398.91ns/iter +/- 526.28 pattern::ends_with_str 3545.04ns/iter +/- 1108.76 pattern::starts_with_char 3348.31ns/iter +/- 352.38 pattern::starts_with_str 3710.59ns/iter +/- 435.57 AFTER CHANGES pattern::ends_with_char 3125.99ns/iter +/- 567.09 (-8.03%) pattern::ends_with_str 3106.43ns/iter +/- 258.33 (-12.38%) pattern::starts_with_char 3094.55ns/iter +/- 595.42 (-7.59%) pattern::starts_with_str 3365.75ns/iter +/- 268.88 (-9.29%) ``` System info for the benchmarks run <details> ``` Based on commit 8317fef rustc 1.96.0-dev binary: rustc commit-hash: unknown commit-date: unknown host: aarch64-apple-darwin release: 1.96.0-dev LLVM version: 22.1.2 Apple M4 Max 16 64 GB ProductName: macOS ProductVersion: 26.3 BuildVersion: 25D125 (this was run on AC and without any heavy load from other apps or whatnot) ``` </details>
|
The job Click to see the possible cause of the failure (guessed by this bot) |
|
💔 Test for f94864f failed: CI. Failed job:
|
|
Flaky rustdoc-gui test? |
This comment has been minimized.
This comment has been minimized.
|
@bors treeclosed- |
|
Tree is now open for merging. |
What is this?This is an experimental post-merge analysis report that shows differences in test outcomes between the merged PR and its parent PR.Comparing 029c9e1 (parent) -> 06293ff (this PR) Test differencesShow 8 test diffsStage 1
Stage 2
Job group index
Test dashboardRun cargo run --manifest-path src/ci/citool/Cargo.toml -- \
test-dashboard 06293ff2b120aecfc29f84b90a22a743a5b90fef --output-dir test-dashboardAnd then open Job duration changes
How to interpret the job duration changes?Job durations can vary a lot, based on the actual runner instance |
|
Finished benchmarking commit (06293ff): comparison URL. Overall result: ❌✅ regressions and improvements - please read:Our benchmarks found a performance regression caused by this PR. Next Steps:
@rustbot label: +perf-regression Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary -2.9%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (secondary 3.4%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeResults (primary -0.2%, secondary -0.1%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 517.684s -> 516.268s (-0.27%) |
|
Pre-merge (with no changes) results were a slight improvement, and the regressions look like noise. Unclear what caused them but marking as triaged. |
View all comments
What is this PR?
This is related to #27721.
This PR is a proposal for a performance improvement in
std::pattern.Profiling of https://github.com/quickwit-oss/quickwit in production shows that
TwoWaySearcher::nextis one of the most CPU-time-consuming functions, so I thought I would give it a look.I read the contribution guide and this seems to be a fitting proposal.
It seems like
TwoWaySearcher::nextandTwoWaySearcher::next_backcould be made faster by usingget_uncheckedin the inner loop comparisons instead of regular indexing, which is safe in the conditions where it would be done (indices are within bounds by construction).I added some
SAFETYcomments in the code to explain why this is safe, as I believe is customary in those cases (and according to this page as well).Benchmarks
I ran the existing bencharmks before/after the changes (only on my laptop, I can run them in other places if that's necessary).
We seem to be getting a ~7.5-12% performance improvement at a very low cost, which sounds worthwhile to me.
But this is the first time I'm proposing a change in Rust, so I'm looking forward to feedback on this.
System info for the benchmarks run
Details