[Optimization] Zero padding in typed copies to help LLVM merge stores#157690
[Optimization] Zero padding in typed copies to help LLVM merge stores#157690ChuanqiXu9 wants to merge 2 commits into
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
r? rust-lang/codegen |
|
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
[Optimization] Zero padding in typed copies to help LLVM merge stores
| // CHECK-LABEL: @via_ptr_write( | ||
| #[no_mangle] | ||
| pub fn via_ptr_write(dest: &mut MaybeUninit<InnerPadded>) { | ||
| let val = InnerPadded { a: 0, b: 0, c: 0 }; |
There was a problem hiding this comment.
| let val = InnerPadded { a: 0, b: 0, c: 0 }; | |
| let val = InnerPadded { a: 0, b: 1, c: 0 }; |
Could the new test case be merged into one store with your PR?
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (68b77d3): comparison URL. Overall result: ❌ regressions - please read:Benchmarking means the PR may be perf-sensitive. It's automatically marked not fit for rolling up. Overriding is possible but disadvised: it risks changing compiler perf. Next, please: If you can, justify the regressions found in this try perf run in writing along with @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (secondary 2.3%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesThis perf run didn't have relevant results for this metric. Binary sizeResults (primary 0.1%, secondary 0.3%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 518.98s -> 524.616s (1.09%) |
|
What exactly is nesting structs doing that affects this optimization? See #157373 (comment) |
For small repr(C) aggregates with padding, direct constant
initialization can still lower into field-wise construction plus
memcpy. That leaves the backend to rediscover that the whole object
is a single constant byte pattern.
This is especially visible for non-zero constant aggregates. Instead
of materializing them as separate field stores, we want codegen_ssa to
emit the packed value directly. For example, a value like
InnerPadded { a: 0, b: 1, c: 0 }
can otherwise lower to something like
store i16 0, ptr %val, align 4
store i8 1, ptr %val_plus_2, align 2
store i32 0, ptr %val_plus_4, align 4
call void @llvm.memcpy(..., ptr %val, ...)
while this change produces
store i64 65536, ptr %val, align 4
call void @llvm.memcpy(..., ptr %val, ...)
Why not solve this in LLVM?
At the problematic lowering point, rustc still knows that the MIR
aggregate is small and fully constant. LLVM only sees a lowered stack
temporary built from per-field stores and then copied out. Recovering
that packed constant there would require rediscovering front-end
aggregate semantics after lowering, so emitting the packed store in
rustc is the simpler and more local fix.
Why not keep the previous typed-copy approach?
An earlier approach zeroed padding on typed copies whose source could
be traced back to a constant assignment. That helped some cases, but it
also widened the optimization to runtime copy paths and could introduce
extra runtime stores purely to maintain padding knowledge. That is not
an acceptable tradeoff here.
Keep the scope narrow instead: only handle direct MIR aggregates whose
fields are all constants, and pack them according to the target
endianness before emitting a single integer store.
Add a focused codegen test covering the original PR 157690 entry
points together with non-zero constant cases. The test uses
-Cno-prepopulate-passes so it checks the immediate-store shape directly
at rustc codegen time, instead of depending on later LLVM store
merging.
Yeah, it makes sense to do this in LLVM too. I just thought it is still meaningful to do this in rust. As I tried to make the patch as small as possible. I was trying to train my self to get familiar with the rust compiler setups. So if you think it is meaningless to do this in rust side, please let me know and I'll try to look at other stuffs. (I'll be happy if you can tell me what is worth next). |
|
As the last implementation shows some runtime regressions, I think it is not good. I rewrote the whole patch to avoid any runtime regressions. The high level view of the patch is, we try to recognize the pattern in the MIR level, and if we find it, we will try to transform it into a single store explicitly. |
I feel this is the same. With nested struct, it has some temp variables. But these temp variables helps the LLVM to understand that it is a whole object. But without the nested struct, LLVM's SROA inserted some My thought is, this shows some complexities and randomness of middle end optimizations. Yeah, we can blame that. But it is still complex and randomness. To avoid such cases, what the current patch did may be meaningful. We did the optimization we want in our part. We introduces the determinism. |
|
The new approach sounds good to me, although I haven’t read it thoroughly yet. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
[Optimization] Zero padding in typed copies to help LLVM merge stores
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (b506142): comparison URL. Overall result: no relevant changes - no action neededBenchmarking means the PR may be perf-sensitive. Consider adding rollup=never if this change is not fit for rolling up. @rustbot label: -S-waiting-on-perf -perf-regression Instruction countThis perf run didn't have relevant results for this metric. Max RSS (memory usage)Results (primary -4.9%, secondary 1.1%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (primary 3.1%, secondary 4.5%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeResults (primary 0.1%, secondary -0.0%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 510.446s -> 505.42s (-0.98%) |
View all comments
Close #157373
For small repr(C) aggregates with padding, direct constant
initialization can still lower into field-wise construction plus
memcpy. That leaves the backend to rediscover that the whole object
is a single constant byte pattern.
This is especially visible for non-zero constant aggregates. Instead
of materializing them as separate field stores, we want codegen_ssa to
emit the packed value directly. For example, a value like
can otherwise lower to something like
while this change produces
Why not solve this in LLVM?
At the problematic lowering point, rustc still knows that the MIR
aggregate is small and fully constant. LLVM only sees a lowered stack
temporary built from per-field stores and then copied out. Recovering
that packed constant there would require rediscovering front-end
aggregate semantics after lowering, so emitting the packed store in
rustc is the simpler and more local fix.
Why not keep the previous typed-copy approach?
An earlier approach zeroed padding on typed copies whose source could
be traced back to a constant assignment. That helped some cases, but it
also widened the optimization to runtime copy paths and could introduce
extra runtime stores purely to maintain padding knowledge. That is not
an acceptable tradeoff here.
Keep the scope narrow instead: only handle direct MIR aggregates whose
fields are all constants, and pack them according to the target
endianness before emitting a single integer store.