Skip to content

[Optimization] Zero padding in typed copies to help LLVM merge stores#157690

Open
ChuanqiXu9 wants to merge 2 commits into
rust-lang:mainfrom
ChuanqiXu9:PaddingZero
Open

[Optimization] Zero padding in typed copies to help LLVM merge stores#157690
ChuanqiXu9 wants to merge 2 commits into
rust-lang:mainfrom
ChuanqiXu9:PaddingZero

Conversation

@ChuanqiXu9

@ChuanqiXu9 ChuanqiXu9 commented Jun 10, 2026

Copy link
Copy Markdown

View all comments

Close #157373

For small repr(C) aggregates with padding, direct constant
initialization can still lower into field-wise construction plus
memcpy. That leaves the backend to rediscover that the whole object
is a single constant byte pattern.

This is especially visible for non-zero constant aggregates. Instead
of materializing them as separate field stores, we want codegen_ssa to
emit the packed value directly. For example, a value like

InnerPadded { a: 0, b: 1, c: 0 }

can otherwise lower to something like

store i16 0, ptr %val, align 4
store i8 1, ptr %val_plus_2, align 2
store i32 0, ptr %val_plus_4, align 4
call void @llvm.memcpy(..., ptr %val, ...)

while this change produces

store i64 65536, ptr %val, align 4
call void @llvm.memcpy(..., ptr %val, ...)

Why not solve this in LLVM?

At the problematic lowering point, rustc still knows that the MIR
aggregate is small and fully constant. LLVM only sees a lowered stack
temporary built from per-field stores and then copied out. Recovering
that packed constant there would require rediscovering front-end
aggregate semantics after lowering, so emitting the packed store in
rustc is the simpler and more local fix.

Why not keep the previous typed-copy approach?

An earlier approach zeroed padding on typed copies whose source could
be traced back to a constant assignment. That helped some cases, but it
also widened the optimization to runtime copy paths and could introduce
extra runtime stores purely to maintain padding knowledge. That is not
an acceptable tradeoff here.

Keep the scope narrow instead: only handle direct MIR aggregates whose
fields are all constants, and pack them according to the target
endianness before emitting a single integer store.

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jun 10, 2026
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@ChuanqiXu9

Copy link
Copy Markdown
Author

r? rust-lang/codegen

@dianqk

dianqk commented Jun 12, 2026

Copy link
Copy Markdown
Member

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jun 12, 2026
@rust-bors

This comment has been minimized.

rust-bors Bot pushed a commit that referenced this pull request Jun 12, 2026
[Optimization] Zero padding in typed copies to help LLVM merge stores

@dianqk dianqk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave this to LLVM. Maybe we can emit store undef in SROA and let instcombine or memcpyopt consume it.

View changes since this review

// CHECK-LABEL: @via_ptr_write(
#[no_mangle]
pub fn via_ptr_write(dest: &mut MaybeUninit<InnerPadded>) {
let val = InnerPadded { a: 0, b: 0, c: 0 };

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let val = InnerPadded { a: 0, b: 0, c: 0 };
let val = InnerPadded { a: 0, b: 1, c: 0 };

Could the new test case be merged into one store with your PR?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, included now.

@rust-bors

rust-bors Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

☀️ Try build successful (CI)
Build commit: 68b77d3 (68b77d32f1786d41b84fa6dfba36f5c3809b1f9f, parent: b30f3df3ba3c4c9de2f58f1a75dd9500b79b3f8d)

@rust-timer

This comment has been minimized.

@rust-timer

Copy link
Copy Markdown
Collaborator

Finished benchmarking commit (68b77d3): comparison URL.

Overall result: ❌ regressions - please read:

Benchmarking means the PR may be perf-sensitive. It's automatically marked not fit for rolling up. Overriding is possible but disadvised: it risks changing compiler perf.

Next, please: If you can, justify the regressions found in this try perf run in writing along with @rustbot label: +perf-regression-triaged. If not, fix the regressions and do another perf run. Neutral or positive results will clear the label automatically.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

mean range count
Regressions ❌
(primary)
0.4% [0.2%, 0.6%] 16
Regressions ❌
(secondary)
0.4% [0.2%, 0.6%] 10
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.4% [0.2%, 0.6%] 16

Max RSS (memory usage)

Results (secondary 2.3%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
2.3% [2.3%, 2.3%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Cycles

This perf run didn't have relevant results for this metric.

Binary size

Results (primary 0.1%, secondary 0.3%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
0.1% [0.0%, 0.2%] 22
Regressions ❌
(secondary)
0.3% [0.0%, 0.4%] 8
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.1% [0.0%, 0.2%] 22

Bootstrap: 518.98s -> 524.616s (1.09%)
Artifact size: 400.92 MiB -> 400.88 MiB (-0.01%)

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Jun 12, 2026
@theemathas

theemathas commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

What exactly is nesting structs doing that affects this optimization? See #157373 (comment)

For small repr(C) aggregates with padding, direct constant
initialization can still lower into field-wise construction plus
memcpy. That leaves the backend to rediscover that the whole object
is a single constant byte pattern.

This is especially visible for non-zero constant aggregates. Instead
of materializing them as separate field stores, we want codegen_ssa to
emit the packed value directly. For example, a value like

    InnerPadded { a: 0, b: 1, c: 0 }

can otherwise lower to something like

    store i16 0, ptr %val, align 4
    store i8 1, ptr %val_plus_2, align 2
    store i32 0, ptr %val_plus_4, align 4
    call void @llvm.memcpy(..., ptr %val, ...)

while this change produces

    store i64 65536, ptr %val, align 4
    call void @llvm.memcpy(..., ptr %val, ...)

Why not solve this in LLVM?

At the problematic lowering point, rustc still knows that the MIR
aggregate is small and fully constant. LLVM only sees a lowered stack
temporary built from per-field stores and then copied out. Recovering
that packed constant there would require rediscovering front-end
aggregate semantics after lowering, so emitting the packed store in
rustc is the simpler and more local fix.

Why not keep the previous typed-copy approach?

An earlier approach zeroed padding on typed copies whose source could
be traced back to a constant assignment. That helped some cases, but it
also widened the optimization to runtime copy paths and could introduce
extra runtime stores purely to maintain padding knowledge. That is not
an acceptable tradeoff here.

Keep the scope narrow instead: only handle direct MIR aggregates whose
fields are all constants, and pack them according to the target
endianness before emitting a single integer store.

Add a focused codegen test covering the original PR 157690 entry
points together with non-zero constant cases. The test uses
-Cno-prepopulate-passes so it checks the immediate-store shape directly
at rustc codegen time, instead of depending on later LLVM store
merging.
@ChuanqiXu9

Copy link
Copy Markdown
Author

I think we should leave this to LLVM. Maybe we can emit store undef in SROA and let instcombine or memcpyopt consume it.

View changes since this review

Yeah, it makes sense to do this in LLVM too. I just thought it is still meaningful to do this in rust. As I tried to make the patch as small as possible.

I was trying to train my self to get familiar with the rust compiler setups. So if you think it is meaningless to do this in rust side, please let me know and I'll try to look at other stuffs. (I'll be happy if you can tell me what is worth next).

@ChuanqiXu9

Copy link
Copy Markdown
Author

As the last implementation shows some runtime regressions, I think it is not good. I rewrote the whole patch to avoid any runtime regressions. The high level view of the patch is, we try to recognize the pattern in the MIR level, and if we find it, we will try to transform it into a single store explicitly.

@ChuanqiXu9

ChuanqiXu9 commented Jun 24, 2026

Copy link
Copy Markdown
Author

What exactly is nesting structs doing that affects this optimization? See #157373 (comment)

I feel this is the same. With nested struct, it has some temp variables. But these temp variables helps the LLVM to understand that it is a whole object. But without the nested struct, LLVM's SROA inserted some store undef and finally makes the other pass failed to recognize it is a whole object.

My thought is, this shows some complexities and randomness of middle end optimizations. Yeah, we can blame that. But it is still complex and randomness. To avoid such cases, what the current patch did may be meaningful. We did the optimization we want in our part. We introduces the determinism.

@dianqk

dianqk commented Jun 25, 2026

Copy link
Copy Markdown
Member

The new approach sounds good to me, although I haven’t read it thoroughly yet.
@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rust-bors

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jun 25, 2026
rust-bors Bot pushed a commit that referenced this pull request Jun 25, 2026
[Optimization] Zero padding in typed copies to help LLVM merge stores
@rust-bors

rust-bors Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

☀️ Try build successful (CI)
Build commit: b506142 (b50614215251a44128e69d72272cc8f1bfe4ae05)
Base parent: f28ac76 (f28ac764c36004fa6a6e098d15b4016a838c13c6)

@rust-timer

This comment has been minimized.

@rust-timer

Copy link
Copy Markdown
Collaborator

Finished benchmarking commit (b506142): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking means the PR may be perf-sensitive. Consider adding rollup=never if this change is not fit for rolling up.

@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This perf run didn't have relevant results for this metric.

Max RSS (memory usage)

Results (primary -4.9%, secondary 1.1%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
5.0% [2.6%, 7.3%] 4
Improvements ✅
(primary)
-4.9% [-5.6%, -4.2%] 2
Improvements ✅
(secondary)
-2.7% [-4.3%, -2.1%] 4
All ❌✅ (primary) -4.9% [-5.6%, -4.2%] 2

Cycles

Results (primary 3.1%, secondary 4.5%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
3.1% [2.9%, 3.2%] 2
Regressions ❌
(secondary)
4.5% [3.4%, 6.7%] 4
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 3.1% [2.9%, 3.2%] 2

Binary size

Results (primary 0.1%, secondary -0.0%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
0.3% [0.3%, 0.3%] 4
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.0% [-0.0%, -0.0%] 3
Improvements ✅
(secondary)
-0.0% [-0.0%, -0.0%] 1
All ❌✅ (primary) 0.1% [-0.0%, 0.3%] 7

Bootstrap: 510.446s -> 505.42s (-0.98%)
Artifact size: 353.05 MiB -> 353.47 MiB (0.12%)

@rustbot rustbot removed S-waiting-on-perf Status: Waiting on a perf run to be completed. perf-regression Performance regression. labels Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

missed optimization when writing a value with inner padding.

6 participants