[Optimization] Zero padding in typed copies to help LLVM merge stores by ChuanqiXu9 · Pull Request #157690 · rust-lang/rust

ChuanqiXu9 · 2026-06-10T02:56:38Z

For small repr(C) aggregates with padding, direct constant
initialization can still lower into field-wise construction plus
memcpy. That leaves the backend to rediscover that the whole object
is a single constant byte pattern.

This is especially visible for non-zero constant aggregates. Instead
of materializing them as separate field stores, we want codegen_ssa to
emit the packed value directly. For example, a value like

InnerPadded { a: 0, b: 1, c: 0 }

can otherwise lower to something like

store i16 0, ptr %val, align 4
store i8 1, ptr %val_plus_2, align 2
store i32 0, ptr %val_plus_4, align 4
call void @llvm.memcpy(..., ptr %val, ...)

while this change produces

store i64 65536, ptr %val, align 4
call void @llvm.memcpy(..., ptr %val, ...)

Why not solve this in LLVM?

At the problematic lowering point, rustc still knows that the MIR
aggregate is small and fully constant. LLVM only sees a lowered stack
temporary built from per-field stores and then copied out. Recovering
that packed constant there would require rediscovering front-end
aggregate semantics after lowering, so emitting the packed store in
rustc is the simpler and more local fix.

Why not keep the previous typed-copy approach?

An earlier approach zeroed padding on typed copies whose source could
be traced back to a constant assignment. That helped some cases, but it
also widened the optimization to runtime copy paths and could introduce
extra runtime stores purely to maintain padding knowledge. That is not
an acceptable tradeoff here.

Keep the scope narrow instead: only handle direct MIR aggregates whose
fields are all constants, and pack them according to the target
endianness before emitting a single integer store.

ChuanqiXu9 · 2026-06-12T03:09:00Z

r? rust-lang/codegen

dianqk · 2026-06-12T03:52:35Z

@bors try @rust-timer queue

[Optimization] Zero padding in typed copies to help LLVM merge stores

dianqk

I think we should leave this to LLVM. Maybe we can emit store undef in SROA and let instcombine or memcpyopt consume it.

View changes since this review

dianqk · 2026-06-12T05:06:07Z

+// CHECK-LABEL: @via_ptr_write(
+#[no_mangle]
+pub fn via_ptr_write(dest: &mut MaybeUninit<InnerPadded>) {
+    let val = InnerPadded { a: 0, b: 0, c: 0 };


Suggested change

let val = InnerPadded { a: 0, b: 0, c: 0 };

let val = InnerPadded { a: 0, b: 1, c: 0 };

Could the new test case be merged into one store with your PR?

Yeah, included now.

rust-bors · 2026-06-12T06:01:05Z

☀️ Try build successful (CI)
Build commit: 68b77d3 (68b77d32f1786d41b84fa6dfba36f5c3809b1f9f, parent: b30f3df3ba3c4c9de2f58f1a75dd9500b79b3f8d)

rust-timer · 2026-06-12T06:41:59Z

Finished benchmarking commit (68b77d3): comparison URL.

Overall result: ❌ regressions - please read:

Benchmarking means the PR may be perf-sensitive. It's automatically marked not fit for rolling up. Overriding is possible but disadvised: it risks changing compiler perf.

Next, please: If you can, justify the regressions found in this try perf run in writing along with @rustbot label: +perf-regression-triaged. If not, fix the regressions and do another perf run. Neutral or positive results will clear the label automatically.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

	mean	range	count
Regressions ❌ (primary)	0.4%	[0.2%, 0.6%]	16
Regressions ❌ (secondary)	0.4%	[0.2%, 0.6%]	10
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.4%	[0.2%, 0.6%]	16

Max RSS (memory usage)

Results (secondary 2.3%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	2.3%	[2.3%, 2.3%]	1
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

Cycles

This perf run didn't have relevant results for this metric.

Binary size

Results (primary 0.1%, secondary 0.3%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	0.1%	[0.0%, 0.2%]	22
Regressions ❌ (secondary)	0.3%	[0.0%, 0.4%]	8
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.1%	[0.0%, 0.2%]	22

Bootstrap: 518.98s -> 524.616s (1.09%)
Artifact size: 400.92 MiB -> 400.88 MiB (-0.01%)

theemathas · 2026-06-17T13:00:14Z

What exactly is nesting structs doing that affects this optimization? See #157373 (comment)

For small repr(C) aggregates with padding, direct constant initialization can still lower into field-wise construction plus memcpy. That leaves the backend to rediscover that the whole object is a single constant byte pattern. This is especially visible for non-zero constant aggregates. Instead of materializing them as separate field stores, we want codegen_ssa to emit the packed value directly. For example, a value like InnerPadded { a: 0, b: 1, c: 0 } can otherwise lower to something like store i16 0, ptr %val, align 4 store i8 1, ptr %val_plus_2, align 2 store i32 0, ptr %val_plus_4, align 4 call void @llvm.memcpy(..., ptr %val, ...) while this change produces store i64 65536, ptr %val, align 4 call void @llvm.memcpy(..., ptr %val, ...) Why not solve this in LLVM? At the problematic lowering point, rustc still knows that the MIR aggregate is small and fully constant. LLVM only sees a lowered stack temporary built from per-field stores and then copied out. Recovering that packed constant there would require rediscovering front-end aggregate semantics after lowering, so emitting the packed store in rustc is the simpler and more local fix. Why not keep the previous typed-copy approach? An earlier approach zeroed padding on typed copies whose source could be traced back to a constant assignment. That helped some cases, but it also widened the optimization to runtime copy paths and could introduce extra runtime stores purely to maintain padding knowledge. That is not an acceptable tradeoff here. Keep the scope narrow instead: only handle direct MIR aggregates whose fields are all constants, and pack them according to the target endianness before emitting a single integer store. Add a focused codegen test covering the original PR 157690 entry points together with non-zero constant cases. The test uses -Cno-prepopulate-passes so it checks the immediate-store shape directly at rustc codegen time, instead of depending on later LLVM store merging.

ChuanqiXu9 · 2026-06-24T07:33:19Z

I think we should leave this to LLVM. Maybe we can emit store undef in SROA and let instcombine or memcpyopt consume it.

View changes since this review

Yeah, it makes sense to do this in LLVM too. I just thought it is still meaningful to do this in rust. As I tried to make the patch as small as possible.

I was trying to train my self to get familiar with the rust compiler setups. So if you think it is meaningless to do this in rust side, please let me know and I'll try to look at other stuffs. (I'll be happy if you can tell me what is worth next).

ChuanqiXu9 · 2026-06-24T07:35:48Z

As the last implementation shows some runtime regressions, I think it is not good. I rewrote the whole patch to avoid any runtime regressions. The high level view of the patch is, we try to recognize the pattern in the MIR level, and if we find it, we will try to transform it into a single store explicitly.

ChuanqiXu9 · 2026-06-24T08:56:10Z

What exactly is nesting structs doing that affects this optimization? See #157373 (comment)

I feel this is the same. With nested struct, it has some temp variables. But these temp variables helps the LLVM to understand that it is a whole object. But without the nested struct, LLVM's SROA inserted some store undef and finally makes the other pass failed to recognize it is a whole object.

My thought is, this shows some complexities and randomness of middle end optimizations. Yeah, we can blame that. But it is still complex and randomness. To avoid such cases, what the current patch did may be meaningful. We did the optimization we want in our part. We introduces the determinism.

dianqk · 2026-06-25T01:22:29Z

The new approach sounds good to me, although I haven’t read it thoroughly yet.
@bors try @rust-timer queue

[Optimization] Zero padding in typed copies to help LLVM merge stores

rust-bors · 2026-06-25T03:42:25Z

☀️ Try build successful (CI)
Build commit: b506142 (b50614215251a44128e69d72272cc8f1bfe4ae05)
Base parent: f28ac76 (f28ac764c36004fa6a6e098d15b4016a838c13c6)

rust-timer · 2026-06-25T04:24:14Z

Finished benchmarking commit (b506142): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking means the PR may be perf-sensitive. Consider adding rollup=never if this change is not fit for rolling up.

@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This perf run didn't have relevant results for this metric.

Max RSS (memory usage)

Results (primary -4.9%, secondary 1.1%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	5.0%	[2.6%, 7.3%]	4
Improvements ✅ (primary)	-4.9%	[-5.6%, -4.2%]	2
Improvements ✅ (secondary)	-2.7%	[-4.3%, -2.1%]	4
All ❌✅ (primary)	-4.9%	[-5.6%, -4.2%]	2

Cycles

Results (primary 3.1%, secondary 4.5%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	3.1%	[2.9%, 3.2%]	2
Regressions ❌ (secondary)	4.5%	[3.4%, 6.7%]	4
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	3.1%	[2.9%, 3.2%]	2

Binary size

Results (primary 0.1%, secondary -0.0%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	0.3%	[0.3%, 0.3%]	4
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.0%	[-0.0%, -0.0%]	3
Improvements ✅ (secondary)	-0.0%	[-0.0%, -0.0%]	1
All ❌✅ (primary)	0.1%	[-0.0%, 0.3%]	7

Bootstrap: 510.446s -> 505.42s (-0.98%)
Artifact size: 353.05 MiB -> 353.47 MiB (0.12%)

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jun 10, 2026