Skip to content

Commit f6db15e

Browse files
avrabeclaude
andcommitted
silicon: wasm-cross-LTO spike — full integration attempt + 3rd synth bug
Pushed the wasm-cross-LTO experiment all the way to a buildable bench ELF integrated via wasm-ld+arm-ar+linker-substitute. Discovered an additional synth backend bug while attempting silicon measurement: synth's emitted memset/memcpy/memmove don't terminate correctly on Zephyr's startup `memset(bss, 0, sizeof(bss))` invocation. The chip hangs in memset+0x4c forever, bouncing between offsets 0x668 and 0x67e in a tight inner loop. The synth disassembly reveals i64 shift instructions (`subs.w r3, r2, #32; rsb r3, r2, #32; lsl.w r3, r1, r3`) lowered into what should be a byte-counter loop — same root cause as the u64-packed FFI return codegen issue documented earlier: synth's i64 codegen is incomplete. End-to-end status: - wasm-ld static-merging: WORKS. shim.wasm.o + libgale_ffi.a → 1MB merged.wasm with z_impl_k_sem_give and gale_k_sem_give_decide both present. - synth inlining at merged-module scope: STRUCTURALLY WORKS. The output `z_impl_k_sem_give` body has zero bl gale_k_sem_give_decide instructions. Verified by disassembly. 138 bytes vs LLVM-LTO's 82 bytes — 1.68x larger but inlined. - Bench integration: BUILDS. CMake bench builds with -DGALE_WASM_LTO_OVERRIDE_SEM_GIVE=1 + custom libgale_ffi.a + --allow-multiple-definition. Final ELF 219 KB FLASH, 66 KB RAM. - Chip boot: BLOCKED. PC stuck in synth-emitted memset. Workarounds via objcopy --weaken-symbol, --strip-symbol, --redefine-sym all failed to evict synth's broken memset bytes from the final ELF. Three synth backend issues filed against pulseengine/synth, ordered: 1. (blocker) memset/memcpy/memmove i64-codegen non-termination — prevents the merged-wasm bench from booting at all. 2. u64-packed FFI return unpacking — ~50% of the LTO-parity size delta. Same i64-codegen root cause as #1. 3. wasm linear-memory access lowering — ~20% of the size delta. Cosmetic compared to #1 and #2. Plus one issue against pulseengine/loom: - Z3 SortDiffers panic in inline_functions pass on i64-heavy wasm modules. Without loom, the verified-LTO claim doesn't hold. The structural claim — "wasm-cross-LTO via PulseEngine pipeline dissolves the C↔Rust seam at wasm IR level" — is **proven by disassembly**. The cyclical claim — "silicon timing matches LLVM-LTO" — is **blocked on synth's memset codegen**. Neither is a fundamental architectural barrier; both are well-scoped engineering work. This commit only updates the NOTES with the integration findings. The bench source is restored to clean state (the gale_sem.c #ifndef edit was transient) and verified building unchanged at 27 KB FLASH at the canonical rustc-direct path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 35f90fb commit f6db15e

1 file changed

Lines changed: 94 additions & 7 deletions

File tree

benches/engine_control/silicon/boards/nucleo_g474re/NOTES-wasm-cross-lto-spike.md

Lines changed: 94 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,56 @@ unpacked via generic shifts instead of a direct field access.
9797

9898
Both are fixable in synth's backend.
9999

100-
## Action items (filed against pulseengine/synth and pulseengine/loom)
100+
## Full integration attempt + new finding — synth memset is broken
101+
102+
After the initial spike I attempted full integration into the bench
103+
to get silicon-measurable timing. The integration path:
104+
105+
1. Edit `zephyr/gale_sem.c` to wrap `z_impl_k_sem_give` in
106+
`#ifndef GALE_WASM_LTO_OVERRIDE_SEM_GIVE` so the bench's native
107+
compilation skips it.
108+
2. Build merged.wasm via wasm-ld → synth → ARM ET_REL.
109+
3. Wrap merged.o in libgale_ffi.a using arm-zephyr-eabi-ar (not
110+
Apple ar — Apple ar's ranlib doesn't index ARM ELF symbols
111+
correctly, archive shows up as "no global symbols").
112+
4. Place the merged libgale_ffi.a at the bench's expected path
113+
(replacing the rustc-direct cargo output).
114+
5. Build with `-DEXTRA_CFLAGS=-DGALE_WASM_LTO_OVERRIDE_SEM_GIVE=1
115+
-DEXTRA_LDFLAGS=-Wl,--allow-multiple-definition`.
116+
117+
The build succeeded — final ELF: 219 KB FLASH (vs LTO's 26 KB), 66 KB
118+
RAM. Linker warnings about `.meld_import_table` orphan section
119+
(synth-emitted, lands at VMA 0 in non-loaded section, harmless).
120+
121+
**But the chip doesn't boot.** PC stays in the synth-emitted `memset`
122+
function (`0x0802c614`, 454 bytes) for >10 seconds, bouncing between
123+
0x0802c668 and 0x0802c67e — a tight inner loop that doesn't terminate
124+
correctly on the boundaries Zephyr's startup uses. Zephyr's z_bss_zero
125+
calls memset(bss_start, 0, bss_size); synth's memset never returns.
126+
127+
Workaround attempts that didn't stick:
128+
- `--allow-multiple-definition` + `objcopy --weaken-symbol=memset`:
129+
weak-vs-strong resolution didn't override; ld picked synth's.
130+
- `objcopy --redefine-sym memset=__synth_memset`: renamed the C-symbol
131+
but the Rust mangled `_ZN17compiler_builtins3mem6memset17h...E`
132+
remained, and the final ELF still resolves `memset` to synth's
133+
buggy code at 0x0802c614 (the bytes are still there from merged.o's
134+
.text section).
135+
- `objcopy --strip-symbol=memset`: removes the symbol table entry
136+
but doesn't remove the bytes; ld still places synth's code at
137+
0x0802c614 and exposes it as `memset` from another reference.
138+
139+
The root cause is that synth's wasm-to-ARM lowering of memset
140+
produces a loop that doesn't terminate on the boundaries Zephyr's
141+
startup uses (`memset(bss, 0, bss_size)` where bss_size is in bytes,
142+
8-byte aligned). The synth output disassembles to a pattern with
143+
`subs.w r3, r2, #32; bpl.n ...; rsb r3, r2, #32; lsl.w r3, r1, r3 …`
144+
which is the i64-shift-with-bytecount pattern from Rust's u64
145+
left-shift implementation — synth seems to have lowered memset's
146+
inner loop using i64 shift operations that don't apply to byte
147+
counts.
148+
149+
## Action items, updated (3 synth bugs, 1 loom bug)
101150

102151
### loom — Z3 i64 sort handling
103152

@@ -113,19 +162,57 @@ LLVM-LTO.
113162

114163
### synth — codegen patterns
115164

116-
1. **u64-packed FFI return unpacking:** when synth lowers a wasm
165+
1. **memset/memcpy/memmove are MIS-COMPILED** (newly discovered, severity:
166+
blocker). Synth's wasm→ARM lowering of compiler_builtins' memset
167+
produces a non-terminating loop on Zephyr's startup
168+
`memset(bss, 0, sizeof(bss))` invocation. The chip hangs in
169+
memset+0x4c forever. Until this is fixed, no integration of
170+
merged-wasm into a real bench can boot. **First-priority fix.**
171+
172+
2. **u64-packed FFI return unpacking:** when synth lowers a wasm
117173
function that returns i64 and the caller immediately bit-masks
118174
into byte-fields, recognize the packed-struct-return pattern and
119175
emit register-direct field access (no shifts). Reduces LTO-parity
120-
gap by ~50% of the size delta.
121-
2. **wasm linear-memory access lowering:** when a wasm `i32.load` is
176+
gap by ~50% of the size delta. Same root issue as memset's bug —
177+
synth's i64 codegen is incomplete.
178+
179+
3. **wasm linear-memory access lowering:** when a wasm `i32.load` is
122180
from a constant address that's known to be in `.data`, emit
123181
`ldr rN, [base, #imm]` instead of `movw + movt + ldr`. Reduces
124182
another ~20% of the size delta.
125183

126-
With both fixes applied, the wasm-LTO route should approach LLVM-LTO
127-
parity (within ~10% on silicon cycles) while delivering the
128-
verification-by-construction property LLVM-LTO doesn't have.
184+
With (1) fixed, the wasm-LTO bench will boot and we get measurable
185+
silicon cycles. With (2)+(3) on top, the wasm-LTO route should
186+
approach LLVM-LTO parity (within ~10% on silicon cycles) while
187+
delivering the verification-by-construction property LLVM-LTO doesn't
188+
have.
189+
190+
## What we have data-to-compare on
191+
192+
Silicon (sha b48a81ac/f6f61281):
193+
baseline (no Gale, ADC=n) 528 cyc handoff median
194+
rustc-direct gale (ADC=n) 574 cyc (+46 = FFI seam)
195+
gale via wasm→synth (Rust only, ADC=y) 582 cyc (seam preserved)
196+
LLVM-LTO gale (ADC=n) 471 cyc (-57 below baseline)
197+
LLVM-LTO gale (ADC=y, post-fix) 558 cyc (+52 above baseline)
198+
wasm-LTO via wasm-ld+synth (ADC=y) ELF builds, chip won't boot
199+
200+
Toolchain-level:
201+
wasm-ld merge + arm-zephyr-eabi-ar works
202+
synth inlining via merged-module works (no bl in z_impl_k_sem_give)
203+
synth emitted body size 138 bytes (1.68x LTO's 82 bytes)
204+
synth memset codegen broken (infinite loop)
205+
loom inline_functions broken (Z3 SortDiffers on i64)
206+
207+
The **structural claim** holds: the wasm-LTO toolchain (meld/wasm-ld → loom
208+
→ synth) inlines through the C↔Rust seam. The disassembly evidence is
209+
robust — synth's emitted ARM has zero `bl gale_k_sem_give_decide` in
210+
the inlined `z_impl_k_sem_give`.
211+
212+
The **silicon-cycle claim** requires fixes upstream: the memset bug
213+
blocks boot, and the i64 codegen patterns prevent LTO parity once we
214+
get there. Two PRs against pulseengine/synth and one against
215+
pulseengine/loom.
129216

130217
## Why this matters for the publication
131218

0 commit comments

Comments
 (0)