Skip to content

Expand all 5 verification tracks: Verus 35, Rocq 9/9, Lean 3/3, Kani 87/87#1

Merged
avrabe merged 5 commits into
mainfrom
feat/verification-expansion
Mar 18, 2026
Merged

Expand all 5 verification tracks: Verus 35, Rocq 9/9, Lean 3/3, Kani 87/87#1
avrabe merged 5 commits into
mainfrom
feat/verification-expansion

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented Mar 18, 2026

Summary

  • Verus SMT: 11 → 35 modules verified (out of 39 total). Added fifo, lifo, timer, mem_slab, queue, mbox, timeout, thread_lifecycle, timeslice, heap, fatal, dynamic, smp_state, stack_config, spinlock, atomic, futex, mempool, kheap, fault_decode, device_init, ring_buf, event, work. Only poll (mutable array indexing), sched (fn pointer types), mem_domain, userspace remain excluded.
  • Rocq proofs: 0/9 → 9/9 passing. Fixed Rocq 9.0 compatibility (Stdlib.Init.Logic, Z_scope, Z.lor_assoc direction, unfold scope, removed unnecessary deps).
  • Lean proofs: 0/3 → 3/3 passing. Fixed Task→SchedTask rename (stdlib conflict), def→abbrev for instance propagation, qualified stdlib lemmas for Lean 4.27.0. No Mathlib needed.
  • Kani BMC: Not wired → 87/87 harnesses pass. Added extern crate kani, #[kani::unwind(N)] annotations for loop harnesses.
  • verus-strip: Added --standalone mode for rocq_of_rust translation. Inlines error constants and type stubs.
  • Bazel: Added verus_strip rule, rocq_module macro, exports_files for cross-package visibility.
  • README: Updated with honest verification status.

Test plan

  • bazel test //:verus_test — 35 modules verified
  • bazel test //:cargo_test — all Rust tests pass
  • bazel test //:clippy_test — zero warnings
  • bazel test //:fmt_test — all formatted
  • bazel test //proofs:sem_proofs_test //proofs:mutex_proofs_test //proofs:condvar_proofs_test //proofs:msgq_proofs_test //proofs:stack_proofs_test //proofs:pipe_proofs_test //proofs:timer_proofs_test //proofs:event_proofs_test //proofs:mem_slab_proofs_test — 9/9 Rocq
  • bazel test //proofs/lean:scheduling_test //proofs/lean:priority_ceiling_test //proofs/lean:priority_queue_test — 3/3 Lean
  • cargo kani --tests — 87/87 harnesses

🤖 Generated with Claude Code

avrabe and others added 5 commits March 17, 2026 20:53
Verus SMT verification now covers 27 modules (up from 11):
added fifo, lifo, timer, mem_slab, queue, mbox, timeout,
thread_lifecycle, timeslice, heap, fatal, dynamic, smp_state,
stack_config, spinlock, atomic.

Key fixes: spec functions for wrapping arithmetic (atomic.rs),
by(bit_vector) proofs for bitwise properties, decreases clauses
for loops, helper proof lemmas for CFSR/HFSR mask decomposition.

Also fixes all clippy warnings across test files (clone-on-Copy,
format string inlining, literal separators, cast_lossless) and
regenerates plain/ stripped output with comprehensive clippy allows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…trip Bazel rule

Verus: added futex and mempool (29 modules, up from 27). Fixed mempool
overflow proof using checked_mul.

Rocq: 6 of 9 proof files now pass (sem, stack, msgq, pipe, timer,
mem_slab). Fixed Rocq 9.0 compatibility: added Stdlib.Init.Logic import,
Open Scope Z_scope, fixed bullet focusing, updated Z.lor lemma names.
Event proofs need Z bitwise lemma renames; mutex/condvar need standalone
file generation with type stubs.

Bazel: added verus_strip rule (bzl/private/verus_to_rocq.bzl) and
rocq_module macro for chaining verus-strip → coq_of_rust → rocq.
Added exports_files for src/*.rs cross-package visibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…7/87

Verus: expanded from 29 to 35 modules. Added kheap (checked_mul),
fault_decode (bit_vector classify proofs), device_init (loop invariant
bounds), ring_buf, event (local vars for bit_vector), work (flag
mutation proofs). Only poll/sched (Verus limitations), mem_domain,
userspace remain excluded.

Rocq: all 9 proof files pass. Fixed event Z.lor_assoc direction,
removed unnecessary //plain deps from mutex/condvar, fixed condvar
unfold scope.

Lean: all 3 proofs pass without Mathlib. Renamed Task→SchedTask
(stdlib conflict), def→abbrev for instance propagation, qualified
stdlib lemmas for Lean 4.27.0, explicit Rat arithmetic.

Kani: 87/87 harnesses pass. Added #[kani::unwind(N)] to 12 loop
harnesses, added extern crate kani to both src/ and plain/ lib.rs.
Bazel passes with 600s timeout.

verus-strip: added --standalone mode for rocq_of_rust generation.
Inlines error constants and type stubs, strips crate:: imports.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
35/39 Verus modules pass (not 38). poll, sched, mem_domain, userspace
marked as "Annotated" with specific Verus limitation noted. Updated
verification counts: Kani 87/87, Rocq 9/9, Lean 3/3. Fixed architecture
diagram to show verus-strip standalone path for Rocq.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@avrabe avrabe merged commit 7d3ce5f into main Mar 18, 2026
24 of 25 checks passed
@avrabe avrabe deleted the feat/verification-expansion branch March 19, 2026 19:29
avrabe added a commit that referenced this pull request Mar 28, 2026
Performance (semaphore test, qemu_cortex_m3):
- FLASH: +2,944 bytes (+7.8%) — within 10% threshold
- RAM: +864 bytes (+6.8%)
- Gale code: 492 bytes total (9 functions, largest 112 bytes)
- LTO tested: no improvement (GCC can't cross-language LTO with Rust)

Next ASIL-D targets (Verus models already exist, need FFI wiring):
1. userspace.c (1,039L C → 751L Rust, US1-US8) — syscall validation
2. mem_domain.c (424L C → 554L Rust, MD1-MD6) — spatial isolation
3. heap.c (592L C → 755L Rust, HP1-HP8) — #1 CVE class in embedded
4. MPU config (2,390L) — deferred, arch-specific

Key insight: Gale already has verified models for the entire ASIL-D
safety perimeter. The verification is done — only FFI wiring remains.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
avrabe added a commit that referenced this pull request Apr 24, 2026
…ent 2)

Follow-up to 9460426. The plain #[inline] annotation didn't change the
LLVM LTO output — all builds still reported 10 gale_ surviving symbols
per run 24912237109. Hypothesis #1 from the post-mortem: #[inline] is a
no-op on pub extern "C" fn + #[no_mangle] because #[no_mangle] forces
external linkage and the inline hint loses to that.

#[inline(always)] is a stronger signal — it tells rustc to emit the
body in every CU that calls it, which should survive the linkage
override. If this one still produces 10 symbols, the blocker is on
the C side (Zephyr compile steps not emitting bitcode) or in the
rustc bitcode-emission path (release-lto profile with lto=false may
not actually produce bitcode-only .rlib archives).

Same 4 functions:
  gale_sem_count_give        (ffi/src/lib.rs:323)
  gale_sem_count_take        (ffi/src/lib.rs:349)
  gale_k_sem_give_decide     (ffi/src/lib.rs:396)
  gale_k_sem_take_decide     (ffi/src/lib.rs:444)

cargo check + cargo clippy --all-targets -- -D warnings both PASS.

Refs: #10

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
avrabe added a commit that referenced this pull request Apr 28, 2026
… optimize mode)

Diagnosed during local debugging: synth's optimized register-allocation
path clobbers r0/r1 (input parameter registers) at function entry when
the wasm body pushes i32 constants before the first local.get. The
function's prologue ends up looking like:

  movs r0, #1     ← clobbers param 0 (count)
  movs r1, #0     ← clobbers param 1 (limit)
  ...
  cmp r0, r1      ← compares clobbered values, not the actual params

This crashes the engine bench in Renode (HardFault on first gale call,
infinite handler loop, never reaches the test's Zero Drops assertion).
The CI run hit a 60-min step timeout without producing a single sample.

Minimal repro saved at /tmp/match_gale.wat (3 i32 params, i64 local,
push 3 i32 constants, then local.get 0). Worth filing as a synth issue
once the experiment lands.

Workaround: synth --no-optimize. Disables the offending pass and emits
proper AAPCS prologue (push r4..r8/lr, locals on stack, params read
from r0/r1/r2 unchanged). Verified locally: same gale_k_sem_give_decide
function now starts with `stmdb sp!, {r4..r8, lr}` and reads r0/r1
correctly.

Cost: ~68 bytes of additional flash (22624 → 22692) and unknown cycles.
The --no-optimize path uses stack-based locals which is wasteful but
correct. Stack frame size also goes up — synth reserves ~4KB per
function for locals which may be excessive; will need to validate
no stack overflow on the engine bench worker thread.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
avrabe added a commit that referenced this pull request Apr 30, 2026
Run 25135494876 timed out at step 13 of 27 (120-min budget) because
the bench emitted one row per sensor ISR (~1 kHz) of which only ~5%
carried t_lock/t_post/t_round/t_bcast — controller runs at 100 Hz, so
the matching pair-tag covers 1 in ~10 sensor rows, and the partial CSV
shows only 109 of 2012 emitted rows had a real measurement. The other
95% were near-empty rows starving Renode at the UART.

Two changes, applied together:

1. emit_event returns bool. Rows whose slot has no controller-cycle
   pair-tag (t_lock == 0) are dropped. reader_count++ only when a
   row was actually emitted; reader_skipped tracks the dropped
   sensor-rate rows for visibility. UART traffic falls ~10x.

2. Long sweep trimmed from 27 cells (sensor_hz x contention x payload)
   to 9 cells (sensor_hz=1000 only x 3 contention x 3 payload). The
   sensor_hz=2000 axis was the timeout cause; sensor_hz=500 carries
   identical primitive signal at lower rate. Per-cell sensor budget
   bumped from 150–200 to 1000 so each cell yields ~100 controller-
   tagged rows (samples * 100 / sensor_hz). TOTAL_SAMPLES recomputed
   to 900.

The drain loop is rewritten in controller-rate units: it now waits
for `expected_ctrl = samples * 100 / sensor_hz` rows to land, with a
short 5s drain timeout because the cell already waited budget_ms for
sensor ISRs to retire.

CI timeout bumped 120 -> 180 min. Workflow comment block updated to
match.

Per audit P3 #1 #5 (cycle-delta column names) and the partial-CSV
diagnostic from run 25135494876. Local build clean for both
qemu_cortex_m3 baseline (16,392 B FLASH, 41,488 B RAM) and gale
variant (18,480 B FLASH, 41,488 B RAM).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
avrabe added a commit that referenced this pull request Apr 30, 2026
Three correctness fixes from the Mythos audit (10 personas, 1
fresh-session validator). All three are confirmed by the partial
CSV from run 25135494876 — none of them is hypothetical.

B2 — actuator_done stale-token drain (P1 #1):
   ctrl_loop's K_MSEC(2) timeout path on actuator_done leaves the
   sem token uncollected if the actuator gives later. The next
   cycle's k_sem_take returns 0 immediately, reads a previous-cycle
   g_actuator_done_cyc, and computes
       t_round = old_done_cyc - new_t_post_out
   which underflows to ~2^32 cycles. Add a drain loop before each
   k_msgq_put to flush stale tokens.

B3 — slot collision wrap (P1 #3):
   Bump SLOT_COUNT 512 → 1024 so the per-cell sensor-ISR budget
   (1000 in the trimmed long sweep) cannot wrap within a single
   cell. Cross-cell wrap remains harmless: sweep_driver stops the
   sensor timer and drains the reader between cells, so any
   in-flight controller stamps from cell X land in slots the
   reader has already consumed before cell X+1 starts.
   RAM cost: 5 arrays × 512 × 4 bytes = +10 KB. RAM use 41,488 B
   → 51,728 B (78.9 %); FLASH +52 B.

B4 — emit_ring drops counter (P4 #2):
   ring_buf_put failures into emit_ring were silently dropped.
   gale's potentially-faster sem_give could mean the reader drains
   emit_ring better → fewer dropped emits → more rows in gale's CSV
   than baseline → biased comparison. Add g_emit_drops volatile,
   emit it in the === END === footer, and assert == 0 in the
   analyzer for both variants. Forces both onto the same
   denominator.

Build clean for qemu_cortex_m3 baseline (16,444 B FLASH, 51,728 B
RAM = 78.9 %).

Audit cross-references:
  P1 (Cortex-M RTOS engineer) — B2, B3
  P4 (counter-attacker)       — B4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
avrabe added a commit that referenced this pull request Apr 30, 2026
The column names (t_round, t_bcast) are inherited from the design
doc, but the actual measurement windows are narrower than the names
suggest:

  - t_round is named "round-trip" but measures only
    controller_post_exit → actuator-0 stamp; it does NOT include the
    controller's post-wake sem_take. It also includes actuator 0's
    cycles_busy=100 busy-loop (same for both variants).
  - t_bcast is named "broadcast" but measures the broadcaster's own
    lock+broadcast+unlock window on the fusion thread; the telemetry
    wake is never sampled.

Per audit P3 #1 #5: a reader who treats these names as stated will
over-attribute the measurement scope. The cheapest fix that protects
publication credibility is to define the columns precisely where the
reader will look — at the top of the analyzer's markdown report and
in main.c's file header. Numbers stay; honest scope-setting is
appended.

CSV column positions are unchanged so engine_control's analyzer
docstring's "strict superset" claim remains true.

Build clean; analyzer parses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
avrabe added a commit that referenced this pull request May 1, 2026
…ulator-artifactual (#27)

* experiment: add GALE_USE_SYNTH build mode (wasm → synth → Cortex-M)

When GALE_USE_SYNTH=ON, the gale-ffi crate is compiled to wasm32 first,
then run through the synth AOT compiler (pulseengine/synth) which emits
a Cortex-M ET_REL relocatable object. The object is wrapped into the
same libgale_ffi.a path the rest of the build expects via ar, so the
per-module gale_sem.c / gale_mutex.c / etc. consumers need no changes.

This is the build-system half of the 4th-variant experiment for the
cross-language LTO blog post: same engine bench, three existing builds
(GCC baseline / GCC + Gale / LLVM + LTO + Gale) plus a 4th data point
where verified Rust reaches Cortex-M via Verus → rustc → wasm → synth's
Rocq-proved i32 instruction selection.

Requires synth with --relocatable flag (pulseengine/synth#83).

Default behaviour (GALE_USE_SYNTH=OFF) is unchanged: rustc-direct-to-Cortex-M
is still the production path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: add engine-bench-renode-synth workflow for the 4th-variant experiment

Adds the gale-via-synth lane to the engine-bench Renode matrix as a
follow-up to the cross-language LTO post. Builds the GCC baseline and
the GALE_USE_SYNTH=ON variant (wasm32 -> synth -> Cortex-M ET_REL ->
libgale_ffi.a) on the same CI run, then sweeps both through Renode at
the long sample count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci+cmake: insert wasm-opt + loom optional passes before synth

Adds two optional formally-verified-(or-not) wasm optimizers between
rustc and synth in the GALE_USE_SYNTH pipeline:

  rustc -> [wasm-opt -Oz] -> [loom optimize] -> synth -> ar -> .a

Both are detected via find_program() and only inserted into the
pipeline if found. If neither is on the path the pipeline reduces
to rustc -> synth (unchanged behaviour).

Effect on engine bench (stm32f4_disco, prj-gale.conf, GCC C kernel):
  synth alone:                  text=22448, total=38533
  synth + wasm-opt + loom:      text=22420, total=38505 (-28 B)

Wasm-level reduction is dramatic (-34% from wasm-opt -Oz), but the
synth-emitted ARM code is dominated by per-function instruction
selection overhead, so the final ELF only moves a few dozen bytes.
The verification-chain story is the bigger win: loom proves each
pass it applies preserves semantics; rejected passes are skipped
rather than applied unsoundly.

CI workflow installs both: binaryen apt package (wasm-opt) and
loom-cli from pulseengine/loom main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: trigger engine-bench-renode-synth on push to experiment branch

workflow_dispatch alone requires the workflow to exist on the
default branch before it can be invoked, which we can't do without
merging the whole experiment first. Adding a push trigger on
experiment/gale-via-synth so the workflow auto-runs whenever the
experiment branch advances. Strip this trigger before any merge
to main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: fix cargo install syntax (--git takes package name positional, not --path)

cargo install --git URL --path PATH is invalid — the two flags are
mutually exclusive. When installing a sub-crate from a git workspace,
pass the package name as a positional argument:

  cargo install --git URL [--branch B] PACKAGE_NAME --force

Fixes the workflow's synth-cli install (was rejected with: 'the
argument --git <URL> cannot be used with --path <PATH>') and the
identical pattern used for loom-cli.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* workaround: synth --no-optimize (regalloc clobbers param registers in optimize mode)

Diagnosed during local debugging: synth's optimized register-allocation
path clobbers r0/r1 (input parameter registers) at function entry when
the wasm body pushes i32 constants before the first local.get. The
function's prologue ends up looking like:

  movs r0, #1     ← clobbers param 0 (count)
  movs r1, #0     ← clobbers param 1 (limit)
  ...
  cmp r0, r1      ← compares clobbered values, not the actual params

This crashes the engine bench in Renode (HardFault on first gale call,
infinite handler loop, never reaches the test's Zero Drops assertion).
The CI run hit a 60-min step timeout without producing a single sample.

Minimal repro saved at /tmp/match_gale.wat (3 i32 params, i64 local,
push 3 i32 constants, then local.get 0). Worth filing as a synth issue
once the experiment lands.

Workaround: synth --no-optimize. Disables the offending pass and emits
proper AAPCS prologue (push r4..r8/lr, locals on stack, params read
from r0/r1/r2 unchanged). Verified locally: same gale_k_sem_give_decide
function now starts with `stmdb sp!, {r4..r8, lr}` and reads r0/r1
correctly.

Cost: ~68 bytes of additional flash (22624 → 22692) and unknown cycles.
The --no-optimize path uses stack-based locals which is wasteful but
correct. Stack frame size also goes up — synth reserves ~4KB per
function for locals which may be excessive; will need to validate
no stack overflow on the engine bench worker thread.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: install synth from fix/synth-i64-locals-and-frame to pick up i64+frame fixes

The previous CI run booted Zephyr with the synth-built ELF but every
RPM step reported count=0 [drain_timeout]. Diagnosed two synth bugs:

1. i64 local storage dropped the upper half (--no-optimize path)
2. Locals area aliased the saved-register spill (also --no-optimize)

Both fixed in pulseengine/synth#85. This commit points the workflow
at that branch so the next CI run uses the fixed synth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: re-trigger with synth fix/synth-i64-locals-and-frame@b8da214

Pulls in:
- explicit I64Or/And/Xor/ExtendI32U/ExtendI32S/Shl/ShrU/ShrS handlers
  in synth's select_with_stack (no more wildcard fallthrough to
  select_default's R0:R1/R2:R3 assumption)
- alloc_consecutive_pair now reserves implicit pair_hi of every
  stack entry plus extra_avoid for popped operands

Local build verified: gale_k_sem_give_decide ends with
  orr r0, r6, r8
  orr r1, r7, ip
matching the wasm i64.or semantics. 22644 B FLASH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(investigate): skip loom install to A/B test cycle delta source

Tracing the −34.5% handoff cycle delta in the synth bench. Found that
loom's optimizer hoists `local.set 3 = 0` from the fall-through arm
of gale_k_sem_give_decide to BEFORE the dispatcher, dropping the
WakeThread/Increment distinction at the wasm level — synth then emits
ARM that always returns action=INCREMENT regardless of has_waiter.

The bench passes (samples=7750, drops=0) because the engine_control
worker is rarely actually blocked at sem_give time, so the WAKE path
is rarely needed for correctness. But the cycle delta is then
comparing a degenerate always-INCREMENT path to rustc's correct
WAKE/INCREMENT discrimination — apples to oranges.

This run skips loom in CI so we can A/B against the loom-on result and
validate the hypothesis. CMakeLists' find_program(LOOM) fails when
loom isn't on PATH, falling through to wasm-opt -> synth without loom.

Filed for follow-up: pulseengine/loom optimizer bug. The hoisting is
unsound for this control-flow pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(investigate): A/B same ELFs under Renode nightly to attribute -34.5% delta

The container ships Renode 1.16.0 (ARG RENODE_VERSION=1.16.0 in
zephyrproject-rtos/docker-image Dockerfile.base, unchanged across
v0.28.x..v0.29.2). 1.16.1 (Feb 2026) touches several Cortex-M paths
that could shift cycle accounting on the gale instruction stream:
fixed ARMv8-M Thumb2 data-processing instructions, fixed Stack
Pointer bits[1:0] handling, fixed wrong exception when FPU is
disabled. None is labelled a cycle-counter fix in the changelog,
but Thumb-2 dispatch changes shift cycle accounting whenever an
instruction takes a different micro-op path.

Adds nightly Renode (builds.renode.io) alongside 1.16.0 and runs
the same two ELFs under both. Yields three controls:

  (a) baseline vs gale, both under nightly
      — does the gale delta change when the cycle model changes?
  (b) baseline_1.16.0 vs baseline_nightly (same ELF)
      — control: cycle-model drift on identical instructions.
  (c) gale_1.16.0 vs gale_nightly (same ELF)
      — does the model shift gale's instructions more than baseline's?
        If yes, the 1.16.0 model is mis-scoring gale-specific
        instructions and the delta is partly artifactual.

Implementation: PATH override puts /opt/renode-nightly first for the
two new run steps. Robot file unchanged (it reads ELF / BENCH_CSV_OUT
from env). Existing 1.16.0 comparison is undisturbed; nightly outputs
go to events-nightly.csv and a separate report section.

Timeout bumped 120 -> 240 min to cover all four Renode runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* synth CI: B6 (rustc-direct cross-check) + D6 (manifest.txt)

Two audit-driven additions to the synth Renode A/B:

B6 — synth correctness control (audit P7, DO-330 §6.1.4):
   The headline -34.5% gale-vs-baseline delta has been ruled non-
   simulator-artifactual by the cycle-model A/B (controls a/b/c
   showed 0.0% drift on identical ELFs across Renode 1.16.0 and
   nightly). What remains is "could synth be miscompiling in a way
   that produces semantically-equivalent-but-faster code?" — a
   research compiler's normal failure mode.

   Adds a third gale build via rustc-direct (no synth in the chain;
   default GALE_USE_SYNTH=OFF path), runs it under 1.16.0, and
   renders an additional comparison "synth vs rustc-direct" in the
   step summary. If the two agree on handoff median within ~3%, the
   synth-emitted ELF is a faithful Cortex-M codegen of the same
   Rust source. This is the cheapest possible "tool error
   mitigation" in the DO-330 §6.1.4 sense — a redundant computation
   by a battle-tested compiler with the same input contract.

   Total Renode runs in this workflow: 5
   (baseline 1.16.0, synth 1.16.0, rustc-direct 1.16.0, baseline
   nightly, synth nightly). timeout-minutes bumped 240 -> 300.

D6 — manifest.txt artifact (audit P9):
   Single point of provenance. The bench's input layer is layered
   with branch-tip pointers (synth --branch fix/..., zephyr --mr
   gale/sem-replacement) and mutable tags (container :v0.29.0).
   Without a captured manifest, "I ran exactly this configuration"
   reduces to "trust the gale_sha and hope nothing else moved" —
   wrong by construction.

   Adds a "Compose build manifest" step at the end of the job that
   captures: rustc/cargo/synth/wasm-opt/west/robotframework/SDK
   versions, Renode 1.16.0 and nightly versions, sha256 of the
   nightly tarball, sha256 of all built ELFs, sha256 + byte-size of
   all CSV outputs, and `west list` for Zephyr fork + module shas.
   Output to /tmp/manifest.txt, included in the artifact bundle
   alongside the three ELFs themselves. The nightly tarball is
   downloaded to /tmp/renode-nightly.tgz first so the sha256 can be
   recorded (was: piped curl -> tar with no intermediate file).

   Six months from now, every reproducibility claim reduces to
   downloading the artifact, opening manifest.txt, and re-resolving
   the three sha-pinned things plus the listed branch tips.

Both YAMLs validated; no behavior changes to existing arms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
avrabe added a commit that referenced this pull request May 1, 2026
…28)

* experiment: add GALE_USE_SYNTH build mode (wasm → synth → Cortex-M)

When GALE_USE_SYNTH=ON, the gale-ffi crate is compiled to wasm32 first,
then run through the synth AOT compiler (pulseengine/synth) which emits
a Cortex-M ET_REL relocatable object. The object is wrapped into the
same libgale_ffi.a path the rest of the build expects via ar, so the
per-module gale_sem.c / gale_mutex.c / etc. consumers need no changes.

This is the build-system half of the 4th-variant experiment for the
cross-language LTO blog post: same engine bench, three existing builds
(GCC baseline / GCC + Gale / LLVM + LTO + Gale) plus a 4th data point
where verified Rust reaches Cortex-M via Verus → rustc → wasm → synth's
Rocq-proved i32 instruction selection.

Requires synth with --relocatable flag (pulseengine/synth#83).

Default behaviour (GALE_USE_SYNTH=OFF) is unchanged: rustc-direct-to-Cortex-M
is still the production path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: add engine-bench-renode-synth workflow for the 4th-variant experiment

Adds the gale-via-synth lane to the engine-bench Renode matrix as a
follow-up to the cross-language LTO post. Builds the GCC baseline and
the GALE_USE_SYNTH=ON variant (wasm32 -> synth -> Cortex-M ET_REL ->
libgale_ffi.a) on the same CI run, then sweeps both through Renode at
the long sample count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci+cmake: insert wasm-opt + loom optional passes before synth

Adds two optional formally-verified-(or-not) wasm optimizers between
rustc and synth in the GALE_USE_SYNTH pipeline:

  rustc -> [wasm-opt -Oz] -> [loom optimize] -> synth -> ar -> .a

Both are detected via find_program() and only inserted into the
pipeline if found. If neither is on the path the pipeline reduces
to rustc -> synth (unchanged behaviour).

Effect on engine bench (stm32f4_disco, prj-gale.conf, GCC C kernel):
  synth alone:                  text=22448, total=38533
  synth + wasm-opt + loom:      text=22420, total=38505 (-28 B)

Wasm-level reduction is dramatic (-34% from wasm-opt -Oz), but the
synth-emitted ARM code is dominated by per-function instruction
selection overhead, so the final ELF only moves a few dozen bytes.
The verification-chain story is the bigger win: loom proves each
pass it applies preserves semantics; rejected passes are skipped
rather than applied unsoundly.

CI workflow installs both: binaryen apt package (wasm-opt) and
loom-cli from pulseengine/loom main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: trigger engine-bench-renode-synth on push to experiment branch

workflow_dispatch alone requires the workflow to exist on the
default branch before it can be invoked, which we can't do without
merging the whole experiment first. Adding a push trigger on
experiment/gale-via-synth so the workflow auto-runs whenever the
experiment branch advances. Strip this trigger before any merge
to main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: fix cargo install syntax (--git takes package name positional, not --path)

cargo install --git URL --path PATH is invalid — the two flags are
mutually exclusive. When installing a sub-crate from a git workspace,
pass the package name as a positional argument:

  cargo install --git URL [--branch B] PACKAGE_NAME --force

Fixes the workflow's synth-cli install (was rejected with: 'the
argument --git <URL> cannot be used with --path <PATH>') and the
identical pattern used for loom-cli.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* workaround: synth --no-optimize (regalloc clobbers param registers in optimize mode)

Diagnosed during local debugging: synth's optimized register-allocation
path clobbers r0/r1 (input parameter registers) at function entry when
the wasm body pushes i32 constants before the first local.get. The
function's prologue ends up looking like:

  movs r0, #1     ← clobbers param 0 (count)
  movs r1, #0     ← clobbers param 1 (limit)
  ...
  cmp r0, r1      ← compares clobbered values, not the actual params

This crashes the engine bench in Renode (HardFault on first gale call,
infinite handler loop, never reaches the test's Zero Drops assertion).
The CI run hit a 60-min step timeout without producing a single sample.

Minimal repro saved at /tmp/match_gale.wat (3 i32 params, i64 local,
push 3 i32 constants, then local.get 0). Worth filing as a synth issue
once the experiment lands.

Workaround: synth --no-optimize. Disables the offending pass and emits
proper AAPCS prologue (push r4..r8/lr, locals on stack, params read
from r0/r1/r2 unchanged). Verified locally: same gale_k_sem_give_decide
function now starts with `stmdb sp!, {r4..r8, lr}` and reads r0/r1
correctly.

Cost: ~68 bytes of additional flash (22624 → 22692) and unknown cycles.
The --no-optimize path uses stack-based locals which is wasteful but
correct. Stack frame size also goes up — synth reserves ~4KB per
function for locals which may be excessive; will need to validate
no stack overflow on the engine bench worker thread.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: install synth from fix/synth-i64-locals-and-frame to pick up i64+frame fixes

The previous CI run booted Zephyr with the synth-built ELF but every
RPM step reported count=0 [drain_timeout]. Diagnosed two synth bugs:

1. i64 local storage dropped the upper half (--no-optimize path)
2. Locals area aliased the saved-register spill (also --no-optimize)

Both fixed in pulseengine/synth#85. This commit points the workflow
at that branch so the next CI run uses the fixed synth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: re-trigger with synth fix/synth-i64-locals-and-frame@b8da214

Pulls in:
- explicit I64Or/And/Xor/ExtendI32U/ExtendI32S/Shl/ShrU/ShrS handlers
  in synth's select_with_stack (no more wildcard fallthrough to
  select_default's R0:R1/R2:R3 assumption)
- alloc_consecutive_pair now reserves implicit pair_hi of every
  stack entry plus extra_avoid for popped operands

Local build verified: gale_k_sem_give_decide ends with
  orr r0, r6, r8
  orr r1, r7, ip
matching the wasm i64.or semantics. 22644 B FLASH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(investigate): skip loom install to A/B test cycle delta source

Tracing the −34.5% handoff cycle delta in the synth bench. Found that
loom's optimizer hoists `local.set 3 = 0` from the fall-through arm
of gale_k_sem_give_decide to BEFORE the dispatcher, dropping the
WakeThread/Increment distinction at the wasm level — synth then emits
ARM that always returns action=INCREMENT regardless of has_waiter.

The bench passes (samples=7750, drops=0) because the engine_control
worker is rarely actually blocked at sem_give time, so the WAKE path
is rarely needed for correctness. But the cycle delta is then
comparing a degenerate always-INCREMENT path to rustc's correct
WAKE/INCREMENT discrimination — apples to oranges.

This run skips loom in CI so we can A/B against the loom-on result and
validate the hypothesis. CMakeLists' find_program(LOOM) fails when
loom isn't on PATH, falling through to wasm-opt -> synth without loom.

Filed for follow-up: pulseengine/loom optimizer bug. The hoisting is
unsound for this control-flow pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(investigate): A/B same ELFs under Renode nightly to attribute -34.5% delta

The container ships Renode 1.16.0 (ARG RENODE_VERSION=1.16.0 in
zephyrproject-rtos/docker-image Dockerfile.base, unchanged across
v0.28.x..v0.29.2). 1.16.1 (Feb 2026) touches several Cortex-M paths
that could shift cycle accounting on the gale instruction stream:
fixed ARMv8-M Thumb2 data-processing instructions, fixed Stack
Pointer bits[1:0] handling, fixed wrong exception when FPU is
disabled. None is labelled a cycle-counter fix in the changelog,
but Thumb-2 dispatch changes shift cycle accounting whenever an
instruction takes a different micro-op path.

Adds nightly Renode (builds.renode.io) alongside 1.16.0 and runs
the same two ELFs under both. Yields three controls:

  (a) baseline vs gale, both under nightly
      — does the gale delta change when the cycle model changes?
  (b) baseline_1.16.0 vs baseline_nightly (same ELF)
      — control: cycle-model drift on identical instructions.
  (c) gale_1.16.0 vs gale_nightly (same ELF)
      — does the model shift gale's instructions more than baseline's?
        If yes, the 1.16.0 model is mis-scoring gale-specific
        instructions and the delta is partly artifactual.

Implementation: PATH override puts /opt/renode-nightly first for the
two new run steps. Robot file unchanged (it reads ELF / BENCH_CSV_OUT
from env). Existing 1.16.0 comparison is undisturbed; nightly outputs
go to events-nightly.csv and a separate report section.

Timeout bumped 120 -> 240 min to cover all four Renode runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* flight_control: macro-bench skeleton (Phases 1-4)

Add the six-thread / two-timer / five-primitive macro benchmark
described in docs/research/macro-bench-design.md. Composes ring_buf +
sem + mutex + msgq + condvar on a 100 Hz fixed-rate flight-control
loop, capturing per-sensor-ISR algo + handoff (engine_control parity)
plus per-controller-period t_lock, t_post, t_round, t_bcast.

Single CSV row per sensor sample with -1 sentinels for the segments
not measured on that row (~9-of-10 cycles have no t_bcast, ~9-of-10
sensor rows have no t_lock/t_post/t_round). 3-axis sweep
(sensor_hz x contention x payload) totalling ~4500 events on the
long sweep, matching engine_control's Renode lane density.

Verified:
  - Builds clean for qemu_cortex_m3 (baseline + gale variants).
  - QEMU smoke run: 150/150 samples, drops=0, telemetry_emits=11
    (priority inheritance keeps the lowest-priority telemetry thread
    alive under fusion/actuator contention).
  - All four new cycle-delta segments populate as expected.

Two-ring split (sensor_ring -> fusion -> emit_ring -> reader) avoids
the single-sem race where reader_loop and the fusion thread would
otherwise compete for sensor_data_ready and steal samples from each
other. Reader thread runs at priority 10 (below all workers) so its
UART back-pressure never starves the measured chain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* flight_control: analyzer + QEMU runner + Renode robot (Phase 5a)

Add analyze.py extending engine_control's per-step + Mann-Whitney
shape with four new metric columns (t_lock, t_post, t_round,
t_bcast). Negative values in the new columns are the "not measured
on this row" sentinel and are filtered out per metric.

New asserts beyond the engine_control set:
  - telemetry_emits > 0 on both variants (design doc Section
    "Risks": priority-inheritance must keep the lowest-priority
    telemetry thread off the starvation floor)
  - gale p99 <= 2 * baseline p99 on each of t_lock, t_post,
    t_round, t_bcast (one regression guard per primitive segment)

run_qemu_bench.sh + tag_events.py mirror engine_control's shape
1:1 (same env conventions, same per-run-id tagging). Renode robot
file is engine_stm32f4.robot with the wait line updated to match
the macro bench's startup banner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: add engine-bench-renode-flight workflow (Phase 5b)

Modeled on engine-bench-renode-synth.yml, runs the new macro
benchmark on stm32f4_disco under Renode for the long sweep
(~4500 events). Same variant matrix (baseline + gale), same
artifact upload shape, same MD report rendered into the job
summary.

Triggered on push to experiment/macro-bench-flight-control and
manually via workflow_dispatch. Uses only safe GitHub contexts
(github.workspace, github.ref) — no untrusted inputs flow into
shell commands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* flight: emit controller-rate rows only; trim long sweep to 9 cells

Run 25135494876 timed out at step 13 of 27 (120-min budget) because
the bench emitted one row per sensor ISR (~1 kHz) of which only ~5%
carried t_lock/t_post/t_round/t_bcast — controller runs at 100 Hz, so
the matching pair-tag covers 1 in ~10 sensor rows, and the partial CSV
shows only 109 of 2012 emitted rows had a real measurement. The other
95% were near-empty rows starving Renode at the UART.

Two changes, applied together:

1. emit_event returns bool. Rows whose slot has no controller-cycle
   pair-tag (t_lock == 0) are dropped. reader_count++ only when a
   row was actually emitted; reader_skipped tracks the dropped
   sensor-rate rows for visibility. UART traffic falls ~10x.

2. Long sweep trimmed from 27 cells (sensor_hz x contention x payload)
   to 9 cells (sensor_hz=1000 only x 3 contention x 3 payload). The
   sensor_hz=2000 axis was the timeout cause; sensor_hz=500 carries
   identical primitive signal at lower rate. Per-cell sensor budget
   bumped from 150–200 to 1000 so each cell yields ~100 controller-
   tagged rows (samples * 100 / sensor_hz). TOTAL_SAMPLES recomputed
   to 900.

The drain loop is rewritten in controller-rate units: it now waits
for `expected_ctrl = samples * 100 / sensor_hz` rows to land, with a
short 5s drain timeout because the cell already waited budget_ms for
sensor ISRs to retire.

CI timeout bumped 120 -> 180 min. Workflow comment block updated to
match.

Per audit P3 #1 #5 (cycle-delta column names) and the partial-CSV
diagnostic from run 25135494876. Local build clean for both
qemu_cortex_m3 baseline (16,392 B FLASH, 41,488 B RAM) and gale
variant (18,480 B FLASH, 41,488 B RAM).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* flight: audit fixes B2/B3/B4 (stale-token drain, slot wrap, emit_drops)

Three correctness fixes from the Mythos audit (10 personas, 1
fresh-session validator). All three are confirmed by the partial
CSV from run 25135494876 — none of them is hypothetical.

B2 — actuator_done stale-token drain (P1 #1):
   ctrl_loop's K_MSEC(2) timeout path on actuator_done leaves the
   sem token uncollected if the actuator gives later. The next
   cycle's k_sem_take returns 0 immediately, reads a previous-cycle
   g_actuator_done_cyc, and computes
       t_round = old_done_cyc - new_t_post_out
   which underflows to ~2^32 cycles. Add a drain loop before each
   k_msgq_put to flush stale tokens.

B3 — slot collision wrap (P1 #3):
   Bump SLOT_COUNT 512 → 1024 so the per-cell sensor-ISR budget
   (1000 in the trimmed long sweep) cannot wrap within a single
   cell. Cross-cell wrap remains harmless: sweep_driver stops the
   sensor timer and drains the reader between cells, so any
   in-flight controller stamps from cell X land in slots the
   reader has already consumed before cell X+1 starts.
   RAM cost: 5 arrays × 512 × 4 bytes = +10 KB. RAM use 41,488 B
   → 51,728 B (78.9 %); FLASH +52 B.

B4 — emit_ring drops counter (P4 #2):
   ring_buf_put failures into emit_ring were silently dropped.
   gale's potentially-faster sem_give could mean the reader drains
   emit_ring better → fewer dropped emits → more rows in gale's CSV
   than baseline → biased comparison. Add g_emit_drops volatile,
   emit it in the === END === footer, and assert == 0 in the
   analyzer for both variants. Forces both onto the same
   denominator.

Build clean for qemu_cortex_m3 baseline (16,444 B FLASH, 51,728 B
RAM = 78.9 %).

Audit cross-references:
  P1 (Cortex-M RTOS engineer) — B2, B3
  P4 (counter-attacker)       — B4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* flight: B1 — column-semantics block in main.c + analyzer report

The column names (t_round, t_bcast) are inherited from the design
doc, but the actual measurement windows are narrower than the names
suggest:

  - t_round is named "round-trip" but measures only
    controller_post_exit → actuator-0 stamp; it does NOT include the
    controller's post-wake sem_take. It also includes actuator 0's
    cycles_busy=100 busy-loop (same for both variants).
  - t_bcast is named "broadcast" but measures the broadcaster's own
    lock+broadcast+unlock window on the fusion thread; the telemetry
    wake is never sampled.

Per audit P3 #1 #5: a reader who treats these names as stated will
over-attribute the measurement scope. The cheapest fix that protects
publication credibility is to define the columns precisely where the
reader will look — at the top of the analyzer's markdown report and
in main.c's file header. Numbers stay; honest scope-setting is
appended.

CSV column positions are unchanged so engine_control's analyzer
docstring's "strict superset" claim remains true.

Build clean; analyzer parses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* flight CI: D6 — manifest.txt artifact (single point of provenance)

Per audit P9: a reader downloading flight-bench-renode-long.zip six
months from now should be able to identify exactly which Renode,
Zephyr fork, rustc, SDK, and gale_sha produced the cycles. Without
this, "I ran exactly this configuration" reduces to "trust the
gale_sha and hope nothing else moved" — but every input below the
gale repo (Zephyr fork at branch tip, container at mutable tag,
rustc on stable channel) is a moving target.

Adds one new step "Compose build manifest" right before the upload
step. The manifest captures: rustc / cargo / west / robotframework /
SDK versions, Renode 1.16.0 version, Zephyr fork + modules sha via
`west list`, and sha256 + byte-size of every built ELF and emitted
CSV. Output goes to /tmp/manifest.txt and is included in the
artifact bundle. Both ELFs are also uploaded so binary-level
reproducibility can be verified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* flight: B5 — moving-block bootstrap CI on pooled p99 + multi-comparison note

Per audit P4 #6 / P2 F1 / P2 F2: the flight bench reports per-step
medians with naive-bootstrap CI and pooled p99 as a point estimate
with no CI. Two related issues:

1. The bench's samples are consecutive 100 Hz controller cycles —
   autocorrelated. A slow noise burst contaminates 5–10 consecutive
   samples; naive bootstrap underestimates the CI by treating
   dependent samples as independent. Politis-Romano predict the
   correction factor is ~sqrt((1+ρ)/(1-ρ)) for first-order auto-
   correlation.
2. The Mann-Whitney p-values are reported uncorrected across
   162 simultaneous tests (27 cells × 6 metrics). At α=0.05 under H0
   that yields ~8 false-positive cells by chance; a reader scanning
   the per-step table for "where did gale win?" will pick those up
   as signal.

Fixes:
- New helper `block_bootstrap_percentile_ci` (block_size=10,
  iters=2000). Used for pooled p50/p75/p95/p99 in the per-metric
  pooled tables. Per-step medians keep the naive bootstrap (median
  is robust to autocorrelation; the issue is tails, not central
  tendency).
- One-paragraph note above the pooled tables explains the bootstrap
  choice and points readers at Holm-Bonferroni / BH-FDR for the
  per-step MW-U cells.

Smoke-tested against engine_control's events.csv (different schema
so 0 rows, but the report header + column-semantics block render
correctly). Block bootstrap on synthetic xs=range(100) gives p99
CI [66, 98] for point=98 — wider than naive, as expected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
avrabe added a commit that referenced this pull request May 10, 2026
…CAL_DECLARATION

LTO crash root-caused (NOTES-llvm-lto-crash.md, b48a81a) and fixed.
The LLVM linker plugin's whole-program LTO partitioning evicts the
ADC1_2 IRQ-18 vector handler when CONFIG_ISR_TABLES_LOCAL_DECLARATION=y;
the chip then takes a spurious IRQ 18 during boot and halts. Removing
the LOCAL_DECLARATION bit restores conventional Zephyr ISR-table layout
that LLVM handles cleanly.

This commit adds:

  - benches/engine_control/silicon/boards/nucleo_g474re/prj-lto-no-isr-local.conf
    (Kconfig overlay enabling LTO without the LOCAL_DECLARATION trigger)

  - benches/engine_control/silicon/runs/2026-05-10-nucleo_g474re-f6f61281-gale-lto-systick/
    (the publication-grade LTO+ADC=y capture: 7321 events received,
    drops=0, sentinel ✅, handoff median 558 cyc with 99.7% stability)

Build invocation:

  west build -p always -b nucleo_g474re -d /tmp/silicon-lto-adc \
    -s gale-smart-data/benches/engine_control -- \
      -DZEPHYR_TOOLCHAIN_VARIANT=llvm \
      -DCMAKE_TRY_COMPILE_TARGET_TYPE=STATIC_LIBRARY \
      -DZEPHYR_EXTRA_MODULES=<gale-smart-data> \
      -DOVERLAY_CONFIG="<gale-smart-data>/benches/engine_control/prj-gale.conf;<gale-smart-data>/benches/engine_control/silicon/boards/nucleo_g474re/prj-lto-no-isr-local.conf" \
      -DCMAKE_EXE_LINKER_FLAGS="-L<arm-zephyr-eabi libgcc> -L<picolibc>" \
      -DENGINE_BENCH_SWEEP=long

  PATH must include /opt/homebrew/opt/llvm@21/bin and /opt/homebrew/opt/lld@21/bin
  (LLVM 21.1.8 + lld 21.1.8 to match rustc 1.94.1's LLVM major version).

Final silicon timing matrix (all gale@f6f61281 / 8a4d817 / 418c6b8,
NUCLEO-G474RE @ 170 MHz, drops=0, 99.7% per-step stability):

                        ADC=y     ADC=n
  baseline (no Gale)     506       528
  gale rustc-direct      582       574    ← FFI seam: +46 to +76 cyc
  gale wasm-synth        582       n/a    (bit-identical to rustc-direct)
  gale LLVM-LTO          558       471    ← LTO recovers part / all of seam

LTO impact, summarized:
  - With ADC=y: LTO recovers 24 of the 76 cyc FFI penalty (582→558).
                Still +52 above baseline; the ADC subsystem in the
                LTO partition apparently affects code layout in ways
                that prevent full inlining recovery.
  - With ADC=n: LTO recovers ALL 46 cyc FFI penalty AND beats baseline
                by 57 cyc (574→471, baseline 528). The verified Rust
                decision logic, once inlined and dedup'd against the C
                bound-check, is measurably tighter than stock Zephyr.
  - LLVM truly inlined gale_k_sem_give_decide (symbol absent from LTO
    ELF, decision logic became 3-instruction `cmp r2,r1; it cc; addcc r2,#1`
    inside z_impl_k_sem_give per disassembly verification).

Renode CI's claim that LTO ≈ rustc-direct (both -2.0%) is inverted
on silicon: at ADC=n, silicon LTO is -10.8% vs baseline (vs +8.7%
for rustc-direct at the same axis); at ADC=y, silicon LTO is +10.3%
above baseline (vs +15.0% for rustc-direct). The TB cost model
under-counts cross-language inlining, and the ADC peripheral's
interaction with LTO partitioning is a real silicon-only effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
avrabe added a commit that referenced this pull request May 10, 2026
Pushed the wasm-cross-LTO experiment all the way to a buildable bench
ELF integrated via wasm-ld+arm-ar+linker-substitute. Discovered an
additional synth backend bug while attempting silicon measurement:

  synth's emitted memset/memcpy/memmove don't terminate correctly on
  Zephyr's startup `memset(bss, 0, sizeof(bss))` invocation. The chip
  hangs in memset+0x4c forever, bouncing between offsets 0x668 and
  0x67e in a tight inner loop. The synth disassembly reveals i64
  shift instructions (`subs.w r3, r2, #32; rsb r3, r2, #32;
  lsl.w r3, r1, r3`) lowered into what should be a byte-counter loop
  — same root cause as the u64-packed FFI return codegen issue
  documented earlier: synth's i64 codegen is incomplete.

End-to-end status:

  - wasm-ld static-merging: WORKS. shim.wasm.o + libgale_ffi.a → 1MB
    merged.wasm with z_impl_k_sem_give and gale_k_sem_give_decide both
    present.

  - synth inlining at merged-module scope: STRUCTURALLY WORKS. The
    output `z_impl_k_sem_give` body has zero bl gale_k_sem_give_decide
    instructions. Verified by disassembly. 138 bytes vs LLVM-LTO's
    82 bytes — 1.68x larger but inlined.

  - Bench integration: BUILDS. CMake bench builds with
    -DGALE_WASM_LTO_OVERRIDE_SEM_GIVE=1 + custom libgale_ffi.a +
    --allow-multiple-definition. Final ELF 219 KB FLASH, 66 KB RAM.

  - Chip boot: BLOCKED. PC stuck in synth-emitted memset. Workarounds
    via objcopy --weaken-symbol, --strip-symbol, --redefine-sym all
    failed to evict synth's broken memset bytes from the final ELF.

Three synth backend issues filed against pulseengine/synth, ordered:

  1. (blocker) memset/memcpy/memmove i64-codegen non-termination —
     prevents the merged-wasm bench from booting at all.
  2. u64-packed FFI return unpacking — ~50% of the LTO-parity size
     delta. Same i64-codegen root cause as #1.
  3. wasm linear-memory access lowering — ~20% of the size delta.
     Cosmetic compared to #1 and #2.

Plus one issue against pulseengine/loom:

  - Z3 SortDiffers panic in inline_functions pass on i64-heavy
    wasm modules. Without loom, the verified-LTO claim doesn't hold.

The structural claim — "wasm-cross-LTO via PulseEngine pipeline
dissolves the C↔Rust seam at wasm IR level" — is **proven by
disassembly**. The cyclical claim — "silicon timing matches LLVM-LTO"
— is **blocked on synth's memset codegen**. Neither is a fundamental
architectural barrier; both are well-scoped engineering work.

This commit only updates the NOTES with the integration findings.
The bench source is restored to clean state (the gale_sem.c
#ifndef edit was transient) and verified building unchanged at
27 KB FLASH at the canonical rustc-direct path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant