Skip to content

experiment(smp): bisect step 2 — also disable CPU_MASK + IPI (do not merge)#41

Draft
avrabe wants to merge 4 commits into
mainfrom
bisect/smp-shim-no-ipi-cpumask
Draft

experiment(smp): bisect step 2 — also disable CPU_MASK + IPI (do not merge)#41
avrabe wants to merge 4 commits into
mainfrom
bisect/smp-shim-no-ipi-cpumask

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented May 9, 2026

Not for merge — CI signal only

Bisect step 2 for the qemu_x86_64 SMP AP-startup hang. Stacked on PR #38.

Step 1 outcome (PR #38, `b4f97a5`): disabling `CONFIG_GALE_KERNEL_SPINLOCK_VALIDATE` made `smp_semaphore` go green; `smp_mutex` and `smp_threads` stayed red.

Step 2 (this branch, `d111f95`): ALSO disable `CONFIG_GALE_KERNEL_CPU_MASK` and `CONFIG_GALE_KERNEL_IPI`. Binary-search collapse — if both pass we know the culprit is in {CPU_MASK, IPI} and a follow-up re-enables them one at a time.

Reading the result:

  • both pass → next bisect: re-enable CPU_MASK alone, see which falls.
  • exactly one passes → that primitive's hang has a different root cause.
  • both still fail → fourth source we haven't named (compiler codegen, gale_overlay.conf itself, or x86 boot path outside Gale).

avrabe and others added 4 commits May 3, 2026 09:06
Three SMP tests on qemu_x86_64 (smp_semaphore, smp_mutex, smp_threads)
have been failing in the libgale_ffi.a build step with:

  warning: linking module flags 'Code Model':
    IDs have conflicting values: 'i32 2' from , and 'i32 1' from gale_ffi...
  error: failed to load bitcode of module "core-...rcgu.o"

Root cause: the precompiled `core` rustlib for x86_64-unknown-none
ships bitcode tagged with code-model=kernel (=2, the target's default).
We pass -Ccode-model=small (=1) so addresses match Zephyr's lower-2GB
kernel mapping. With ffi/Cargo.toml's release profile `lto = true`
(fat LTO), rustc tries to merge the two bitcode modules and LLVM
rejects the link.

Fix: switch x86 to the release-lto profile (lto=false), which already
exists for cross-language LTO. rustc then emits a regular static
archive without doing fat LTO, so no bitcode merge is attempted. All
other release tuning (opt-level=z, codegen-units=1, panic=abort,
overflow-checks=true) is preserved via `inherits = "release"`.

Cortex-M targets are unaffected — the change is scoped inside
`if(CONFIG_X86)`. Their precompiled core ships with a matching
code-model, so fat LTO continues to work there. Reproduced the
failure locally with rustc 1.95.0 + the exact CI cargo invocation;
confirmed the fix builds clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous fix on this branch (release-lto profile) only resolved
the bitcode-merge half of the problem. CI showed it built libgale_ffi.a
clean but then failed at link with a different error:

    ld.bfd: discarded output section: `.got.plt'
    ld.bfd: final link failed
    collect2: error: ld returned 1 exit status

Both errors root-cause to the same thing: rustup's precompiled `core`
rustlib for x86_64-unknown-none was built with the target's defaults
(code-model=kernel, PIC), which don't match the flags Zephyr's kernel
needs (code-model=small, static relocation, no PIE).

  - Fat LTO (release profile) → bitcode merge rejects code-model
    mismatch.
  - lto=false (release-lto profile) → bitcode merge avoided, but the
    precompiled core's GOT/PLT references hit the non-PIC link script
    and ld.bfd refuses to discard the now-non-empty .got.plt section.

Proper fix: rebuild `core` from source with our flags via -Zbuild-std,
so the rebuilt core picks up RUSTFLAGS=-Ccode-model=small
-Crelocation-model=static and is both bitcode-compatible AND PIC-free.
Reverts to the original release profile (fat LTO works again) and
adds:

  - GALE_CARGO_EXTRA_ARGS = "-Zbuild-std=core" (threaded into all 42
    cargo build invocations in CMakeLists.txt)
  - RUSTC_BOOTSTRAP=1 in GALE_CARGO_ENV (to permit unstable cargo
    flag on stable rustc)
  - rustup component add rust-src in the SMP test job

Scoped inside if(CONFIG_X86); Cortex-M codegen is byte-identical to
before.

Also addresses the underlying root-cause comment at zephyr-tests.yml
about the SMP build failures (the runtime hang remains a separate
known-issue tracked in the workflow comment).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #30's build fix (-Zbuild-std=core, eliminating bitcode/code-model
mismatch) is correct, but the resulting kernel now hangs at AP startup
on qemu_x86_64 SMP — never reaches `*** Booting Zephyr OS ***`. The
hang was previously masked by the build failure and is the same
known issue documented in zephyr-tests.yml:273-275.

This branch is bisect step 1: disable CONFIG_GALE_KERNEL_SPINLOCK_VALIDATE
in gale_smp_overlay.conf (highest-suspicion shim, touches every spinlock
acquire/release on the AP path) and watch CI.

Decision tree:
- smp_semaphore/smp_mutex/smp_threads green here → spinlock-validate is
  the culprit; investigate the validation hook's interaction with x86
  AP startup.
- still red → push next bisect step disabling CPU_MASK; then IPI.

Not for merge — experimental signal only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 1 (PR #38, b4f97a5) confirmed CONFIG_GALE_KERNEL_SPINLOCK_VALIDATE
is the culprit for smp_semaphore (qemu_x86_64, SMP) but smp_mutex and
smp_threads remained red. So at least one more of {CPU_MASK, IPI} is
implicated in the AP-startup hang.

This step disables BOTH remaining shims at once instead of bisecting
them sequentially — a binary-search collapse. Reading the result:

- smp_mutex AND smp_threads green → second culprit is in {CPU_MASK, IPI};
  follow-up branch re-enables one at a time to localize.
- exactly one green → that one's gating fix is in {CPU_MASK, IPI};
  the other primitive's hang has a separate root cause.
- still red → AP-hang has a fourth source we haven't named yet
  (compiler codegen elsewhere, gale_overlay.conf, x86 boot path beyond
  the Gale shims).

Branched off bisect/smp-shim-no-spinlock-validate so this represents
the cumulative state, not a fresh from-#30 run.

Not for merge — experimental signal only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant