experiment(smp): bisect step 2 — also disable CPU_MASK + IPI (do not merge)#41
Draft
avrabe wants to merge 4 commits into
Draft
experiment(smp): bisect step 2 — also disable CPU_MASK + IPI (do not merge)#41avrabe wants to merge 4 commits into
avrabe wants to merge 4 commits into
Conversation
Three SMP tests on qemu_x86_64 (smp_semaphore, smp_mutex, smp_threads)
have been failing in the libgale_ffi.a build step with:
warning: linking module flags 'Code Model':
IDs have conflicting values: 'i32 2' from , and 'i32 1' from gale_ffi...
error: failed to load bitcode of module "core-...rcgu.o"
Root cause: the precompiled `core` rustlib for x86_64-unknown-none
ships bitcode tagged with code-model=kernel (=2, the target's default).
We pass -Ccode-model=small (=1) so addresses match Zephyr's lower-2GB
kernel mapping. With ffi/Cargo.toml's release profile `lto = true`
(fat LTO), rustc tries to merge the two bitcode modules and LLVM
rejects the link.
Fix: switch x86 to the release-lto profile (lto=false), which already
exists for cross-language LTO. rustc then emits a regular static
archive without doing fat LTO, so no bitcode merge is attempted. All
other release tuning (opt-level=z, codegen-units=1, panic=abort,
overflow-checks=true) is preserved via `inherits = "release"`.
Cortex-M targets are unaffected — the change is scoped inside
`if(CONFIG_X86)`. Their precompiled core ships with a matching
code-model, so fat LTO continues to work there. Reproduced the
failure locally with rustc 1.95.0 + the exact CI cargo invocation;
confirmed the fix builds clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous fix on this branch (release-lto profile) only resolved
the bitcode-merge half of the problem. CI showed it built libgale_ffi.a
clean but then failed at link with a different error:
ld.bfd: discarded output section: `.got.plt'
ld.bfd: final link failed
collect2: error: ld returned 1 exit status
Both errors root-cause to the same thing: rustup's precompiled `core`
rustlib for x86_64-unknown-none was built with the target's defaults
(code-model=kernel, PIC), which don't match the flags Zephyr's kernel
needs (code-model=small, static relocation, no PIE).
- Fat LTO (release profile) → bitcode merge rejects code-model
mismatch.
- lto=false (release-lto profile) → bitcode merge avoided, but the
precompiled core's GOT/PLT references hit the non-PIC link script
and ld.bfd refuses to discard the now-non-empty .got.plt section.
Proper fix: rebuild `core` from source with our flags via -Zbuild-std,
so the rebuilt core picks up RUSTFLAGS=-Ccode-model=small
-Crelocation-model=static and is both bitcode-compatible AND PIC-free.
Reverts to the original release profile (fat LTO works again) and
adds:
- GALE_CARGO_EXTRA_ARGS = "-Zbuild-std=core" (threaded into all 42
cargo build invocations in CMakeLists.txt)
- RUSTC_BOOTSTRAP=1 in GALE_CARGO_ENV (to permit unstable cargo
flag on stable rustc)
- rustup component add rust-src in the SMP test job
Scoped inside if(CONFIG_X86); Cortex-M codegen is byte-identical to
before.
Also addresses the underlying root-cause comment at zephyr-tests.yml
about the SMP build failures (the runtime hang remains a separate
known-issue tracked in the workflow comment).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #30's build fix (-Zbuild-std=core, eliminating bitcode/code-model mismatch) is correct, but the resulting kernel now hangs at AP startup on qemu_x86_64 SMP — never reaches `*** Booting Zephyr OS ***`. The hang was previously masked by the build failure and is the same known issue documented in zephyr-tests.yml:273-275. This branch is bisect step 1: disable CONFIG_GALE_KERNEL_SPINLOCK_VALIDATE in gale_smp_overlay.conf (highest-suspicion shim, touches every spinlock acquire/release on the AP path) and watch CI. Decision tree: - smp_semaphore/smp_mutex/smp_threads green here → spinlock-validate is the culprit; investigate the validation hook's interaction with x86 AP startup. - still red → push next bisect step disabling CPU_MASK; then IPI. Not for merge — experimental signal only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 1 (PR #38, b4f97a5) confirmed CONFIG_GALE_KERNEL_SPINLOCK_VALIDATE is the culprit for smp_semaphore (qemu_x86_64, SMP) but smp_mutex and smp_threads remained red. So at least one more of {CPU_MASK, IPI} is implicated in the AP-startup hang. This step disables BOTH remaining shims at once instead of bisecting them sequentially — a binary-search collapse. Reading the result: - smp_mutex AND smp_threads green → second culprit is in {CPU_MASK, IPI}; follow-up branch re-enables one at a time to localize. - exactly one green → that one's gating fix is in {CPU_MASK, IPI}; the other primitive's hang has a separate root cause. - still red → AP-hang has a fourth source we haven't named yet (compiler codegen elsewhere, gale_overlay.conf, x86 boot path beyond the Gale shims). Branched off bisect/smp-shim-no-spinlock-validate so this represents the cumulative state, not a fresh from-#30 run. Not for merge — experimental signal only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Not for merge — CI signal only
Bisect step 2 for the qemu_x86_64 SMP AP-startup hang. Stacked on PR #38.
Step 1 outcome (PR #38, `b4f97a5`): disabling `CONFIG_GALE_KERNEL_SPINLOCK_VALIDATE` made `smp_semaphore` go green; `smp_mutex` and `smp_threads` stayed red.
Step 2 (this branch, `d111f95`): ALSO disable `CONFIG_GALE_KERNEL_CPU_MASK` and `CONFIG_GALE_KERNEL_IPI`. Binary-search collapse — if both pass we know the culprit is in {CPU_MASK, IPI} and a follow-up re-enables them one at a time.
Reading the result: