Add baremetal RISC-V smoke tests (rv32, rv64) by luhenry · Pull Request #4 · riseproject-dev/executorch

luhenry · 2026-05-23T16:40:00Z

Summary

Add baremetal RISC-V testing on CI for rv32 and rv64.

Test plan

It's only testing on CI, no new code really, so CI is the testing.

Will submit to https://github.com/pytorch/executorch once pytorch#19741 is merged

Differential Revision: D105973185 Pull Request resolved: pytorch#19736

@digantdesai

Add model tests of currently not supported models - yolo11 - wav2letter - silero_vad cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Adrian Lundell <adrian.lundell@arm.com>

Differential Revision: D102880053 Pull Request resolved: pytorch#19211

Differential Revision: D106123930 Pull Request resolved: pytorch#19742

pytorch#19746) pytorch#18476 clone version due to bot crash

…ackend (pytorch#19747) clone pytorch#18477 due to bot crash

clone pytorch#18728 due to bot crash

Differential Revision: D106162684 Pull Request resolved: pytorch#19749

@robert-kalmar

### Summary Add tests verifying correct support for add.tensor by the Neutron backend using the new Neutron MLIR flow. ### Test plan Unit tests provided. cc @robert-kalmar

…#19752) Differential Revision: D106254596 Pull Request resolved: pytorch#19752

Treat BUCK and TARGETS files as build metadata in the Arm pre-push license check so they do not need copyright headers. Signed-off-by: Per Held <per.held@arm.com> Change-Id: I4b3bbd1e03ba4b9c38fd06225156344985f0cc70

@robert-kalmar

### Summary Add tests verifying correct support for sub.tensor by the Neutron backend using the new Neutron MLIR flow. ### Test plan Unit tests provided. cc @robert-kalmar @JakeStevens @digantdesai @rascani

…opy (pytorch#19751) Follow-up to pytorch#17097, which added BF16 support to the TOSA GATHER op. `aten.index_select` and `aten.unfold_copy` both lower via TOSA GATHER but their support checks were not updated at the time. In both decompositions(`DecomposeIndexSelectToGatherPass()` and `DecomposeUnfoldToGatherPass()`), the bf16 values tensor flows through dtype-agnostic reshape ops and `tosa.GATHER`, which accepts `BF16`. The support check was the only blocker. | Op | bf16 before | bf16 after | |---------------------|:-----------:|:----------:| | `aten.gather` | ✅ | ✅ | | `aten.index.Tensor` | ✅ | ✅ | | `aten.slice_copy` | ✅ | ✅ | | `aten.index_select` | ❌ | ✅ | | `aten.unfold_copy` | ❌ | ✅ | Changes: - `index_select_support.py`, `unfold_copy_support.py`: extend float branch to include `bfloat16`; add bf16 extension guard; update rejection message. - `test_index_select.py`, `test_unfold_copy.py`: add isolated `_tosa_FP_bf16` test functions using `TosaPipelineFP(..., tosa_extensions=["bf16"])`. ### Test plan `test_index_select_tosa_FP_bf16` and `test_unfold_copy_tosa_FP_bf16` exercise the bf16 path end-to-end through `TosaPipelineFP` with the bf16 extension enabled, following the same pattern of the existing `test_slice_tensor_tosa_FP_bf16` from pytorch#17492

@psiddh

This is done for conv, depthwise conv, transpose conv, and bmm. Add scratch tensors to the operator signatures, which are then assigned exir.memory.alloc. These allocs are automatically memory planned by ExecuTorch. Introduce `required_cmsis_buffer_size`which computes the buffer size from node properties + the Cortex-M configuration. The function uses functions registered by target in backends/cortex_m/passes/scratch_buffer_sizes.py This is used to set the size of the allocs in ConvertToCortexMPass Finally, modify the kernels to use the new scratch tensor instead of allocating temporary memory. Add a new macro CORTEX_M_ENABLE_RUNTIME_CHECKS to do a safety check that the aot computed buffer size is equal to the buffer size computed at runtime. Use this when testing. cc @psiddh @AdrianLundell @digantdesai @rascani @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell --------- Signed-off-by: Erik Lundell <erik.lundell@arm.com> Co-authored-by: Måns Nilsson <mans.nilsson@arm.com>

@cccclai

…es (pytorch#19146) ### Summary To enable GPU backend support in the Llama runner, refactoring is required because the dtypes of kv_cache, attention_mask, and logits are currently hardcoded, preventing floating‑point models from running. This PR focuses on removing the hardcode dtype for them. #### Key changes - Remove template parameter <typename T> from KVManager, LhdTokenGenerator, MultimodalPromptProcessor, and related runner classes - Detect kv_cache and attention_mask dtypes dynamically from MethodMeta at construction time instead of compile-time bitwidth detection - Switch to std::byte* pointer arithmetic with getDtypeSize() for all buffer offsets; add fill_mask() helper for multi-dtype attention mask filling - Update spec_prop pass for custom llama op for sharding case greater than 1 ### Test plan ``` python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript.test_llama_stories_110m --model SM8650 --build_folder /local/mnt/workspace/chenweng/executorch/executorch/build-android --device acfa9311 --executorch_root . --artifact_dir ./stories_110m_pte_size --llama_artifacts . --use_fp16 ``` <img width="1977" height="468" alt="image" src="https://github.com/user-attachments/assets/8bf3bffa-9b9f-4655-9cbc-b20127c2468a" /> cc @cccclai @cbilgin @abhinaykukkadapu

Summary: Pull Request resolved: pytorch#19764 Reviewed By: kirklandsign Differential Revision: D106332819

@digantdesai

As documented at https://vkdoc.net/man/VkDataGraphPipelineSessionBindPointRequirementARM .stype of VkDataGraphPipelineSessionBindPointRequirementARM should alway be set to VK_STRUCTURE_TYPE_DATA_GRAPH_PIPELINE_SESSION_BIND_POINT_REQUIREMENT_ARM cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Erik Lundell <erik.lundell@arm.com>

Enable CPPCHECK for Cortex-M sources and headers. The Cortex-M kernels are registered through generated wrappers, so cppcheck cannot see direct call sites for the exported *_out entry points and reports them as unused. Keep narrow unusedFunction suppressions for those registration-visible functions. The scratch buffer context header is linted as a standalone header but currently exposes helper API without in-tree call sites, so suppress unusedFunction at file scope there instead of dropping Cortex-M header coverage. Keep the quantize and dequantize context parameters non-const to match the generated kernel ABI; changing them to const changes the mangled symbols used by registration. Signed-off-by: Per Held <per.held@arm.com> Change-Id: I3bcb6e5d3f125ae400005d1b033b24a07eb7924f

### Summary It relates to pytorch#18833. It doesn't add Yolo on baremetal, but it at least makes sure that it works using Portable Kernels and XNNPACK backends. ### Test plan It's only adding a model to CI, so the CI is the test plan.

Convert BenchmarkActivity, BenchmarkMetric, LlmBenchmark, LlmModelRunner, and ModelRunner from Java to Kotlin. Differential Revision: D106195816

@digantdesai

…rch#19731) ### Summary Extend the Cortex-M cross-CPU build pipeline to Armv6-M by patching two upstream issues that block the Corstone-300 target source and the CMSIS Cortex DFP from building for `cortex-m0plus`: * `core_platform/0003-*.patch` guards the `HardFault_Handler` in `targets/corstone-300/target.cpp`. The handler uses an `ite eq` IT-block in inline asm and dereferences the SCB CFSR/BFAR/MMFAR fault-status registers; both are Armv7-M / Armv8-M Mainline only. The patch wraps the rich handler in `__ARM_ARCH_7M__ / 7EM / 8M_MAIN / 8_1M_MAIN` and falls back to a minimal stub on Armv6-M / Armv8-M Baseline (M0/M0+/M23). * `core_software/0002-*.patch` fixes `cmsis.cmake`'s handling of the M0+ device. The Cortex DFP names the device directory and headers `ARMCM0plus` (lowercase suffix), while the device sources (`startup_ARMCM0plus.c`, `system_ARMCM0plus.c`) gate their implementations on the `ARMCM0P` preprocessor macro — three different spellings. The previous `string(TOUPPER ...)` produced `ARMCM0PLUS`: the include path lookup failed and the source files hit their `#error device not specified!` guard. Override `ARM_CPU` to `ARMCM0plus` for the directory + filename and introduce a separate `CMSIS_DEVICE_CPU_DEFINE` set to `ARMCM0P` for the cmsis_startup and cmsis_system compile-definitions; all other cores still drive both paths from the uppercased default. Both patches are layered via the existing `patch_repo` mechanism; the `corstone_utils.cmake` TODO is updated so the deletion plan for 0002 and 0003 is documented together. ### Test Plan Locally validated end-to-end on the Corstone-300 FVP with the `qadd` model: `cortex-m0plus` build links a runner that includes `startup_ARMCM0plus.c` / `system_ARMCM0plus.c` and the patched `target.cpp`, and the FVP run prints `TEST: BundleIO index[0] Test_result: PASS` with all error stats zero. The bundled `libcmsis-nn.a` reports `Tag_CPU_arch: v6S-M` and `Tag_THUMB_ISA_use: Thumb-1` with zero DSP / MVE / saturating instructions, confirming the scalar code path was exercised. Authored with Claude. cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

Differential Revision: D106026285 Pull Request resolved: pytorch#19734

Differential Revision: D106394605 Pull Request resolved: pytorch#19775

@robert-kalmar

pytorch#19772) … Registration ### Summary Docs improvement. ### Test plan Docs only. cc @robert-kalmar @JakeStevens @digantdesai @rascani

@digantdesai

Re-upload with BUCK changes. Share TOSA RESIZE parameter validation between upsample support checks and fake RESIZE lowering so invalid nearest and bilinear resize parameters are rejected before delegation. Change-Id: I57c267aca96d733879ae90329267e44adce399c6 cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Per Held <per.held@arm.com>

Differential Revision: D106408368 Pull Request resolved: pytorch#19783

### Summary In pytorch#19651, I added a global seed for pytest runs. This was intended to reduce random tolerance flakes, but didn't actually do so in practice. This is because the parallel test runners don't guarantee any ordering, so random state is unstable between runs. I've updated it to set the seed per-test. This should hopefully make the random state invariant of test execution order.

Differential Revision: D106430647 Pull Request resolved: pytorch#19790

…19743) Differential Revision: D105630451 Pull Request resolved: pytorch#19743

@kimishpatel

… GPU / CPU (pytorch#19252) ### Summary CoreML decides at compile/load time which device each MIL operation will execute on, and coremltools 9.0+ exposes that through `MLComputePlan`. The recurring question on the issue tracker is *"why isn't my model running fully on the ANE?"* — for example: - pytorch#4091 — `llama model is not fully lowered to ANE` - pytorch#11541 — `CoreML model is crashing on iPhone GPU, but not on iPhone CPU or macOS GPU` - pytorch#8439 — `ANE compile OOMs on certain input shapes` - pytorch#8445 — `CPU Overhead After ANE Execution` Today the only way for an ExecuTorch user to answer it is to break out Swift / Xcode. This PR adds a Python wrapper around `MLComputePlan` so the answer is one shell command: ``` $ python coreml_compute_plan.py --model_path my_model.mlpackage \ --compute_units cpu_and_ne --show_non_ane === my_model.mlpackage === ANE: 412 / 480 ( 85.8%) CPU: 68 / 480 ( 14.2%) Non-ANE op types: 32 ios17.cast 18 ios17.gather 12 ios17.reshape 6 ios17.constexpr_blockwise_shift_scale ``` Inputs supported: | Input | Behavior | |---|---| | `.pte` | Extract every Core ML partition into a tempdir, then analyze each. | | `.mlpackage` | Compile to `.mlmodelc` in a tempdir, then analyze. | | `.mlmodelc` | Analyze directly. | The PTE path reuses the same JSON/named-data extraction logic that `extract_coreml_models.py` uses, and is inlined into the script so it can be run against a plain CoreML model without depending on the executorch package. ### Test plan Added `test_coreml_compute_plan.py` covering: - `_device_name(...)` for `None` and a stub `MLNeuralEngineComputeDevice`. - `_COMPUTE_UNIT_CHOICES` mapping (`cpu_and_ne` / `all`). - `analyze_one(...)` end-to-end on a tiny `relu(x @ x.T) + x.sum()` mlpackage built with `coremltools.convert(...)`: returns rows for every dispatched op, with a `main` function and the expected MIL op types (`matmul`, `relu`, `add`, `reduce_sum`). ``` $ python -m pytest examples/apple/coreml/scripts/test_coreml_compute_plan.py -v ============================== 7 passed in 3.68s =============================== ``` I also ran the script against a few hand-built `.mlpackage` and `.mlmodelc` files on macOS 26 with coremltools 9.0 and verified the output matches what `MLComputePlan` returns directly. Authored with Claude. cc @kimishpatel @YifanShenSZ @cymbalrush @metascroy

Differential Revision: D106412035 Pull Request resolved: pytorch#19777

…h#16986) Differential Revision: D91725222 Pull Request resolved: pytorch#16986

Add a docker build image based on Ubuntu 26.04 with gcc 15. It's necessary for the the baremetal on RISC-V use case since libstdc++-riscv64-unknown-elf-picolibc is only available starting Ubuntu 26.04. It also makes sure that gcc-riscv64-unknown-elf is at least gcc 14+ which has support for RVV

Cross-compiles with riscv64-unknown-elf + picolibc, embeds the .bpte into the ELF, and runs under qemu-system-riscv{32,64} -machine virt with semihosting carrying stdout and exit status. Same bundled-IO PASS criterion as the existing linux runs.

metascroy and others added 7 commits May 22, 2026 19:20

Fix 2 broken tests caused by D105910457

a83e7c4

Differential Revision: D105973185 Pull Request resolved: pytorch#19736

Convert Android LLM extension from Java to Kotlin (pytorch#19211)

158c5d8

Differential Revision: D102880053 Pull Request resolved: pytorch#19211

Globally serialize XNNPACK execution, add logging (pytorch#19742)

6bda6c4

Differential Revision: D106123930 Pull Request resolved: pytorch#19742

[ET Device Support] Module: allocate device memory for planned buffers (

12f62f2

pytorch#19746) pytorch#18476 clone version due to bot crash

[ET Device Support] CudaAllocator: device memory allocator for CUDA b…

c27cc5d

…ackend (pytorch#19747) clone pytorch#18477 due to bot crash

[ET Device Support] Define AOT device copy ops registry (pytorch#19748)

7d8063f

clone pytorch#18728 due to bot crash

This was referenced May 23, 2026

[discussion] Upstreaming an HPMicro bare-metal RISC-V MCU backend pytorch/executorch#19666

Open

Export YOLO to executorch for RISC-V Baremetal environment pytorch/executorch#18833

Open

kirklandsign and others added 11 commits May 23, 2026 18:50

Add extension_llm_runner to CMake deps (pytorch#19749)

d757776

Differential Revision: D106162684 Pull Request resolved: pytorch#19749

NXP backend: Enable Add Tensor with new Neutron flow (pytorch#19550)

b69cbcd

### Summary Add tests verifying correct support for add.tensor by the Neutron backend using the new Neutron MLIR flow. ### Test plan Unit tests provided. cc @robert-kalmar

Back out "Globally serialize XNNPACK execution, add logging" (pytorch…

ba6074c

…#19752) Differential Revision: D106254596 Pull Request resolved: pytorch#19752

Arm backend: Exclude build metadata from license checks

ee4c90a

Treat BUCK and TARGETS files as build metadata in the Arm pre-push license check so they do not need copyright headers. Signed-off-by: Per Held <per.held@arm.com> Change-Id: I4b3bbd1e03ba4b9c38fd06225156344985f0cc70

NXP backend: Enable Sub Tensor with new Neutron flow (pytorch#19588)

b73df0b

### Summary Add tests verifying correct support for sub.tensor by the Neutron backend using the new Neutron MLIR flow. ### Test plan Unit tests provided. cc @robert-kalmar @JakeStevens @digantdesai @rascani

add cuda allocator to cmake target (pytorch#19764) (pytorch#19764)

75fb249

Summary: Pull Request resolved: pytorch#19764 Reviewed By: kirklandsign Differential Revision: D106332819

luhenry mentioned this pull request May 26, 2026

Add Yolo26 to matrix of tested models on RISC-V pytorch/executorch#19741

Merged

luhenry and others added 9 commits May 26, 2026 09:14

Convert minibench Java files to Kotlin (pytorch#19760)

6128a45

Convert BenchmarkActivity, BenchmarkMetric, LlmBenchmark, LlmModelRunner, and ModelRunner from Java to Kotlin. Differential Revision: D106195816

Harden against concurrency violations (pytorch#19734) (pytorch#19734)

fb3f6eb

Differential Revision: D106026285 Pull Request resolved: pytorch#19734

Convert Experimental, DType, MethodMetadata from Java to Kotlin

50ee05e

Differential Revision: D106394605 Pull Request resolved: pytorch#19775

NXP backend: Improve docs for NXP eIQ Neutron Kernel Selective Kernel… (

5d36c7c

pytorch#19772) … Registration ### Summary Docs improvement. ### Test plan Docs only. cc @robert-kalmar @JakeStevens @digantdesai @rascani

Fix cortex_m test failures from D106339880

29c3a23

Differential Revision: D106408368 Pull Request resolved: pytorch#19783

kirklandsign and others added 7 commits May 26, 2026 23:24

Collapse Experimental.kt annotation onto a single line to satisfy linter

b4d62ed

Differential Revision: D106430647 Pull Request resolved: pytorch#19790

Handle out_dtype in ReplacePT2DequantWithCadenceDequantPass (pytorch#…

034b044

…19743) Differential Revision: D105630451 Pull Request resolved: pytorch#19743

Fix bug with mixed weight cache + workspace sharing

fb420f3

Differential Revision: D106412035 Pull Request resolved: pytorch#19777

New exported program pass manager and exported program passes (pytorc…

77df9b7

…h#16986) Differential Revision: D91725222 Pull Request resolved: pytorch#16986

luhenry force-pushed the riscv-testing-baremetal branch from 3bac226 to 6661a84 Compare May 27, 2026 12:33

github-actions Bot added module: arm ciflow/trunk labels May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add baremetal RISC-V smoke tests (rv32, rv64)#4

Add baremetal RISC-V smoke tests (rv32, rv64)#4
luhenry wants to merge 34 commits into
riscv-testing-modelsfrom
riscv-testing-baremetal

luhenry commented May 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

Conversation

luhenry commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

luhenry commented May 23, 2026 •

edited

Loading