新增 shmem接口,实现在C++侧,同时也做了python侧的封装,额外提供了A3 A5两个运行的demo case。#585
新增 shmem接口,实现在C++侧,同时也做了python侧的封装,额外提供了A3 A5两个运行的demo case。#585sunkaixuan2018 wants to merge 8 commits intohw-native-sys:mainfrom
Conversation
- expose host malloc/register APIs through ChipWorker and Python bindings - implement a2a3 runtime symbol resolution with rt/acl host-memory fallbacks - add an a2a3 tensormap_and_ringbuffer demo that reads mapped host memory - teach code_runner to run cleanup hooks for external mapped buffers
- submit the mapped host tensor as INOUT in orchestration - update the AIV kernel to store the +1 result back to mapped host memory - keep the regular copy-back output so host-visible writes and runtime output can be compared
- Add standalone mallocHostDeviceShareMem/freeHostDeviceShareMem exports in the host runtime and keep their logic independent from host_malloc/host_register_mapped\n- Bind the new APIs through ChipWorker and Python so golden.py can call the direct shared-memory helpers\n- Update non-a2a3 backends to return unsupported explicitly for the new symbols
- Implement direct mallocHostDeviceShareMem/freeHostDeviceShareMem support on the a5 onboard host runtime\n- Add an a5 host_register_mapped_demo plus a case-local README with the run command\n- Keep the existing Python helper chain and document the shared-memory API usage
- Merge hw-native-sys/simpler main up to a9f3ea9 into the local feature branch - Resolve the only content conflict in simpler_setup/code_runner.py by keeping the post_run_collect flow and upstream dump-tensor config support - Preserve the existing host-device share memory work on top of the refreshed upstream base
- Replace stale ensure_device_set() calls in the onboard host runtime C API with attach_current_thread() after the upstream DeviceRunner refactor - Keep host-side set_device and mapped-memory registration aligned with the new run-context lifecycle on both a2a3 and a5
…reMem Remove the intermediate host_malloc/host_free/host_register_mapped/ host_unregister_mapped API surface from all layers (C header, ChipWorker, nanobind bindings, Python wrappers, and platform implementations). Only the two aggregated APIs (mallocHostDeviceShareMem/freeHostDeviceShareMem) are retained as the public shared-memory interface. Also removes the draft documentation file and CamelCase compatibility wrappers to narrow the PR scope for upstream submission. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces host-side mapped memory support for the a2a3 and a5 platforms, allowing AIV kernels to directly read and write host-allocated buffers. The implementation includes new C++ runtime APIs, Python bindings, and a post_run_collect hook in the test runner for resource cleanup. Review feedback identifies critical bugs in the ACL registration path, specifically incorrect function signatures and the failure to retrieve device pointers via aclrtGetDevicePointer. Additionally, the global worker management in the Python interface is noted as fragile, and improvements are suggested for the dynamic symbol resolution logic to prevent mixing RT and ACL API families.
| using RtHostUnregisterFn = int (*)(void *); | ||
| using AclMallocHostFn = int (*)(void **, size_t); | ||
| using AclFreeHostFn = int (*)(void *); | ||
| using AclHostRegisterFn = int (*)(void *, uint64_t, uint32_t, void **); |
There was a problem hiding this comment.
The AclHostRegisterFn typedef is incorrect for the standard aclrtHostRegister API in CANN. aclrtHostRegister takes 3 arguments (void *ptr, size_t size, uint32_t flags) and does not return a device pointer directly. To obtain the device-visible address, a subsequent call to aclrtGetDevicePointer is required. Using this 4-argument typedef will lead to incorrect behavior or failures in the ACL path.
using AclHostRegisterFn = int (*)(void *, size_t, uint32_t);
using AclGetDevicePointerFn = int (*)(void *, void **);| } else if (AclHostRegisterFn register_fn = | ||
| resolve_symbol<AclHostRegisterFn>(&symbol_name, {"aclrtHostRegister"})) { | ||
| rc = register_fn(allocated_host_ptr, size, kHostRegisterMappedFlag, devPtr); | ||
| if (rc != 0 || *devPtr == nullptr) { | ||
| LOG_ERROR( | ||
| "mallocHostDeviceShareMem via %s failed on host register: rc=%d host=%p size=%" PRIu64 " flag=%u", | ||
| symbol_name, rc, allocated_host_ptr, size, kHostRegisterMappedFlag | ||
| ); | ||
| if (RtFreeHostFn free_fn = resolve_symbol<RtFreeHostFn>(nullptr, {"rtFreeHost"})) { | ||
| free_fn(allocated_host_ptr); | ||
| } else if (AclFreeHostFn free_fn = resolve_symbol<AclFreeHostFn>(nullptr, {"aclrtFreeHost"})) { | ||
| free_fn(allocated_host_ptr); | ||
| } | ||
| return (rc != 0) ? rc : -1; | ||
| } |
There was a problem hiding this comment.
The ACL registration path is broken because it incorrectly calls aclrtHostRegister with 4 arguments and fails to call aclrtGetDevicePointer to retrieve the mapped address. Since aclrtHostRegister does not write to the 4th argument, *devPtr will remain nullptr (initialized at line 216), causing the check at line 260 to always trigger a failure even if registration succeeded.
| global _ACTIVE_WORKER | ||
| self._impl.finalize() | ||
| if _ACTIVE_WORKER is self: | ||
| _ACTIVE_WORKER = None |
There was a problem hiding this comment.
The global _ACTIVE_WORKER tracking logic is fragile when multiple ChipWorker instances are used. If a user creates w1 then w2, _ACTIVE_WORKER becomes w2. If w2 is finalized, _ACTIVE_WORKER is set to None, even though w1 is still active and initialized. Subsequent calls to module-level convenience functions like malloc_host_device_share_mem will fail with a RuntimeError despite a valid worker existing.
Consider maintaining a class-level list of active workers and updating _ACTIVE_WORKER to the most recently created non-finalized worker.
| if (RtMallocHostFn malloc_fn = resolve_symbol<RtMallocHostFn>(&symbol_name, {"rtMallocHost"})) { | ||
| rc = malloc_fn(&allocated_host_ptr, size, 0U); | ||
| if (rc != 0 || allocated_host_ptr == nullptr) { | ||
| LOG_ERROR("mallocHostDeviceShareMem via %s failed on rtMallocHost: rc=%d size=%" PRIu64, symbol_name, rc, size); | ||
| return (rc != 0) ? rc : -1; | ||
| } | ||
| } else if (AclMallocHostFn malloc_fn = resolve_symbol<AclMallocHostFn>(&symbol_name, {"aclrtMallocHost"})) { | ||
| rc = malloc_fn(&allocated_host_ptr, static_cast<size_t>(size)); | ||
| if (rc != 0 || allocated_host_ptr == nullptr) { | ||
| LOG_ERROR("mallocHostDeviceShareMem via %s failed on aclrtMallocHost: rc=%d size=%" PRIu64, symbol_name, rc, size); | ||
| return (rc != 0) ? rc : -1; | ||
| } | ||
| } else { | ||
| LOG_ERROR("mallocHostDeviceShareMem: missing symbols rtMallocHost / aclrtMallocHost"); | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
The implementation mixes Runtime (RT) and Access Control List (ACL) symbols within the same function. While they may coexist in some environments, it is safer to stick to one API family once a selection is made. For example, if rtMallocHost is successfully resolved, the code should prioritize rtHostRegister and rtFreeHost rather than falling through to ACL symbols if RT registration fails.
背景
当前框架缺少在 Host 侧分配内存并映射为 Device 可见地址的能力。部分场景(如 Host 与 Device
之间通过共享内存直接通信)需要此能力来避免额外的数据拷贝。
改动内容
1. 新增 C API:
mallocHostDeviceShareMem/freeHostDeviceShareMempto_runtime_c_api.h中声明,通过 ChipWorker dlsym 加载rtMallocHost+rtsHostRegister/aclrtMallocHost+aclrtHostRegister实现2. ChipWorker / nanobind / Python 层贯通
ChipWorker新增mallocHostDeviceShareMem()/freeHostDeviceShareMem()方法malloc_host_device_share_mem/free_host_device_share_memtask_interface.py新增同名方法及模块级便捷函数3. CodeRunner 资源安全改进
reset_device()/finalize()在异常时也能执行post_run_collect生命周期钩子,允许 golden.py 在每个 case 结束后释放外部资源4. a2a3 / a5 示例
examples/a2a3/tensormap_and_ringbuffer/host_register_mapped_demo/examples/a5/tensormap_and_ringbuffer/host_register_mapped_demo/涉及文件(19 个,+1329 / -76)
src/common/worker/pto_runtime_c_api.hsrc/common/worker/chip_worker.h,chip_worker.cppsrc/a2a3/platform/{onboard,sim}/host/pto_runtime_c_api.cppsrc/a5/platform/{onboard,sim}/host/pto_runtime_c_api.cpppython/bindings/task_interface.cpppython/simpler/task_interface.pysimpler_setup/code_runner.pyexamples/a2a3/.../host_register_mapped_demo/*examples/a5/.../host_register_mapped_demo/*