Skip to content

新增 shmem接口,实现在C++侧,同时也做了python侧的封装,额外提供了A3 A5两个运行的demo case。#585

Open
sunkaixuan2018 wants to merge 8 commits intohw-native-sys:mainfrom
sunkaixuan2018:wangchao-main-shmem
Open

新增 shmem接口,实现在C++侧,同时也做了python侧的封装,额外提供了A3 A5两个运行的demo case。#585
sunkaixuan2018 wants to merge 8 commits intohw-native-sys:mainfrom
sunkaixuan2018:wangchao-main-shmem

Conversation

@sunkaixuan2018
Copy link
Copy Markdown
Contributor

背景

当前框架缺少在 Host 侧分配内存并映射为 Device 可见地址的能力。部分场景(如 Host 与 Device
之间通过共享内存直接通信)需要此能力来避免额外的数据拷贝。

改动内容

1. 新增 C API:mallocHostDeviceShareMem / freeHostDeviceShareMem

  • pto_runtime_c_api.h 中声明,通过 ChipWorker dlsym 加载
  • a2a3 onboard:基于 rtMallocHost + rtsHostRegister / aclrtMallocHost + aclrtHostRegister 实现
  • a5 onboard:基于相同的驱动 API 实现
  • a2a3/a5 sim:提供 stub(返回 -1 + 日志提示),保持符号对齐

2. ChipWorker / nanobind / Python 层贯通

  • ChipWorker 新增 mallocHostDeviceShareMem() / freeHostDeviceShareMem() 方法
  • nanobind 绑定新增 malloc_host_device_share_mem / free_host_device_share_mem
  • Python task_interface.py 新增同名方法及模块级便捷函数

3. CodeRunner 资源安全改进

  • 外层 try/finally 保证 reset_device() / finalize() 在异常时也能执行
  • 新增 post_run_collect 生命周期钩子,允许 golden.py 在每个 case 结束后释放外部资源

4. a2a3 / a5 示例

  • examples/a2a3/tensormap_and_ringbuffer/host_register_mapped_demo/
  • examples/a5/tensormap_and_ringbuffer/host_register_mapped_demo/
  • 演示流程:Host 分配共享内存 → 写入数据 → Device kernel 读取并 +1 → 验证结果

涉及文件(19 个,+1329 / -76)

层级 文件
C API 头文件 src/common/worker/pto_runtime_c_api.h
ChipWorker src/common/worker/chip_worker.h, chip_worker.cpp
平台实现 src/a2a3/platform/{onboard,sim}/host/pto_runtime_c_api.cpp
平台实现 src/a5/platform/{onboard,sim}/host/pto_runtime_c_api.cpp
Python 绑定 python/bindings/task_interface.cpp
Python 封装 python/simpler/task_interface.py
测试框架 simpler_setup/code_runner.py
示例 examples/a2a3/.../host_register_mapped_demo/*
示例 examples/a5/.../host_register_mapped_demo/*

sunkaixuan2018 and others added 8 commits April 15, 2026 09:35
- expose host malloc/register APIs through ChipWorker and Python bindings
- implement a2a3 runtime symbol resolution with rt/acl host-memory fallbacks
- add an a2a3 tensormap_and_ringbuffer demo that reads mapped host memory
- teach code_runner to run cleanup hooks for external mapped buffers
- submit the mapped host tensor as INOUT in orchestration
- update the AIV kernel to store the +1 result back to mapped host memory
- keep the regular copy-back output so host-visible writes and runtime output can be compared
- Add standalone mallocHostDeviceShareMem/freeHostDeviceShareMem exports in the host runtime and keep their logic independent from host_malloc/host_register_mapped\n- Bind the new APIs through ChipWorker and Python so golden.py can call the direct shared-memory helpers\n- Update non-a2a3 backends to return unsupported explicitly for the new symbols
- Implement direct mallocHostDeviceShareMem/freeHostDeviceShareMem support on the a5 onboard host runtime\n- Add an a5 host_register_mapped_demo plus a case-local README with the run command\n- Keep the existing Python helper chain and document the shared-memory API usage
- Merge hw-native-sys/simpler main up to a9f3ea9 into the local feature branch
- Resolve the only content conflict in simpler_setup/code_runner.py by keeping the post_run_collect flow and upstream dump-tensor config support
- Preserve the existing host-device share memory work on top of the refreshed upstream base
- Replace stale ensure_device_set() calls in the onboard host runtime C API with attach_current_thread() after the upstream DeviceRunner refactor
- Keep host-side set_device and mapped-memory registration aligned with the new run-context lifecycle on both a2a3 and a5
…reMem

Remove the intermediate host_malloc/host_free/host_register_mapped/
host_unregister_mapped API surface from all layers (C header, ChipWorker,
nanobind bindings, Python wrappers, and platform implementations).
Only the two aggregated APIs (mallocHostDeviceShareMem/freeHostDeviceShareMem)
are retained as the public shared-memory interface.

Also removes the draft documentation file and CamelCase compatibility
wrappers to narrow the PR scope for upstream submission.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces host-side mapped memory support for the a2a3 and a5 platforms, allowing AIV kernels to directly read and write host-allocated buffers. The implementation includes new C++ runtime APIs, Python bindings, and a post_run_collect hook in the test runner for resource cleanup. Review feedback identifies critical bugs in the ACL registration path, specifically incorrect function signatures and the failure to retrieve device pointers via aclrtGetDevicePointer. Additionally, the global worker management in the Python interface is noted as fragile, and improvements are suggested for the dynamic symbol resolution logic to prevent mixing RT and ACL API families.

using RtHostUnregisterFn = int (*)(void *);
using AclMallocHostFn = int (*)(void **, size_t);
using AclFreeHostFn = int (*)(void *);
using AclHostRegisterFn = int (*)(void *, uint64_t, uint32_t, void **);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The AclHostRegisterFn typedef is incorrect for the standard aclrtHostRegister API in CANN. aclrtHostRegister takes 3 arguments (void *ptr, size_t size, uint32_t flags) and does not return a device pointer directly. To obtain the device-visible address, a subsequent call to aclrtGetDevicePointer is required. Using this 4-argument typedef will lead to incorrect behavior or failures in the ACL path.

using AclHostRegisterFn = int (*)(void *, size_t, uint32_t);
using AclGetDevicePointerFn = int (*)(void *, void **);

Comment on lines +257 to +271
} else if (AclHostRegisterFn register_fn =
resolve_symbol<AclHostRegisterFn>(&symbol_name, {"aclrtHostRegister"})) {
rc = register_fn(allocated_host_ptr, size, kHostRegisterMappedFlag, devPtr);
if (rc != 0 || *devPtr == nullptr) {
LOG_ERROR(
"mallocHostDeviceShareMem via %s failed on host register: rc=%d host=%p size=%" PRIu64 " flag=%u",
symbol_name, rc, allocated_host_ptr, size, kHostRegisterMappedFlag
);
if (RtFreeHostFn free_fn = resolve_symbol<RtFreeHostFn>(nullptr, {"rtFreeHost"})) {
free_fn(allocated_host_ptr);
} else if (AclFreeHostFn free_fn = resolve_symbol<AclFreeHostFn>(nullptr, {"aclrtFreeHost"})) {
free_fn(allocated_host_ptr);
}
return (rc != 0) ? rc : -1;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ACL registration path is broken because it incorrectly calls aclrtHostRegister with 4 arguments and fails to call aclrtGetDevicePointer to retrieve the mapped address. Since aclrtHostRegister does not write to the 4th argument, *devPtr will remain nullptr (initialized at line 216), causing the check at line 260 to always trigger a failure even if registration succeeded.

Comment on lines +212 to +215
global _ACTIVE_WORKER
self._impl.finalize()
if _ACTIVE_WORKER is self:
_ACTIVE_WORKER = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The global _ACTIVE_WORKER tracking logic is fragile when multiple ChipWorker instances are used. If a user creates w1 then w2, _ACTIVE_WORKER becomes w2. If w2 is finalized, _ACTIVE_WORKER is set to None, even though w1 is still active and initialized. Subsequent calls to module-level convenience functions like malloc_host_device_share_mem will fail with a RuntimeError despite a valid worker existing.

Consider maintaining a class-level list of active workers and updating _ACTIVE_WORKER to the most recently created non-finalized worker.

Comment on lines +225 to +240
if (RtMallocHostFn malloc_fn = resolve_symbol<RtMallocHostFn>(&symbol_name, {"rtMallocHost"})) {
rc = malloc_fn(&allocated_host_ptr, size, 0U);
if (rc != 0 || allocated_host_ptr == nullptr) {
LOG_ERROR("mallocHostDeviceShareMem via %s failed on rtMallocHost: rc=%d size=%" PRIu64, symbol_name, rc, size);
return (rc != 0) ? rc : -1;
}
} else if (AclMallocHostFn malloc_fn = resolve_symbol<AclMallocHostFn>(&symbol_name, {"aclrtMallocHost"})) {
rc = malloc_fn(&allocated_host_ptr, static_cast<size_t>(size));
if (rc != 0 || allocated_host_ptr == nullptr) {
LOG_ERROR("mallocHostDeviceShareMem via %s failed on aclrtMallocHost: rc=%d size=%" PRIu64, symbol_name, rc, size);
return (rc != 0) ? rc : -1;
}
} else {
LOG_ERROR("mallocHostDeviceShareMem: missing symbols rtMallocHost / aclrtMallocHost");
return -1;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation mixes Runtime (RT) and Access Control List (ACL) symbols within the same function. While they may coexist in some environments, it is safer to stick to one API family once a selection is made. For example, if rtMallocHost is successfully resolved, the code should prioritize rtHostRegister and rtFreeHost rather than falling through to ACL symbols if RT registration fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant