Add: child_memory flag on ContinuousTensor to skip H2D copy#579
Add: child_memory flag on ContinuousTensor to skip H2D copy#579ChaoWao wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a child_memory flag to the ContinuousTensor structure, allowing the runtime to skip host-to-device copies for tensors already managed by a child process. The changes span the core task interface, Python bindings, and runtime implementations for multiple platforms. Feedback highlights a critical regression where skipping these tensors in the initialization phase causes misalignment in the result validation logic, potentially leading to data corruption. Furthermore, the provided system test uses host-side shared memory to simulate device memory, which may not accurately represent behavior on physical hardware.
| if (t.is_child_memory()) { | ||
| LOG_INFO(" Tensor %d: child memory, pass-through (0x%" PRIx64 ")", i, t.data); | ||
| device_args.add_tensor(t); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
The skip logic introduced here for child_memory tensors causes a regression in the result validation phase. In validate_runtime_impl (around line 351), the runtime uses a first_output_tensor flag to identify which tensor should receive the packed graph output from graph_out_ptr. Since child_memory tensors are not recorded in tensor_pairs, they are omitted from the validation loop. If a child_memory tensor is the first output in the original TaskArgs, the validation loop will incorrectly apply the packed output buffer to the next available tensor in tensor_pairs, leading to silent data corruption.
| if (t.is_child_memory()) { | ||
| LOG_INFO(" Tensor %d: child memory, pass-through (0x%" PRIx64 ")", i, t.data); | ||
| device_args.add_tensor(t); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
The skip logic introduced here for child_memory tensors causes a regression in the result validation phase. In validate_runtime_impl (around line 379), the runtime uses a first_output_tensor flag to identify which tensor should receive the packed graph output from graph_out_ptr. Since child_memory tensors are not recorded in tensor_pairs, they are omitted from the validation loop. If a child_memory tensor is the first output in the original TaskArgs, the validation loop will incorrectly apply the packed output buffer to the next available tensor in tensor_pairs, leading to silent data corruption.
| if (t.is_child_memory()) { | ||
| LOG_INFO(" Tensor %d: child memory, pass-through (0x%" PRIx64 ")", i, t.data); | ||
| device_args.add_tensor(t); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
The skip logic introduced here for child_memory tensors causes a regression in the result validation phase. In validate_runtime_impl (around line 379), the runtime uses a first_output_tensor flag to identify which tensor should receive the packed graph output from graph_out_ptr. Since child_memory tensors are not recorded in tensor_pairs, they are omitted from the validation loop. If a child_memory tensor is the first output in the original TaskArgs, the validation loop will incorrectly apply the packed output buffer to the next available tensor in tensor_pairs, leading to silent data corruption.
| SIZE = 128 * 128 | ||
| return TaskArgsBuilder( | ||
| Tensor("a", torch.full((SIZE,), 2.0, dtype=torch.float32).share_memory_()), | ||
| Tensor("w", torch.full((SIZE,), 3.0, dtype=torch.float32).share_memory_()), |
There was a problem hiding this comment.
This test uses torch.share_memory_() which creates a host-side shared memory buffer, but then marks it as child_memory=True. As per the documentation in tensor_arg.h, the child_memory flag is intended for device pointers allocated by a child process. While this works in simulation (a2a3sim) where the address space is often unified or accessible, it is misleading and would fail on real hardware where the AICPU cannot access host memory directly. For a more robust test, this should ideally use a proper device-allocated buffer.
Add a 1-byte `child_memory` field in the existing padding of ContinuousTensor (sizeof stays 40B). When set, init_runtime_impl passes the tensor pointer through as-is instead of malloc + H2D copy + record_tensor_pair. This enables child-process-allocated device buffers (e.g. HCCL windows, pre-staged weights) to be referenced in TaskArgs without being re-copied or freed per task. - tensor_arg.h: add child_memory field + is_child_memory() helper - runtime_maker.cpp (a2a3 aicpu, a2a3 tmar, a5 tmar): skip loop - nanobind: expose child_memory on ContinuousTensor.make() + property - C++ unit tests: sizeof, default, blob roundtrip, view_to_chip_storage - Python unit tests: make, property, repr, ChipStorageTaskArgs - L3 scene test: same child_memory weight across two kernel invocations
08eed02 to
cdef4ce
Compare
Summary
Add a 1-byte
child_memoryfield in the existing tail padding ofContinuousTensor(sizeofstays 40B, ABI unchanged). When set,init_runtime_implpasses the tensor pointer through as-is instead ofdevice_malloc+copy_to_device+record_tensor_pair.This closes the gap identified during review of #571: worker-allocated device buffers (e.g. HCCL windows, pre-staged weights from
device_malloc_ctx) can now be referenced inTaskArgswithout being re-copied or freed per task invocation.Changes
tensor_arg.h: adduint8_t child_memoryfield +is_child_memory()helperpto_runtime_c_api: exportdevice_malloc_ctx/device_free_ctx/copy_to_device_ctx/copy_from_device_ctx(all 4 platforms: a2a3 sim/onboard, a5 sim/onboard)chip_worker: load new symbols via dlsym, adddevice_malloc/device_free/copy_to_device/copy_from_devicemethodsruntime_maker.cpp(a2a3 aicpu_build_graph, a2a3 tensormap_and_ringbuffer, a5 tensormap_and_ringbuffer): skip malloc+copy+record whenis_child_memory()child_memoryonContinuousTensor.make(child_memory=True)+ read/write property + repr; exposedevice_malloc/device_free/copy_to_device/copy_from_deviceonChipWorkerWorker: addmalloc/free/copy_to/copy_from(L2, delegates to ChipWorker)sizeofunchanged, default=0, blob roundtrip,view_to_chip_storagepreservationmake(), property, repr,ChipStorageTaskArgsroundtripdevice_malloc→copy_to_device→ContinuousTensor(child_memory=True)→ run kernel twice with same weightDesign
The
child_memorybyte sits inContinuousTensor's existing 7-byte tail padding (afterdtype), so:sizeof(ContinuousTensor)remains 40write_blob/read_blob/view_to_chip_storagecarry it transparently viamemcpyvalidate_runtime_implnaturally skips these tensors (not intensor_pairs)Testing
device_mallocweight, kernel invoked twice, same pointer pass-through both times)Related: #571