Add: child_memory flag on ContinuousTensor to skip H2D copy by ChaoWao · Pull Request #579 · hw-native-sys/simpler

ChaoWao · 2026-04-16T03:21:46Z

Summary

Add a 1-byte child_memory field in the existing tail padding of ContinuousTensor (sizeof stays 40B, ABI unchanged). When set, init_runtime_impl passes the tensor pointer through as-is instead of device_malloc + copy_to_device + record_tensor_pair.

This closes the gap identified during review of #571: worker-allocated device buffers (e.g. HCCL windows, pre-staged weights from device_malloc_ctx) can now be referenced in TaskArgs without being re-copied or freed per task invocation.

Changes

tensor_arg.h: add uint8_t child_memory field + is_child_memory() helper
pto_runtime_c_api: export device_malloc_ctx / device_free_ctx / copy_to_device_ctx / copy_from_device_ctx (all 4 platforms: a2a3 sim/onboard, a5 sim/onboard)
chip_worker: load new symbols via dlsym, add device_malloc / device_free / copy_to_device / copy_from_device methods
runtime_maker.cpp (a2a3 aicpu_build_graph, a2a3 tensormap_and_ringbuffer, a5 tensormap_and_ringbuffer): skip malloc+copy+record when is_child_memory()
nanobind bindings: expose child_memory on ContinuousTensor.make(child_memory=True) + read/write property + repr; expose device_malloc/device_free/copy_to_device/copy_from_device on ChipWorker
Worker: add malloc / free / copy_to / copy_from (L2, delegates to ChipWorker)
C++ unit tests: sizeof unchanged, default=0, blob roundtrip, view_to_chip_storage preservation
Python unit tests: make(), property, repr, ChipStorageTaskArgs roundtrip
L2 scene test: device_malloc → copy_to_device → ContinuousTensor(child_memory=True) → run kernel twice with same weight

Design

The child_memory byte sits in ContinuousTensor's existing 7-byte tail padding (after dtype), so:

sizeof(ContinuousTensor) remains 40
write_blob / read_blob / view_to_chip_storage carry it transparently via memcpy
No wire format or mailbox protocol changes
validate_runtime_impl naturally skips these tensors (not in tensor_pairs)
Default 0 = full backward compatibility

Testing

C++ unit tests pass (8/8)
Python unit tests pass (7/7 new + 89 existing)
L2 scene test passes on a2a3sim (device_malloc weight, kernel invoked twice, same pointer pass-through both times)
Hardware tests (requires device)

Related: #571

gemini-code-assist

Code Review

This pull request introduces a child_memory flag to the ContinuousTensor structure, allowing the runtime to skip host-to-device copies for tensors already managed by a child process. The changes span the core task interface, Python bindings, and runtime implementations for multiple platforms. Feedback highlights a critical regression where skipping these tensors in the initialization phase causes misalignment in the result validation logic, potentially leading to data corruption. Furthermore, the provided system test uses host-side shared memory to simulate device memory, which may not accurately represent behavior on physical hardware.

gemini-code-assist · 2026-04-16T03:24:09Z

+        if (t.is_child_memory()) {
+            LOG_INFO("  Tensor %d: child memory, pass-through (0x%" PRIx64 ")", i, t.data);
+            device_args.add_tensor(t);
+            continue;
+        }


The skip logic introduced here for child_memory tensors causes a regression in the result validation phase. In validate_runtime_impl (around line 351), the runtime uses a first_output_tensor flag to identify which tensor should receive the packed graph output from graph_out_ptr. Since child_memory tensors are not recorded in tensor_pairs, they are omitted from the validation loop. If a child_memory tensor is the first output in the original TaskArgs, the validation loop will incorrectly apply the packed output buffer to the next available tensor in tensor_pairs, leading to silent data corruption.

gemini-code-assist · 2026-04-16T03:24:09Z

+        if (t.is_child_memory()) {
+            LOG_INFO("  Tensor %d: child memory, pass-through (0x%" PRIx64 ")", i, t.data);
+            device_args.add_tensor(t);
+            continue;
+        }


The skip logic introduced here for child_memory tensors causes a regression in the result validation phase. In validate_runtime_impl (around line 379), the runtime uses a first_output_tensor flag to identify which tensor should receive the packed graph output from graph_out_ptr. Since child_memory tensors are not recorded in tensor_pairs, they are omitted from the validation loop. If a child_memory tensor is the first output in the original TaskArgs, the validation loop will incorrectly apply the packed output buffer to the next available tensor in tensor_pairs, leading to silent data corruption.

gemini-code-assist · 2026-04-16T03:24:10Z

+        if (t.is_child_memory()) {
+            LOG_INFO("  Tensor %d: child memory, pass-through (0x%" PRIx64 ")", i, t.data);
+            device_args.add_tensor(t);
+            continue;
+        }


The skip logic introduced here for child_memory tensors causes a regression in the result validation phase. In validate_runtime_impl (around line 379), the runtime uses a first_output_tensor flag to identify which tensor should receive the packed graph output from graph_out_ptr. Since child_memory tensors are not recorded in tensor_pairs, they are omitted from the validation loop. If a child_memory tensor is the first output in the original TaskArgs, the validation loop will incorrectly apply the packed output buffer to the next available tensor in tensor_pairs, leading to silent data corruption.

gemini-code-assist · 2026-04-16T03:24:10Z

+        SIZE = 128 * 128
+        return TaskArgsBuilder(
+            Tensor("a", torch.full((SIZE,), 2.0, dtype=torch.float32).share_memory_()),
+            Tensor("w", torch.full((SIZE,), 3.0, dtype=torch.float32).share_memory_()),


This test uses torch.share_memory_() which creates a host-side shared memory buffer, but then marks it as child_memory=True. As per the documentation in tensor_arg.h, the child_memory flag is intended for device pointers allocated by a child process. While this works in simulation (a2a3sim) where the address space is often unified or accessible, it is misleading and would fail on real hardware where the AICPU cannot access host memory directly. For a more robust test, this should ideally use a proper device-allocated buffer.

Add a 1-byte `child_memory` field in the existing padding of ContinuousTensor (sizeof stays 40B). When set, init_runtime_impl passes the tensor pointer through as-is instead of malloc + H2D copy + record_tensor_pair. This enables child-process-allocated device buffers (e.g. HCCL windows, pre-staged weights) to be referenced in TaskArgs without being re-copied or freed per task. - tensor_arg.h: add child_memory field + is_child_memory() helper - runtime_maker.cpp (a2a3 aicpu, a2a3 tmar, a5 tmar): skip loop - nanobind: expose child_memory on ContinuousTensor.make() + property - C++ unit tests: sizeof, default, blob roundtrip, view_to_chip_storage - Python unit tests: make, property, repr, ChipStorageTaskArgs - L3 scene test: same child_memory weight across two kernel invocations

gemini-code-assist bot reviewed Apr 16, 2026

View reviewed changes

ChaoWao force-pushed the feat/device-resident-tensor branch from 08eed02 to cdef4ce Compare April 16, 2026 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: child_memory flag on ContinuousTensor to skip H2D copy#579

Add: child_memory flag on ContinuousTensor to skip H2D copy#579
ChaoWao wants to merge 1 commit intohw-native-sys:mainfrom
ChaoWao:feat/device-resident-tensor

ChaoWao commented Apr 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoWao commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Design

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChaoWao commented Apr 16, 2026 •

edited

Loading