Skip to content

Add: child_memory flag on ContinuousTensor to skip H2D copy#579

Open
ChaoWao wants to merge 1 commit intohw-native-sys:mainfrom
ChaoWao:feat/device-resident-tensor
Open

Add: child_memory flag on ContinuousTensor to skip H2D copy#579
ChaoWao wants to merge 1 commit intohw-native-sys:mainfrom
ChaoWao:feat/device-resident-tensor

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented Apr 16, 2026

Summary

Add a 1-byte child_memory field in the existing tail padding of ContinuousTensor (sizeof stays 40B, ABI unchanged). When set, init_runtime_impl passes the tensor pointer through as-is instead of device_malloc + copy_to_device + record_tensor_pair.

This closes the gap identified during review of #571: worker-allocated device buffers (e.g. HCCL windows, pre-staged weights from device_malloc_ctx) can now be referenced in TaskArgs without being re-copied or freed per task invocation.

Changes

  • tensor_arg.h: add uint8_t child_memory field + is_child_memory() helper
  • pto_runtime_c_api: export device_malloc_ctx / device_free_ctx / copy_to_device_ctx / copy_from_device_ctx (all 4 platforms: a2a3 sim/onboard, a5 sim/onboard)
  • chip_worker: load new symbols via dlsym, add device_malloc / device_free / copy_to_device / copy_from_device methods
  • runtime_maker.cpp (a2a3 aicpu_build_graph, a2a3 tensormap_and_ringbuffer, a5 tensormap_and_ringbuffer): skip malloc+copy+record when is_child_memory()
  • nanobind bindings: expose child_memory on ContinuousTensor.make(child_memory=True) + read/write property + repr; expose device_malloc/device_free/copy_to_device/copy_from_device on ChipWorker
  • Worker: add malloc / free / copy_to / copy_from (L2, delegates to ChipWorker)
  • C++ unit tests: sizeof unchanged, default=0, blob roundtrip, view_to_chip_storage preservation
  • Python unit tests: make(), property, repr, ChipStorageTaskArgs roundtrip
  • L2 scene test: device_malloccopy_to_deviceContinuousTensor(child_memory=True) → run kernel twice with same weight

Design

The child_memory byte sits in ContinuousTensor's existing 7-byte tail padding (after dtype), so:

  • sizeof(ContinuousTensor) remains 40
  • write_blob / read_blob / view_to_chip_storage carry it transparently via memcpy
  • No wire format or mailbox protocol changes
  • validate_runtime_impl naturally skips these tensors (not in tensor_pairs)
  • Default 0 = full backward compatibility

Testing

  • C++ unit tests pass (8/8)
  • Python unit tests pass (7/7 new + 89 existing)
  • L2 scene test passes on a2a3sim (device_malloc weight, kernel invoked twice, same pointer pass-through both times)
  • Hardware tests (requires device)

Related: #571

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a child_memory flag to the ContinuousTensor structure, allowing the runtime to skip host-to-device copies for tensors already managed by a child process. The changes span the core task interface, Python bindings, and runtime implementations for multiple platforms. Feedback highlights a critical regression where skipping these tensors in the initialization phase causes misalignment in the result validation logic, potentially leading to data corruption. Furthermore, the provided system test uses host-side shared memory to simulate device memory, which may not accurately represent behavior on physical hardware.

Comment on lines +139 to +143
if (t.is_child_memory()) {
LOG_INFO(" Tensor %d: child memory, pass-through (0x%" PRIx64 ")", i, t.data);
device_args.add_tensor(t);
continue;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The skip logic introduced here for child_memory tensors causes a regression in the result validation phase. In validate_runtime_impl (around line 351), the runtime uses a first_output_tensor flag to identify which tensor should receive the packed graph output from graph_out_ptr. Since child_memory tensors are not recorded in tensor_pairs, they are omitted from the validation loop. If a child_memory tensor is the first output in the original TaskArgs, the validation loop will incorrectly apply the packed output buffer to the next available tensor in tensor_pairs, leading to silent data corruption.

Comment on lines +163 to +167
if (t.is_child_memory()) {
LOG_INFO(" Tensor %d: child memory, pass-through (0x%" PRIx64 ")", i, t.data);
device_args.add_tensor(t);
continue;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The skip logic introduced here for child_memory tensors causes a regression in the result validation phase. In validate_runtime_impl (around line 379), the runtime uses a first_output_tensor flag to identify which tensor should receive the packed graph output from graph_out_ptr. Since child_memory tensors are not recorded in tensor_pairs, they are omitted from the validation loop. If a child_memory tensor is the first output in the original TaskArgs, the validation loop will incorrectly apply the packed output buffer to the next available tensor in tensor_pairs, leading to silent data corruption.

Comment on lines +163 to +167
if (t.is_child_memory()) {
LOG_INFO(" Tensor %d: child memory, pass-through (0x%" PRIx64 ")", i, t.data);
device_args.add_tensor(t);
continue;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The skip logic introduced here for child_memory tensors causes a regression in the result validation phase. In validate_runtime_impl (around line 379), the runtime uses a first_output_tensor flag to identify which tensor should receive the packed graph output from graph_out_ptr. Since child_memory tensors are not recorded in tensor_pairs, they are omitted from the validation loop. If a child_memory tensor is the first output in the original TaskArgs, the validation loop will incorrectly apply the packed output buffer to the next available tensor in tensor_pairs, leading to silent data corruption.

SIZE = 128 * 128
return TaskArgsBuilder(
Tensor("a", torch.full((SIZE,), 2.0, dtype=torch.float32).share_memory_()),
Tensor("w", torch.full((SIZE,), 3.0, dtype=torch.float32).share_memory_()),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test uses torch.share_memory_() which creates a host-side shared memory buffer, but then marks it as child_memory=True. As per the documentation in tensor_arg.h, the child_memory flag is intended for device pointers allocated by a child process. While this works in simulation (a2a3sim) where the address space is often unified or accessible, it is misleading and would fail on real hardware where the AICPU cannot access host memory directly. For a more robust test, this should ideally use a proper device-allocated buffer.

Add a 1-byte `child_memory` field in the existing padding of
ContinuousTensor (sizeof stays 40B). When set, init_runtime_impl
passes the tensor pointer through as-is instead of malloc + H2D
copy + record_tensor_pair. This enables child-process-allocated
device buffers (e.g. HCCL windows, pre-staged weights) to be
referenced in TaskArgs without being re-copied or freed per task.

- tensor_arg.h: add child_memory field + is_child_memory() helper
- runtime_maker.cpp (a2a3 aicpu, a2a3 tmar, a5 tmar): skip loop
- nanobind: expose child_memory on ContinuousTensor.make() + property
- C++ unit tests: sizeof, default, blob roundtrip, view_to_chip_storage
- Python unit tests: make, property, repr, ChipStorageTaskArgs
- L3 scene test: same child_memory weight across two kernel invocations
@ChaoWao ChaoWao force-pushed the feat/device-resident-tensor branch from 08eed02 to cdef4ce Compare April 16, 2026 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant