Skip to content

[Bugfix] [Relay] Insertion of "device_copy" CallNode to Resolve Device Conflict on Unconstrained Nodes#15090

Merged
masahi merged 3 commits intoapache:mainfrom
lecoan:fix/plan_device
Jun 22, 2023
Merged

[Bugfix] [Relay] Insertion of "device_copy" CallNode to Resolve Device Conflict on Unconstrained Nodes#15090
masahi merged 3 commits intoapache:mainfrom
lecoan:fix/plan_device

Conversation

@lecoan
Copy link
Copy Markdown

@lecoan lecoan commented Jun 13, 2023

This PR addresses an issue #15019 I opened previously, regarding the PlanDevices pass's failure in cases where two operators share the same input but are intended to be assigned to different target devices. This scenario can often occur in the context of a neural network, where multiple layers can process the same input.

In the specific case of an operation (a+b)*(b+c), where the first add operator is assigned to the CPU and the second one to the GPU, PlanDevices pass would fail as it had difficulty determining the correct device for b.

The problematic behavior seemed to be due to PlanDevices pass marking b for the CPU (when it first visits a+b), and then throwing an error when it attempts to place b on the GPU while visiting b+c.

The solution I've implemented in this PR is the automatic addition of a device_copy. This means that if b is assigned to CPU after visiting a+b , the PlanDevices pass will append a device_copy to copy b to GPU when visiting b+c.

@tvm-bot
Copy link
Copy Markdown
Collaborator

tvm-bot commented Jun 13, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@lecoan
Copy link
Copy Markdown
Author

lecoan commented Jun 16, 2023

@mbs-octoml @Lunderberg Hi, all the test cases passed. Can you help me review the patch?

Comment thread src/relay/transforms/device_planner.cc Outdated
*
* Phase 1
* -------
* We iterate process the programs to find those nodes with conflicting virtual devices. If the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iteratively

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review! My apologies for the spelling errors. I have double-checked the comment using ChatGPT to ensure accuracy.

Comment thread src/relay/transforms/device_planner.cc Outdated
};

/*!
* \brief Flows the device constraints over the module and find all the conflicted nodes. The
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flow

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies for the spelling errors. I have double-checked the comment using ChatGPT to ensure accuracy.

}

IRModule mod_;
std::unique_ptr<DeviceContext> dev_ctx_;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need pointer here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is similar to the PlanDevicesCore sub-pass, which uses a pointer for DeviceDomains to prevent unnecessary copying. Since the necessary information is contained in dev_ctx_, which is created in ConflictedNodeFinder and then passed to ConflictedNodeRewriter, we also use a pointer here.

Comment thread src/relay/transforms/device_planner.cc Outdated
*
* Phase 1
* -------
* We iterately process the programs and find nodes with conflicting virtual devices. If the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still wrong, "iteratively"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@masahi masahi merged commit 6b20cae into apache:main Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants