[ET-VK] Deprecate gpu_sizes_ubo() and toggle packing layout via specialization shader#3181
Closed
SS-JIA wants to merge 2 commits intogh/SS-JIA/42/basefrom
Closed
[ET-VK] Deprecate gpu_sizes_ubo() and toggle packing layout via specialization shader#3181SS-JIA wants to merge 2 commits intogh/SS-JIA/42/basefrom
gpu_sizes_ubo() and toggle packing layout via specialization shader#3181SS-JIA wants to merge 2 commits intogh/SS-JIA/42/basefrom
Conversation
…cialization shader ## Context This changeset cleans up how shaders consume tensor metadata in two ways: ### Pass in Packing Layout via Specialization Shader The packing layout of a tensor determines how to convert between tensor indices and physical texture coordinates. Currently, the packing layout is determined by generating a completely new variant of a shader. However, this is rather expensive for build size. Specialization constants support was added a while back, which enables packing layout to be communicated to the shader via a specialization constant. This is a much better and natural way for shaders to determine the packing layout of its tensors and vary its behaviour. The primary benefit of this is that we can vastly reduce the number of variants that are generated. Generating shader variants for combinations of dtypes and memory layouts can lead to combinatorial explosion of build size. Note that dtype cannot be passed as a specialization constant since it impacts the types used in the layout portion of a shader. ### Deprecate GPU sizes Currently there are two representations of the tensor's sizes; `cpu_sizes()` and the `gpu_sizes()`. The GPU sizes is a simple modification of the CPU sizes where the packed dim is aligned to the next multiple of 4. However, often times shaders need to reference the original sizes of the tensor so we end up passing both CPU sizes and GPU sizes. The CPU sizes is used to determine out of bounds elements and the GPU sizes is used to convert between logical tensor indices and physical texture coordinates. Since the GPU sizes is easily determined from the CPU sizes given the packing layout, deprecate GPU sizes and use CPU sizes exclusively as the canonical tensor sizes. Hence `cpu_sizes()` is renamed to simple `sizes()`. The primary benefit of this change is such: 1. Less confusion over how to reference the tensor sizes 2. Fewer descriptors to bind when constructing compute pipelines Differential Revision: [D56377775](https://our.internmc.facebook.com/intern/diff/D56377775/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3181
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 6859982 with merge base 36453fc ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
SS-JIA
added a commit
that referenced
this pull request
Apr 19, 2024
…cialization shader ## Context This changeset cleans up how shaders consume tensor metadata in two ways: ### Pass in Packing Layout via Specialization Shader The packing layout of a tensor determines how to convert between tensor indices and physical texture coordinates. Currently, the packing layout is determined by generating a completely new variant of a shader. However, this is rather expensive for build size. Specialization constants support was added a while back, which enables packing layout to be communicated to the shader via a specialization constant. This is a much better and natural way for shaders to determine the packing layout of its tensors and vary its behaviour. The primary benefit of this is that we can vastly reduce the number of variants that are generated. Generating shader variants for combinations of dtypes and memory layouts can lead to combinatorial explosion of build size. Note that dtype cannot be passed as a specialization constant since it impacts the types used in the layout portion of a shader. ### Deprecate GPU sizes Currently there are two representations of the tensor's sizes; `cpu_sizes()` and the `gpu_sizes()`. The GPU sizes is a simple modification of the CPU sizes where the packed dim is aligned to the next multiple of 4. However, often times shaders need to reference the original sizes of the tensor so we end up passing both CPU sizes and GPU sizes. The CPU sizes is used to determine out of bounds elements and the GPU sizes is used to convert between logical tensor indices and physical texture coordinates. Since the GPU sizes is easily determined from the CPU sizes given the packing layout, deprecate GPU sizes and use CPU sizes exclusively as the canonical tensor sizes. Hence `cpu_sizes()` is renamed to simple `sizes()`. The primary benefit of this change is such: 1. Less confusion over how to reference the tensor sizes 2. Fewer descriptors to bind when constructing compute pipelines Differential Revision: [D56377775](https://our.internmc.facebook.com/intern/diff/D56377775/) ghstack-source-id: 223274863 Pull Request resolved: #3181
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D56377775 |
yipjustin
approved these changes
Apr 19, 2024
…out via specialization shader" ## Context This changeset cleans up how shaders consume tensor metadata in two ways: ### Pass in Packing Layout via Specialization Shader The packing layout of a tensor determines how to convert between tensor indices and physical texture coordinates. Currently, the packing layout is determined by generating a completely new variant of a shader. However, this is rather expensive for build size. Specialization constants support was added a while back, which enables packing layout to be communicated to the shader via a specialization constant. This is a much better and natural way for shaders to determine the packing layout of its tensors and vary its behaviour. The primary benefit of this is that we can vastly reduce the number of variants that are generated. Generating shader variants for combinations of dtypes and memory layouts can lead to combinatorial explosion of build size. Note that dtype cannot be passed as a specialization constant since it impacts the types used in the layout portion of a shader. ### Deprecate GPU sizes Currently there are two representations of the tensor's sizes; `cpu_sizes()` and the `gpu_sizes()`. The GPU sizes is a simple modification of the CPU sizes where the packed dim is aligned to the next multiple of 4. However, often times shaders need to reference the original sizes of the tensor so we end up passing both CPU sizes and GPU sizes. The CPU sizes is used to determine out of bounds elements and the GPU sizes is used to convert between logical tensor indices and physical texture coordinates. Since the GPU sizes is easily determined from the CPU sizes given the packing layout, deprecate GPU sizes and use CPU sizes exclusively as the canonical tensor sizes. Hence `cpu_sizes()` is renamed to simple `sizes()`. The primary benefit of this change is such: 1. Less confusion over how to reference the tensor sizes 2. Fewer descriptors to bind when constructing compute pipelines Differential Revision: [D56377775](https://our.internmc.facebook.com/intern/diff/D56377775/) [ghstack-poisoned]
SS-JIA
added a commit
that referenced
this pull request
Apr 20, 2024
…ing layout via specialization constants Pull Request resolved: #3181 ## Context This changeset cleans up how shaders consume tensor metadata in two ways: ### Pass in Packing Layout via Specialization Shader The packing layout of a tensor determines how to convert between tensor indices and physical texture coordinates. Currently, the packing layout is determined by generating a completely new variant of a shader. However, this is rather expensive for build size. Specialization constants support was added a while back, which enables packing layout to be communicated to the shader via a specialization constant. This is a much better and natural way for shaders to determine the packing layout of its tensors and vary its behaviour. The primary benefit of this is that we can vastly reduce the number of variants that are generated. Generating shader variants for combinations of dtypes and memory layouts can lead to combinatorial explosion of build size. Note that dtype cannot be passed as a specialization constant since it impacts the types used in the layout portion of a shader. ### Deprecate GPU sizes and Extents Currently there are 3 representations of the tensor's sizes; `cpu_sizes()`, `gpu_sizes()`, and `extents()`. The GPU sizes is a simple modification of the CPU sizes where the packed dim is aligned to the next multiple of 4. Extents represents the physical extents of the image texture used to store the image. However, often times shaders need to reference the original sizes of the tensor so we end up passing two different representations of the tensor sizes. The CPU sizes and extents is used to determine out of bounds elements and the GPU sizes is used to convert between logical tensor indices and physical texture coordinates. Since the GPU sizes and extents are easily determined from the CPU sizes given the packing layout, deprecate GPU sizes and use CPU sizes exclusively as the canonical tensor sizes. Hence `cpu_sizes()` is renamed to simple `sizes()`. The primary benefit of this change is such: 1. Less confusion over how to reference the tensor sizes 2. Fewer descriptors to bind when constructing compute pipelines 3. Fewer uniform buffers to update when resizing tensors between inferences. Differential Revision: [D56377775](https://our.internmc.facebook.com/intern/diff/D56377775/) ghstack-source-id: 223317313
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D56377775 |
Contributor
|
This pull request has been merged in c350e58. |
SS-JIA
added a commit
that referenced
this pull request
Apr 22, 2024
## Context #3181 deprecated the `gpu_sizes_ubo()` and `extents_ubo()` functions of `vTensor` in order to standardize how shaders consume shape/size metadata of input tensors. However, this came at the cost of increasing the overhead required for bounds checking, which is needed to support dynamic shapes as shaders now needed to convert the input sizes to texture limits before checking if a given texel position is out of bounds. Benchmarking revealed that this overhead can be quite significant especially on lower power mobile GPUs. In the interest of preserving performance, `extents_ubo()` is re-introduced since bounds checking is an operation that is common to every single shader. However, some improvements are made: * instead of `extents`, the nomenclature `texture_limits` is used in order to differentiate from physical image extents of the texture. * `texture_limits` is represented via an `ivec3` (previously `uvec4`); this means that to use it for bounds checking, there does not need to be an implicit cast to from `uvec` to `ivec` and there is also no need for swizzling. Also introduced in this changeset is the convention of passing both the texture limits and tensor sizes instead of using `pos_out_of_bounds()`. Passing in the texture limits is probably cheaper than using `pos_out_of_bounds()`. There are some exceptions though where I choose not to migrate to this pattern to avoid passing in too many variants of tensor metadata. ### What about `gpu_sizes_ubo`? I will hold off on re-introducing `gpu_sizes_ubo` for now since converting `sizes` to `gpu_sizes` is much cheaper compared to `pos_out_of_bounds()`: ``` ivec4 sizes[packed_dim] = alignup4(sizes[packed_dim]) ``` Will perform some additional benchmarking on this to see if the overhead of the alignment warrants an explicit API for passing in GPU sizes to shaders. Differential Revision: [D56435574](https://our.internmc.facebook.com/intern/diff/D56435574/) [ghstack-poisoned]
SS-JIA
added a commit
that referenced
this pull request
Apr 22, 2024
## Context #3181 deprecated the `gpu_sizes_ubo()` and `extents_ubo()` functions of `vTensor` in order to standardize how shaders consume shape/size metadata of input tensors. However, this came at the cost of increasing the overhead required for bounds checking, which is needed to support dynamic shapes as shaders now needed to convert the input sizes to texture limits before checking if a given texel position is out of bounds. Benchmarking revealed that this overhead can be quite significant especially on lower power mobile GPUs. In the interest of preserving performance, `extents_ubo()` is re-introduced since bounds checking is an operation that is common to every single shader. However, some improvements are made: * instead of `extents`, the nomenclature `texture_limits` is used in order to differentiate from physical image extents of the texture. * `texture_limits` is represented via an `ivec3` (previously `uvec4`); this means that to use it for bounds checking, there does not need to be an implicit cast to from `uvec` to `ivec` and there is also no need for swizzling. Also introduced in this changeset is the convention of passing both the texture limits and tensor sizes instead of using `pos_out_of_bounds()`. Passing in the texture limits is probably cheaper than using `pos_out_of_bounds()`. There are some exceptions though where I choose not to migrate to this pattern to avoid passing in too many variants of tensor metadata. ### What about `gpu_sizes_ubo`? I will hold off on re-introducing `gpu_sizes_ubo` for now since converting `sizes` to `gpu_sizes` is much cheaper compared to `pos_out_of_bounds()`: ``` ivec4 sizes[packed_dim] = alignup4(sizes[packed_dim]) ``` Will perform some additional benchmarking on this to see if the overhead of the alignment warrants an explicit API for passing in GPU sizes to shaders. Differential Revision: [D56435574](https://our.internmc.facebook.com/intern/diff/D56435574/) ghstack-source-id: 223453651 Pull Request resolved: #3217
facebook-github-bot
pushed a commit
that referenced
this pull request
Apr 22, 2024
Summary: Pull Request resolved: #3217 ## Context #3181 deprecated the `gpu_sizes_ubo()` and `extents_ubo()` functions of `vTensor` in order to standardize how shaders consume shape/size metadata of input tensors. However, this came at the cost of increasing the overhead required for bounds checking, which is needed to support dynamic shapes as shaders now needed to convert the input sizes to texture limits before checking if a given texel position is out of bounds. Benchmarking revealed that this overhead can be quite significant especially on lower power mobile GPUs. In the interest of preserving performance, `extents_ubo()` is re-introduced since bounds checking is an operation that is common to every single shader. However, some improvements are made: * instead of `extents`, the nomenclature `texture_limits` is used in order to differentiate from physical image extents of the texture. * `texture_limits` is represented via an `ivec3` (previously `uvec4`); this means that to use it for bounds checking, there does not need to be an implicit cast to from `uvec` to `ivec` and there is also no need for swizzling. Also introduced in this changeset is the convention of passing both the texture limits and tensor sizes instead of using `pos_out_of_bounds()`. Passing in the texture limits is probably cheaper than using `pos_out_of_bounds()`. There are some exceptions though where I choose not to migrate to this pattern to avoid passing in too many variants of tensor metadata. ### What about `gpu_sizes_ubo`? I will hold off on re-introducing `gpu_sizes_ubo` for now since converting `sizes` to `gpu_sizes` is much cheaper compared to `pos_out_of_bounds()`: ``` ivec4 sizes[packed_dim] = alignup4(sizes[packed_dim]) ``` Will perform some additional benchmarking on this to see if the overhead of the alignment warrants an explicit API for passing in GPU sizes to shaders. ghstack-source-id: 223453651 exported-using-ghexport Reviewed By: yipjustin, jorgep31415 Differential Revision: D56435574 fbshipit-source-id: 656f79eecbfc7c77cbe067df6c9ea54c51c50633
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
gpu_sizes_ubo()and toggle packing layout via specialization shader #3181Context
This changeset cleans up how shaders consume tensor metadata in two ways:
Pass in Packing Layout via Specialization Shader
The packing layout of a tensor determines how to convert between tensor indices and physical texture coordinates. Currently, the packing layout is determined by generating a completely new variant of a shader. However, this is rather expensive for build size.
Specialization constants support was added a while back, which enables packing layout to be communicated to the shader via a specialization constant. This is a much better and natural way for shaders to determine the packing layout of its tensors and vary its behaviour.
The primary benefit of this is that we can vastly reduce the number of variants that are generated. Generating shader variants for combinations of dtypes and memory layouts can lead to combinatorial explosion of build size.
Note that dtype cannot be passed as a specialization constant since it impacts the types used in the layout portion of a shader.
Deprecate GPU sizes
Currently there are two representations of the tensor's sizes;
cpu_sizes()and thegpu_sizes(). The GPU sizes is a simple modification of the CPU sizes where the packed dim is aligned to the next multiple of 4.However, often times shaders need to reference the original sizes of the tensor so we end up passing both CPU sizes and GPU sizes. The CPU sizes is used to determine out of bounds elements and the GPU sizes is used to convert between logical tensor indices and physical texture coordinates.
Since the GPU sizes is easily determined from the CPU sizes given the packing layout, deprecate GPU sizes and use CPU sizes exclusively as the canonical tensor sizes. Hence
cpu_sizes()is renamed to simplesizes().The primary benefit of this change is such:
Differential Revision: D56377775