RFC: Compute only in int32/long/float/double for portable ops to save size#9635
RFC: Compute only in int32/long/float/double for portable ops to save size#9635swolchok wants to merge 2 commits intogh/swolchok/400/headfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9635
Note: Links to docs will display an error until the docs builds have been completed. ❌ 5 New Failures, 11 PendingAs of commit ac64f9e with merge base 811352d ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
wasn't sure if I was making things up, so: https://developer.arm.com/Processors/Ethos-U55 is a real example from the present day |
|
size impact: on my mac, test/build_size_test.sh reports that size_test_all_ops has size 1205136 before this PR and 1105856 after, a decrease of around 8% |
|
This is known to break tests, I think because it breaks SupportedTensorDtypes::SAME_AS_COMMON for reasons outlined in #9613, hence it is RFC status. The problem is fixable, but if we have directional concerns with this then I don't want to invest in fixing it. |
|
per discussion with @manuelcandales, if we do this then we we need to cast through the "actual" compute type before casting to the output type so that we match ATen. example: computing in int32 or int16 would cause this to yield 10000, not 16; casting through int8 would correct this. |
|
This is a bad idea because smaller compute dtypes benefit from additional SIMD lanes. |
| // Gate above optimization off if we appear to be on some kind of 8-bit or | ||
| // 16-bit CPU, which would invalidate our assumption about 32-bit | ||
| // math being just as fast. | ||
| constexpr bool cpu_appears_to_be_at_least_32_bit = sizeof(void*) >= 4 && sizeof(int) >= 4; |
Concern: what if we're running on some sort of 16-bit microcontroller where this is a pessimization?