You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems suboptimal to me that we have to create separate optimized ops just to get basic stuff like parallelization (and vectorization, but let's start with parallelization). Here's what I'd like to do: (The timeline here is "ASAP", but I'm opening an issue because this got too long for chat and so that I can point to this issue on the PRs.)
Set up a proper CMake build for extension/parallel; right now it's free-riding on buck and getting automatically duplicated into 3 different targets per the generated executorch_srcs.cmake. (done; Add proper CMake build for extension_parallel #8938)
move extension/parallel/thread_parallel.h to core. (@larryliu0820 suggests runtime/kernel/thread_parallel.h) (Yes I will leave a stub header behind for backward compatibility.) Move thread_parallel.cpp to threadpool, since there will be no reason not to provide it when threads are available. Provide a default implementation of parallel_for if threadpool is not built (gated behind ET_USE_THREADPOOL) that is just an inlinable for loop. (Split & remove extension_parallel #8983)
🚀 The feature, motivation and pitch
It seems suboptimal to me that we have to create separate optimized ops just to get basic stuff like parallelization (and vectorization, but let's start with parallelization). Here's what I'd like to do: (The timeline here is "ASAP", but I'm opening an issue because this got too long for chat and so that I can point to this issue on the PRs.)
-DET_USE_THREADPOOLmacro we already use and define somewhat ad-hoc. (done; Properly export ET_USE_THREADPOOL from the threadpool extension #8947)runtime/kernel/thread_parallel.h) (Yes I will leave a stub header behind for backward compatibility.) Move thread_parallel.cpp to threadpool, since there will be no reason not to provide it when threads are available. Provide a default implementation of parallel_for if threadpool is not built (gated behindET_USE_THREADPOOL) that is just an inlinableforloop. (Split & remove extension_parallel #8983)parallel_forin at least one portable op, either directly or via the workhorse "util" functions. (Add basic parallel_for support to reduce_util #8986)parallel_foracross portable ops and workhorse "util" functions.Thoughts? Blockers?
Alternatives
status quo -- slow portable ops
Additional context
No response
RFC (Optional)
No response
cc @larryliu0820 @manuelcandales