ExecutionSpace: flattened OpState#304
Conversation
| inner_opstate.start(); | ||
|
|
||
| try { | ||
| Kokkos::parallel_for( | ||
| std::format("{}: then", Kokkos::Impl::TypeInfo<typename Schd::execution_space>::name()), | ||
| Kokkos::RangePolicy(this->schd.state->exec, 0, 1), | ||
| wrapper); | ||
| } catch (...) { | ||
| stdexec::set_error(std::move(this->rcvr), std::current_exception()); | ||
| } |
There was a problem hiding this comment.
This has no chance to work.
Say we have:
auto chain = stdexec::schedule(esc.get_scheduler()) | stdexec::then(..A..) | stdexec::then(..B..);
stdexec::sync_wait(std::move(chain));What you'll get is:
- the operation state of
sync_waitwill start the operation state of B - operation state of B calls operation state of A
- operation state of A call operation state of the scheduler, that call the set value on its receiver
- thus calling set value on A recevier
- calling set value on B receiver
- calling set value on sync_wait receiver
- triggering the sync wait fence
- Only after that does control return to A operation state start() function, which then launches kernel A.
- Then control returns to B operation state start() function, which launches kernel B
So you got everything in the wrong order, well done 👍
I agree we can have operation states, yet, it's the call to the ThenReceiver set_value that tells us "now is the time to launch you async operation".
And so we're back to the same situation that #262 solves.
If you feel strongly about this, you can followup on #262 and move all the propagate_completion_signal stuff in some operation state, much like in nvexec.
There was a problem hiding this comment.
I agree it can't work on its own. We'd have to modify the schedule sender and sync wait too. Essentially, the start function of the op state of the schedule sender would do nothing. The one of the sync wait would fence.
The issue that this tries to solve is that we currently launch work and fence in our set_value functions. And if we read the standard literally, it seems to say that the set_value must run on an execution agent, i.e. the gpu. This is a problem because we can't launch and fence from device. So even if we don't retain the upstate from this PR, there's the food for thought 😄.
There was a problem hiding this comment.
We'd have to modify the schedule sender and sync wait too.
It will not help you with the when_all case at all.
dfb7d79 to
d97f2c2
Compare
|
Hey @romintomasetti. This is more concretely what I am thinking of. There are two commits:
The main reason I think of the second step is point 10 from https://eel.is/c++draft/exec#async.ops-10:
I interpret this as saying that the set_value will in principle be called on device. And so that's why I feel it may not be the best place to launch work. There could also be a third step and move the launch into a "child opstate". There could be some nice links with graph nodes in that case. |
625a39d to
7cfb5db
Compare
77a3c1a to
dfdc450
Compare
OpState
901eba4 to
4ba0a8d
Compare
11bafad to
173ed6a
Compare
4293b8d to
c163709
Compare
romintomasetti
left a comment
There was a problem hiding this comment.
LEt's also update the PR description.
| std::is_nothrow_move_constructible_v<typename std::remove_cvref_t<Data>::functor_t> | ||
| && stdexec::__nothrow_decay_copyable<Sndr&&>) { |
There was a problem hiding this comment.
I think this is not right.
You'd better have a helper that givens you the ParallelForSender type given the Data and Sndr.
There was a problem hiding this comment.
Let's not write it at all, it does not serve any purpose (error channel or overload resolution) and just make it harder to read.
| requires std::same_as< | ||
| stdexec::__completion_domain_of_t<stdexec::set_value_t, Sndr, stdexec::env_of_t<Rcvr>>, | ||
| Kokkos::Experimental::details::execution_space::Domain | ||
| > |
3a2c3b0 to
8d07d67
Compare
OpStateOpState
d079a48 to
31915fe
Compare
Signed-off-by: Maarten Arnst <maarten.arnst@uliege.be>
31915fe to
7208251
Compare
OpStateOpState
Food for thought :) An alternative design with an opstate.
The main change is that in this design, work is launched from the
startfunction on host. This way, we avoid launching it from theset_valuefunction that the standard seems to want to run on the execution agent, which may be the device.If we make the op state queryable for the exec, we can decide in the
startfunctor whether or not to fence.