Skip to content

Expose cancellation signal on request#650

Draft
mcuelenaere wants to merge 2 commits intorestatedev:mainfrom
mcuelenaere:feature/cancellation-signal
Draft

Expose cancellation signal on request#650
mcuelenaere wants to merge 2 commits intorestatedev:mainfrom
mcuelenaere:feature/cancellation-signal

Conversation

@mcuelenaere
Copy link

Summary

This PR adds a cancellationSignal: AbortSignal property to the Request interface that aborts as soon as the Restate runtime sends a cancellation, rather than waiting for the next ctx.run() call.

This allows passing the signal to fetch(), database clients, or any API that accepts an AbortSignal for proactive cancellation of in-flight operations. The signal's reason is a CancelledError.

Motivation

When an invocation is cancelled, the SDK currently only detects this at the next Restate operation. For handlers with long-running user code between operations (e.g., AI/LLM streaming), cancellation detection can be significantly delayed.

Implementation

A new CancellationWatcherPromise hooks into the existing PromisesExecutor loop to monitor is_completed(cancel_handle()) and abort an AbortController when cancellation is detected. The watcher is stopped via invocationEndPromise when the invocation ends, preventing interaction with a closed VM.

Design document

Notes

I've verified this change works correctly in my local setup. The added E2E tests I haven't run yet, I was hoping the CI could help with that.

mcuelenaere and others added 2 commits February 21, 2026 10:21
…tion

Add a `cancellationSignal: AbortSignal` property to the `Request` interface
that aborts as soon as the Restate runtime sends a cancellation, rather than
waiting for the next `ctx.run()` call. This allows users to pass the signal
to fetch(), database clients, or other async operations for proactive
cancellation handling.

Implementation uses a new CancellationWatcherPromise that hooks into the
existing PromisesExecutor loop to monitor the VM's cancel_handle() without
modifying InputPump or doProgressInner.

Co-authored-by: Cursor <cursoragent@cursor.com>
Stop the watcher's doProgressInner loop when the invocation ends by
registering cancelWatcher.stop() on invocationEndPromise. Without this,
the watcher keeps calling do_progress() on the closed VM after sys_end(),
causing cascading "(598) State machine was closed" errors and DANGER
warnings about operations running after invocation close.

Co-authored-by: Cursor <cursoragent@cursor.com>
@tillrohrmann tillrohrmann requested a review from nikrooz February 23, 2026 09:46
@tillrohrmann
Copy link
Contributor

Thanks a lot for your contribution @mcuelenaere. I think this is a really nice improvement. Curious to hear what @nikrooz thinks about this idea.

The one thing that one needs to be aware of is that with this change, the code snippet that reacts to the AbortSignal might have done some side effects and we record the journal entry as completed with a CancellationException. Hence, CancellationException will no longer mean that this journal entry hasn't been executed. This is not different from the ctx.run failing with a terminal exception. Another aspect is that we will now have potentially a cancelled journal entry as well as the cancellation signal in the journal. @slinkydeveloper is this a problem?

@mcuelenaere
Copy link
Author

hi @tillrohrmann, I didn't think of any possible side effects that could occur from this. To give some context, our usecase is a restate service that does streaming LLM (via an out-of-band pub-sub mechanism), but we'd like the user to be able to abort this. My thinking was to leverage the built-in Restate cancellation mechanism, but we do need to do some simple cleanup/bookkeeping (setting some state in the virtual object) after cancellation is requested. If that breaks certain Restate assumptions, then I could look into pivotting the cancellation to an out-of-band mechanism as well (and drop this PR). WDYT?

@tillrohrmann
Copy link
Contributor

I don't think that this is a problem @mcuelenaere but I wanted to double check with @slinkydeveloper who knows the SDK implementation a lot better than me. I like your approach and the capability to abort an ongoing ctx.run block. If this a solution that works, then I'd love to have it in the other SDKs as well if there is a comparable mechanism.

@nikrooz
Copy link
Contributor

nikrooz commented Feb 24, 2026

Thanks @mcuelenaere for doing this PR. As Till mentioned this is something for Francesco to look at.
It also worth mentioning that cancel is cooperative and will surface at safe boundaries. If cancellationSignal aborts in-flight work inside ctx.run, and that abort is turned into RetryableError (which it will if for instance you pass the signal to a fetch), the SDK records a transient run failure and retries the run closure. The handler remains blocked on await ctx.run(...), so cancellation can be starved and user code may never observe CancelledError (until retries exhaust, if ever). so you will end up with something like

CC @igalshilman

Screenshot 2026-02-24 at 16 14 05

@mcuelenaere
Copy link
Author

For reference, I didn't have the time to look into this yet, but this PR might be causing journal mismatch errors: https://restatecommunity.slack.com/archives/C0821C5RBH9/p1772110842398699

@slinkydeveloper
Copy link
Collaborator

I skimmed through this, and i think the current design won't work correctly, as we won't be able to replay this deterministically, because this creates the situation of 2 competing asynchronous tasks:

  • The abort handler you register
  • The restate handler that will eventually get the promise completed with cancel error

If the abort handler runs some asynchronous stuff, and uses the context (most likely), then you will have the abort handler competing with the restate handler.

I think for this to work well, we would have to change how we expose cancellation in the first place, and/or deterministically be able to run the abort handler and later resume the restate handler code... Will have to think about this...

@mcuelenaere
Copy link
Author

Just FYI: internally, we are no longer using this PR. We switched to an out-of-bound signalling system (basically, using Redis), which fulfill the requirements just as good. I can close this PR or keep it open, up to you. I will most likely not be maintaining this change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants