Skip to content

fix: stop genservers gracefully#319

Open
dicethedev wants to merge 1 commit intolambdaclass:mainfrom
dicethedev:fix/stop-genservers-gracefully
Open

fix: stop genservers gracefully#319
dicethedev wants to merge 1 commit intolambdaclass:mainfrom
dicethedev:fix/stop-genservers-gracefully

Conversation

@dicethedev
Copy link
Copy Markdown

🗒️ Description

This PR implements graceful shutdown for the ethlambda node's actors and HTTP servers. Previously, the node would abruptly terminate on Ctrl+C without waiting for in-flight operations to complete.

Now, on shutdown signal:

  • The BlockChain and P2P actors are stopped cleanly via their context's stop() method
  • The API and metrics HTTP servers are shut down gracefully using Axum's with_graceful_shutdown()
  • The main thread waits for all actors and server tasks to finish before exiting

These changes ensure that:

  • Aggregation workers complete their current jobs
  • Database writes are flushed
  • Network connections are closed properly

This prevents data corruption and improves reliability.

🔗 Related Issues or PRs

Closes #195

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

This PR adds graceful shutdown to the ethlambda node by stopping the BlockChain and P2P actors via context().stop(), signalling both HTTP servers through a shared Arc<Notify>, and awaiting all handles before exit. The crates/net/rpc changes are clean; the main concern is in the shutdown signalling path.

  • notify_waiters() race: Notify::notify_waiters() only wakes currently registered waiters. If the spawned server tasks haven't been polled (and thus haven't registered their notified() futures) before Ctrl+C fires, both notifications are silently dropped and api_handle.await / metrics_handle.await block indefinitely. Using notify_one() with two separate Arc<Notify> instances, or a watch channel, would eliminate this race.
  • Unbounded shutdown awaits: There is no timeout on the four join/await calls, so a stuck actor or a server holding open long-lived connections will prevent the process from ever exiting.

Confidence Score: 3/5

Safe to merge for typical usage, but a race condition in the shutdown signalling path can cause an indefinite hang.

One P1 finding (notify_waiters race that silently drops shutdown signals and causes indefinite hangs) and one P2 (no shutdown timeout). The P1 drives the score below the 4/5 ceiling.

bin/ethlambda/src/main.rs — specifically the notify_waiters() call and the unbounded join awaits in the shutdown sequence.

Important Files Changed

Filename Overview
bin/ethlambda/src/main.rs Adds graceful shutdown via Arc and actor context().stop(); the notify_waiters() call can silently drop notifications if spawned tasks haven't registered their waiters yet, causing an indefinite hang on shutdown.
crates/net/rpc/src/lib.rs Adds a generic shutdown future parameter to both server functions and wires it into axum's with_graceful_shutdown(); clean and idiomatic change with no issues.

Sequence Diagram

sequenceDiagram
    participant Main
    participant BlockChain
    participant P2P
    participant ApiServer
    participant MetricsServer

    Main->>BlockChain: spawn actor
    Main->>P2P: spawn actor
    Main->>ApiServer: tokio::spawn (with shutdown_future)
    Main->>MetricsServer: tokio::spawn (with shutdown_future)
    Main->>Main: ctrl_c().await

    Main->>BlockChain: context().stop()
    Main->>P2P: context().stop()
    Main->>ApiServer: shutdown_notify.notify_waiters()
    Main->>MetricsServer: shutdown_notify.notify_waiters()

    Main->>BlockChain: join().await
    BlockChain-->>Main: done
    Main->>P2P: join().await
    P2P-->>Main: done
    Main->>ApiServer: api_handle.await
    ApiServer-->>Main: done
    Main->>MetricsServer: metrics_handle.await
    MetricsServer-->>Main: done

    Main->>Main: info!("Shutdown complete")
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: bin/ethlambda/src/main.rs
Line: 238

Comment:
**`notify_waiters()` can silently drop shutdown signals**

`Notify::notify_waiters()` only wakes futures that are *currently registered* as waiters at the moment of the call — it does not store a permit for futures that haven't been polled yet. The shutdown futures inside the spawned tasks (`async move { X.notified().await }`) don't register themselves until axum polls the shutdown future for the first time, which happens asynchronously after `tokio::spawn`. If `ctrl_c` fires before the spawned tasks have had a chance to run, both notifications are silently dropped and `api_handle.await` / `metrics_handle.await` will block indefinitely — the process never exits.

The clearest fix is two separate `Arc<Notify>` handles each called with `notify_one()`, which stores a permit that survives until consumed:

```rust
let api_shutdown_notify = Arc::new(Notify::new());
let metrics_shutdown_notify = Arc::new(Notify::new());
// ... pass clones into spawned tasks ...
api_shutdown_notify.notify_one();
metrics_shutdown_notify.notify_one();
```

Alternatively, a `tokio::sync::watch` channel with a stored value avoids this race entirely for any number of listeners.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: bin/ethlambda/src/main.rs
Line: 240-243

Comment:
**No timeout on shutdown awaits**

All four `await` calls — `blockchain_ref.join()`, `p2p_ref.join()`, `api_handle.await`, and `metrics_handle.await` — are unbounded. If any actor gets stuck in a long-running handler or the HTTP servers hold open long-lived connections, the process will hang indefinitely. Consider wrapping the shutdown sequence in `tokio::time::timeout` so the node can force-exit after a reasonable deadline (e.g., 30 s) even if something stalls.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: stop genservers gracefully" | Re-trigger Greptile

Comment thread bin/ethlambda/src/main.rs
let p2p_ref = p2p.actor_ref().clone();
blockchain_ref.context().stop();
p2p_ref.context().stop();
shutdown_notify.notify_waiters();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 notify_waiters() can silently drop shutdown signals

Notify::notify_waiters() only wakes futures that are currently registered as waiters at the moment of the call — it does not store a permit for futures that haven't been polled yet. The shutdown futures inside the spawned tasks (async move { X.notified().await }) don't register themselves until axum polls the shutdown future for the first time, which happens asynchronously after tokio::spawn. If ctrl_c fires before the spawned tasks have had a chance to run, both notifications are silently dropped and api_handle.await / metrics_handle.await will block indefinitely — the process never exits.

The clearest fix is two separate Arc<Notify> handles each called with notify_one(), which stores a permit that survives until consumed:

let api_shutdown_notify = Arc::new(Notify::new());
let metrics_shutdown_notify = Arc::new(Notify::new());
// ... pass clones into spawned tasks ...
api_shutdown_notify.notify_one();
metrics_shutdown_notify.notify_one();

Alternatively, a tokio::sync::watch channel with a stored value avoids this race entirely for any number of listeners.

Prompt To Fix With AI
This is a comment left during a code review.
Path: bin/ethlambda/src/main.rs
Line: 238

Comment:
**`notify_waiters()` can silently drop shutdown signals**

`Notify::notify_waiters()` only wakes futures that are *currently registered* as waiters at the moment of the call — it does not store a permit for futures that haven't been polled yet. The shutdown futures inside the spawned tasks (`async move { X.notified().await }`) don't register themselves until axum polls the shutdown future for the first time, which happens asynchronously after `tokio::spawn`. If `ctrl_c` fires before the spawned tasks have had a chance to run, both notifications are silently dropped and `api_handle.await` / `metrics_handle.await` will block indefinitely — the process never exits.

The clearest fix is two separate `Arc<Notify>` handles each called with `notify_one()`, which stores a permit that survives until consumed:

```rust
let api_shutdown_notify = Arc::new(Notify::new());
let metrics_shutdown_notify = Arc::new(Notify::new());
// ... pass clones into spawned tasks ...
api_shutdown_notify.notify_one();
metrics_shutdown_notify.notify_one();
```

Alternatively, a `tokio::sync::watch` channel with a stored value avoids this race entirely for any number of listeners.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is shy I suggested CancellationToken, since it was made specifically for these cases

Comment thread bin/ethlambda/src/main.rs
Comment on lines +240 to +243
blockchain_ref.join().await;
p2p_ref.join().await;
let _ = api_handle.await;
let _ = metrics_handle.await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No timeout on shutdown awaits

All four await calls — blockchain_ref.join(), p2p_ref.join(), api_handle.await, and metrics_handle.await — are unbounded. If any actor gets stuck in a long-running handler or the HTTP servers hold open long-lived connections, the process will hang indefinitely. Consider wrapping the shutdown sequence in tokio::time::timeout so the node can force-exit after a reasonable deadline (e.g., 30 s) even if something stalls.

Prompt To Fix With AI
This is a comment left during a code review.
Path: bin/ethlambda/src/main.rs
Line: 240-243

Comment:
**No timeout on shutdown awaits**

All four `await` calls — `blockchain_ref.join()`, `p2p_ref.join()`, `api_handle.await`, and `metrics_handle.await` — are unbounded. If any actor gets stuck in a long-running handler or the HTTP servers hold open long-lived connections, the process will hang indefinitely. Consider wrapping the shutdown sequence in `tokio::time::timeout` so the node can force-exit after a reasonable deadline (e.g., 30 s) even if something stalls.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better to do "1 ctrl+C to stop, N ctrl+C to abort immediately" instead. We can tackle that in another PR, if we find it necessary

Copy link
Copy Markdown
Collaborator

@MegaRedHand MegaRedHand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dicethedev! Thank you for your contribution! I left you some comments.

Also, remember to sign your commits, as explained in our CONTRIBUTING.md

Comment thread bin/ethlambda/src/main.rs
Comment on lines 207 to +208

tokio::spawn(async move {
let _ = ethlambda_rpc::start_metrics_server(metrics_socket)
let shutdown_notify = Arc::new(Notify::new());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a CancellationToken here?

Comment thread crates/net/rpc/src/lib.rs
address: SocketAddr,
store: Store,
aggregator: AggregatorController,
shutdown: impl std::future::Future<Output = ()> + Send + 'static,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to use the specific type here, instead of generics

Comment thread bin/ethlambda/src/main.rs
Comment on lines +240 to +243
blockchain_ref.join().await;
p2p_ref.join().await;
let _ = api_handle.await;
let _ = metrics_handle.await;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better to do "1 ctrl+C to stop, N ctrl+C to abort immediately" instead. We can tackle that in another PR, if we find it necessary

Comment thread bin/ethlambda/src/main.rs
let p2p_ref = p2p.actor_ref().clone();
blockchain_ref.context().stop();
p2p_ref.context().stop();
shutdown_notify.notify_waiters();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is shy I suggested CancellationToken, since it was made specifically for these cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stop genservers gracefully

2 participants