fix: stop genservers gracefully#319
Conversation
Greptile SummaryThis PR adds graceful shutdown to the ethlambda node by stopping the
Confidence Score: 3/5Safe to merge for typical usage, but a race condition in the shutdown signalling path can cause an indefinite hang. One P1 finding (notify_waiters race that silently drops shutdown signals and causes indefinite hangs) and one P2 (no shutdown timeout). The P1 drives the score below the 4/5 ceiling. bin/ethlambda/src/main.rs — specifically the notify_waiters() call and the unbounded join awaits in the shutdown sequence.
|
| Filename | Overview |
|---|---|
| bin/ethlambda/src/main.rs | Adds graceful shutdown via Arc and actor context().stop(); the notify_waiters() call can silently drop notifications if spawned tasks haven't registered their waiters yet, causing an indefinite hang on shutdown. |
| crates/net/rpc/src/lib.rs | Adds a generic shutdown future parameter to both server functions and wires it into axum's with_graceful_shutdown(); clean and idiomatic change with no issues. |
Sequence Diagram
sequenceDiagram
participant Main
participant BlockChain
participant P2P
participant ApiServer
participant MetricsServer
Main->>BlockChain: spawn actor
Main->>P2P: spawn actor
Main->>ApiServer: tokio::spawn (with shutdown_future)
Main->>MetricsServer: tokio::spawn (with shutdown_future)
Main->>Main: ctrl_c().await
Main->>BlockChain: context().stop()
Main->>P2P: context().stop()
Main->>ApiServer: shutdown_notify.notify_waiters()
Main->>MetricsServer: shutdown_notify.notify_waiters()
Main->>BlockChain: join().await
BlockChain-->>Main: done
Main->>P2P: join().await
P2P-->>Main: done
Main->>ApiServer: api_handle.await
ApiServer-->>Main: done
Main->>MetricsServer: metrics_handle.await
MetricsServer-->>Main: done
Main->>Main: info!("Shutdown complete")
Prompt To Fix All With AI
This is a comment left during a code review.
Path: bin/ethlambda/src/main.rs
Line: 238
Comment:
**`notify_waiters()` can silently drop shutdown signals**
`Notify::notify_waiters()` only wakes futures that are *currently registered* as waiters at the moment of the call — it does not store a permit for futures that haven't been polled yet. The shutdown futures inside the spawned tasks (`async move { X.notified().await }`) don't register themselves until axum polls the shutdown future for the first time, which happens asynchronously after `tokio::spawn`. If `ctrl_c` fires before the spawned tasks have had a chance to run, both notifications are silently dropped and `api_handle.await` / `metrics_handle.await` will block indefinitely — the process never exits.
The clearest fix is two separate `Arc<Notify>` handles each called with `notify_one()`, which stores a permit that survives until consumed:
```rust
let api_shutdown_notify = Arc::new(Notify::new());
let metrics_shutdown_notify = Arc::new(Notify::new());
// ... pass clones into spawned tasks ...
api_shutdown_notify.notify_one();
metrics_shutdown_notify.notify_one();
```
Alternatively, a `tokio::sync::watch` channel with a stored value avoids this race entirely for any number of listeners.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: bin/ethlambda/src/main.rs
Line: 240-243
Comment:
**No timeout on shutdown awaits**
All four `await` calls — `blockchain_ref.join()`, `p2p_ref.join()`, `api_handle.await`, and `metrics_handle.await` — are unbounded. If any actor gets stuck in a long-running handler or the HTTP servers hold open long-lived connections, the process will hang indefinitely. Consider wrapping the shutdown sequence in `tokio::time::timeout` so the node can force-exit after a reasonable deadline (e.g., 30 s) even if something stalls.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix: stop genservers gracefully" | Re-trigger Greptile
| let p2p_ref = p2p.actor_ref().clone(); | ||
| blockchain_ref.context().stop(); | ||
| p2p_ref.context().stop(); | ||
| shutdown_notify.notify_waiters(); |
There was a problem hiding this comment.
notify_waiters() can silently drop shutdown signals
Notify::notify_waiters() only wakes futures that are currently registered as waiters at the moment of the call — it does not store a permit for futures that haven't been polled yet. The shutdown futures inside the spawned tasks (async move { X.notified().await }) don't register themselves until axum polls the shutdown future for the first time, which happens asynchronously after tokio::spawn. If ctrl_c fires before the spawned tasks have had a chance to run, both notifications are silently dropped and api_handle.await / metrics_handle.await will block indefinitely — the process never exits.
The clearest fix is two separate Arc<Notify> handles each called with notify_one(), which stores a permit that survives until consumed:
let api_shutdown_notify = Arc::new(Notify::new());
let metrics_shutdown_notify = Arc::new(Notify::new());
// ... pass clones into spawned tasks ...
api_shutdown_notify.notify_one();
metrics_shutdown_notify.notify_one();Alternatively, a tokio::sync::watch channel with a stored value avoids this race entirely for any number of listeners.
Prompt To Fix With AI
This is a comment left during a code review.
Path: bin/ethlambda/src/main.rs
Line: 238
Comment:
**`notify_waiters()` can silently drop shutdown signals**
`Notify::notify_waiters()` only wakes futures that are *currently registered* as waiters at the moment of the call — it does not store a permit for futures that haven't been polled yet. The shutdown futures inside the spawned tasks (`async move { X.notified().await }`) don't register themselves until axum polls the shutdown future for the first time, which happens asynchronously after `tokio::spawn`. If `ctrl_c` fires before the spawned tasks have had a chance to run, both notifications are silently dropped and `api_handle.await` / `metrics_handle.await` will block indefinitely — the process never exits.
The clearest fix is two separate `Arc<Notify>` handles each called with `notify_one()`, which stores a permit that survives until consumed:
```rust
let api_shutdown_notify = Arc::new(Notify::new());
let metrics_shutdown_notify = Arc::new(Notify::new());
// ... pass clones into spawned tasks ...
api_shutdown_notify.notify_one();
metrics_shutdown_notify.notify_one();
```
Alternatively, a `tokio::sync::watch` channel with a stored value avoids this race entirely for any number of listeners.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
This is shy I suggested CancellationToken, since it was made specifically for these cases
| blockchain_ref.join().await; | ||
| p2p_ref.join().await; | ||
| let _ = api_handle.await; | ||
| let _ = metrics_handle.await; |
There was a problem hiding this comment.
All four await calls — blockchain_ref.join(), p2p_ref.join(), api_handle.await, and metrics_handle.await — are unbounded. If any actor gets stuck in a long-running handler or the HTTP servers hold open long-lived connections, the process will hang indefinitely. Consider wrapping the shutdown sequence in tokio::time::timeout so the node can force-exit after a reasonable deadline (e.g., 30 s) even if something stalls.
Prompt To Fix With AI
This is a comment left during a code review.
Path: bin/ethlambda/src/main.rs
Line: 240-243
Comment:
**No timeout on shutdown awaits**
All four `await` calls — `blockchain_ref.join()`, `p2p_ref.join()`, `api_handle.await`, and `metrics_handle.await` — are unbounded. If any actor gets stuck in a long-running handler or the HTTP servers hold open long-lived connections, the process will hang indefinitely. Consider wrapping the shutdown sequence in `tokio::time::timeout` so the node can force-exit after a reasonable deadline (e.g., 30 s) even if something stalls.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
I think it'd be better to do "1 ctrl+C to stop, N ctrl+C to abort immediately" instead. We can tackle that in another PR, if we find it necessary
MegaRedHand
left a comment
There was a problem hiding this comment.
Hey @dicethedev! Thank you for your contribution! I left you some comments.
Also, remember to sign your commits, as explained in our CONTRIBUTING.md
|
|
||
| tokio::spawn(async move { | ||
| let _ = ethlambda_rpc::start_metrics_server(metrics_socket) | ||
| let shutdown_notify = Arc::new(Notify::new()); |
| address: SocketAddr, | ||
| store: Store, | ||
| aggregator: AggregatorController, | ||
| shutdown: impl std::future::Future<Output = ()> + Send + 'static, |
There was a problem hiding this comment.
Let's try to use the specific type here, instead of generics
| blockchain_ref.join().await; | ||
| p2p_ref.join().await; | ||
| let _ = api_handle.await; | ||
| let _ = metrics_handle.await; |
There was a problem hiding this comment.
I think it'd be better to do "1 ctrl+C to stop, N ctrl+C to abort immediately" instead. We can tackle that in another PR, if we find it necessary
| let p2p_ref = p2p.actor_ref().clone(); | ||
| blockchain_ref.context().stop(); | ||
| p2p_ref.context().stop(); | ||
| shutdown_notify.notify_waiters(); |
There was a problem hiding this comment.
This is shy I suggested CancellationToken, since it was made specifically for these cases
🗒️ Description
This PR implements graceful shutdown for the ethlambda node's actors and HTTP servers. Previously, the node would abruptly terminate on Ctrl+C without waiting for in-flight operations to complete.
Now, on shutdown signal:
stop()methodwith_graceful_shutdown()These changes ensure that:
This prevents data corruption and improves reliability.
🔗 Related Issues or PRs
Closes #195