Conversation
| "submitted bottom up checkpoint({}) in parent at height {}", | ||
| event.height, | ||
| epoch | ||
| .unwrap(); |
There was a problem hiding this comment.
Just curious about the behaviour when the number of checkpoints to submit (say 10) is actually more than number of permits (say 2). So acquire_owned will actually block if 2 checkpoints are being submitted until one of 2 actually finishes? Even if submit_checkpoint fails and the thread crashes, drop(submission_permit) will be executed and no dead lock here right?
There was a problem hiding this comment.
So acquire_owned will actually block if 2 checkpoints are being submitted until one of 2 actually finishes?
Yes, it will literally block (the for loop just pauses).
Even if submit_checkpoint fails and the thread crashes, drop(submission_permit) will be executed and no dead lock here right?
My last version had an issue where the async task can panic before dropping the permit (because the use of unwrap()). It's been fixed here so the permit will be dropped no matter it succeeds or fails when submitting the checkpoint.
ipc/provider/src/checkpoint.rs
Outdated
| epoch | ||
| .unwrap(); | ||
| all_submit_tasks.push(tokio::task::spawn(async move { | ||
| Self::submit_checkpoint(parent_handler_clone, submitter, bundle, event).await; |
There was a problem hiding this comment.
Will this actually work in parallel, because there is a sequential enforcement in the checkpoint height, so say we are submitting two checkpoints at height 100 and 130. If height 100 is not submitted, 130 cannot be submitted. My question is really about sort of race in the sense that if height 100 is still in the memory pool and the transaction for height 130 is submitted, will it pass the gas estimation without errors? Or will it be rejected due to nonce not set incrementally?
There was a problem hiding this comment.
I'm not entirely sure whether the submit of 130 can succeed if the submit of 100 is still in progress...@aakoshh Can you share your opinions here?
There was a problem hiding this comment.
I don't know how Lotus works exactly, but let's say that it:
- limits the number of pending transactions in the mempool to 4
- orders the pending transactions by nonce
- applies them to an in-memory check-state like fendermint
If that were true, then as long as the transactions are created in a logical order, their order could be restored by Lotus when they are included in the block. As for the checks, it probably depends on whether that particular node has seen the preceding transactions are not.
Note that we would still send the transactions sequentially, we just don't want to wait for the receipt between each submission.
There was a problem hiding this comment.
Indeed, blockchain nodes accept gapped transactions via their RPC and via the mpool, because they live in an eventually consistent universe. Note that any of these can happen, in addition to other scenarios, of course:
- The user submits transactions in sequence to a load-balanced RPC endpoint, and they land gapped in the backend nodes.
- The user submits out of order transactions to the same node, which is just a subcase of the gapped situation.
- The user submits many transactions, all in order, to the same backend node, but each tx takes different gossip propagation routes through the network and it arrives to various nodes in various non-sequential and gapped orders.
ipc/provider/src/checkpoint.rs
Outdated
| Err(_err) => { | ||
| log::debug!("Failed to submit checkpoint"); |
There was a problem hiding this comment.
| Err(_err) => { | |
| log::debug!("Failed to submit checkpoint"); | |
| Err(err) => { | |
| log::error!(error = err, height = bundle.height, "Failed to submit checkpoint"); |
Not sure how to log the height, but ignoring errors and even demoting the log to debug level is a recipe for frustration.
There was a problem hiding this comment.
Agreed. Changed to error level logging.
There was a problem hiding this comment.
The error = err, height = bundle.height do not seem to work. I just log them in a string instead.
There was a problem hiding this comment.
@aakoshh key values were added to the log crate in v0.4.21, but our Cargo.lock is fixed on v0.4.20. We'll need to upgrade before we can adopt them without using unlocking the unstable_kv feature.
ipc/provider/src/checkpoint.rs
Outdated
| .await | ||
| .map_err(|e| { | ||
| anyhow!( | ||
| "cannot submit bottom up checkpoint at height {} due to: {e:}", |
There was a problem hiding this comment.
| "cannot submit bottom up checkpoint at height {} due to: {e:}", | |
| "cannot submit bottom up checkpoint at height {} due to: {e:?}", |
Not sure what {e:} is but I suspect it's the same as {e}. There is also {e:#}.
There was a problem hiding this comment.
This is from original code...
I changed it to just {}.
| "submitted bottom up checkpoint({}) in parent at height {}", | ||
| event.height, | ||
| epoch |
There was a problem hiding this comment.
| "submitted bottom up checkpoint({}) in parent at height {}", | |
| event.height, | |
| epoch | |
| height = event.height, epoch, "submitted bottom up checkpoint" |
There was a problem hiding this comment.
This is also from original code.
Your code doesn't compile? height is not recognized in this log::info! macro. I end up putting them in the string.
There was a problem hiding this comment.
This is OK for a start, but it's closer to a batched submission than to actual parallelism. The goal was to have max-parallelism active threads submitting checkpoints at all times.
EDIT: apologies, I think I misread a loop condition. This indeed is performing parallel submissions!
Could you please comment on how this was tested?
ipc/provider/src/checkpoint.rs
Outdated
| log::error!("Fail to submit checkpoint at height {height}: {e}"); | ||
| drop(submission_permit); | ||
| Err(e) | ||
| } else { | ||
| drop(submission_permit); | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
If all we're doing is log, we can use inspect_err.
I tested it by running the relayer locally by adding much more debug log (not checked in), and had this observation:
I think these observations is what we expected from the change. |
This closes ENG-767.
We now support submitting bottum-up checkpoint with limited parallelism, configured by
--max-parallel-submissionflag inipc-cli checkpoint relayercommand.