feat(consensus): implement consensus/qbft#463
Conversation
|
@claude review this PR |
|
Claude finished @iamquang95's task in 7m 43s —— View job Code review — feat(consensus): implement consensus/qbftSummarySubstantial port of Charon's Findings[High]
|
| Component | Go | Rust | Match | Notes |
|---|---|---|---|---|
| Inbound admission order: verify→duty gate→justifications→values→deadline | yes | yes | yes | component.rs:336-387 vs qbft.go:614-676 |
| Recv buffer capacity (100) | instance.RecvBufferSize |
RECV_BUFFER_SIZE |
yes | instance.rs:38 |
| Single-slot channels (hash/value/verify/err/decided_at, cap 1) | yes | yes | yes | instance.rs:117-122 |
leader = (slot + duty_type + round) % nodes |
yes | yes | yes | definition.rs:461-476; signed arithmetic via rem_euclid |
| QBFT v2 protocol ID | protocols.QBFTv2ProtocolID |
QBFT_V2_PROTOCOL_ID |
yes | |
| Inbound stream receive timeout | 5s | 5s | yes | p2p.rs:43 |
| Outbound stream send timeout | 7s | 7s | yes | p2p.rs:45 |
| Skip self on broadcast | yes | yes | yes | p2p.rs:492-507 |
| Sniffer message buffer | append-only | append-only | yes | sniffer.rs |
| Consensus metrics | ported | absent | no | see Medium #2 |
| Round timers (inc / dlinear / linear) | yes | yes | yes | timer.rs |
compareAttestations source/target equality |
yes | yes | yes | definition.rs:167-201 |
qbft.Run cancellation by decide callback |
yes | yes | yes | runner.rs:277-282; instance_ct.cancel() |
Subscribe / SubscribePriority value-type filtering |
yes | yes | yes | component.rs:237-256 |
Tests
I could not run cargo +nightly fmt --all --check, cargo clippy ... -D warnings, or cargo test --workspace --all-features from this sandbox (cargo is not on the allowlist). Please verify locally — the in-memory consensus test (qbft_run_test.rs) and the libp2p loopback test in p2p.rs provide strong coverage but rely on tokio::time and may be flaky under high parallelism.
Open questions / assumptions
- Are consensus metrics intentionally out of scope for this PR? If yes, drop
decided_at_tx/take_decided_at_rxto keep the surface honest; if no, file a follow-up. - The retain-until-deadline behaviour added in
233e80eis correct, but please confirm there's a follow-up test that explicitly walks the "late inbound message after decision" code path (handle_drops_late_message_after_started_receiver_closedcovers the basic case). qbft_run_testshould be extended to actually run a sub-quorum cluster — the current test names are misleading.
· Branch:iamquang95/consensus-qbft
|
@varex83agent /review-pr |
varex83agent
left a comment
There was a problem hiding this comment.
QBFT consensus port — review
This PR ports Charon's core/consensus/qbft component to a new pluto-consensus crate (moving the existing consensus module out of pluto-core), plus the supporting libp2p adapter, transport, sniffer, round timers, and the dutydb unsigned-data-set SSZ/JSON decoding. Overall this is a high-quality, faithful port with strong test coverage and golden-vector parity tests. Three independent review passes (functional-equivalence vs Charon v1.7.1, security, and concurrency/state-machine) found no correctness bugs.
Verified parity (v1.7.1): leader election (slot+dutyType+round)%nodes, quorum (2n+2)/3, verify_msg/handle validation order and error strings, justification duty-differs check, compare_attester field comparison, hash_proto (SSZ PutBytes over deterministic proto + BTreeMap ordering), increasing/eager-double-linear/proposal timers, sniffer fields & protocol id, dutydb SSZ offsets (8/96), and participate() duty-type/feature gating all match the pinned Go reference.
Security: signature verification runs before any expensive work or instance creation; the instances map is gated by the duty gater + deadliner and reaped by the expired-cleanup loop; inbound size/timeout bounds are enforced; SSZ offset/length validation in decode_attestation_data_ssz is sound; no key material reaches logs/Debug; justification recursion is depth-1. No reachable issues found.
Bugs (must-fix): none.
Findings: 3 minor (a block_on-on-blocking-thread robustness/documentation note, error double-stringification that loses the typed source chain, and a test-coverage gap for the attester-compare end-to-end path) and 2 nits, all inline. None block merge — leaving open for author discretion.
Note (out of scope, FYI only): crates/consensus/src/timer.rs linear_subsequent_round_timeout uses from_millis with a comment claiming charon#4537 landed in the reference, but v1.7.1's linearRoundTimer still uses the bare time.Duration(200*(round-1)+200) (nanoseconds). This is pre-existing code relocated by the rename, not introduced here, so it is not flagged inline — worth a follow-up to confirm the deliberate deviation.
emlautarom1
left a comment
There was a problem hiding this comment.
Could you provide some guidance on this PR? At over 6k LOC it's hard to figure out what its is scope.
|
@emlautarom1 I updated the PR description on the structure of the PR. Also I have to move consensus out from core to avoid the cyclic dependency |
emlautarom1
left a comment
There was a problem hiding this comment.
I did not finish reviewing the entire PR but I would like to share some partial feedback to not block.
emlautarom1
left a comment
There was a problem hiding this comment.
I have to say the code is quite convoluted in both Pluto and Charon. It seems to be the result of:
- Use of many callbacks which accept references/arc/box
- Methods on structs with different lifetimes, specially in closures
- Two type of channels (crossbeam and tokio)
I don't see any specific issue beyond some minor duplication. P2P requires an extra pair of eyes though before approving.
|
@varex83 can you help to take a look at p2p layer |
|
sure, let me take a look
|
| if self.is_connected(&peer_id) { | ||
| self.pending_events.push_back(ToSwarm::NotifyHandler { | ||
| peer_id, | ||
| handler: NotifyHandler::Any, |
There was a problem hiding this comment.
One possible concern here - if we have multiple connection handlers (in case of multiple connection protocols), we can get to the point of sending multiple qbft messages
There was a problem hiding this comment.
Any is actually exactly one (not all connections), so we will only send one message
| /// Queues a behaviour send request until libp2p opens a stream. | ||
| fn on_behaviour_event(&mut self, event: Self::FromBehaviour) { | ||
| match event { | ||
| ToHandler::Send { request_id, msg } => self.pending_open.push_back((request_id, msg)), |
There was a problem hiding this comment.
Could be wrong here, but I think we should make a check for a duplicate send events in the connection handler, it should be created globally per peer id in the behaviour and passed to conn handlers.
There was a problem hiding this comment.
I don't think a duplicate can happen here. Broadcast will emit only one Send event per peer
Replace stringly-typed inbound/send errors in the QBFT libp2p adapter with InboundError/SendError enums threaded through FromHandler and Event, so consumers can match on variants and walk the source() chain instead of parsing strings. The common write-path I/O keeps the full io::Error. Drops the now-dead timeout_error helper (timeout is a payload-free variant) and the unused request_id param of write_outbound.
Remove the redundant `peers: Vec<PeerId>` field from the QBFT p2p adapter Config: cluster membership and broadcast fan-out now read from `P2PContext::known_peers()`, the single source of truth already shared with the rest of the p2p stack. `local_peer_id` stays on Config because the Node sets it on the context only after Behaviour::new runs. handle_broadcast collects non-self targets up front so the immutable known_peers borrow is released before enqueue_send borrows self mutably; target_count is now the exact deduplicated peer count.
…qbft # Conflicts: # crates/core/src/validatorapi/component.rs
5e007fb to
b294cd3
Compare
Fix: #157
This PR ports Charon’s
core/consensus/qbftwrapper into Rust. The existingpluto_core::qbftalgorithm remains the generic QBFT engine; this PR adds the production consensus layer around it: admission, value handling, runner lifecycle, transport/self-send, p2p wiring, decision dispatch, and parity tests.Because Charon’s
qbft.gois large, the Rust port is split by responsibility:component.rs:Consensus, constructor,subscribe,start, inbound admission, peer/pubkey lookup, instance IO map.runner.rs:propose,propose_priority,participate, internal propose flow,run_instance, async-to-core bridge, cancellation/teardown.definition.rs: coreqbft::Definition, leader selection, decide callback, compare callback, round-change logging, timeout reason logic.transport.rs: adapter between core QBFT broadcasts andQbftConsensusMsg; value cache, self-send, sniffer integration.p2p.rs: QBFT libp2p protocol, handler, behaviour, framing, send/receive loop.msg.rs: Charonmsg.goequivalent: hashing, signing, verification, wrapping/unwrapping consensus messages.sniffer.rs: Charonsniffer.goequivalent.Inbound flow:
p2p.rsreceivesQbftConsensusMsg→ callsConsensus::handleincomponent.rs→ admission checks duty/peer/signature/hash/justification →msg.rswraps protobuf into coreqbft::Msg→ message enters that duty’s receive buffer →runner.rsbridges it intopluto_core::qbft.Outbound flow:
pluto_core::qbftcalls broadcast callback fromdefinition.rs→transport.rsconverts core message toQbftConsensusMsg, resolves values, self-sends locally, records sniffer data → broadcaster sends viap2p.rs.Runner flow:
runner.rsimplementspropose,propose_priority,participate, andrun_instance. It getsInstanceIofromcomponent.rs, wires local value/hash/verify channels, buildsdefinition.rs+transport.rs, then runs the core QBFT engine.Decision flow:
Core QBFT decides →
definition.rsdecide callback decodes the value →component.rssubscribers receive the decided duty/value.Test coverage includes: