feat(consensus): implement consensus/qbft by iamquang95 · Pull Request #463 · NethermindEth/pluto

iamquang95 · 2026-06-01T07:52:29Z

This PR ports Charon’s core/consensus/qbft wrapper into Rust. The existing pluto_core::qbft algorithm remains the generic QBFT engine; this PR adds the production consensus layer around it: admission, value handling, runner lifecycle, transport/self-send, p2p wiring, decision dispatch, and parity tests.

Because Charon’s qbft.go is large, the Rust port is split by responsibility:

component.rs: Consensus, constructor, subscribe, start, inbound admission, peer/pubkey lookup, instance IO map.
runner.rs: propose, propose_priority, participate, internal propose flow, run_instance, async-to-core bridge, cancellation/teardown.
definition.rs: core qbft::Definition, leader selection, decide callback, compare callback, round-change logging, timeout reason logic.
transport.rs: adapter between core QBFT broadcasts and QbftConsensusMsg; value cache, self-send, sniffer integration.
p2p.rs: QBFT libp2p protocol, handler, behaviour, framing, send/receive loop.
msg.rs: Charon msg.go equivalent: hashing, signing, verification, wrapping/unwrapping consensus messages.
sniffer.rs: Charon sniffer.go equivalent.

Inbound flow:
p2p.rs receives QbftConsensusMsg → calls Consensus::handle in component.rs → admission checks duty/peer/signature/hash/justification → msg.rs wraps protobuf into core qbft::Msg → message enters that duty’s receive buffer → runner.rs bridges it into pluto_core::qbft.

Outbound flow:
pluto_core::qbft calls broadcast callback from definition.rs → transport.rs converts core message to QbftConsensusMsg, resolves values, self-sends locally, records sniffer data → broadcaster sends via p2p.rs.

Runner flow:
runner.rs implements propose, propose_priority, participate, and run_instance. It gets InstanceIo from component.rs, wires local value/hash/verify channels, builds definition.rs + transport.rs, then runs the core QBFT engine.

Decision flow:
Core QBFT decides → definition.rs decide callback decodes the value → component.rs subscribers receive the decided duty/value.

Test coverage includes:

message hashing/signing/wrapping,
admission and validation,
runner lifecycle,
transport self-send and value cache,
libp2p framing and loopback,
full in-memory QBFT consensus scenarios,
strategy simulator parity coverage.

iamquang95 · 2026-06-01T08:00:33Z

@claude review this PR

claude · 2026-06-01T08:00:54Z

Claude finished @iamquang95's task in 7m 43s —— View job

Code review — feat(consensus): implement consensus/qbft

Summary

Substantial port of Charon's core/consensus/qbft into a dedicated pluto-consensus crate. The bulk of the work — message wrapping (msg.rs), transport (transport.rs), the QBFT-core wiring (definition.rs), the runner (runner.rs), the inbound admission path (component.rs), the libp2p NetworkBehaviour (p2p.rs), and the sniffer — closely tracks the Go reference. Code is well-tested (~5k lines, including a full in-memory consensus simulation) and parity on the wire format is verified against a reference protobuf payload. A few concerns worth addressing below, none blocking.

Findings

[High] `qbft_consensus` test never exercises a sub-quorum cluster

crates/consensus/src/qbft/qbft_run_test.rs:27-32 — the test takes both threshold and cluster_nodes, but only spins up threshold peers: let active_nodes = in_memory_network(threshold, sniffed_tx);. The cluster_nodes arg is only used for the assert!(threshold <= cluster_nodes) and discarded afterward. Cases like (4, 6) — described as "4 of 6" — actually run a 4-of-4 healthy-cluster simulation. The intended threshold scenarios (silent peers, leader rotation across more nodes) aren't covered. Fix this →

[Medium] Consensus duration / decided-leader / decided-rounds metrics not ported

Charon's runInstance records ObserveConsensusDuration, SetDecidedLeaderIndex, SetDecidedRounds, IncConsensusTimeout, IncConsensusError (charon/core/consensus/qbft/qbft.go:432-606). Pluto plumbs decided_at_tx / take_decided_at_rx in crates/consensus/src/instance.rs:110-205 and writes Instant::now() in runner.rs:279, but nothing consumes the receiver — take_decided_at_rx is unused. Either wire up the metrics (core/consensus/metrics) or remove the dead channel to avoid the appearance of completeness. The dead plumbing has been allocated and tracked since eadb107.

[Medium] `Behaviour::handle_broadcast` + `Handler::pending_open` are unbounded queues

crates/consensus/src/qbft/p2p.rs:282,492-538 — pending_open: VecDeque<…> in the handler and pending_by_peer: HashMap<PeerId, VecDeque<PendingSend>> in the behaviour grow without backpressure. If a peer connection stays half-open and never accepts streams, every QBFT round adds another entry. Since QBFT rounds are bounded by round count this is unlikely to bite in practice, but worth documenting or capping (e.g. drop oldest after N pending sends per peer with a warn!). Charon's p2p.Sender.SendAsync has equivalent shape but is bounded by a per-peer worker queue.

[Medium] `compare_attestations` validity hinges on local value canonicality

crates/consensus/src/qbft/transport.rs:133-144 — get_value re-hashes local.value via hash_proto_bytes(&local.value) and inserts under that hash. The doc comment correctly notes "the local producer must pack canonical deterministic bytes." Today this is satisfied because the only writer is runner::propose → Any::from_msg(&value) (line runner.rs:121). If anyone in the future enqueues an inbound-derived Any into value_tx (which currently never happens), the hash would no longer match hash_proto(decoded) and get_value would silently start returning UnknownValue. Consider asserting the invariant at the boundary, or wrap value_tx so only locally-encoded Any can be enqueued.

[Medium] `decide_callback` may drop the decided-at signal silently

crates/consensus/src/qbft/runner.rs:278-282 — let _ = decided_at_tx.try_send(Instant::now()); uses try_send on a capacity-1 channel. If for any reason the decide-callback fires twice (the core does not, but the closure is Fn not FnOnce), the second value is dropped. Today this is harmless because the receiver isn't consumed (see Medium finding above). Once metrics are wired up, prefer let _ = decided_at_tx.send(...).await from a non-blocking context, or document that exactly-one decision is guaranteed.

[Medium] `definition::new_timer` swallows timer construction errors

crates/consensus/src/qbft/definition.rs:281-287 — on round_timer.timer(round) failure, the code logs a warn! and returns a qbft::Timer whose receive channel never fires. The QBFT core would then never advance that round. This affects only extreme inputs (negative or i64::MAX rounds, both rejected by verify_msg for inbound messages). Acceptable, but a tracing::error! plus an instance-cancel would be safer than silently stalling.

[Low] `Consensus::handle` enqueue race when receiver is closed mid-receive

crates/consensus/src/qbft/component.rs:373-386 — the branch Err(_) if inst.has_started() => Ok(()) correctly treats late messages as drops after consensus completed. Note the has_started() check races with the deadline-cleanup task; if the runner just finished and the cleanup task hasn't run yet, late inbound messages are silently dropped, which is the desired behaviour. Good comment in the source. No change required — calling out as evidence that the recent retain-until-deadline change (233e80e) is correct.

[Low] `subscribe` / `subscribe_priority` thread-safety doc drift

crates/consensus/src/qbft/component.rs:319-333 — Charon documents these as "not thread safe, called before Start and Propose." Pluto's SubscriberSet uses Arc<Mutex<…>>, so it is thread-safe — an improvement over Go. Worth updating the doc comment to reflect that subscribers may be registered at any time (the doc currently inherits Charon's stricter contract by omission).

[Low] `decode_attestation_data_ssz` magic numbers lack a spec reference

crates/core/src/dutydb/memory.rs:18-19,225-251 — the constants ATTESTATION_DATA_SSZ_OFFSET = 8, ATTESTER_DUTY_SSZ_SIZE = 96, and the field offsets 48..96 come from the Charon AttestationData + AttesterDuty SSZ layout. Add a comment pointing at the Go source (charon/core/encode.go / core/attestation.go) so future maintainers can verify against the spec.

[Low] Cargo dependency cleanup is well done

crates/core/Cargo.toml correctly drops futures, k256, libp2p, pluto-k1util, pluto-featureset now that the consensus module has moved out. Nothing else in pluto-core should still pull them in — verifiable via cargo machete (the last commit ran it).

Parity matrix

Component	Go	Rust	Match	Notes
Inbound admission order: verify→duty gate→justifications→values→deadline	yes	yes	yes	`component.rs:336-387` vs `qbft.go:614-676`
Recv buffer capacity (100)	`instance.RecvBufferSize`	`RECV_BUFFER_SIZE`	yes	`instance.rs:38`
Single-slot channels (hash/value/verify/err/decided_at, cap 1)	yes	yes	yes	`instance.rs:117-122`
`leader = (slot + duty_type + round) % nodes`	yes	yes	yes	`definition.rs:461-476`; signed arithmetic via `rem_euclid`
QBFT v2 protocol ID	`protocols.QBFTv2ProtocolID`	`QBFT_V2_PROTOCOL_ID`	yes
Inbound stream receive timeout	5s	5s	yes	`p2p.rs:43`
Outbound stream send timeout	7s	7s	yes	`p2p.rs:45`
Skip self on broadcast	yes	yes	yes	`p2p.rs:492-507`
Sniffer message buffer	append-only	append-only	yes	`sniffer.rs`
Consensus metrics	ported	absent	no	see Medium #2
Round timers (inc / dlinear / linear)	yes	yes	yes	`timer.rs`
`compareAttestations` source/target equality	yes	yes	yes	`definition.rs:167-201`
`qbft.Run` cancellation by decide callback	yes	yes	yes	`runner.rs:277-282`; `instance_ct.cancel()`
`Subscribe` / `SubscribePriority` value-type filtering	yes	yes	yes	`component.rs:237-256`

Tests

I could not run cargo +nightly fmt --all --check, cargo clippy ... -D warnings, or cargo test --workspace --all-features from this sandbox (cargo is not on the allowlist). Please verify locally — the in-memory consensus test (qbft_run_test.rs) and the libp2p loopback test in p2p.rs provide strong coverage but rely on tokio::time and may be flaky under high parallelism.

Open questions / assumptions

Are consensus metrics intentionally out of scope for this PR? If yes, drop decided_at_tx/take_decided_at_rx to keep the surface honest; if no, file a follow-up.
The retain-until-deadline behaviour added in 233e80e is correct, but please confirm there's a follow-up test that explicitly walks the "late inbound message after decision" code path (handle_drops_late_message_after_started_receiver_closed covers the basic case).
qbft_run_test should be extended to actually run a sub-quorum cluster — the current test names are misleading.
· Branch: iamquang95/consensus-qbft

iamquang95 · 2026-06-01T08:16:08Z

@varex83agent /review-pr

varex83agent

QBFT consensus port — review

This PR ports Charon's core/consensus/qbft component to a new pluto-consensus crate (moving the existing consensus module out of pluto-core), plus the supporting libp2p adapter, transport, sniffer, round timers, and the dutydb unsigned-data-set SSZ/JSON decoding. Overall this is a high-quality, faithful port with strong test coverage and golden-vector parity tests. Three independent review passes (functional-equivalence vs Charon v1.7.1, security, and concurrency/state-machine) found no correctness bugs.

Verified parity (v1.7.1): leader election (slot+dutyType+round)%nodes, quorum (2n+2)/3, verify_msg/handle validation order and error strings, justification duty-differs check, compare_attester field comparison, hash_proto (SSZ PutBytes over deterministic proto + BTreeMap ordering), increasing/eager-double-linear/proposal timers, sniffer fields & protocol id, dutydb SSZ offsets (8/96), and participate() duty-type/feature gating all match the pinned Go reference.

Security: signature verification runs before any expensive work or instance creation; the instances map is gated by the duty gater + deadliner and reaped by the expired-cleanup loop; inbound size/timeout bounds are enforced; SSZ offset/length validation in decode_attestation_data_ssz is sound; no key material reaches logs/Debug; justification recursion is depth-1. No reachable issues found.

Bugs (must-fix): none.

Findings: 3 minor (a block_on-on-blocking-thread robustness/documentation note, error double-stringification that loses the typed source chain, and a test-coverage gap for the attester-compare end-to-end path) and 2 nits, all inline. None block merge — leaving open for author discretion.

Note (out of scope, FYI only): crates/consensus/src/timer.rs linear_subsequent_round_timeout uses from_millis with a comment claiming charon#4537 landed in the reference, but v1.7.1's linearRoundTimer still uses the bare time.Duration(200*(round-1)+200) (nanoseconds). This is pre-existing code relocated by the rename, not introduced here, so it is not flagged inline — worth a follow-up to confirm the deliberate deviation.

…qbft

emlautarom1

Could you provide some guidance on this PR? At over 6k LOC it's hard to figure out what its is scope.

iamquang95 · 2026-06-03T02:52:07Z

@emlautarom1 I updated the PR description on the structure of the PR. Also I have to move consensus out from core to avoid the cyclic dependency

…qbft

emlautarom1

I did not finish reviewing the entire PR but I would like to share some partial feedback to not block.

emlautarom1

I have to say the code is quite convoluted in both Pluto and Charon. It seems to be the result of:

Use of many callbacks which accept references/arc/box
Methods on structs with different lifetimes, specially in closures
Two type of channels (crossbeam and tokio)

I don't see any specific issue beyond some minor duplication. P2P requires an extra pair of eyes though before approving.

iamquang95 · 2026-06-11T10:30:58Z

@varex83 can you help to take a look at p2p layer

varex83 · 2026-06-11T10:32:37Z

sure, let me take a look

@varex83 can you help to take a look at p2p layer

varex83 · 2026-06-11T11:04:54Z

+        if self.is_connected(&peer_id) {
+            self.pending_events.push_back(ToSwarm::NotifyHandler {
+                peer_id,
+                handler: NotifyHandler::Any,


One possible concern here - if we have multiple connection handlers (in case of multiple connection protocols), we can get to the point of sending multiple qbft messages

Any is actually exactly one (not all connections), so we will only send one message

varex83 · 2026-06-11T11:06:58Z

+    /// Queues a behaviour send request until libp2p opens a stream.
+    fn on_behaviour_event(&mut self, event: Self::FromBehaviour) {
+        match event {
+            ToHandler::Send { request_id, msg } => self.pending_open.push_back((request_id, msg)),


Could be wrong here, but I think we should make a check for a duplicate send events in the connection handler, it should be created globally per peer id in the behaviour and passed to conn handlers.

I don't think a duplicate can happen here. Broadcast will emit only one Send event per peer

Replace stringly-typed inbound/send errors in the QBFT libp2p adapter with InboundError/SendError enums threaded through FromHandler and Event, so consumers can match on variants and walk the source() chain instead of parsing strings. The common write-path I/O keeps the full io::Error. Drops the now-dead timeout_error helper (timeout is a payload-free variant) and the unused request_id param of write_outbound.

Remove the redundant `peers: Vec<PeerId>` field from the QBFT p2p adapter Config: cluster membership and broadcast fan-out now read from `P2PContext::known_peers()`, the single source of truth already shared with the rest of the p2p stack. `local_peer_id` stays on Config because the Node sets it on the context only after Behaviour::new runs. handle_broadcast collects non-self targets up front so the immutable known_peers borrow is released before enqueue_send borrows self mutably; target_count is now the exact deduplicated peer count.

…qbft # Conflicts: # crates/core/src/validatorapi/component.rs

iamquang95 added 8 commits May 29, 2026 17:22

feat(consensus): implement consensus/qbft

dd4e82e

fix: wait for qbft receive buffer capacity

712354d

fix: implement qbft attester compare

1cf0c84

fix: retain qbft instance until deadline

233e80e

fix: update docs

a74e6b6

fix: avoid qbft attester compare polling

eadb107

fix: refactor code

bcf6798

ifx: machete

51182aa

varex83agent reviewed Jun 1, 2026

View reviewed changes

Comment thread crates/consensus/src/qbft/runner.rs

Comment thread crates/consensus/src/qbft/runner.rs Outdated

Comment thread crates/consensus/src/qbft/qbft_run_test.rs Outdated

Comment thread crates/consensus/src/instance.rs

Comment thread crates/consensus/src/qbft/p2p.rs Outdated

iamquang95 added 8 commits June 1, 2026 17:15

feat: add example to run qbft

d5585f9

fix: single p2p_context in example

0ac81da

fix: harden qbft receive validation

8fd1b9d

fix: harden qbft value hash admission

45c7f0c

test: cover qbft attester compare full run

00928b1

fix: simplify error

7a2a112

Merge remote-tracking branch 'origin/main' into iamquang95/consensus-…

ab5d47b

…qbft

fix: test 4_6

e577c6a

iamquang95 marked this pull request as ready for review June 2, 2026 10:12

iamquang95 requested review from emlautarom1, mskrzypkows and varex83 June 2, 2026 10:28

varex83 reviewed Jun 2, 2026

View reviewed changes

Comment thread crates/consensus/src/qbft/p2p.rs Outdated

emlautarom1 reviewed Jun 2, 2026

View reviewed changes

iamquang95 added 3 commits June 3, 2026 14:49

test: add strategy sim test

6f33555

Merge remote-tracking branch 'origin/main' into iamquang95/consensus-…

677193e

…qbft

fix: add featureset to core

4ec4a18

iamquang95 added 8 commits June 3, 2026 17:46

refactor: test:

37118ef

test: mutex to avoid timeout test

032dd76

test: spawn threshold node only

e108c90

test: improve tests quality, better cover

3562471

fix: surface qbft timer construction failures

eb2902c

fix: lint

563e035

Merge remote-tracking branch 'origin/main' into iamquang95/consensus-…

6e28b66

…qbft

fix: use local validatorapi boxed future alias

f9ac786

emlautarom1 requested changes Jun 8, 2026

View reviewed changes

iamquang95 added 7 commits June 9, 2026 10:21

fix: add logs

42cb2b2

fix: use pluto_tracing in example

94f5288

fix: make qbft example start timeout one-shot

dc23c02

refactor: example qbft

83570dc

doc: add documentation for mutex featureset

561c5ff

test: simplify qbft run task joins

e06f276

refactor: localize unsigned data decoding

30302e7

iamquang95 requested a review from emlautarom1 June 9, 2026 10:02

emlautarom1 approved these changes Jun 10, 2026

View reviewed changes

Comment thread crates/consensus/src/qbft/component.rs Outdated

Comment thread crates/consensus/src/qbft/definition.rs Outdated

fix: address comments

ebf2714

varex83 reviewed Jun 11, 2026

View reviewed changes

iamquang95 added 3 commits June 11, 2026 20:15

Merge remote-tracking branch 'origin/main' into iamquang95/consensus-…

b294cd3

…qbft # Conflicts: # crates/core/src/validatorapi/component.rs

iamquang95 force-pushed the iamquang95/consensus-qbft branch from 5e007fb to b294cd3 Compare June 11, 2026 16:02

iamquang95 merged commit 5bf008a into main Jun 12, 2026
12 checks passed

iamquang95 deleted the iamquang95/consensus-qbft branch June 12, 2026 07:15

This was referenced Jun 15, 2026

[Epic] Track A — Duty pipeline + consensus #395

Open

Implement core/consensus controller + wrapper + debugger #398

Open

Conversation

iamquang95 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iamquang95 commented Jun 1, 2026

Uh oh!

claude Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review — feat(consensus): implement consensus/qbft

Summary

Findings

[High] qbft_consensus test never exercises a sub-quorum cluster

[Medium] Consensus duration / decided-leader / decided-rounds metrics not ported

[Medium] Behaviour::handle_broadcast + Handler::pending_open are unbounded queues

[Medium] compare_attestations validity hinges on local value canonicality

[Medium] decide_callback may drop the decided-at signal silently

[Medium] definition::new_timer swallows timer construction errors

[Low] Consensus::handle enqueue race when receiver is closed mid-receive

[Low] subscribe / subscribe_priority thread-safety doc drift

[Low] decode_attestation_data_ssz magic numbers lack a spec reference

[Low] Cargo dependency cleanup is well done

Parity matrix

Tests

Open questions / assumptions

Uh oh!

iamquang95 commented Jun 1, 2026

Uh oh!

varex83agent left a comment

Choose a reason for hiding this comment

QBFT consensus port — review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

emlautarom1 left a comment

Choose a reason for hiding this comment

Uh oh!

iamquang95 commented Jun 3, 2026

Uh oh!

emlautarom1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

emlautarom1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

iamquang95 commented Jun 11, 2026

Uh oh!

varex83 commented Jun 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

varex83 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

iamquang95 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

varex83 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

iamquang95 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

iamquang95 commented Jun 1, 2026 •

edited

Loading

claude Bot commented Jun 1, 2026 •

edited

Loading

[High] `qbft_consensus` test never exercises a sub-quorum cluster

[Medium] `Behaviour::handle_broadcast` + `Handler::pending_open` are unbounded queues

[Medium] `compare_attestations` validity hinges on local value canonicality

[Medium] `decide_callback` may drop the decided-at signal silently

[Medium] `definition::new_timer` swallows timer construction errors

[Low] `Consensus::handle` enqueue race when receiver is closed mid-receive

[Low] `subscribe` / `subscribe_priority` thread-safety doc drift

[Low] `decode_attestation_data_ssz` magic numbers lack a spec reference