Simulator "Kill Node" Operation - Design Discussion #3017

krishvishal · 2026-03-23T13:32:59Z

krishvishal
Mar 23, 2026

Problem

The Iggy simulator currently has no mechanism to kill or crash a node. All replicas are created at startup and live for the entire simulation. To test consensus correctness under node failures (leader crashes, minority/majority failures, recovery), we need a replica_crash / replica_restart operation. The simulator currently uses a shared MemBus (Arc<MemBus>) as a simple FIFO queue (VecDeque) for all inter-node communication. There is no concept of a node being "up" or "down"
and send_to_replica() enqueues regardless, and step() dispatches to any replica.

To implement kill/restart, we need message filtering for dead nodes.

The PacketSimulator requires &mut self to submit and step packets. But outbound messages originate deep inside consensus:

Simulator::step()                                                                                                                                                                           
  => replica.on_message(msg)                                                                                                                                                                 
    => VsrConsensus::on_request(msg)
      => self.message_bus().send_to_replica(next, prepare)  // HERE

send_to_replica is called on &self (the MessageBus trait is &self), this creates a problem: the Simulator owns both the Network and the Replicas, and dispatching a message to a replica triggers outbound sends that need &mut Network, while we're already borrowing &self.replicas[id].

Proposal

MemBus as outbox-only:

MemBus becomes a per-replica outbox. Consensus still calls send_to_replica() / send_to_client(), but instead of being the delivery queue, the messages are staged. The Simulator drains each outbox after dispatching to a replica and feeds the messages into network.submit().

Simulator tick loop:

    1. packets = network.step()          // &mut network                                                                                                                                      
                                          // network borrow ends
                                                                                                                                                                                              
    2. for packet in packets:             // &self.replicas[id]
         replica.on_message(packet)                                                                                                                                                           
         replica.process_loopback()
         // consensus calls bus.send_to_replica() -> stages in outbox                                                                                                                          
                                                                                                                                                                                              
         outbox[id].drain()              // collect staged messages                                                                                                                           
           => network.submit(from, to, msg)  // &mut network (replica borrow ended)                                                                                                            
                                                                                                                                                                                              
    3. network.recycle_buffer(packets)    // &mut network                                                                                                                                     
    4. network.tick()                     // &mut network

Each phase borrows either replicas or network, never both simultaneously. The borrow checker is satisfied without Arc<Mutex<>>, RefCell, or dynamic dispatch.

Behavior of `replica_crash`

Message type	Behavior	Rationale
FROM crashed node, already in network	Delivered to live targets	Already on the wire
TO crashed node, in network	Dropped at delivery time (link filter = BLOCK_ALL)	Connection refused
FROM crashed node, still in outbox	Discarded (outbox.drain() in replica_crash)	Never reached the wire
TO restarted node, queued pre-restart	Delivered after links re-enabled	Consensus rejects stale via view/op checks

Please let me know your thoughts about this proposal.

numinnex · 2026-03-24T18:29:33Z

numinnex
Mar 24, 2026
Collaborator

I agree with the idea of using message bus as just outbox (this makes sense from the point of view how a real TCP based MessageBus actually works) and simulating it this way is definitely the way to go.

The thing that I am puzzled about is how to do it in a way where we can use exactly the same primitive IggyShard both by the simulator, aswell as the production binary. In the production binary we have bunch of listeners (tcp_listener, websocket listener, quic listner), that spawn dedicated compio task for each connection that busy loops waiting for incomming packets and turning those into messages. In the simulator tho, we would like the simulator step function to replace the functionality of those listeners. I guess one approach that we could take is simply disabling those listeners in the simulator (similarly how we do it right now, we use the config to check whether we want to expose tcp, websocket, quic and http listeners, then in the simulator binary we would have an explicit step function that would play the role of those listeners, by draining the message_bus outbox.

0 replies

numinnex · 2026-03-24T18:32:59Z

numinnex
Mar 24, 2026
Collaborator

Here is an short markdown with some more details:

Core idea

Keep IggyShard the same in production and in the simulator.

The difference should be only:

MessageBus implementation
storage implementation
runtime driver

Runtime split

Do not make MessageBus or IggyShard own the event loop.

Use this boundary:

IggyShard / consensus / handlers
    -> MessageBus::send_to_replica()
    -> MessageBus::send_to_client()
    -> stage outbound message

runtime driver
    -> drain outbound messages
    -> deliver inbound messages

Production

In the real server / cluster binary:

keep the real compio listener tasks
keep the per-accepted-connection read tasks
keep transport tasks enabled only when config enables tcp / quic / http / ws

So production still looks like:

listener accepts connection
connection task reads and decodes frames
decoded inbound message is injected into IggyShard
outbound messages are drained from the bus and written by runtime-owned tasks

Simulator

In the simulator:

do not start the real listener/connection tasks
do not bind sockets
let Simulator::step() be the runtime driver

So simulator does:

drain outbound messages from the bus
submit them into simulated network
take ready packets from simulated network
inject them into IggyShard

Important rule

Inbound should not go through MessageBus.

MessageBus should stay outbound-only.

sockets / simulator network produce inbound messages
inbound messages go directly into shard/replica handling
bus is only for outbound staging

Most important implementation points

MemBus should become outbox-only, not the global delivery queue.
The simulator should own node liveness: Up, Paused, Down.
replica_crash() should discard transient state but preserve durable state.
replica_restart() should rebuild a fresh runtime from durable state.
All inter-node traffic should go through the simulator Network.

0 replies

hubcio · 2026-03-25T10:52:07Z

hubcio
Mar 25, 2026
Collaborator

i went through the simulator code in detail after reading this proposal. the outbox-only model is the right call - it matches how a real TCP-based message bus actually works (or will work when I implement it, haha... send = stage outbound, network = deliver), and it solves the &mut Network vs &self Replica borrow conflict cleanly without Arc<Mutex>. @numinnex's elaboration is exactly the architecture we're building toward: same IggyShard in prod and sim, difference is only MessageBus impl + storage impl + runtime driver.

the crash semantics table is mostly correct. "FROM crashed, in network -> delivered" is right (already on the wire, sender can't recall). "FROM crashed, in outbox -> discarded" is right (never left the process). "TO crashed, in network -> dropped" is right - BLOCK_ALL link filters are checked during PacketSimulator::step() at delivery time, so setting BLOCK_ALL before stepping drops all packets to the crashed node correctly.

the part that needs revision is row 4: "TO restarted node -> delivered, consensus rejects stale via view/op checks." this doesn't hold because consensus recovery isn't implemented yet. VsrConsensus::init() at impls.rs:523 starts at (view=0, op=0, commit=0) and immediately sets Status::Normal. is_syncing() is hardcoded to return false at impls.rs:569. SimJournal uses MemStorage which is RefCell<Vec<u8>> - drop the replica, lose everything. there is no superblock, no journal replay, no state reconstruction path.

what actually happens to a restarted replica depends on whether the cluster has changed views since it went down, but both paths are broken.

if the cluster advanced views (say view=3 while the replica comes up at view=0): replicate_preflight at plane_helpers.rs:132 checks header.view > consensus.view() - a prepare from view=3 gets rejected because 3 > 0. the replica is dead-on-arrival, rejecting everything. it can learn the current view through handle_start_view_change at impls.rs:847 (view advance and ViewChange status set at lines 863-865), but that alone doesn't help - replicate_preflight:128 rejects prepares when status != Normal. the replica needs a full StartView to resume. handle_start_view at impls.rs:1050 transitions to Normal (line 1075), syncs the op counter (sequencer.set_sequence(msg_op) at line 1083), and clears the pipeline (pipeline.borrow_mut().clear() at line 1080). but then it does something worse: lines 1096-1103 send PrepareOKs for ops (msg_commit+1)..=msg_op that were never journaled - the pipeline was just wiped and the journal is empty. the replica is lying about having replicated data it doesn't have. if its PrepareOK tips a commit quorum, the cluster commits an op that only f replicas actually hold - one more failure away from data loss.

if the cluster is still in the same view (view=0, no view changes occurred, just ops advanced to 51): this is the more direct failure. prepare(view=0, op=52, commit=50) passes the view check (0 > 0 = false), advance_commit_number(header.commit) at replicate_preflight:137 jumps the commit counter to 50 - past 50 ops that were never journaled, and then debug_assert_eq!(header.op, current_op + 1) at iggy_partitions.rs:382 fires because 52 != 1. debug builds panic, release builds silently create a 50-op gap in the log. the phantom commits corrupt DVC quorum decisions in any subsequent view change.

either way, a restarted replica with empty journal cannot safely participate.

this isn't a flaw in your proposal - it's a known gap in our consensus layer. replica_crash (without restart) is feasible right now and independently useful. a crashed node that never comes back is equivalent to a permanent partition - we can test view change behavior, quorum loss, leader election under f failures, all without restart support.

for replica_restart to work we need at minimum: (1) a SimSuperblock struct held by the simulator (not the replica) that persists (view, log_view, commit, op) across crashes - the simulator plays the role of the disk controller here, (2) journal survival - either the simulator holds HashMap<u8, SimJournal> and swaps them in/out, or we keep the replica object alive and just reset volatile consensus state, (3) VsrConsensus::init_from_durable() that restores state and enters Status::Recovering instead of blindly going to Normal - the replica must refuse to participate until caught up.

i think the right approach is phased PRs:

PR 1: wire Network into Simulator. per-replica outboxes replacing the shared MemBus (rename to SimOutbox or similar), phase-separated stepping where each phase borrows exactly one resource (drain outboxes -> network.submit() -> network.step() -> dispatch to replicas -> network.tick()), step() becomes &mut self. also clean up the dead inbox channel at replica.rs:88 - the sender is dropped immediately, receiver is never polled. the Network wrapper already has a clean API with submit(), step(), tick(), and link filter management - it's ready to be plugged in. this is ~300-400 lines and independently valuable for deterministic fault injection testing.

PR 2: add crash_replica only (no restart). crash_replica(id) on Simulator - BLOCK_ALL all links, discard outbox, track crashed replicas in a HashSet<u8> (not Option<Replica> - that pollutes every iterator with .flatten()), skip dispatch for crashed replicas in the step loop. ~150-200 lines.

after that, the durability prerequisites and recovery stub are internal work we need to do in the consensus layer before restart_replica can be implemented. once that lands, a PR 3 for restart becomes straightforward.

on the "Paused" state - i'd defer it. it's functionally equivalent to a full network partition (BLOCK_ALL on all links) which PacketSimulator already supports, and the mid-commit semantics (paused after journal append but before PrepareOk) need careful specification. Up/Down is enough for the first iteration.

PR 1 is a great starting point. key files: core/simulator/src/bus.rs (refactor into per-replica outbox), core/simulator/src/lib.rs (phase-separated step, integrate Network), core/simulator/src/replica.rs (remove dead inbox, accept per-replica outbox).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simulator "Kill Node" Operation - Design Discussion #3017

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Simulator "Kill Node" Operation - Design Discussion #3017

Uh oh!

Uh oh!

krishvishal Mar 23, 2026

Problem

Proposal

Behavior of replica_crash

Replies: 3 comments

Uh oh!

numinnex Mar 24, 2026 Collaborator

Uh oh!

numinnex Mar 24, 2026 Collaborator

Core idea

Runtime split

Production

Simulator

Important rule

Most important implementation points

Uh oh!

hubcio Mar 25, 2026 Collaborator

krishvishal
Mar 23, 2026

Behavior of `replica_crash`

numinnex
Mar 24, 2026
Collaborator

numinnex
Mar 24, 2026
Collaborator

hubcio
Mar 25, 2026
Collaborator