Skip to content

pbio/port_lump: Recover from RX desync bursts in data recv thread#475

Open
DrTom wants to merge 1 commit into
pybricks:workfrom
DrTom:tom/2698-work-patch
Open

pbio/port_lump: Recover from RX desync bursts in data recv thread#475
DrTom wants to merge 1 commit into
pybricks:workfrom
DrTom:tom/2698-work-patch

Conversation

@DrTom
Copy link
Copy Markdown

@DrTom DrTom commented May 22, 2026

Rebased onto the work branch (keepalive grace period + increased header-read timeout). This drops the patch's earlier keepalive and header-read-timeout changes, which work now supersedes, and keeps only the in-place desync realignment in the data receive thread.

When the SPIKE Prime hub's UART driver drops bytes, the data receive parser loses byte-stream alignment and reads mid-packet data as message headers. The keepalive grace period delays disconnect but does not fix the parser state. This adds active realignment:

  • Track consecutive bad-header events (wrong message size or unexpected message type) in lump_dev->rx_bad_header_streak, stored on the device struct to match the style of err_count.
  • After EV3_UART_BAD_HEADER_STREAK_MAX (8) consecutive bad headers, flush the UART buffer to realign byte boundaries. A short burst is normal after a corrupt packet; a sustained streak means the parser is stuck reading mid-stream bytes and needs an active reset.
  • Treat a payload read timeout as an immediate desync signal (a valid header arrived but its payload bytes never came) and flush immediately.

The header-byte read timeout itself is left to work's handling, which returns the error so the keepalive logic can trigger a full reconnect.

Rebased onto the `work` branch (keepalive grace period + increased
header-read timeout). This drops the patch's earlier keepalive and
header-read-timeout changes, which `work` now supersedes, and keeps only
the in-place desync realignment in the data receive thread.

When the SPIKE Prime hub's UART driver drops bytes, the data receive
parser loses byte-stream alignment and reads mid-packet data as message
headers. The keepalive grace period delays disconnect but does not fix
the parser state. This adds active realignment:

- Track consecutive bad-header events (wrong message size or unexpected
  message type) in lump_dev->rx_bad_header_streak, stored on the device
  struct to match the style of err_count.
- After EV3_UART_BAD_HEADER_STREAK_MAX (8) consecutive bad headers, flush
  the UART buffer to realign byte boundaries. A short burst is normal
  after a corrupt packet; a sustained streak means the parser is stuck
  reading mid-stream bytes and needs an active reset.
- Treat a payload read timeout as an immediate desync signal (a valid
  header arrived but its payload bytes never came) and flush immediately.

The header-byte read timeout itself is left to `work`'s handling, which
returns the error so the keepalive logic can trigger a full reconnect.
@laurensvalk
Copy link
Copy Markdown
Member

Thanks for submitting! Now that the timeout issues are resolved, is this fixing a potential issue or is it something that we can experimentally confirm on known devices?

There is an existing TODO for this so I'm open to considering fixing this.

    // REVISIT: This is not the greatest. We can easily get a buffer overrun and
    // loose data. For now, the retry after bad message size helps get back into
    // sync with the data stream.

Could we include a test case for it here?

We could extend one of the existing tests to:

  • mimic sending normal data like it does now for one sample, but add a small loop to send a couple of 100 normal messages.
  • mimic sending a bad message / header.

And then we can verify that your proposal gets us back in sync.


I'm somewhat curious though. I think you've mentioned that the timeout fixes brings the error rate to 0. Are we never hitting this one for bad data then?


        PBIO_OS_AWAIT(state, &lump_dev->read_pt, err = pbdrv_uart_read(&lump_dev->read_pt, uart_dev, lump_dev->rx_msg + 1, lump_dev->rx_msg_size - 1, EV3_UART_IO_TIMEOUT));
        if (err != PBIO_SUCCESS) {
            debug_pr("UART Rx data error\n");
            return err;
        }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants