Fix rapid binary logs (IDFGH-17270)#36
Fix rapid binary logs (IDFGH-17270)#36jthoward64-lexcelon wants to merge 2 commits intoespressif:masterfrom
Conversation
👋 Hello jthoward64-lexcelon, we appreciate your contribution to this project! Click to see more instructions ...
Review and merge process you can expect ...
|
|
Hi @jthoward64-lexcelon, thank you for contributing. By any chance, would you have an app that I could use to reproduce this issue? |
There was a problem hiding this comment.
Pull request overview
Improves robustness of ESP-IDF Monitor’s binary log decoding during high-throughput bursts by making frame detection re-sync instead of bailing out on the first mismatch, reducing the chance of the console getting “stuck” printing binary garbage.
Changes:
- Update binary log frame scanning to skip bytes on CRC/control-structure mismatches and keep searching for the next valid frame.
- Simplify binary-log handling in the serial input path (removes fallback behavior on decoding errors).
- Adjust binary log message encoding to use default string encoding when producing output bytes.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| esp_idf_monitor/base/serial_handler.py | Removes the ValueError fallback around binary log decoding and always returns after attempting binlog conversion. |
| esp_idf_monitor/base/binlog.py | Changes find_frames to re-sync by scanning forward on invalid CRC/control parsing, and tweaks message encoding. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
612c7bb to
0f29b80
Compare
0f29b80 to
29f99bb
Compare
No unfortunately all my ESP-IDF repos are proprietary, but I can reliably reproduce the issue by enabling V2 binary logs and either logging very long lines or very rapidly |
|
@jthoward64-lexcelon I am sorry, but I am struggling to reproduce the issue. Without that, it is hard to evaluate if the code is correct and covers all the corner cases - it would also be nice to include some tests to avoid having such issues in the future. Would you be able to create a minimal app that would be able to reproduce this issue without using any of your proprietary code? It would be helpful to also mention what your OS is, which chip you are using, or any other details which might be relevant (like using VM/WSL, etc.) Here is the example I tried it on (using the latest IDF on ESP32-C5 v1.0, using the UART port, with binlog turned on for the app only): #include <stdio.h>
#include "esp_log.h"
#include "esp_system.h"
static const char *TAG = "example";
void app_main(void)
{
for (int i = 0; i < 1000; i++){
ESP_LOGI(TAG, "Lorem ipsum dolor sit amet consectetur adipiscing elit. Quisque faucibus ex sapien vitae pellentesque sem placerat. In id cursus mi pretium tellus duis convallis. Tempus leo eu aenean sed diam urna tempor. Pulvinar vivamus fringilla lacus nec metus bibendum egestas. Iaculis massa nisl malesuada lacinia integer nunc posuere. Ut hendrerit semper vel class aptent taciti sociosqu. Ad litora torquent per conubia nostra inceptos himenaeos.Lorem ipsum dolor sit amet consectetur adipiscing elit. Quisque faucibus ex sapien vitae pellentesque sem placerat. In id cursus mi pretium tellus duis convallis. Tempus leo eu aenean sed diam urna tempor. Pulvinar vivamus fringilla lacus nec metus bibendum egestas. Iaculis massa nisl malesuada lacinia integer nunc posuere. Ut hendrerit semper vel class aptent taciti sociosqu. Ad litora torquent per conubia nostra inceptos himenaeos.Lorem ipsum dolor sit amet consectetur adipiscing elit. Quisque faucibus ex sapien vitae pellentesque sem placerat. In id cursus mi pretium tellus duis convallis. Tempus leo eu aenean sed diam urna tempor. Pulvinar vivamus fringilla lacus nec metus bibendum egestas. Iaculis massa nisl malesuada lacinia integer nunc posuere. Ut hendrerit semper vel class aptent taciti sociosqu. Ad litora torquent per conubia nostra inceptos himenaeos.Lorem ipsum dolor sit amet consectetur adipiscing elit. Quisque faucibus ex sapien vitae pellentesque sem placerat. In id cursus mi pretium tellus duis convallis. Tempus leo eu aenean sed diam urna tempor. Pulvinar vivamus fringilla lacus nec metus bibendum egestas. Iaculis massa nisl malesuada lacinia integer nunc posuere. Ut hendrerit semper vel class aptent taciti sociosqu. Ad litora torquent per conubia nostra inceptos himenaeos.Lorem ipsum dolor sit amet consectetur adipiscing elit. Quisque faucibus ex sapien vitae pellentesque sem placerat. In id cursus mi pretium tellus duis convallis. Tempus leo eu aenean sed diam urna tempor. Pulvinar vivamus fringilla lacus nec metus bibendum egestas. Iaculis massa nisl malesuada lacinia integer nunc posuere. Ut hendrerit semper vel class aptent taciti sociosqu. Ad litora torquent per conubia nostra inceptos himenaeos.");
}
ESP_LOGI(TAG, "Restarting now.");
fflush(stdout);
esp_restart();
} |
|
Yeah I'll throw something together. For reference I'm on IDF 6, ESP32-C3, USB-JTAG, Linux dev container |
|
Ok, not quite the same presentation, but I don't have the time to replicate the exact cause in my codebase, but this repo has two examples that have the same underlying cause and are fixed in this PR. One reboots the esp32 mid-log which causes a message to get caught in the buffer The second one (almost always around message 8000), just spams logs from multiple tasks all at once. This one takes longer than the first repro, but is much closer to how I see it in my codebase This one also crashes a bit (don't particularly care why tbh), but it works to show the issue (eventually) And, while I can't share it, I have been using this patch in my own development for the past couple days and haven't seen a single desync, whereas it was usually very frequent. |
|
Anything else I need to do to have this looked at? Or is that repro not working? |
|
Yes, I managed to reproduce the issue (with a small hack, see below). Thank you for providing it. Regarding the first log that you posted - I was able to reproduce it, but using your fix, I am missing the start of the boot log, see below: Here, the second line should be: Regarding the second issue, I was struggling a bit to reproduce it. Looks like it highly depends on the used system. I have tried this on macOS and on two other colleagues' Linux machines. All the messages are always decoded correctly, even without this fix. In the end, I was able to reproduce this by adding a small delay ( I think if we solve issues like the one that I mentioned before, we would be able to accept this. So I would suggest printing the "garbage" data anyway, as it has been until now. The new logic regarding the resync command LGTM. |
|
I have tried to solve the issue with missing text in the log. Now all the data that is not part of succesfull binary frame is printed (including some garbage bytes), please let me know if the following works for you as well: diff --git a/esp_idf_monitor/base/binlog.py b/esp_idf_monitor/base/binlog.py
index efd7d5c413..9ec11465eb 100644
--- a/esp_idf_monitor/base/binlog.py
+++ b/esp_idf_monitor/base/binlog.py
@@ -239,9 +239,11 @@ class BinaryLog:
which we reject so that scan-ahead continues instead of prematurely breaking."""
return 1 <= control.level <= 5 and control.version == 0 and control.pkg_len >= 15
- def find_frames(self, data: bytes) -> Tuple[List[bytes], bytes]:
+ def find_frames(self, data: bytes) -> Tuple[List[bytes], bytes, bytes]:
frames: List[bytes] = []
+ leaked_chunks: List[bytes] = []
i = 0
+ idx_after_last_frame = 0
idx_partial_frame = None # start index of a plausible but incomplete frame
while i < len(data):
@@ -264,7 +266,11 @@ class BinaryLog:
continue
frame = data[start_idx : start_idx + control.pkg_len]
if control.pkg_len != 0 and self.crc8(frame) == 0:
+ # Collect non-frame bytes before this frame
+ if start_idx > idx_after_last_frame:
+ leaked_chunks.append(data[idx_after_last_frame:start_idx])
frames.append(frame)
+ idx_after_last_frame = start_idx + control.pkg_len
i += control.pkg_len - 1
else:
# CRC mismatch – skip this byte and try to re-sync
@@ -278,24 +284,29 @@ class BinaryLog:
i += 1
- # Decide what to carry forward for the next call:
- # 1. Stopped early on a plausible partial frame → stash from that point so
- # the next read can complete it.
- # 2. Broke at the < 15-byte minimum-length check and the remaining bytes
- # start with a binary frame marker → stash them for the same reason.
- # If they don't start with a marker they are noise/text and must not be
- # carried forward as binary data.
- # 3. Everything else → return b'' so stale data doesn't keep the caller
- # locked in binary mode across a device reset.
+ # Trailing non-frame data (e.g. boot log text after last valid frame)
if idx_partial_frame is not None:
- return frames, data[idx_partial_frame:]
- if i < len(data) and self.detected(data[i]):
- return frames, data[i:]
- return frames, b''
+ # Stopped on partial frame: leaked text is from last frame to partial start
+ if idx_partial_frame > idx_after_last_frame:
+ leaked_chunks.append(data[idx_after_last_frame:idx_partial_frame])
+ remaining = data[idx_partial_frame:]
+ elif i < len(data) and self.detected(data[i]):
+ # Remaining bytes start with binary marker; include up to there as leaked
+ if i > idx_after_last_frame:
+ leaked_chunks.append(data[idx_after_last_frame:i])
+ remaining = data[i:]
+ else:
+ # No partial frame; everything after last frame is non-binary
+ if len(data) > idx_after_last_frame:
+ leaked_chunks.append(data[idx_after_last_frame:])
+ remaining = b''
+
+ leaked_text = b''.join(leaked_chunks)
+ return frames, remaining, leaked_text
- def convert_to_text(self, data: bytes) -> Tuple[List[bytes], bytes]:
+ def convert_to_text(self, data: bytes) -> Tuple[List[bytes], bytes, bytes]:
messages: List[bytes] = []
- frames, incomplete_fragment = self.find_frames(data)
+ frames, incomplete_fragment, leaked_text = self.find_frames(data)
for pkg_msg in frames:
elf_path = self.source_of_message(pkg_msg[0])
msg = Message(self.debug, elf_path, pkg_msg)
@@ -303,7 +314,7 @@ class BinaryLog:
messages += self.format_buffer_message(msg)
else:
messages.append(self.format_message(msg))
- return messages, incomplete_fragment
+ return messages, incomplete_fragment, leaked_text
def format_message(self, message: Message) -> bytes:
try:
diff --git a/esp_idf_monitor/base/serial_handler.py b/esp_idf_monitor/base/serial_handler.py
index 606d39ef7d..79c01fdd16 100644
--- a/esp_idf_monitor/base/serial_handler.py
+++ b/esp_idf_monitor/base/serial_handler.py
@@ -204,11 +204,23 @@ class SerialHandler:
sp = self.splitdata(data)
if self.binary_log_detected:
- text_lines, self._last_line_part = self.binlog.convert_to_text(sp[0])
+ text_lines, self._last_line_part, leaked_text = self.binlog.convert_to_text(sp[0])
for line in text_lines:
self.print_colored(line)
self.logger.handle_possible_pc_address_in_line(line)
self.monitor_cmd_executor.execute_from_log_line(line)
+ if leaked_text:
+ leaked_lines = leaked_text.splitlines(keepends=True) or [b'']
+ if leaked_lines and not (
+ leaked_lines[-1].endswith(b'\n') or leaked_lines[-1].endswith(b'\r')
+ ):
+ incomplete = leaked_lines.pop()
+ if not self._last_line_part:
+ self._last_line_part = incomplete
+ for line in leaked_lines:
+ self.print_colored(line)
+ self.logger.handle_possible_pc_address_in_line(line)
+ self.monitor_cmd_executor.execute_from_log_line(line)
return
for line in sp:This essentially combines the old approach with the new logic that you added. |
|
Sorry, on US time; thanks for the patch! And yes it does seem to work for me |
eba4761 to
ac5f499
Compare
|
@jthoward64-lexcelon Thank you again for contributing and helping out with the reproducer. I had to change a couple of things because of our internal review process and release notes generation from commit messages, so the PR was marked as "closed", but the changes have been merged. Thanks for understanding |
|
No worries, thank you! Any idea if this will get a release in the next couple weeks? |
|
Not sure yet, I would like to include a couple of things in the next release that are not done yet, but I will try to do it at the end of March, maybe early April. |
Description
I've been using binary logging for the massive stack use reduction, but I've found that when a lot of logs come in at once, it causes monitor to get out of sync and just spam the console with binary garbage until it possible recovers when it syncs up again. This PR allows the binary log
find_framesmethod to try and gracefully recover (or at least not lose track altogether) from a miss.Before:
After:
Both of those are the same firmware, without this change the logger is useless after 304, with it, it continues logging without issue.
Related
N/A
Testing
Tested with my current project in an ESP-IDF 6 docker container
Checklist
Before submitting a Pull Request, please ensure the following: