Skip to content

Fix rapid binary logs (IDFGH-17270)#36

Closed
jthoward64-lexcelon wants to merge 2 commits intoespressif:masterfrom
jthoward64-lexcelon:fix-fast-binlog
Closed

Fix rapid binary logs (IDFGH-17270)#36
jthoward64-lexcelon wants to merge 2 commits intoespressif:masterfrom
jthoward64-lexcelon:fix-fast-binlog

Conversation

@jthoward64-lexcelon
Copy link

Description

I've been using binary logging for the massive stack use reduction, but I've found that when a lot of logs come in at once, it causes monitor to get out of sync and just spam the console with binary garbage until it possible recovers when it syncs up again. This PR allows the binary log find_frames method to try and gracefully recover (or at least not lose track altogether) from a miss.

Before:

I (303) [REDACTED]
I (303) [REDACTED]
I (303) [REDACTED]
I (303) [REDACTED]
I (304) [REDACTED]
I (304) [REDACTED]
.�<��0�
        D<�0<�2<�s<�jU
                      <�}0<��
                             7<�1<�Uz
                                     A�<�n1p.�<���.�<��2        d
                                                                 �<�2<�U�
                                                                         <�}2<��?
                                                                                 7<�2<��a+P<�U2<�0(�$ <��2<��<
                                                                                                              ,�#�<��3<�xL<
                                                                                                                           ,
                                                                                                                            Ep<��3<����<�53<
                                                                                                                                            a�
                                                                                                                                              ��<�53�.�<��3
�.�<��4
       �
         D<�4<��<�h<�jv
                       �<�4<��<�j�
                                  B<�n4�.�<��4

After:

I (302) [REDACTED]
I (303) [REDACTED]
I (303) [REDACTED]
I (303) [REDACTED]
I (303) [REDACTED]
I (304) [REDACTED]
I (304) [REDACTED]
E (304) [REDACTED]
I (304) [REDACTED]
I (304) [REDACTED]
I (305) [REDACTED]
I (305) [REDACTED]
E (305) [REDACTED]
E (306) [REDACTED]
I (306) [REDACTED]
I (306) [REDACTED]
I (306) [REDACTED]
E (306) [REDACTED]
E (306) [REDACTED]
E (307) [REDACTED]
I (307) [REDACTED]
W (307) [REDACTED]
I (307) [REDACTED]

Both of those are the same firmware, without this change the logger is useless after 304, with it, it continues logging without issue.

Related

N/A

Testing

Tested with my current project in an ESP-IDF 6 docker container


Checklist

Before submitting a Pull Request, please ensure the following:

  • 🚨 This PR does not introduce breaking changes.
  • All CI checks (GH Actions) pass.
  • Documentation is updated as needed.
  • Tests are updated or added as necessary.
  • Code is well-commented, especially in complex areas.
  • Git history is clean — commits are squashed to the minimum necessary.

@github-actions
Copy link

github-actions bot commented Feb 20, 2026

Messages
📖 🎉 Good Job! All checks are passing!

👋 Hello jthoward64-lexcelon, we appreciate your contribution to this project!


Click to see more instructions ...


This automated output is generated by the PR linter DangerJS, which checks if your Pull Request meets the project's requirements and helps you fix potential issues.

DangerJS is triggered with each push event to a Pull Request and modify the contents of this comment.

Please consider the following:
- Danger mainly focuses on the PR structure and formatting and can't understand the meaning behind your code or changes.
- Danger is not a substitute for human code reviews; it's still important to request a code review from your colleagues.
- To manually retry these Danger checks, please navigate to the Actions tab and re-run last Danger workflow.

Review and merge process you can expect ...


We do welcome contributions in the form of bug reports, feature requests and pull requests.

1. An internal issue has been created for the PR, we assign it to the relevant engineer.
2. They review the PR and either approve it or ask you for changes or clarifications.
3. Once the GitHub PR is approved we do the final review, collect approvals from core owners and make sure all the automated tests are passing.
- At this point we may do some adjustments to the proposed change, or extend it by adding tests or documentation.
4. If the change is approved and passes the tests it is merged into the default branch.

Generated by 🚫 dangerJS against ac5f499

@github-actions github-actions bot changed the title Fix rapid binary logs Fix rapid binary logs (IDFGH-17270) Feb 20, 2026
@peterdragun
Copy link
Collaborator

Hi @jthoward64-lexcelon, thank you for contributing. By any chance, would you have an app that I could use to reproduce this issue?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves robustness of ESP-IDF Monitor’s binary log decoding during high-throughput bursts by making frame detection re-sync instead of bailing out on the first mismatch, reducing the chance of the console getting “stuck” printing binary garbage.

Changes:

  • Update binary log frame scanning to skip bytes on CRC/control-structure mismatches and keep searching for the next valid frame.
  • Simplify binary-log handling in the serial input path (removes fallback behavior on decoding errors).
  • Adjust binary log message encoding to use default string encoding when producing output bytes.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
esp_idf_monitor/base/serial_handler.py Removes the ValueError fallback around binary log decoding and always returns after attempting binlog conversion.
esp_idf_monitor/base/binlog.py Changes find_frames to re-sync by scanning forward on invalid CRC/control parsing, and tweaks message encoding.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jthoward64-lexcelon
Copy link
Author

Hi @jthoward64-lexcelon, thank you for contributing. By any chance, would you have an app that I could use to reproduce this issue?

No unfortunately all my ESP-IDF repos are proprietary, but I can reliably reproduce the issue by enabling V2 binary logs and either logging very long lines or very rapidly

@peterdragun
Copy link
Collaborator

@jthoward64-lexcelon I am sorry, but I am struggling to reproduce the issue. Without that, it is hard to evaluate if the code is correct and covers all the corner cases - it would also be nice to include some tests to avoid having such issues in the future.

Would you be able to create a minimal app that would be able to reproduce this issue without using any of your proprietary code? It would be helpful to also mention what your OS is, which chip you are using, or any other details which might be relevant (like using VM/WSL, etc.)

Here is the example I tried it on (using the latest IDF on ESP32-C5 v1.0, using the UART port, with binlog turned on for the app only):

#include <stdio.h>
#include "esp_log.h"
#include "esp_system.h"

static const char *TAG = "example";

void app_main(void)
{
    for (int i = 0; i < 1000; i++){
        ESP_LOGI(TAG, "Lorem ipsum dolor sit amet consectetur adipiscing elit. Quisque faucibus ex sapien vitae pellentesque sem placerat. In id cursus mi pretium tellus duis convallis. Tempus leo eu aenean sed diam urna tempor. Pulvinar vivamus fringilla lacus nec metus bibendum egestas. Iaculis massa nisl malesuada lacinia integer nunc posuere. Ut hendrerit semper vel class aptent taciti sociosqu. Ad litora torquent per conubia nostra inceptos himenaeos.Lorem ipsum dolor sit amet consectetur adipiscing elit. Quisque faucibus ex sapien vitae pellentesque sem placerat. In id cursus mi pretium tellus duis convallis. Tempus leo eu aenean sed diam urna tempor. Pulvinar vivamus fringilla lacus nec metus bibendum egestas. Iaculis massa nisl malesuada lacinia integer nunc posuere. Ut hendrerit semper vel class aptent taciti sociosqu. Ad litora torquent per conubia nostra inceptos himenaeos.Lorem ipsum dolor sit amet consectetur adipiscing elit. Quisque faucibus ex sapien vitae pellentesque sem placerat. In id cursus mi pretium tellus duis convallis. Tempus leo eu aenean sed diam urna tempor. Pulvinar vivamus fringilla lacus nec metus bibendum egestas. Iaculis massa nisl malesuada lacinia integer nunc posuere. Ut hendrerit semper vel class aptent taciti sociosqu. Ad litora torquent per conubia nostra inceptos himenaeos.Lorem ipsum dolor sit amet consectetur adipiscing elit. Quisque faucibus ex sapien vitae pellentesque sem placerat. In id cursus mi pretium tellus duis convallis. Tempus leo eu aenean sed diam urna tempor. Pulvinar vivamus fringilla lacus nec metus bibendum egestas. Iaculis massa nisl malesuada lacinia integer nunc posuere. Ut hendrerit semper vel class aptent taciti sociosqu. Ad litora torquent per conubia nostra inceptos himenaeos.Lorem ipsum dolor sit amet consectetur adipiscing elit. Quisque faucibus ex sapien vitae pellentesque sem placerat. In id cursus mi pretium tellus duis convallis. Tempus leo eu aenean sed diam urna tempor. Pulvinar vivamus fringilla lacus nec metus bibendum egestas. Iaculis massa nisl malesuada lacinia integer nunc posuere. Ut hendrerit semper vel class aptent taciti sociosqu. Ad litora torquent per conubia nostra inceptos himenaeos.");
    }
    ESP_LOGI(TAG, "Restarting now.");
    fflush(stdout);
    esp_restart();
}

@jthoward64-lexcelon
Copy link
Author

jthoward64-lexcelon commented Feb 23, 2026

Yeah I'll throw something together. For reference I'm on IDF 6, ESP32-C3, USB-JTAG, Linux dev container

@jthoward64-lexcelon
Copy link
Author

jthoward64-lexcelon commented Feb 23, 2026

Ok, not quite the same presentation, but I don't have the time to replicate the exact cause in my codebase, but this repo has two examples that have the same underlying cause and are fixed in this PR. One reboots the esp32 mid-log which causes a message to get caught in the buffer

I (225) repro: msg 294 val=2058 ratio=0.980000 addr=0x498 flag=even extra=43755
I (225) repro: msg 295 val=2065 ratio=0.983333 addr=0x49c flag=odd extra=43754
I (226) repro: msg 296 val=2072 ratio=0.986667 addr=0x4a0 flag=even extra=43749
I (226) repro: msg 297 val=2079 ratio=0.990000 addr=0x4a4 flag=odd extra=43748
I (227) repro: msg 298 val=2086 ratio=0.993333 addr=0x4a8 flag=even extra=43751

,R<?
�-?��~K��<?���ESP-ROM:esp32c3-api1-20210207
Build:Feb  7 2021
rst:0xc (RTC_SW_CPU_RST),boot:0xd (SPI_FAST_FLASH_BOOT)
Saved PC:0x4038425e
--- 0x4038425e: esp_restart_noos at /opt/esp/idf/components/esp_system/port/soc/esp32c3/system_internal.c:115
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fcd5830,len:0x1400
load:0x403cbf10,len:0x1240
--- 0x403cbf10: esp_bootloader_get_description at /opt/esp/idf/components/esp_bootloader_format/esp_bootloader_desc.c:40
load:0x403ce710,len:0x2f64
--- 0x403ce710: esp_flash_encryption_enabled at /opt/esp/idf/components/bootloader_support/src/flash_encrypt.c:93
entry 0x403cbf1a
--- 0x403cbf1a: call_start_cpu0 at /opt/esp/idf/components/bootloader/subproject/main/bootloader_start.c:27
I (24) boot: ESP-IDF v6.0-beta1-1452-g48e7e0618d 2nd stage bootloader
I (24) boot: compile time Feb 20 2026 01:26:26
I (25) boot: chip revision: v0.4
I (25) boot: efuse block revision: v1.3
I (26) qio_mode: Enabling default flash chip QIO
I (26) boot.esp32c3: SPI Speed      : 80MHz
I (27) boot.esp32c3: SPI Mode       : QIO
I (27) boot.esp32c3: SPI Flash Size : 4MB

The second one (almost always around message 8000), just spams logs from multiple tasks all at once. This one takes longer than the first repro, but is much closer to how I see it in my codebase

                                          $6<? _�:�?�`M
                                                       $6<? _�:�?�
=�
  $6<? _�:�?�Q�p
                $6<? _�:�?ٙ��8
                             $6<? _�:�?��G�A
                                            $6<? _�:�?�(���
                                                           $6<? _�:�?�p��+
                                                                          $6<? _�:�?޸Q�
                                                                                       $6<? _�:�?��
                                                                                                   $6<? _�:�?��
                                                                                                               $6<? _�:��G� 
                                                                                                                            $6<? _�:�?�� Q
                                                                                                                                          $6<? _�:�?�\ <
                                                                                                                                                        $6<? _�:�?�33@
                                                                                                                                                                      $6<? _�:�?��
@�
  $6<? _�:� ?�z�@ 
                  $6<? _�:�!?��`!
                                 6<? _�:�"?�`"�
                                               $6<? _�:�#?�ff`#�
                                                                $6<? _�:�$?�
=�$
   $6<? _�:�%?��%&
                  $6<? _�:�&?�Q�&
                                 $6<? _�:�'?��'�
                                                $6<? _�:�(?陙�(J
                                                                $6<? _�:�)?�=p�)e
                                                                                 $6<? _�:�*?��G�*�
I (24574) repro: burst message 15044 value=44 ratio=0.8800 ptr=0x2c                               $6<? _�:�+?��+
I (24574) repro: burst message 15045 value=45 ratio=0.9000 ptr=0x2d
I (24575) repro: burst message 15046 value=46 ratio=0.9200 ptr=0x2e
I (24575) repro: burst message 15047 value=47 ratio=0.9400 ptr=0x2f
I (24575) repro: burst message 15048 value=48 ratio=0.9600 ptr=0x30
I (24575) repro: burst message 15049 value=49 ratio=0.9800 ptr=0x31

This one also crashes a bit (don't particularly care why tbh), but it works to show the issue (eventually)

And, while I can't share it, I have been using this patch in my own development for the past couple days and haven't seen a single desync, whereas it was usually very frequent.

@jthoward64-lexcelon
Copy link
Author

Anything else I need to do to have this looked at? Or is that repro not working?

@peterdragun
Copy link
Collaborator

peterdragun commented Mar 2, 2026

Yes, I managed to reproduce the issue (with a small hack, see below). Thank you for providing it.

Regarding the first log that you posted - I was able to reproduce it, but using your fix, I am missing the start of the boot log, see below:

I (100) repro: msg 299 val=2093 ratio=0.996667 addr=0x4ac flag=odd extra=43750
-api1-20210207
Build:Feb  7 2021
rst:0xc (RTC_SW_CPU_RST),boot:0xd (SPI_FAST_FLASH_BOOT)
Saved PC:0x40384418

Here, the second line should be: ESP-ROM:esp32c3-api1-20210207. IMO, missing data is a bigger issue than the fact that there is some undecoded/garbage data.

Regarding the second issue, I was struggling a bit to reproduce it. Looks like it highly depends on the used system. I have tried this on macOS and on two other colleagues' Linux machines. All the messages are always decoded correctly, even without this fix. In the end, I was able to reproduce this by adding a small delay (time.sleep(0.01)) right after read in serial_reader.py. (This should simulate your issue, where the read buffer is overflowing with the data coming from the chip)

I think if we solve issues like the one that I mentioned before, we would be able to accept this. So I would suggest printing the "garbage" data anyway, as it has been until now. The new logic regarding the resync command LGTM.

@peterdragun
Copy link
Collaborator

I have tried to solve the issue with missing text in the log. Now all the data that is not part of succesfull binary frame is printed (including some garbage bytes), please let me know if the following works for you as well:

diff --git a/esp_idf_monitor/base/binlog.py b/esp_idf_monitor/base/binlog.py
index efd7d5c413..9ec11465eb 100644
--- a/esp_idf_monitor/base/binlog.py
+++ b/esp_idf_monitor/base/binlog.py
@@ -239,9 +239,11 @@ class BinaryLog:
         which we reject so that scan-ahead continues instead of prematurely breaking."""
         return 1 <= control.level <= 5 and control.version == 0 and control.pkg_len >= 15
 
-    def find_frames(self, data: bytes) -> Tuple[List[bytes], bytes]:
+    def find_frames(self, data: bytes) -> Tuple[List[bytes], bytes, bytes]:
         frames: List[bytes] = []
+        leaked_chunks: List[bytes] = []
         i = 0
+        idx_after_last_frame = 0
         idx_partial_frame = None  # start index of a plausible but incomplete frame
 
         while i < len(data):
@@ -264,7 +266,11 @@ class BinaryLog:
                         continue
                     frame = data[start_idx : start_idx + control.pkg_len]
                     if control.pkg_len != 0 and self.crc8(frame) == 0:
+                        # Collect non-frame bytes before this frame
+                        if start_idx > idx_after_last_frame:
+                            leaked_chunks.append(data[idx_after_last_frame:start_idx])
                         frames.append(frame)
+                        idx_after_last_frame = start_idx + control.pkg_len
                         i += control.pkg_len - 1
                     else:
                         # CRC mismatch – skip this byte and try to re-sync
@@ -278,24 +284,29 @@ class BinaryLog:
 
             i += 1
 
-        # Decide what to carry forward for the next call:
-        # 1. Stopped early on a plausible partial frame → stash from that point so
-        #    the next read can complete it.
-        # 2. Broke at the < 15-byte minimum-length check and the remaining bytes
-        #    start with a binary frame marker → stash them for the same reason.
-        #    If they don't start with a marker they are noise/text and must not be
-        #    carried forward as binary data.
-        # 3. Everything else → return b'' so stale data doesn't keep the caller
-        #    locked in binary mode across a device reset.
+        # Trailing non-frame data (e.g. boot log text after last valid frame)
         if idx_partial_frame is not None:
-            return frames, data[idx_partial_frame:]
-        if i < len(data) and self.detected(data[i]):
-            return frames, data[i:]
-        return frames, b''
+            # Stopped on partial frame: leaked text is from last frame to partial start
+            if idx_partial_frame > idx_after_last_frame:
+                leaked_chunks.append(data[idx_after_last_frame:idx_partial_frame])
+            remaining = data[idx_partial_frame:]
+        elif i < len(data) and self.detected(data[i]):
+            # Remaining bytes start with binary marker; include up to there as leaked
+            if i > idx_after_last_frame:
+                leaked_chunks.append(data[idx_after_last_frame:i])
+            remaining = data[i:]
+        else:
+            # No partial frame; everything after last frame is non-binary
+            if len(data) > idx_after_last_frame:
+                leaked_chunks.append(data[idx_after_last_frame:])
+            remaining = b''
+
+        leaked_text = b''.join(leaked_chunks)
+        return frames, remaining, leaked_text
 
-    def convert_to_text(self, data: bytes) -> Tuple[List[bytes], bytes]:
+    def convert_to_text(self, data: bytes) -> Tuple[List[bytes], bytes, bytes]:
         messages: List[bytes] = []
-        frames, incomplete_fragment = self.find_frames(data)
+        frames, incomplete_fragment, leaked_text = self.find_frames(data)
         for pkg_msg in frames:
             elf_path = self.source_of_message(pkg_msg[0])
             msg = Message(self.debug, elf_path, pkg_msg)
@@ -303,7 +314,7 @@ class BinaryLog:
                 messages += self.format_buffer_message(msg)
             else:
                 messages.append(self.format_message(msg))
-        return messages, incomplete_fragment
+        return messages, incomplete_fragment, leaked_text
 
     def format_message(self, message: Message) -> bytes:
         try:
diff --git a/esp_idf_monitor/base/serial_handler.py b/esp_idf_monitor/base/serial_handler.py
index 606d39ef7d..79c01fdd16 100644
--- a/esp_idf_monitor/base/serial_handler.py
+++ b/esp_idf_monitor/base/serial_handler.py
@@ -204,11 +204,23 @@ class SerialHandler:
 
         sp = self.splitdata(data)
         if self.binary_log_detected:
-            text_lines, self._last_line_part = self.binlog.convert_to_text(sp[0])
+            text_lines, self._last_line_part, leaked_text = self.binlog.convert_to_text(sp[0])
             for line in text_lines:
                 self.print_colored(line)
                 self.logger.handle_possible_pc_address_in_line(line)
                 self.monitor_cmd_executor.execute_from_log_line(line)
+            if leaked_text:
+                leaked_lines = leaked_text.splitlines(keepends=True) or [b'']
+                if leaked_lines and not (
+                    leaked_lines[-1].endswith(b'\n') or leaked_lines[-1].endswith(b'\r')
+                ):
+                    incomplete = leaked_lines.pop()
+                    if not self._last_line_part:
+                        self._last_line_part = incomplete
+                for line in leaked_lines:
+                    self.print_colored(line)
+                    self.logger.handle_possible_pc_address_in_line(line)
+                    self.monitor_cmd_executor.execute_from_log_line(line)
             return
 
         for line in sp:

This essentially combines the old approach with the new logic that you added.

@jthoward64-lexcelon
Copy link
Author

Sorry, on US time; thanks for the patch! And yes it does seem to work for me

@peterdragun
Copy link
Collaborator

@jthoward64-lexcelon Thank you again for contributing and helping out with the reproducer. I had to change a couple of things because of our internal review process and release notes generation from commit messages, so the PR was marked as "closed", but the changes have been merged. Thanks for understanding

@jthoward64-lexcelon
Copy link
Author

jthoward64-lexcelon commented Mar 4, 2026

No worries, thank you! Any idea if this will get a release in the next couple weeks?

@peterdragun
Copy link
Collaborator

Not sure yet, I would like to include a couple of things in the next release that are not done yet, but I will try to do it at the end of March, maybe early April.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants