MDEV-39660 Replica Crashes on Malformed Partial_rows_log_event#5093
MDEV-39660 Replica Crashes on Malformed Partial_rows_log_event#5093bnestere wants to merge 1 commit into
Conversation
If the first Partial_rows_log_event in a group provides an original_event_size that is smaller than the actual size of the underlying Rows_log_event, the slave will not allocate enough memory to hold the actual Rows_log_event content, but still try to copy that much memory, resulting in the slave crashing. This patch adds validation checking for this condition, and will error with code ER_PARTIAL_ROWS_LOG_EVENT_BAD_STREAM when detected. Additionally, the error message for ER_PARTIAL_ROWS_LOG_EVENT_BAD_STREAM is generalized to better support various types of broken streams. The previous message was hard-coded to be about detecting out-of-sequence fragments. The new message is updated to allow general "found %s" and "expected %s" conditions. Reviewed-by: Claude Code Signed-off-by: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
There was a problem hiding this comment.
Code Review
This pull request improves the error reporting for broken Partial_rows_log_event streams by making the messages more descriptive and adding a new test case for corrupt event sizes. It also introduces a bounds check to prevent buffer overflows when reassembling fragments. Feedback highlights potential integer truncation and overflow issues on 32-bit systems during memory allocation and bounds checking, as well as a suggestion to increase the maximum length for event type strings to prevent truncation in logs.
| : (char *) my_malloc(PSI_INSTRUMENT_ME, | ||
| partial_ev->original_event_size, MYF(MY_WME)); |
There was a problem hiding this comment.
On 32-bit systems, partial_ev->original_event_size (a uint64_t) could be larger than SIZE_MAX. Passing it directly to my_malloc (which takes size_t) will result in truncation. For example, a size of 4GB + 100 bytes would be truncated to 100 bytes. This would lead to a buffer overflow during memcpy even with the new bounds checks, because rows_ev_buf_len would still be the large 64-bit value. Consider adding a check to ensure original_event_size fits in size_t on the current platform.
| if (this->ev_len + partial_ev->get_rows_size() > this->rows_ev_buf_len) | ||
| { | ||
| char found_buf[PARTIAL_ROWS_EVENT_BAD_ORIG_SIZE_ERRSTR_LEN]; | ||
| char expect_buf[PARTIAL_ROWS_EVENT_BAD_ORIG_SIZE_ERRSTR_LEN]; | ||
| my_snprintf(found_buf, sizeof(found_buf), "row data size %" PRIu64, | ||
| this->ev_len + partial_ev->get_rows_size()); | ||
| my_snprintf(expect_buf, sizeof(expect_buf), "row data size %" PRIu64, | ||
| this->rows_ev_buf_len); | ||
| rgi->rli->report(ERROR_LEVEL, ER_PARTIAL_ROWS_LOG_EVENT_BAD_STREAM, | ||
| rgi->gtid_info(), | ||
| ER_THD(rgi->thd, ER_PARTIAL_ROWS_LOG_EVENT_BAD_STREAM), | ||
| found_buf, expect_buf); | ||
| return ER_PARTIAL_ROWS_LOG_EVENT_BAD_STREAM; | ||
| } |
There was a problem hiding this comment.
The check this->ev_len + partial_ev->get_rows_size() > this->rows_ev_buf_len can suffer from integer overflow on 32-bit systems if ev_len and the fragment size are 32-bit types (like ulong and size_t), as the addition is performed in 32-bit arithmetic before being compared to the 64-bit rows_ev_buf_len. Using subtraction (partial_ev->get_rows_size() > this->rows_ev_buf_len - this->ev_len) is safer. Also, ensure the sum in the error message is calculated using 64-bit arithmetic to avoid reporting a truncated value.
if (partial_ev->get_rows_size() > this->rows_ev_buf_len - this->ev_len)
{
char found_buf[PARTIAL_ROWS_EVENT_BAD_ORIG_SIZE_ERRSTR_LEN];
char expect_buf[PARTIAL_ROWS_EVENT_BAD_ORIG_SIZE_ERRSTR_LEN];
my_snprintf(found_buf, sizeof(found_buf), "row data size %" PRIu64,
(uint64_t)this->ev_len + partial_ev->get_rows_size());
my_snprintf(expect_buf, sizeof(expect_buf), "row data size %" PRIu64,
this->rows_ev_buf_len);
rgi->rli->report(ERROR_LEVEL, ER_PARTIAL_ROWS_LOG_EVENT_BAD_STREAM,
rgi->gtid_info(),
ER_THD(rgi->thd, ER_PARTIAL_ROWS_LOG_EVENT_BAD_STREAM),
found_buf, expect_buf);
return ER_PARTIAL_ROWS_LOG_EVENT_BAD_STREAM;
}| (sizeof("row data size ") + 20 + 1) | ||
|
|
||
| // Len of "type %s" where %s is Log_event::get_type_str() | ||
| constexpr size_t MAX_LOG_EVENT_TYPE_STR_LEN= 25; // Log_event::get_type_str() |
There was a problem hiding this comment.
The constant MAX_LOG_EVENT_TYPE_STR_LEN is set to 25, which is exactly the length of some MariaDB event type strings (e.g., Update_rows_compressed_v1). Increasing this to 32 would provide better headroom and avoid potential truncation in error messages if longer event names are added in the future.
constexpr size_t MAX_LOG_EVENT_TYPE_STR_LEN= 32; // Log_event::get_type_str()
If the first Partial_rows_log_event in a group provides an original_event_size that is smaller than the actual size of the underlying Rows_log_event, the slave will not allocate enough memory to hold the actual Rows_log_event content, but still try to copy that much memory, resulting in the slave crashing.
This patch adds validation checking for this condition, and will error with code ER_PARTIAL_ROWS_LOG_EVENT_BAD_STREAM when detected.
Additionally, the error message for ER_PARTIAL_ROWS_LOG_EVENT_BAD_STREAM is generalized to better support various types of broken streams. The previous message was hard-coded to be about detecting out-of-sequence fragments. The new message is updated to allow general "found %s" and "expected %s" conditions.