Skip to content

Fix lexer infinite loop / abort on invalid UTF-8 byte#2973

Open
ksss wants to merge 1 commit into
ruby:masterfrom
ksss:ksss/fix-lexer-invalid-utf8
Open

Fix lexer infinite loop / abort on invalid UTF-8 byte#2973
ksss wants to merge 1 commit into
ruby:masterfrom
ksss:ksss/fix-lexer-invalid-utf8

Conversation

@ksss
Copy link
Copy Markdown
Collaborator

@ksss ksss commented May 25, 2026

Summary

rbs_next_char left byte_len = 0 when the active encoding's char_width() rejected a byte. Depending on where the byte appeared, the lexer either:

  • looped forever inside a comment (GVL held, SIGINT ineffective), or
  • tripped RBS_ASSERT(current_character_bytes > 0, ...) in rbs_skip at top level and called exit(1).

This PR makes rbs_next_char advance one byte in that case so the lexer always makes progress; the invalid byte then flows into the usual parser error path.

Reproducer

buf = RBS::Buffer.new(content: "# \xC2".force_encoding("UTF-8"), name: "x.rbs")
RBS::Parser._parse_signature(buf, 0, buf.content.bytesize)
# hangs forever; only `kill -9` stops the process

Fix

src/lexstate.c (rbs_next_char): when encoding->char_width() returns 0, treat the byte as a 1-byte garbage character and advance one byte. The lexer's invariant "cursor advances by at least one byte per step" is now preserved on every code path. Valid UTF-8 input is unaffected because char_width() never returns 0 for valid sequences.


This PR description was written by Claude Code.

When the active encoding's `char_width` returned 0 for a byte,
`rbs_next_char` left `byte_len = 0`. The lexer then either looped
forever (when the byte was inside a comment) or tripped
`RBS_ASSERT(current_character_bytes > 0, ...)` in `rbs_skip` at top
level.

Treat such a byte as a 1-byte garbage character so the lexer always
advances at least one byte. The invalid byte then surfaces as a
regular parsing error through the existing error path.

Minimal reproducer that used to hang the host process indefinitely
with the GVL held:

  RBS::Parser._parse_signature(
    RBS::Buffer.new(content: "# \xC2".force_encoding("UTF-8"),
                    name: "x.rbs"),
    0, 3
  )

Found by fuzzing the parser entry points with random byte mutations
of the existing seed RBS files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ksss ksss added this to the RBS 4.1 milestone May 25, 2026
@ksss ksss force-pushed the ksss/fix-lexer-invalid-utf8 branch from 4d292ab to e9612b4 Compare May 25, 2026 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant