Skip to content
15 changes: 14 additions & 1 deletion integration_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -502,4 +502,17 @@ def test_define(record_property, tmpdir): # #589

assert exitcode == 0
assert stderr == "test.cpp:1: syntax error: failed to expand 'TEST_P', Invalid ## usage when expanding 'TEST_P': Unexpected token ')'\n"
assert stdout == '\n'
assert stdout == '\n'

def test_utf16_bom(tmpdir):
test_file = os.path.join(tmpdir, "test.cpp")
with open(test_file, 'wb') as f:
f.write(b'\xFF\xFE\x3B\x00')

args = [test_file]

exitcode, stdout, stderr = simplecpp(args, cwd=tmpdir)

assert exitcode == 0
assert stderr == ''
assert stdout == ';\n'
10 changes: 6 additions & 4 deletions simplecpp.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -275,8 +275,10 @@ class simplecpp::TokenList::Stream {
return ch;
}

unsigned char peekChar() {
auto ch = static_cast<unsigned char>(peek());
int peekChar() {
int ch = peek();
if (ch == EOF)
return ch;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate that we handle EOF. However, spontanously it does not seem optimal to return some abitrary ch value. It's implementation defined what value of EOF is right? If we would return 0xFF we at least know what the return value is. Or do we want to have the implementation defined return value?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to return EOF since we are a custom layer on top of getc(). And since we use the EOF macro it is not "arbitrary" but the one defined by the libc implementation.

Copy link
Copy Markdown
Collaborator

@danmar danmar May 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well.. we are not returning EOF. That is why I said it returns a arbitrary value. I.e. if EOF is FFFFFF41 then we return a 'A' char (0x41) here which is not equal to EOF.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to return EOF since we are a custom layer on top of getc().

spontanously this sounds best to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EOF is int, but the return type is unsigned char. We already return EOF cast to unsigned char.

Copy link
Copy Markdown
Collaborator

@danmar danmar May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more proper that peekChar() returns the int value.

If we assume that EOF is always -1 then as far as I see our usage will be safe currently. However it could become unsafe in the future.

    if (num && ch=='\'' && isNameChar(stream.peekChar()))

isNameChar returns false for FF and it should return false for EOF.

If we assume that EOF returns 'A' on some machine then that code will suddenly not work properly. It will think that 'A' is a name char.


// For UTF-16 encoded files the BOM is 0xfeff/0xfffe. If the
// character is non-ASCII character then replace it with 0xff
Expand All @@ -285,7 +287,7 @@ class simplecpp::TokenList::Stream {
const auto ch2 = static_cast<unsigned char>(peek());
unget();
const int ch16 = makeUtf16Char(ch, ch2);
ch = static_cast<unsigned char>(((ch16 >= 0x80) ? 0xff : ch16));
ch = (ch16 >= 0x80) ? 0xff : ch16;
}

// Handling of newlines..
Expand Down Expand Up @@ -598,7 +600,7 @@ std::string simplecpp::TokenList::stringify(bool linenrs) const
return ret.str();
}

static bool isNameChar(unsigned char ch)
static bool isNameChar(int ch)
{
return std::isalnum(ch) || ch == '_' || ch == '$';
}
Expand Down
Loading