Affected page
https://www.php.net/manual/en/function.htmlspecialchars.php
Current issue
The htmlspecialchars() manual documents ENT_SUBSTITUTE and explains that it replaces invalid code unit sequences with the Unicode replacement character. The manual also warns that ENT_IGNORE may have security implications.
However, the Examples section does not show a minimal example of this behavior. As a result, readers may not immediately understand what ENT_SUBSTITUTE returns when the input contains an invalid UTF-8 byte.
Suggested improvement
Add a simple example using an invalid UTF-8 byte.
<?php
// "\x80" is not a valid UTF-8 byte sequence.
$input = "\x80";
echo bin2hex(htmlspecialchars($input, ENT_QUOTES | ENT_SUBSTITUTE, 'UTF-8'));
Expected output:
Add an explanation:
This example uses bin2hex() so that the result does not depend on terminal rendering.
The output efbfbd is the UTF-8 byte sequence for U+FFFD, the Unicode replacement character.
Optionally, add a short comparison with ENT_IGNORE:
<?php
// "\x80" is not a valid UTF-8 byte sequence.
$input = "atta\x80ck";
echo bin2hex(htmlspecialchars($input, ENT_QUOTES | ENT_IGNORE, 'UTF-8')), "\n";
echo bin2hex(htmlspecialchars($input, ENT_QUOTES | ENT_SUBSTITUTE, 'UTF-8')), "\n";
Expected output:
61747461636b
61747461efbfbd636b
With ENT_IGNORE, the invalid byte is discarded. With ENT_SUBSTITUTE, the invalid byte is replaced by U+FFFD, whose UTF-8 byte sequence is efbfbd. Replacing invalid byte sequences is generally preferable to silently deleting them, because deletion can join surrounding text and change its meaning.
Additional context (optional)
The current manual already warns that using ENT_IGNORE may have security implications. A short example would make that warning easier to understand.
Unicode Technical Report 36 explains that replacing ill-formed subsequences is generally safer than deleting them, because deletion can join text that would otherwise remain separate. The report is stabilized and no longer maintained, so it should be used as historical background rather than as the latest Unicode security guidance.
Affected page
https://www.php.net/manual/en/function.htmlspecialchars.php
Current issue
The
htmlspecialchars()manual documentsENT_SUBSTITUTEand explains that it replaces invalid code unit sequences with the Unicode replacement character. The manual also warns thatENT_IGNOREmay have security implications.However, the Examples section does not show a minimal example of this behavior. As a result, readers may not immediately understand what
ENT_SUBSTITUTEreturns when the input contains an invalid UTF-8 byte.Suggested improvement
Add a simple example using an invalid UTF-8 byte.
Expected output:
Add an explanation:
This example uses
bin2hex()so that the result does not depend on terminal rendering.The output
efbfbdis the UTF-8 byte sequence forU+FFFD, the Unicode replacement character.Optionally, add a short comparison with
ENT_IGNORE:Expected output:
With
ENT_IGNORE, the invalid byte is discarded. WithENT_SUBSTITUTE, the invalid byte is replaced byU+FFFD, whose UTF-8 byte sequence isefbfbd. Replacing invalid byte sequences is generally preferable to silently deleting them, because deletion can join surrounding text and change its meaning.Additional context (optional)
The current manual already warns that using
ENT_IGNOREmay have security implications. A short example would make that warning easier to understand.Unicode Technical Report 36 explains that replacing ill-formed subsequences is generally safer than deleting them, because deletion can join text that would otherwise remain separate. The report is stabilized and no longer maintained, so it should be used as historical background rather than as the latest Unicode security guidance.