Skip to content

Add an example showing how ENT_SUBSTITUTE replaces invalid UTF-8 bytes #5561

@masakielastic

Description

@masakielastic

Affected page

https://www.php.net/manual/en/function.htmlspecialchars.php

Current issue

The htmlspecialchars() manual documents ENT_SUBSTITUTE and explains that it replaces invalid code unit sequences with the Unicode replacement character. The manual also warns that ENT_IGNORE may have security implications.

However, the Examples section does not show a minimal example of this behavior. As a result, readers may not immediately understand what ENT_SUBSTITUTE returns when the input contains an invalid UTF-8 byte.

Suggested improvement

Add a simple example using an invalid UTF-8 byte.

<?php
// "\x80" is not a valid UTF-8 byte sequence.
$input = "\x80";

echo bin2hex(htmlspecialchars($input, ENT_QUOTES | ENT_SUBSTITUTE, 'UTF-8'));

Expected output:

efbfbd

Add an explanation:

This example uses bin2hex() so that the result does not depend on terminal rendering.
The output efbfbd is the UTF-8 byte sequence for U+FFFD, the Unicode replacement character.

Optionally, add a short comparison with ENT_IGNORE:

<?php
// "\x80" is not a valid UTF-8 byte sequence.
$input = "atta\x80ck";

echo bin2hex(htmlspecialchars($input, ENT_QUOTES | ENT_IGNORE, 'UTF-8')), "\n";
echo bin2hex(htmlspecialchars($input, ENT_QUOTES | ENT_SUBSTITUTE, 'UTF-8')), "\n";

Expected output:

61747461636b
61747461efbfbd636b

With ENT_IGNORE, the invalid byte is discarded. With ENT_SUBSTITUTE, the invalid byte is replaced by U+FFFD, whose UTF-8 byte sequence is efbfbd. Replacing invalid byte sequences is generally preferable to silently deleting them, because deletion can join surrounding text and change its meaning.

Additional context (optional)

The current manual already warns that using ENT_IGNORE may have security implications. A short example would make that warning easier to understand.

Unicode Technical Report 36 explains that replacing ill-formed subsequences is generally safer than deleting them, because deletion can join text that would otherwise remain separate. The report is stabilized and no longer maintained, so it should be used as historical background rather than as the latest Unicode security guidance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions