Add an example showing how ENT_SUBSTITUTE replaces invalid UTF-8 bytes

### Affected page

https://www.php.net/manual/en/function.htmlspecialchars.php

### Current issue

The `htmlspecialchars()` manual documents `ENT_SUBSTITUTE` and explains that it replaces invalid code unit sequences with the Unicode replacement character. The manual also warns that `ENT_IGNORE` may have security implications.

However, the Examples section does not show a minimal example of this behavior. As a result, readers may not immediately understand what `ENT_SUBSTITUTE` returns when the input contains an invalid UTF-8 byte.

### Suggested improvement

Add a simple example using an invalid UTF-8 byte.

```php
<?php
// "\x80" is not a valid UTF-8 byte sequence.
$input = "\x80";

echo bin2hex(htmlspecialchars($input, ENT_QUOTES | ENT_SUBSTITUTE, 'UTF-8'));
```

Expected output:

```text
efbfbd
```

Add an explanation:


This example uses `bin2hex()` so that the result does not depend on terminal rendering. 
The output `efbfbd` is the UTF-8 byte sequence for `U+FFFD`, the Unicode replacement character.

Optionally, add a short comparison with `ENT_IGNORE`:

```php
<?php
// "\x80" is not a valid UTF-8 byte sequence.
$input = "atta\x80ck";

echo bin2hex(htmlspecialchars($input, ENT_QUOTES | ENT_IGNORE, 'UTF-8')), "\n";
echo bin2hex(htmlspecialchars($input, ENT_QUOTES | ENT_SUBSTITUTE, 'UTF-8')), "\n";
```

Expected output:

```text
61747461636b
61747461efbfbd636b
```

With `ENT_IGNORE`, the invalid byte is discarded. With `ENT_SUBSTITUTE`, the invalid byte is replaced by `U+FFFD`, whose UTF-8 byte sequence is `efbfbd`. Replacing invalid byte sequences is generally preferable to silently deleting them, because deletion can join surrounding text and change its meaning.

### Additional context (optional)

The current manual already warns that using `ENT_IGNORE` may have security implications. A short example would make that warning easier to understand.

Unicode Technical Report 36 explains that replacing ill-formed subsequences is generally safer than deleting them, because deletion can join text that would otherwise remain separate. The report is stabilized and no longer maintained, so it should be used as historical background rather than as the latest Unicode security guidance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an example showing how ENT_SUBSTITUTE replaces invalid UTF-8 bytes #5561

Affected page

Current issue

Suggested improvement

Additional context (optional)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add an example showing how ENT_SUBSTITUTE replaces invalid UTF-8 bytes #5561

Description

Affected page

Current issue

Suggested improvement

Additional context (optional)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions