Skip to content

Add basic examples for mb_scrub() #5562

@masakielastic

Description

@masakielastic

Affected page

https://www.php.net/manual/en/function.mb-scrub.php

Current issue

The mb_scrub() manual page explains that the function replaces ill-formed byte sequences with the substitute character, but it currently has no examples.

Without an example, it is difficult to see that the replacement happens inside the PHP string itself. Visual output alone can also be misleading, because terminals and browsers may display malformed byte sequences using a replacement character.

As the original author of mb_scrub(), I apologize that the manual page has remained without examples for a long time.

Suggested improvement

Add a basic example using bin2hex() to show the byte-level replacement performed by mb_scrub().

Also add a small example showing that the scrubbed string can be passed to a UTF-8 aware PCRE pattern using the u modifier.

Proposed example: byte-level replacement

<?php

$input = "A\xFFB";

echo bin2hex($input), "\n";

$clean = mb_scrub($input, 'UTF-8');

echo bin2hex($clean), "\n";

Expected output:

41ff42
41efbfbd42

Suggested explanation:

The byte ff is not valid in UTF-8. After mb_scrub(), it is replaced with efbfbd, the UTF-8 encoding of U+FFFD REPLACEMENT CHARACTER.

This example uses bin2hex() so that the result does not depend on how a terminal, browser, or font displays malformed byte sequences.

Proposed example: use before UTF-8 aware PCRE processing

<?php

$input = "A\xFFB";

$clean = mb_scrub($input, 'UTF-8');

$count = preg_match_all('/./us', $clean);

if ($count === false) {
    echo preg_last_error_msg(), "\n";
    exit;
}

echo $count, "\n";

Expected output:

3

Suggested explanation:

mb_scrub() can be useful before passing external input to UTF-8 aware text processing functions. In this example, the malformed byte sequence is replaced before the string is passed to a PCRE pattern using the u modifier.

The result of preg_match_all() is the number of matches for the given pattern. In this example, /./us matches each UTF-8 code point, including the replacement character.

Additional context (optional)

This issue intentionally focuses only on simple examples for mb_scrub().

Related topics, such as how PCRE functions and grapheme_* functions handle invalid UTF-8 input, how streamed input should be handled, and how substitution differs from validation, escaping, or database storage behavior, can be discussed separately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions