gh-111089: Add PyUnicode_AsUTF8Unsafe() function#111672
gh-111089: Add PyUnicode_AsUTF8Unsafe() function#111672vstinner wants to merge 1 commit intopython:mainfrom
Conversation
Moreover, PyUnicode_AsUTF8AndSize(str, NULL) now raises an exception if the string contains embedded null characters.
|
@serhiy-storchaka suggested in private that if The change adds |
|
Apparently, this is a disagreement on the PyUnicode_AsUTF8() change which rejects null characters: #111091 (comment) |
serhiy-storchaka
left a comment
There was a problem hiding this comment.
It is more consistent with PyUnicode_AsWideCharString() and PyBytes_AsStringAndSize`.
In general LGTM (besides some nitpicks), but I would wait until the ongoing discussion has been finished.
An alternative is to restore the PyUnicode_AsUTF8() behavior and introduce PyUnicode_AsUTF8Safe(). Then PyUnicode_AsUTF8() can be removed from the Limited C API and deprecated as it was initially planned.
| returned buffer always has an extra null byte appended (not included in | ||
| *size*), regardless of whether there are any other null code points. | ||
|
|
||
| If *size* is NULL and the *unicode* string contains embedded null |
There was a problem hiding this comment.
The wording differs from the one for PyUnicode_AsWideCharString(). It would be better to have the same wording for the same behavior, so the user do not need to search non-existing differences.
| If *size* is NULL and the *unicode* string contains embedded null | ||
| characters, raise an exception. To accept embedded null characters and | ||
| truncate on purpose at the first null byte, :c:func:`PyUnicode_AsUTF8Unsafe` | ||
| and :c:func:`PyUnicode_AsUTF8AndSize(unicode, &size) |
There was a problem hiding this comment.
This is a reference to self. Unlikely it will be useful.
| Similar to :c:func:`PyUnicode_AsUTF8AndSize(unicode, NULL) | ||
| <PyUnicode_AsUTF8AndSize>`, but does not store the size. |
There was a problem hiding this comment.
PyUnicode_AsUTF8AndSize(unicode, NULL) does not store size either.
Maybe just say that it is equivalent to PyUnicode_AsUTF8AndSize(unicode, NULL)? And no more explanations will be needed.
| #if !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x030D0000 | ||
| PyAPI_FUNC(const char*) PyUnicode_AsUTF8Unsafe(PyObject *unicode); | ||
| #endif |
There was a problem hiding this comment.
Maybe not add it to the Limited C API? PyUnicode_AsUTF8() was not the Limited C API before 3.13.
| // and subsequent calls will return the same string. The memory is released | ||
| // when the Unicode object is deallocated. | ||
| PyAPI_FUNC(const char *) PyUnicode_AsUTF8(PyObject *unicode); | ||
| PyAPI_FUNC(const char*) PyUnicode_AsUTF8(PyObject *unicode); |
There was a problem hiding this comment.
BTW, this function should only be available in the Limited C API 3.13.
|
I abandon this PR in favor of the opposite approach: add PyUnicode_AsUTF8Safe(), PR #111688. |
Moreover, PyUnicode_AsUTF8AndSize(str, NULL) now raises an exception if the string contains embedded null characters.
📚 Documentation preview 📚: https://cpython-previews--111672.org.readthedocs.build/