gh-111089: Add PyUnicode_AsUTF8Unsafe() function by vstinner · Pull Request #111672 · python/cpython

vstinner · 2023-11-03T01:10:22Z

Moreover, PyUnicode_AsUTF8AndSize(str, NULL) now raises an exception if the string contains embedded null characters.

Issue: [C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089

📚 Documentation preview 📚: https://cpython-previews--111672.org.readthedocs.build/

Moreover, PyUnicode_AsUTF8AndSize(str, NULL) now raises an exception if the string contains embedded null characters.

vstinner · 2023-11-03T01:13:28Z

@serhiy-storchaka suggested in private that if PyUnicode_AsUTF8(str) raises an exception on embedded null character, PyUnicode_AsUTF8AndSize(str, NULL) should also raise. So I wrote this draft PR to implement this idea.

The change adds PyUnicode_AsUTF8Unsafe() (name open for bikeshedding) which is like PyUnicode_AsUTF8() but doesn't reject null characters.

vstinner · 2023-11-03T01:14:44Z

Apparently, this is a disagreement on the PyUnicode_AsUTF8() change which rejects null characters: #111091 (comment)

serhiy-storchaka

It is more consistent with PyUnicode_AsWideCharString() and PyBytes_AsStringAndSize`.

In general LGTM (besides some nitpicks), but I would wait until the ongoing discussion has been finished.

An alternative is to restore the PyUnicode_AsUTF8() behavior and introduce PyUnicode_AsUTF8Safe(). Then PyUnicode_AsUTF8() can be removed from the Limited C API and deprecated as it was initially planned.

serhiy-storchaka · 2023-11-03T08:02:03Z

Doc/c-api/unicode.rst

   returned buffer always has an extra null byte appended (not included in
   *size*), regardless of whether there are any other null code points.

+   If *size* is NULL and the *unicode* string contains embedded null


The wording differs from the one for PyUnicode_AsWideCharString(). It would be better to have the same wording for the same behavior, so the user do not need to search non-existing differences.

serhiy-storchaka · 2023-11-03T08:02:54Z

Doc/c-api/unicode.rst

+   If *size* is NULL and the *unicode* string contains embedded null
+   characters, raise an exception. To accept embedded null characters and
+   truncate on purpose at the first null byte, :c:func:`PyUnicode_AsUTF8Unsafe`
+   and :c:func:`PyUnicode_AsUTF8AndSize(unicode, &size)


This is a reference to self. Unlikely it will be useful.

serhiy-storchaka · 2023-11-03T08:06:41Z

Doc/c-api/unicode.rst

+   Similar to :c:func:`PyUnicode_AsUTF8AndSize(unicode, NULL)
+   <PyUnicode_AsUTF8AndSize>`, but does not store the size.


PyUnicode_AsUTF8AndSize(unicode, NULL) does not store size either.

Maybe just say that it is equivalent to PyUnicode_AsUTF8AndSize(unicode, NULL)? And no more explanations will be needed.

serhiy-storchaka · 2023-11-03T08:09:17Z

Include/unicodeobject.h

+#if !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x030D0000
+PyAPI_FUNC(const char*) PyUnicode_AsUTF8Unsafe(PyObject *unicode);
+#endif


Maybe not add it to the Limited C API? PyUnicode_AsUTF8() was not the Limited C API before 3.13.

serhiy-storchaka · 2023-11-03T08:13:36Z

Include/unicodeobject.h

 // and subsequent calls will return the same string. The memory is released
 // when the Unicode object is deallocated.
-PyAPI_FUNC(const char *) PyUnicode_AsUTF8(PyObject *unicode);
+PyAPI_FUNC(const char*) PyUnicode_AsUTF8(PyObject *unicode);


BTW, this function should only be available in the Limited C API 3.13.

vstinner · 2023-11-03T11:08:20Z

I abandon this PR in favor of the opposite approach: add PyUnicode_AsUTF8Safe(), PR #111688.

pythongh-111089: Add PyUnicode_AsUTF8Unsafe() function

cb87653

Moreover, PyUnicode_AsUTF8AndSize(str, NULL) now raises an exception if the string contains embedded null characters.

bedevere-app bot mentioned this pull request Nov 3, 2023

[C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089

Closed

serhiy-storchaka reviewed Nov 3, 2023

View reviewed changes

vstinner closed this Nov 3, 2023

vstinner deleted the asutf8_unsafe branch November 3, 2023 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-111089: Add PyUnicode_AsUTF8Unsafe() function#111672

gh-111089: Add PyUnicode_AsUTF8Unsafe() function#111672
vstinner wants to merge 1 commit intopython:mainfrom
vstinner:asutf8_unsafe

vstinner commented Nov 3, 2023 •

edited by github-actions bot

Loading

Uh oh!

vstinner commented Nov 3, 2023

Uh oh!

vstinner commented Nov 3, 2023

Uh oh!

serhiy-storchaka left a comment

Uh oh!

serhiy-storchaka Nov 3, 2023

Uh oh!

serhiy-storchaka Nov 3, 2023

Uh oh!

serhiy-storchaka Nov 3, 2023

Uh oh!

serhiy-storchaka Nov 3, 2023

Uh oh!

serhiy-storchaka Nov 3, 2023

Uh oh!

vstinner commented Nov 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		Similar to :c:func:`PyUnicode_AsUTF8AndSize(unicode, NULL)
		<PyUnicode_AsUTF8AndSize>`, but does not store the size.

Uh oh!

Conversation

vstinner commented Nov 3, 2023 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Nov 3, 2023

Uh oh!

vstinner commented Nov 3, 2023

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Nov 3, 2023

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Nov 3, 2023

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Nov 3, 2023

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Nov 3, 2023

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Nov 3, 2023

Choose a reason for hiding this comment

Uh oh!

vstinner commented Nov 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vstinner commented Nov 3, 2023 •

edited by github-actions bot

Loading