Skip to content

Conversation

@serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Dec 22, 2025

Add the unicodedata.iter_graphemes() function to iterate over grapheme clusters according to rules defined in Unicode Standard Annex #29.

Add unicodedata.grapheme_cluster_break(), unicodedata.indic_conjunct_break() and unicodedata.extended_pictographic() functions to get the properties of the character which are related to the above algorithm.

Add the unicodedata.iter_graphemes() function to iterate over grapheme
clusters according to rules defined in Unicode Standard Annex python#29.

Add unicodedata.grapheme_cluster_break(), unicodedata.indic_conjunct_break()
and unicodedata.extended_pictographic() functions to get the properties
of the character which are related to the above algorithm.

Co-authored-by: Guillaume "Vermeille" Sanchez <guillaume.v.sanchez@gmail.com>
``False`` otherwise.

.. versionadded:: next

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of functions in this file doesn’t seem to be alphabetical or topical.
I think another ticket should be created to add a quick links table at the top.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can split it on sections by type and order alphabetically inside a section.

.. data:: ucd_3_2_0

This is an object that has the same methods as the entire module, but uses the
This is an object that has most of the methods of the entire module, but uses the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is not fully right, but I can’t find the right suggestion with both «most of» and «same as».

@merwok
Copy link
Member

merwok commented Dec 23, 2025

These functions help compute width?

@serhiy-storchaka
Copy link
Member Author

At least two implementations (in Perl's Unicode::GCString and builtin in C++) use graphemes. Naive implementation in C's wcwidth() does not work well with complex characters and Emoji.

serhiy-storchaka and others added 2 commits December 23, 2025 11:37
Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
@merwok
Copy link
Member

merwok commented Dec 23, 2025

Sorry if my question was not clear.
Are these functions building blocks that will be used to compute character widths, as needed by the other tickets about pyrepl, str justify, etc?
(maybe iter_graphemes?)

@serhiy-storchaka
Copy link
Member Author

Yes, I think that _PyGraphemeBreak, _Py_InitGraphemeBreak() and _Py_NextGraphemeBreak() will be used in the implementation of unicodedata.width().

self.assertEqual(chunks.pop(), '', line)
input = ''.join(chunks)
with self.subTest(line):
result = list(unicodedata.iter_graphemes(input))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to use the passed ucd argument?

Suggested change
result = list(unicodedata.iter_graphemes(input))
result = list(ucd.iter_graphemes(input))

self.assertEqual([x.start for x in result], breaks[:-1], comment)
self.assertEqual([x.end for x in result], breaks[1:], comment)
for i in range(1, len(breaks) - 1):
result = list(unicodedata.iter_graphemes(input, breaks[i]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
result = list(unicodedata.iter_graphemes(input, breaks[i]))
result = list(ucd.iter_graphemes(input, breaks[i]))

Continues above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is module-only function.

}


/* XXX Add doc strings. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above functions already have docstrings?

hdr = testfile.readline()
return unicodedata.unidata_version in hdr

@requires_resource('network')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it not be urlfetch resource?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. The other test (for normalization) uses the network resource).

@serhiy-storchaka serhiy-storchaka enabled auto-merge (squash) January 14, 2026 14:16
@serhiy-storchaka serhiy-storchaka merged commit bab1d7a into python:main Jan 14, 2026
47 checks passed
@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot s390x Fedora Stable LTO 3.x (tier-3) has failed when building commit bab1d7a.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/1654/builds/1940) and take a look at the build logs.
  4. Check if the failure is related to this commit (bab1d7a) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/1654/builds/1940

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-fedora-stable-s390x.lto/build/Lib/tempfile.py", line 484, in __del__
    _warnings.warn(self.warn_message, ResourceWarning)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ResourceWarning: Implicitly cleaning up <HTTPError 403: 'Forbidden'>

@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot AMD64 CentOS9 NoGIL Refleaks 3.x (tier-1) has failed when building commit bab1d7a.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/1610/builds/2750) and take a look at the build logs.
  4. Check if the failure is related to this commit (bab1d7a) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/1610/builds/2750

Failed tests:

  • test_unicodedata

Test leaking resources:

  • test_unicodedata: memory blocks
  • test_unicodedata: references

Summary of the results of the build (if available):

==

Click to see traceback logs
remote: Enumerating objects: 24, done.        
remote: Counting objects:   6% (1/16)        
remote: Counting objects:  12% (2/16)        
remote: Counting objects:  18% (3/16)        
remote: Counting objects:  25% (4/16)        
remote: Counting objects:  31% (5/16)        
remote: Counting objects:  37% (6/16)        
remote: Counting objects:  43% (7/16)        
remote: Counting objects:  50% (8/16)        
remote: Counting objects:  56% (9/16)        
remote: Counting objects:  62% (10/16)        
remote: Counting objects:  68% (11/16)        
remote: Counting objects:  75% (12/16)        
remote: Counting objects:  81% (13/16)        
remote: Counting objects:  87% (14/16)        
remote: Counting objects:  93% (15/16)        
remote: Counting objects: 100% (16/16)        
remote: Counting objects: 100% (16/16), done.        
remote: Compressing objects:   9% (1/11)        
remote: Compressing objects:  18% (2/11)        
remote: Compressing objects:  27% (3/11)        
remote: Compressing objects:  36% (4/11)        
remote: Compressing objects:  45% (5/11)        
remote: Compressing objects:  54% (6/11)        
remote: Compressing objects:  63% (7/11)        
remote: Compressing objects:  72% (8/11)        
remote: Compressing objects:  81% (9/11)        
remote: Compressing objects:  90% (10/11)        
remote: Compressing objects: 100% (11/11)        
remote: Compressing objects: 100% (11/11), done.        
remote: Total 24 (delta 5), reused 5 (delta 5), pack-reused 8 (from 2)        
From https://github.com/python/cpython
 * branch                    main       -> FETCH_HEAD
Note: switching to 'bab1d7a561ab015dd6bb97e255fd12a8ce367edf'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at bab1d7a561a gh-74902: Add Unicode Grapheme Cluster Break algorithm (GH-143076)
Switched to and reset branch 'main'

configure: WARNING: no system libmpdec found; falling back to pure-Python version for the decimal module

make: *** [Makefile:2503: buildbottest] Error 2

@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot AMD64 FreeBSD Refleaks 3.x (tier-3) has failed when building commit bab1d7a.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/1613/builds/2681) and take a look at the build logs.
  4. Check if the failure is related to this commit (bab1d7a) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/1613/builds/2681

Failed tests:

  • test_unicodedata

Test leaking resources:

  • test_unicodedata: memory blocks
  • test_unicodedata: references

Summary of the results of the build (if available):

==

Click to see traceback logs
remote: Enumerating objects: 24, done.        
remote: Counting objects:   6% (1/16)        
remote: Counting objects:  12% (2/16)        
remote: Counting objects:  18% (3/16)        
remote: Counting objects:  25% (4/16)        
remote: Counting objects:  31% (5/16)        
remote: Counting objects:  37% (6/16)        
remote: Counting objects:  43% (7/16)        
remote: Counting objects:  50% (8/16)        
remote: Counting objects:  56% (9/16)        
remote: Counting objects:  62% (10/16)        
remote: Counting objects:  68% (11/16)        
remote: Counting objects:  75% (12/16)        
remote: Counting objects:  81% (13/16)        
remote: Counting objects:  87% (14/16)        
remote: Counting objects:  93% (15/16)        
remote: Counting objects: 100% (16/16)        
remote: Counting objects: 100% (16/16), done.        
remote: Compressing objects:   9% (1/11)        
remote: Compressing objects:  18% (2/11)        
remote: Compressing objects:  27% (3/11)        
remote: Compressing objects:  36% (4/11)        
remote: Compressing objects:  45% (5/11)        
remote: Compressing objects:  54% (6/11)        
remote: Compressing objects:  63% (7/11)        
remote: Compressing objects:  72% (8/11)        
remote: Compressing objects:  81% (9/11)        
remote: Compressing objects:  90% (10/11)        
remote: Compressing objects: 100% (11/11)        
remote: Compressing objects: 100% (11/11), done.        
remote: Total 24 (delta 5), reused 5 (delta 5), pack-reused 8 (from 2)        
From https://github.com/python/cpython
 * branch                    main       -> FETCH_HEAD
Note: switching to 'bab1d7a561ab015dd6bb97e255fd12a8ce367edf'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at bab1d7a561a gh-74902: Add Unicode Grapheme Cluster Break algorithm (GH-143076)
Switched to and reset branch 'main'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants