Skip to content

Unicode lookup perf#227

Closed
benoitkugler wants to merge 60 commits intomainfrom
unicode-lookup-perf
Closed

Unicode lookup perf#227
benoitkugler wants to merge 60 commits intomainfrom
unicode-lookup-perf

Conversation

@benoitkugler
Copy link
Copy Markdown
Contributor

@benoitkugler benoitkugler commented Feb 11, 2026

This PR is a follow up of #225, including 3 changes improving performances :

  • it uses the packTab algorithm (used by Harfbuzz) to produce better Unicode lookup tables. The resulting tables have both lower binary size and faster lookup. (Full credit to Behdad for this algorithm, it is quite baffling)
  • it slightly rewrites segmenter to avoid applying unnecessary rules
  • it removes some indirection in Harfbuzz unicode functions

See the two benchmarks (comparison is with #225) :

goos: linux
goarch: amd64
pkg: github.com/go-text/typesetting/segmenter
cpu: Intel(R) Core(TM) i7-4610M CPU @ 3.00GHz

old.txt new.txt
sec/op sec/op vs base
SegmentUnicodeReference-4 1471.8m ± 1% 271.7m ± 12% -81.54% (p=0.000 n=10)
Shaping/fa-thelittleprince.txt_-_Amiri-4 141.1m ± 12% 133.2m ± 4% -5.63% (p=0.002 n=6)
Shaping/fa-thelittleprince.txt_-_NotoNastaliqUrdu-4 519.7m ± 12% 498.3m ± 16% ~ (p=0.937 n=6)
Shaping/fa-monologue.txt_-_Amiri-4 24.44m ± 8% 22.85m ± 9% -6.50% (p=0.004 n=6)
Shaping/fa-monologue.txt_-_NotoNastaliqUrdu-4 87.45m ± 9% 78.91m ± 6% -9.76% (p=0.002 n=6)
Shaping/en-thelittleprince.txt_-_Roboto-4 89.53m ± 7% 81.36m ± 8% -9.13% (p=0.026 n=6)
Shaping/en-words.txt_-_Roboto-4 85.88m ± 12% 81.90m ± 8% ~ (p=0.065 n=6)
geomean 103.2m 96.29m -6.66%

@benoitkugler
Copy link
Copy Markdown
Contributor Author

benoitkugler commented Feb 11, 2026

Side note : Several test files in unicodedata contain the old data tables (as *unicode.RangeTable), so that the benchmark
BenchmarkLookups is self contained.

I propose to delete these files once you have reviewed and reproduced the benchmark.

@andydotxyz
Copy link
Copy Markdown
Contributor

I was about to review this too, but realised that it has all the #225 changes as well, once that is landed I guess the diff will be readable?

@benoitkugler
Copy link
Copy Markdown
Contributor Author

I was about to review this too, but realised that it has all the #225 changes as well, once that is landed I guess the diff will be readable?

Yes, that was my precisly my intent.

@benoitkugler
Copy link
Copy Markdown
Contributor Author

I've merged #225 , meaning the diffs on this branch should be easier to review.

There are still a large number of changes, but many of them are only renaming.

@whereswaldon whereswaldon changed the base branch from main to within-word February 25, 2026 12:53
@whereswaldon whereswaldon changed the base branch from within-word to main February 25, 2026 12:53
@whereswaldon
Copy link
Copy Markdown
Member

I tried to get the changeset to stop reflecting commits from #225 by changing the PR base branch, but it didn't help. @benoitkugler, would you rebase this on main?

@benoitkugler
Copy link
Copy Markdown
Contributor Author

Hum... rebasing seems quite painfull here.
Let me rather cherry pick the new commits. I'll close this PR and re-open on a new branch (perhaps not today).

@benoitkugler
Copy link
Copy Markdown
Contributor Author

Closing in favor of #236

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants