OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4)#1104
Draft
krickert wants to merge 2 commits into
Draft
OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4)#1104krickert wants to merge 2 commits into
krickert wants to merge 2 commits into
Conversation
Contributor
Author
|
OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to
Supersedes #1101. |
This was referenced Jun 20, 2026
Draft
Contributor
There was a problem hiding this comment.
Pull request overview
Adds the next slice of OPENNLP-1850 by introducing a Unicode UAX #29 word segmenter/tokenizer implementation and a new layered normalization “Term” model (Term + TermAnalyzer), plus a language-to-normalization profile registry and the associated Unicode data/license attributions.
Changes:
- Implement UAX #29 word boundary segmentation (
WordSegmenter) and a word tokenizer (WordTokenizer) with typed tokens (WordType,WordToken), including Extended_Pictographic support. - Introduce the layered Term normalization stack (
Term,TermAnalyzer,Dimension) and a language-based registry (NormalizationProfile,NormalizationProfiles). - Add comprehensive JUnit tests (including official Unicode conformance) and update NOTICE/LICENSE/RAT exclusions for bundled Unicode data.
Reviewed changes
Copilot reviewed 25 out of 27 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/license/NOTICE.template | Expands Unicode data attribution text for additional bundled UCD/UTS resources. |
| rat-excludes | Excludes newly bundled Unicode data files from RAT header checks. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/TermAnalyzerTest.java | Tests for TermAnalyzer layering, ordering, lazy dimensions, and tokenization behavior. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/NormalizationProfilesTest.java | Tests language-to-profile resolution and search analyzer behavior. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/ConfusablesTest.java | Tests confusable skeleton folding behavior. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordTokenizerTest.java | Tests tokenizer output, typed tokens, and max-length chopping behavior. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordSegmenterTest.java | Tests segmentation boundaries on representative UAX #29 cases. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordBreakPropertyTest.java | Tests Word_Break property lookup behavior and edge cases. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordBoundaryConformanceTest.java | Runs the official Unicode WordBreakTest.txt conformance suite against WordSegmenter. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/ExtendedPictographicTest.java | Tests Extended_Pictographic membership checks and bounds safety. |
| opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/tokenize/uax29/ExtendedPictographic.txt | Bundled derived Unicode data for Extended_Pictographic property membership. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/TermAnalyzer.java | Implements configurable token segmentation + ordered normalization dimension pipeline. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Term.java | Represents a token with cached/lazy normalization layers. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NormalizationProfiles.java | Registry mapping language codes to normalization/stemming profiles with detection dispatch. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NormalizationProfile.java | Per-language profile record and searchAnalyzer() builder. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Dimension.java | Javadoc updates aligning Dimension docs with the new Term/TermAnalyzer model. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordType.java | Adds token categorization for downstream handling (scripts, numeric, emoji, etc.). |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordTokenizer.java | Implements UAX #29-based word tokenization with spans and optional typed streaming. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordToken.java | Typed token record (span + type) produced by WordTokenizer. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordSegmenter.java | Implements the UAX #29 word boundary algorithm with fast-path transition tables. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordBreakProperty.java | Loads and looks up Unicode Word_Break property values from bundled data. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordBreak.java | Enum for Word_Break property values + parser for property names in the data file. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/ExtendedPictographic.java | Loads Extended_Pictographic membership from bundled data for WB3c behavior. |
| NOTICE | Updates top-level NOTICE with expanded Unicode attribution details. |
| LICENSE | Updates top-level LICENSE to include Unicode License V3 applicability for added data files. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Builds on the normalization foundation. - opennlp-runtime tokenize/uax29: the UAX #29 word segmenter and Tokenizer implementation (WordSegmenter, WordTokenizer, WordType, WordBreak, boundary engine) with bundled Unicode WordBreakProperty and emoji ExtendedPictographic data, validated against the official WordBreakTest conformance suite (1944/1944). - The layered Term model (Term, TermAnalyzer) that tokenizes then normalizes per token over the Dimension ladder, the per-language NormalizationProfile registry, and the confusable-fold coverage. - Extends the bundled-Unicode attribution (NOTICE, NOTICE.template, LICENSE, rat-excludes) to the WordBreakProperty / ExtendedPictographic / WordBreakTest data files, and restores Dimension's javadoc cross-links now that the Term layer is present.
- WordBoundaryConformanceTest: guard the conformance resource stream with Objects.requireNonNull and a clear message instead of an opaque NPE in InputStreamReader, and remove the unused NO_BOUNDARY constant. - NormalizationProfiles.forLanguage: fail loud on a null language argument at the public entry point, with a null-rejection test.
dab5605 to
67c922a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part 2/4 of OPENNLP-1850. Stacked on the foundation branch (base is OPENNLP-1850-1-foundation, so the diff is only this slice).
UAX #29 word segmenter and Tokenizer impl with bundled WordBreakProperty/ExtendedPictographic data (conformance 1944/1944), the layered Term model (Term, TermAnalyzer), the NormalizationProfile registry, and the WordBreak data's License V3 attribution.