OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) by krickert · Pull Request #1104 · apache/opennlp

krickert · 2026-06-20T12:36:49Z

Part 2/4 of OPENNLP-1850. Stacked on the foundation branch (base is OPENNLP-1850-1-foundation, so the diff is only this slice).

UAX #29 word segmenter and Tokenizer impl with bundled WordBreakProperty/ExtendedPictographic data (conformance 1944/1944), the layered Term model (Term, TermAnalyzer), the NormalizationProfile registry, and the WordBreak data's License V3 attribution.

krickert · 2026-06-20T12:37:30Z

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105 — Offset-safe input normalization in the DL components
OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106 — Documentation

Supersedes #1101.

Copilot

Pull request overview

Adds the next slice of OPENNLP-1850 by introducing a Unicode UAX #29 word segmenter/tokenizer implementation and a new layered normalization “Term” model (Term + TermAnalyzer), plus a language-to-normalization profile registry and the associated Unicode data/license attributions.

Changes:

Implement UAX #29 word boundary segmentation (WordSegmenter) and a word tokenizer (WordTokenizer) with typed tokens (WordType, WordToken), including Extended_Pictographic support.
Introduce the layered Term normalization stack (Term, TermAnalyzer, Dimension) and a language-based registry (NormalizationProfile, NormalizationProfiles).
Add comprehensive JUnit tests (including official Unicode conformance) and update NOTICE/LICENSE/RAT exclusions for bundled Unicode data.

Reviewed changes

Copilot reviewed 25 out of 27 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/license/NOTICE.template	Expands Unicode data attribution text for additional bundled UCD/UTS resources.
rat-excludes	Excludes newly bundled Unicode data files from RAT header checks.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/TermAnalyzerTest.java	Tests for `TermAnalyzer` layering, ordering, lazy dimensions, and tokenization behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/NormalizationProfilesTest.java	Tests language-to-profile resolution and search analyzer behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/ConfusablesTest.java	Tests confusable skeleton folding behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordTokenizerTest.java	Tests tokenizer output, typed tokens, and max-length chopping behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordSegmenterTest.java	Tests segmentation boundaries on representative UAX #29 cases.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordBreakPropertyTest.java	Tests Word_Break property lookup behavior and edge cases.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordBoundaryConformanceTest.java	Runs the official Unicode `WordBreakTest.txt` conformance suite against `WordSegmenter`.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/ExtendedPictographicTest.java	Tests Extended_Pictographic membership checks and bounds safety.
opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/tokenize/uax29/ExtendedPictographic.txt	Bundled derived Unicode data for Extended_Pictographic property membership.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/TermAnalyzer.java	Implements configurable token segmentation + ordered normalization dimension pipeline.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Term.java	Represents a token with cached/lazy normalization layers.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NormalizationProfiles.java	Registry mapping language codes to normalization/stemming profiles with detection dispatch.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NormalizationProfile.java	Per-language profile record and `searchAnalyzer()` builder.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Dimension.java	Javadoc updates aligning Dimension docs with the new `Term`/`TermAnalyzer` model.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordType.java	Adds token categorization for downstream handling (scripts, numeric, emoji, etc.).
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordTokenizer.java	Implements UAX #29-based word tokenization with spans and optional typed streaming.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordToken.java	Typed token record (span + type) produced by `WordTokenizer`.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordSegmenter.java	Implements the UAX #29 word boundary algorithm with fast-path transition tables.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordBreakProperty.java	Loads and looks up Unicode Word_Break property values from bundled data.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordBreak.java	Enum for Word_Break property values + parser for property names in the data file.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/ExtendedPictographic.java	Loads Extended_Pictographic membership from bundled data for WB3c behavior.
NOTICE	Updates top-level NOTICE with expanded Unicode attribution details.
LICENSE	Updates top-level LICENSE to include Unicode License V3 applicability for added data files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Builds on the normalization foundation. - opennlp-runtime tokenize/uax29: the UAX #29 word segmenter and Tokenizer implementation (WordSegmenter, WordTokenizer, WordType, WordBreak, boundary engine) with bundled Unicode WordBreakProperty and emoji ExtendedPictographic data, validated against the official WordBreakTest conformance suite (1944/1944). - The layered Term model (Term, TermAnalyzer) that tokenizes then normalizes per token over the Dimension ladder, the per-language NormalizationProfile registry, and the confusable-fold coverage. - Extends the bundled-Unicode attribution (NOTICE, NOTICE.template, LICENSE, rat-excludes) to the WordBreakProperty / ExtendedPictographic / WordBreakTest data files, and restores Dimension's javadoc cross-links now that the Term layer is present.

- WordBoundaryConformanceTest: guard the conformance resource stream with Objects.requireNonNull and a clear message instead of an opaque NPE in InputStreamReader, and remove the unused NO_BOUNDARY constant. - NormalizationProfiles.forLanguage: fail loud on a null language argument at the public entry point, with a null-rejection test.

krickert marked this pull request as draft June 20, 2026 14:43

krickert requested a review from Copilot June 20, 2026 14:56

Copilot started reviewing on behalf of krickert June 20, 2026 14:57 View session

krickert requested review from mawiesne and rzo1 June 20, 2026 14:58

Copilot AI reviewed Jun 20, 2026

View reviewed changes

krickert added 2 commits June 20, 2026 16:14

krickert force-pushed the OPENNLP-1850-2-tokenizer branch from dab5605 to 67c922a Compare June 20, 2026 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4)#1104

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4)#1104
krickert wants to merge 2 commits into
OPENNLP-1850-1-foundationfrom
OPENNLP-1850-2-tokenizer

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants