Skip to content

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4)#1104

Draft
krickert wants to merge 2 commits into
OPENNLP-1850-1-foundationfrom
OPENNLP-1850-2-tokenizer
Draft

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4)#1104
krickert wants to merge 2 commits into
OPENNLP-1850-1-foundationfrom
OPENNLP-1850-2-tokenizer

Conversation

@krickert

Copy link
Copy Markdown
Contributor

Part 2/4 of OPENNLP-1850. Stacked on the foundation branch (base is OPENNLP-1850-1-foundation, so the diff is only this slice).

UAX #29 word segmenter and Tokenizer impl with bundled WordBreakProperty/ExtendedPictographic data (conformance 1944/1944), the layered Term model (Term, TermAnalyzer), the NormalizationProfile registry, and the WordBreak data's License V3 attribution.

@krickert

Copy link
Copy Markdown
Contributor Author

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

  1. OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
  2. OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
  3. OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105 — Offset-safe input normalization in the DL components
  4. OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106 — Documentation

Supersedes #1101.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the next slice of OPENNLP-1850 by introducing a Unicode UAX #29 word segmenter/tokenizer implementation and a new layered normalization “Term” model (Term + TermAnalyzer), plus a language-to-normalization profile registry and the associated Unicode data/license attributions.

Changes:

  • Implement UAX #29 word boundary segmentation (WordSegmenter) and a word tokenizer (WordTokenizer) with typed tokens (WordType, WordToken), including Extended_Pictographic support.
  • Introduce the layered Term normalization stack (Term, TermAnalyzer, Dimension) and a language-based registry (NormalizationProfile, NormalizationProfiles).
  • Add comprehensive JUnit tests (including official Unicode conformance) and update NOTICE/LICENSE/RAT exclusions for bundled Unicode data.

Reviewed changes

Copilot reviewed 25 out of 27 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/license/NOTICE.template Expands Unicode data attribution text for additional bundled UCD/UTS resources.
rat-excludes Excludes newly bundled Unicode data files from RAT header checks.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/TermAnalyzerTest.java Tests for TermAnalyzer layering, ordering, lazy dimensions, and tokenization behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/NormalizationProfilesTest.java Tests language-to-profile resolution and search analyzer behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/ConfusablesTest.java Tests confusable skeleton folding behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordTokenizerTest.java Tests tokenizer output, typed tokens, and max-length chopping behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordSegmenterTest.java Tests segmentation boundaries on representative UAX #29 cases.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordBreakPropertyTest.java Tests Word_Break property lookup behavior and edge cases.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordBoundaryConformanceTest.java Runs the official Unicode WordBreakTest.txt conformance suite against WordSegmenter.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/ExtendedPictographicTest.java Tests Extended_Pictographic membership checks and bounds safety.
opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/tokenize/uax29/ExtendedPictographic.txt Bundled derived Unicode data for Extended_Pictographic property membership.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/TermAnalyzer.java Implements configurable token segmentation + ordered normalization dimension pipeline.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Term.java Represents a token with cached/lazy normalization layers.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NormalizationProfiles.java Registry mapping language codes to normalization/stemming profiles with detection dispatch.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NormalizationProfile.java Per-language profile record and searchAnalyzer() builder.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Dimension.java Javadoc updates aligning Dimension docs with the new Term/TermAnalyzer model.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordType.java Adds token categorization for downstream handling (scripts, numeric, emoji, etc.).
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordTokenizer.java Implements UAX #29-based word tokenization with spans and optional typed streaming.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordToken.java Typed token record (span + type) produced by WordTokenizer.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordSegmenter.java Implements the UAX #29 word boundary algorithm with fast-path transition tables.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordBreakProperty.java Loads and looks up Unicode Word_Break property values from bundled data.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordBreak.java Enum for Word_Break property values + parser for property names in the data file.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/ExtendedPictographic.java Loads Extended_Pictographic membership from bundled data for WB3c behavior.
NOTICE Updates top-level NOTICE with expanded Unicode attribution details.
LICENSE Updates top-level LICENSE to include Unicode License V3 applicability for added data files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

krickert added 2 commits June 20, 2026 16:14
Builds on the normalization foundation.

- opennlp-runtime tokenize/uax29: the UAX #29 word segmenter and Tokenizer
  implementation (WordSegmenter, WordTokenizer, WordType, WordBreak, boundary
  engine) with bundled Unicode WordBreakProperty and emoji ExtendedPictographic
  data, validated against the official WordBreakTest conformance suite (1944/1944).
- The layered Term model (Term, TermAnalyzer) that tokenizes then normalizes per
  token over the Dimension ladder, the per-language NormalizationProfile registry,
  and the confusable-fold coverage.
- Extends the bundled-Unicode attribution (NOTICE, NOTICE.template, LICENSE,
  rat-excludes) to the WordBreakProperty / ExtendedPictographic / WordBreakTest
  data files, and restores Dimension's javadoc cross-links now that the Term
  layer is present.
- WordBoundaryConformanceTest: guard the conformance resource stream with
  Objects.requireNonNull and a clear message instead of an opaque NPE in
  InputStreamReader, and remove the unused NO_BOUNDARY constant.
- NormalizationProfiles.forLanguage: fail loud on a null language argument at the
  public entry point, with a null-rejection test.
@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from dab5605 to 67c922a Compare June 20, 2026 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants