OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) by krickert · Pull Request #1106 · apache/opennlp

krickert · 2026-06-20T12:36:52Z

Part 4/4 of OPENNLP-1850. New normalizer manual chapter plus tokenizer/doccat/namefinder/introduction updates and the master opennlp.xml.

krickert · 2026-06-20T12:37:31Z

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105 — Offset-safe input normalization in the DL components
OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106 — Documentation

Supersedes #1101.

Copilot

Pull request overview

Adds/updates Developer Manual documentation for Unicode-aware normalization and UAX #29 tokenization, and connects those docs to the DL (ONNX) components’ Unicode chunking/normalization behavior.

Changes:

Adds a new “Text Normalization” manual chapter and includes it in the master DocBook.
Extends the tokenizer chapter with guidance about Unicode preprocessing and a new UAX #29 tokenizer/segmenter section.
Updates NameFinderDL/DocumentCategorizerDL and introduction docs to reference Unicode-aware DL chunking and normalization options.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
opennlp-docs/src/docbkx/tokenizer.xml	Adds Unicode preprocessing guidance and documents the UAX #29 tokenizer/segmenter APIs.
opennlp-docs/src/docbkx/opennlp.xml	Includes the new normalizer chapter in the book build.
opennlp-docs/src/docbkx/normalizer.xml	New “Text Normalization” chapter covering normalizers, pipelines, term model, and reference data.
opennlp-docs/src/docbkx/namefinder.xml	Updates NameFinderDL constructor usage and documents Unicode-aware DL chunking and normalization options.
opennlp-docs/src/docbkx/introduction.xml	Links DL inference Unicode handling to the normalizer documentation.
opennlp-docs/src/docbkx/doccat.xml	Documents Unicode-aware DL chunking/normalization and adds ONNX usage snippet updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Adds the normalizer manual chapter and updates the tokenizer, doccat, namefinder, and introduction chapters (and the master opennlp.xml) to cover the new normalization pipeline and word tokenizer.

Addresses Copilot review of the documentation examples. - namefinder.xml: the ONNX examples now define ids2Labels and the SentenceDetector, drop the unused categories map, and use the offset-safe findInOriginal(...) (the deprecated find(...) is no longer shown); the normalization example is made self-contained. - doccat.xml: update the ONNX examples to the current DocumentCategorizerDL constructor (categories + ClassificationScoringStrategy + InferenceOptions) and construct AverageClassificationScoringStrategy inline instead of referencing an undefined scoringStrategy.

krickert mentioned this pull request Jun 20, 2026

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103

Draft

This was referenced Jun 20, 2026

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104

Draft

OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105

Draft

OPENNLP-1850: Improve Whitespace UTF normalization #1101

Closed

krickert marked this pull request as draft June 20, 2026 14:43

krickert requested a review from Copilot June 20, 2026 14:56

Copilot started reviewing on behalf of krickert June 20, 2026 14:57 View session

krickert requested review from mawiesne and rzo1 June 20, 2026 14:58

Copilot AI reviewed Jun 20, 2026

View reviewed changes

Comment thread opennlp-docs/src/docbkx/namefinder.xml Outdated

Comment thread opennlp-docs/src/docbkx/namefinder.xml Outdated

Comment thread opennlp-docs/src/docbkx/doccat.xml Outdated

Comment thread opennlp-docs/src/docbkx/doccat.xml

krickert added 2 commits June 20, 2026 16:15

OPENNLP-1850 Document Unicode normalization and the UAX #29 tokenizer

0ba03e4

Adds the normalizer manual chapter and updates the tokenizer, doccat, namefinder, and introduction chapters (and the master opennlp.xml) to cover the new normalization pipeline and word tokenizer.

krickert force-pushed the OPENNLP-1850-3-dl branch from 1c17110 to 8534bb3 Compare June 20, 2026 20:16

krickert force-pushed the OPENNLP-1850-4-docs branch from 3037db7 to 9a71f28 Compare June 20, 2026 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)#1106

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)#1106
krickert wants to merge 2 commits into
OPENNLP-1850-3-dlfrom
OPENNLP-1850-4-docs

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants