OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)#1106
Draft
krickert wants to merge 2 commits into
Draft
OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)#1106krickert wants to merge 2 commits into
krickert wants to merge 2 commits into
Conversation
Contributor
Author
|
OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to
Supersedes #1101. |
This was referenced Jun 20, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
Adds/updates Developer Manual documentation for Unicode-aware normalization and UAX #29 tokenization, and connects those docs to the DL (ONNX) components’ Unicode chunking/normalization behavior.
Changes:
- Adds a new “Text Normalization” manual chapter and includes it in the master DocBook.
- Extends the tokenizer chapter with guidance about Unicode preprocessing and a new UAX #29 tokenizer/segmenter section.
- Updates NameFinderDL/DocumentCategorizerDL and introduction docs to reference Unicode-aware DL chunking and normalization options.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| opennlp-docs/src/docbkx/tokenizer.xml | Adds Unicode preprocessing guidance and documents the UAX #29 tokenizer/segmenter APIs. |
| opennlp-docs/src/docbkx/opennlp.xml | Includes the new normalizer chapter in the book build. |
| opennlp-docs/src/docbkx/normalizer.xml | New “Text Normalization” chapter covering normalizers, pipelines, term model, and reference data. |
| opennlp-docs/src/docbkx/namefinder.xml | Updates NameFinderDL constructor usage and documents Unicode-aware DL chunking and normalization options. |
| opennlp-docs/src/docbkx/introduction.xml | Links DL inference Unicode handling to the normalizer documentation. |
| opennlp-docs/src/docbkx/doccat.xml | Documents Unicode-aware DL chunking/normalization and adds ONNX usage snippet updates. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Adds the normalizer manual chapter and updates the tokenizer, doccat, namefinder, and introduction chapters (and the master opennlp.xml) to cover the new normalization pipeline and word tokenizer.
Addresses Copilot review of the documentation examples. - namefinder.xml: the ONNX examples now define ids2Labels and the SentenceDetector, drop the unused categories map, and use the offset-safe findInOriginal(...) (the deprecated find(...) is no longer shown); the normalization example is made self-contained. - doccat.xml: update the ONNX examples to the current DocumentCategorizerDL constructor (categories + ClassificationScoringStrategy + InferenceOptions) and construct AverageClassificationScoringStrategy inline instead of referencing an undefined scoringStrategy.
1c17110 to
8534bb3
Compare
3037db7 to
9a71f28
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part 4/4 of OPENNLP-1850. New normalizer manual chapter plus tokenizer/doccat/namefinder/introduction updates and the master opennlp.xml.