Skip to content

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)#1106

Draft
krickert wants to merge 2 commits into
OPENNLP-1850-3-dlfrom
OPENNLP-1850-4-docs
Draft

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)#1106
krickert wants to merge 2 commits into
OPENNLP-1850-3-dlfrom
OPENNLP-1850-4-docs

Conversation

@krickert

Copy link
Copy Markdown
Contributor

Part 4/4 of OPENNLP-1850. New normalizer manual chapter plus tokenizer/doccat/namefinder/introduction updates and the master opennlp.xml.

@krickert

Copy link
Copy Markdown
Contributor Author

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

  1. OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
  2. OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
  3. OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105 — Offset-safe input normalization in the DL components
  4. OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106 — Documentation

Supersedes #1101.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds/updates Developer Manual documentation for Unicode-aware normalization and UAX #29 tokenization, and connects those docs to the DL (ONNX) components’ Unicode chunking/normalization behavior.

Changes:

  • Adds a new “Text Normalization” manual chapter and includes it in the master DocBook.
  • Extends the tokenizer chapter with guidance about Unicode preprocessing and a new UAX #29 tokenizer/segmenter section.
  • Updates NameFinderDL/DocumentCategorizerDL and introduction docs to reference Unicode-aware DL chunking and normalization options.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
opennlp-docs/src/docbkx/tokenizer.xml Adds Unicode preprocessing guidance and documents the UAX #29 tokenizer/segmenter APIs.
opennlp-docs/src/docbkx/opennlp.xml Includes the new normalizer chapter in the book build.
opennlp-docs/src/docbkx/normalizer.xml New “Text Normalization” chapter covering normalizers, pipelines, term model, and reference data.
opennlp-docs/src/docbkx/namefinder.xml Updates NameFinderDL constructor usage and documents Unicode-aware DL chunking and normalization options.
opennlp-docs/src/docbkx/introduction.xml Links DL inference Unicode handling to the normalizer documentation.
opennlp-docs/src/docbkx/doccat.xml Documents Unicode-aware DL chunking/normalization and adds ONNX usage snippet updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread opennlp-docs/src/docbkx/namefinder.xml Outdated
Comment thread opennlp-docs/src/docbkx/namefinder.xml Outdated
Comment thread opennlp-docs/src/docbkx/doccat.xml Outdated
Comment thread opennlp-docs/src/docbkx/doccat.xml
krickert added 2 commits June 20, 2026 16:15
Adds the normalizer manual chapter and updates the tokenizer, doccat, namefinder,
and introduction chapters (and the master opennlp.xml) to cover the new
normalization pipeline and word tokenizer.
Addresses Copilot review of the documentation examples.

- namefinder.xml: the ONNX examples now define ids2Labels and the SentenceDetector,
  drop the unused categories map, and use the offset-safe findInOriginal(...) (the
  deprecated find(...) is no longer shown); the normalization example is made
  self-contained.
- doccat.xml: update the ONNX examples to the current DocumentCategorizerDL
  constructor (categories + ClassificationScoringStrategy + InferenceOptions) and
  construct AverageClassificationScoringStrategy inline instead of referencing an
  undefined scoringStrategy.
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 1c17110 to 8534bb3 Compare June 20, 2026 20:16
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from 3037db7 to 9a71f28 Compare June 20, 2026 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants