OPENNLP-1850: Improve Whitespace UTF normalization by krickert · Pull Request #1101 · apache/opennlp

krickert · 2026-06-19T17:42:56Z

Thank you for contributing to Apache OpenNLP.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?

For code changes:

Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

…decoding NameFinderDL only decoded B-PER/I-PER and put the matched text in Span.getType() instead of the entity label. Decode the BIO sequence generically and harden it: - Any B-<TYPE> begins a span whose type is the label minus the B- prefix (B-ORG -> ORG), extending while the following labels are I-<same type>. Span.getType() now reports the entity label (PER, ORG, LOC, ...) and ids2Labels fully drives recognition for any BIO-tagged model. - isBeginLabel() requires a non-empty type after "B-", so a malformed "B-" label no longer starts an empty-type span. An argmax index with no entry in ids2Labels fails loudly instead of being silently skipped. - Span.getProb() is now a numerically stable softmax over the token's label scores (bounded to [0,1]) instead of the raw max logit; handles +Inf, all-(-Inf) and NaN edge cases. - find() inference is fail-loud and consistent with the sibling DocumentCategorizerDL: failures surface as IllegalStateException (cause preserved) and an unexpected/empty model-output shape is its own loud failure, rather than a bare RuntimeException or raw ClassCastException. - Floor the character-search cursor at each sentence's start (via sentPosDetect) and thread it forward across that sentence's chunks, so a repeated entity surface form is located at its own occurrence instead of being re-matched against an earlier one -- which previously emitted duplicate or mis-located spans for multi-sentence/multi-chunk input. - Span text reconstruction matches the source with flexible whitespace (\s*), so entities whose wordpiece tokenization splits internal punctuation or "&" apart (U.S.A, AT&T) are still located instead of silently dropped. - Remove the now-unused SpanEnd record. - Extract decodeSpans()/predictLabel()/findEntityEnd()/buildSpanText() and expose labelProbability()/maxIndex() for unit testing without an ONNX model; add NameFinderDLTest coverage for entity types, bounded and edge-case probabilities, malformed begin labels, wordpiece reconstruction, internal-punctuation and case-insensitive matching, missing labels, and cursor-threaded span location. - Reconcile the OPENNLP-1844 concurrency/snapshot eval tests with the new all-types output (the George-Washington input now yields PER + LOC) and assert span types and covered text.

Keep unmapped label ids graceful, bound decoded span lookup to the current sentence, add diagnostics for unlocated decoded spans, and tighten exception types/messages plus helper documentation.

…d labels Make the public no-space token constants immutable (Set.of instead of mutable arrays) while keeping them public for third-party use. Fail loud on an unmapped model output index: predictLabel now throws an IllegalStateException naming the index instead of degrading the token to "O", and the constructors document that ids2Labels must be exhaustive over the model's output indices. Also document the IllegalArgumentException that find() can raise on a vocabulary/model mismatch. Add edge-case decoding tests: token/score count mismatch, orphan I- labels, adjacent entities of different types, multi-token minimum-probability semantics, repeated entities at distinct offsets within one call, regex metacharacters in span text, and search-start clamping past end of text.

…and tests Co-authored-by: Junie <junie@jetbrains.com> Signed-off-by: Kristian Rickert <krickert@gmail.com>

… the TextNormalizer pipeline, and offset-preserving TextAnalyzer Quote, digit, decimal, invisible-control, ellipsis, and bullet normalizers, all reusing the cursor-based CharClass engine (O(1) membership, no regex). TextNormalizer is a fluent builder that composes the rungs into an AggregateCharSequenceNormalizer, with a conservative searchDefault() chain. TextAnalyzer/AnalyzedToken tokenize and normalize per token while keeping each token's source span, the offset-preserving building block for BM25 matching.

…components InferenceOptions gains setNormalizeWhitespace and setNormalizeDashes (both off by default). When enabled, NameFinderDL and DocumentCategorizerDL fold input whitespace and/or dashes to their ASCII forms before inference via a shared AbstractDL.normalizeInput helper. The mapping is one code point to one ASCII character, so it is offset preserving for the Basic Multilingual Plane and any spans the model produces still align with the input.

Add a Text Normalization chapter to the developer manual covering the normalizer family, the TextNormalizer pipeline, script-gated diacritic folding and its multilingual safety, the CharClass engine and user-defined code point sets, offset-preserving analysis, and the Unicode reference data.

Brings in actions/checkout v7.0.0 CI updates. NameFinderDL conflicts resolved in favor of the 1850 Unicode cursor matcher and chunking.

Additive Unicode text handling for matching, search, and tokenization preprocessing (new types only, no breaking changes). UAX #29 word tokenizer (opennlp.tools.tokenize.uax29): - WordSegmenter, WordTokenizer (implements opennlp.tools.tokenize.Tokenizer), and WordType. A single-pass, table-driven engine with O(1) Word_Break lookups and no regular expression; 100% conformant on the official Unicode 17.0 WordBreakTest suite (1944/1944). Offset-preserving spans and a zero-allocation streaming API. Text normalization (opennlp.tools.util.normalizer): - The layered Term model (Dimension, Term, TermAnalyzer): a token as a stack of normalization layers (NFC, NFKC, whitespace, dash, case fold, accent fold, confusable fold, stem, lemma) with eager configured layers, lazy memoized extras, and O(1) peel; integrates the UAX #29 tokenizer and the existing Stemmer/Lemmatizer as the token-level layers. - Confusable (homoglyph) skeleton folding per UTS #39, from the bundled Unicode security data. - Per-language profiles (NormalizationProfile, NormalizationProfiles) mirroring the Snowball algorithm set with LanguageDetector fallback, including a German DIN 5007-2 umlaut fold (a-umlaut to ae, eszett to ss). - First-class builder configuration: whitespace/dash fold targets, locale case folding, accent-fold script scope, and max token length, over a general transform(dimension, normalizer) hook. Documentation: a Text Normalization chapter and a UAX #29 tokenizer section in the manual; the bundled Unicode data files (WordBreakProperty, emoji-data, WordBreakTest, confusables) are attributed in NOTICE. Tests: UAX #29 boundary conformance and unit tests, and unit tests for the normalizer engine, term model, confusables, language profiles, and German fold.

rzo1 · 2026-06-19T18:16:56Z

I know it is currently a draft state, but here are some thoughts after an intermediate pass, mostly around licensing, architecture and reviewability.

Licensing / release plumbing

The bundled Unicode data files (WordBreakProperty.txt, ExtendedPictographic.txt, confusables.txt and the WordBreakTest.txt under test) are all under the Unicode license, which is ASF Category A, so the content itself is fine to ship. A few things still need fixing before a release build is happy though:

The Unicode attribution was added to the generated NOTICE. It needs to go into src/license/NOTICE.template instead, otherwise it gets dropped the next time NOTICE is regenerated (same as we did for the stopword lists and the spellchecker in OPENNLP-1832).
rat-excludes is not updated for the four bundled .txt files. The default build hides this, but the apache-release profile runs RAT with the excludes, and RAT will not recognize the Unicode header. Please add the four paths with an OPENNLP-1850 comment.
LICENSE should carry the actual Unicode license text, the way it already does for the bundled stopword lists. The newer Unicode files only link to terms_of_use.html in their header rather than embedding the text, so a URL in NOTICE is not enough on its own.
ExtendedPictographic.txt is described in NOTICE as an unmodified emoji-data.txt, but it is actually a filtered subset: only the Extended_Pictographic property is kept (451 lines), and the file was renamed. The upstream emoji-data.txt carries six properties. The wording should say it is derived from emoji-data.txt by extracting that one property, not "unmodified".

Architecture

The bigger thing I would like to align on before this freezes as public API: there are now several overlapping ways to normalize text living next to each other. We already had the CharSequenceNormalizer family (Aggregate, Shrink, Number, Url, Twitter, Emoji). This PR adds about 14 more of those, plus a TextNormalizer builder over them, plus a TextAnalyzer/AnalyzedToken offset model in opennlp-api, plus a separate TermAnalyzer/Term/Dimension layered model in opennlp-runtime. The same rungs (nfc, nfkc, whitespace, dash, case fold, accent fold) end up declared in three places that can drift. TextAnalyzer and TermAnalyzer also do basically the same job (tokenize, normalize per token, keep the source span) but with different names and different tokenizers. I think we should pick one token-analysis entry point and one place that defines the steps before this ships, otherwise unifying it later is a breaking change.

Related: the api vs runtime split is worth a second look. The classes placed in opennlp-api are concrete algorithms plus Unicode tables, not contracts, and the only cross-module consumer is opennlp-dl (which uses CharClass and already depends on opennlp-runtime). The feature is currently split across both modules for no clear reason.

The DL changes (InferenceOptions normalize flags, AbstractDL.normalizeInput, and the NameFinderDL/DocumentCategorizerDL span decode rework) are good, but note these are behavioral changes to existing components rather than purely additive, so they deserve their own focused review.

Reviewability

This is around 70 files and 20k+ lines covering four fairly independent subsystems, which makes it hard to review as one unit. Would you be open to splitting it into stacked PRs under an OPENNLP-1850 umbrella, roughly: (1) the Unicode primitives in opennlp-api, (2) the UAX 29 tokenizer plus its data files and the rat/license plumbing, (3) the new CharSequenceNormalizer rungs, (4) the Term/Dimension layered model plus confusables, (5) the DL integration, and (6) the docs? That would let the foundational pieces go in quickly and give the contested parts (the overlapping abstractions, and the DL behavior change) the attention they need on their own.

Happy to help with the NOTICE/rat/LICENSE fixes if useful.

krickert · 2026-06-19T19:25:24Z

Here's what I'll do now before splitting up -

Move the portions out of the API
Work on the obvious parts pointed out that need to land regardless
Focus first on the API - that will help us triage how to split up the work

…nalysis on Term - Relocate CharClass, CodePointSet, UnicodeWhitespace, UnicodeDash (and their tests) from opennlp-api to opennlp-runtime, so the API keeps only contracts (CharSequenceNormalizer) and table-free value types (NormalizedText, OffsetMap). - opennlp-dl now depends on opennlp-runtime (compile) for the CharClass it uses to chunk input (AbstractDL). - Delete TextAnalyzer/AnalyzedToken: TermAnalyzer/Term is the single token-analysis entry point; original/normalized/span are read from Term. Manual updated.

Each character-level Dimension now carries its default CharSequenceNormalizer (resolved lazily via a Supplier, so the confusables table is not loaded on enum init). TermAnalyzer drops its parallel defaultTransforms() map and reads the default from the dimension (builder overrides still win); TextNormalizer's nfc, nfkc, whitespace, dash, case-fold, and accent-fold methods delegate to the same source instead of re-listing the normalizers. The shared rungs are now defined once. TextNormalizer-only cleanup steps (quotes, digits, ellipsis, bullets, strip-invisible) stay standalone.

The bundled Unicode data files (WordBreakProperty.txt, ExtendedPictographic.txt, confusables.txt, and the WordBreakTest.txt test fixture) ship under the Unicode License V3 (ASF Category A). Make the release plumbing reflect that: - Add the Unicode attribution to src/license/NOTICE.template so it survives NOTICE regeneration; it previously lived only in the generated NOTICE. - Embed the full Unicode License V3 text in LICENSE, as is already done for the bundled stopword lists. The newer Unicode headers only link to terms_of_use.html rather than embedding the text, so the NOTICE link alone is not enough. - Exclude the four bundled .txt files in rat-excludes so apache-release RAT does not flag their non-Apache headers. - Correct the ExtendedPictographic.txt description: it is a filtered subset of emoji-data.txt (only the Extended_Pictographic property, renamed), not an unmodified copy.

krickert · 2026-06-20T00:24:08Z

Thanks for the thorough pass.
Below I touch on every point.

It's all pushed now (through 5ab7f873); and did a first pass. If you are OK with it - I can do 2 smaller PRs on top of this (I think that's easiest - but if you want it to be further broken down I can figure it out).

Licensing / release plumbing - done

NOTICE.template, not just NOTICE. The Unicode attribution now lives in
src/license/NOTICE.template, so it survives regeneration (same as the stopword /
spellchecker entries). The generated NOTICE is updated to match.
rat-excludes. Added the four bundled .txt paths under an OPENNLP-1850 comment
(WordBreakProperty.txt, ExtendedPictographic.txt, confusables.txt, and the
WordBreakTest.txt test fixture).
LICENSE carries the actual text. Embedded the full Unicode License V3 in LICENSE,
the way the stopword lists embed their BSD text - since the newer Unicode headers only
link to terms_of_use.html rather than inlining it, a URL in NOTICE wasn't enough on
its own.
ExtendedPictographic.txt wording. Corrected - it's described as derived from
emoji-data.txt by keeping only the Extended_Pictographic property (451 lines,
renamed), not an unmodified copy. The wording also notes the upstream file carries the
other five emoji properties that aren't retained.

Architecture

One token-analysis entry point, one place that defines the steps - done.
This wasn't bad at all, it also shrinks the surface:

TextAnalyzer / AnalyzedToken are deleted. TermAnalyzer / Term is the single
entry point - Term.original() / normalized() / span() give you exactly what
AnalyzedToken did, on the UAX OPENNLP-910: Add checkstyle #29 tokenizer instead of a second one.
Each Dimension now carries its own default CharSequenceNormalizer (lazily, so the
confusables table isn't loaded just by touching the enum). TermAnalyzer dropped its
parallel defaultTransforms() map, and TextNormalizer's nfc / nfkc / whitespace /
dash / case-fold / accent-fold rungs now read from Dimension. The six shared rungs are
defined once. (TextNormalizer's standalone cleanup steps - quotes, digits, ellipsis,
bullets, strip-invisible - I left as-is for now since they already live in one place;
promoting them to Dimensions is an easy follow-up if you'd rather everything be
layer-able. LMK)

api vs runtime split is completed. opennlp-api/util/normalizer is now exactly three files:
CharSequenceNormalizer (the contract) plus the table-free value types NormalizedText
and OffsetMap. Everything with data or an engine - CharClass, CodePointSet,
UnicodeWhitespace, UnicodeDash, the whole normalizer family - moved down to
opennlp-runtime. Nothing table-shaped is left in the API. One consequence to sanity-check:
AbstractDL uses CharClass for input chunking, so opennlp-dl now compile-depends on
opennlp-runtime (it was test-scope before). That's the minimal fix; if you'd rather DL
stay api-thin, the alternative is to inject the normalizer through InferenceOptions just LMK.
I don't have an opin in either direction for that one.

DL changes are behavioral. The InferenceOptions flags, normalizeInput, and
the span-decode rework are behavioral changes to existing components, not purely additive, so
I'll go ahead and make that a separate review. There were good reasons for this: feeding offset-shifting normalization into the DL models requires the span decode to map predictions back to the original text rather than the normalized buffer. Therefore without that rework, NameFinderDL/DocumentCategorizerDL would emit spans pointing at the wrong characters whenever normalization changed the input length. However, the output remains unchanged

Reviewability splitting it up

Rather than stack onto the in-flight PRs (this is self-contained and would just tangle with those), I'd split this along its natural dependency seam into two stacks:

Stack 1 - normalization foundation: the API contracts + the runtime normalizer engine
(CharClass, the normalizer family, Dimension, TextNormalizer, the offset model) + the
DL opt-ins.
Stack 2 - UAX OPENNLP-910: Add checkstyle #29 tokenizer + advanced: the tokenizer and its boundary/perf engine (where
the lookup tables live), the layered Term model, confusable folding (UTS Remove deprecated IndexHashTable class #39), and the
per-language profiles

That contains the perf/lookup-table and DL-behavior conversations entirely in Stack 2 and lets
the foundation land clean. Your six-way split would also work; I leaned to two because the seam
between foundation and tokenizer is the one real dependency boundary, and the rest are more
"commits within a stack" than independently-landable units.

The speed framing (for context, not a competition)

Numbers from JMH harness (2 forks, warmed up), vs Lucene 10.3.2
StandardTokenizer and today's SimpleTokenizer, Latin corpus (Mchars/s):

tokenizer	Mchars/s	vs OpenNLP today	vs Lucene 10
OpenNLP `SimpleTokenizer` (today)	282	1.00×	0.88×
Lucene 10.3.2 `StandardTokenizer`	320	1.13×	1.00×
new — boundary scan	388	1.38×	1.21×
new — streaming tokenize	335	1.19×	1.05×

The boundary output is byte-identical before and after the perf work — 1944/1944 on the official
UAX #29 conformance suite throughout; the transition table is derived from the readable rule
cascade at class-load, so the rules stay the source of truth. The public surface (the Tokenizer
impl, the streaming handler, the Term / Dimension layering) is independent of the engine
internals, so we can retune or swap the table without touching the interface. The Lucene
head-to-head benchmark stays out of the repo (it'd pull a Lucene dependency -
I don't think we should but I was just curious), but I'll share the harness in a temporary commit.

OK to cut the two stacks?

krickert and others added 10 commits June 16, 2026 09:15

OPENNLP-1846 - Address NameFinderDL review feedback

a3c423a

Keep unmapped label ids graceful, bound decoded span lookup to the current sentence, add diagnostics for unlocated decoded spans, and tighten exception types/messages plus helper documentation.

Merge branch 'OPENNLP-1846' into OPENNLP-1850_Whitespace-UTF-Normalizae

d17eb84

OPENNLP-1850 - Add robust character sequence normalization utilities …

0d53e31

…and tests Co-authored-by: Junie <junie@jetbrains.com> Signed-off-by: Kristian Rickert <krickert@gmail.com>

Merge upstream/main into OPENNLP-1850_Whitespace-UTF-Normalizae

4b9d5cb

Brings in actions/checkout v7.0.0 CI updates. NameFinderDL conflicts resolved in favor of the 1850 Unicode cursor matcher and chunking.

mawiesne changed the title ~~Opennlp 1850 whitespace utf normalizae~~ OPENNLP-1850: Improve Whitespace UTF normalization Jun 19, 2026

krickert added 3 commits June 19, 2026 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPENNLP-1850: Improve Whitespace UTF normalization#1101

OPENNLP-1850: Improve Whitespace UTF normalization#1101
krickert wants to merge 13 commits into
mainfrom
OPENNLP-1850_Whitespace-UTF-Normalizae

krickert commented Jun 19, 2026

Uh oh!

rzo1 commented Jun 19, 2026

Uh oh!

krickert commented Jun 19, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krickert commented Jun 19, 2026

For all changes:

For code changes:

For documentation related changes:

Note:

Uh oh!

rzo1 commented Jun 19, 2026

Uh oh!

krickert commented Jun 19, 2026

Uh oh!

krickert commented Jun 20, 2026

Licensing / release plumbing - done

Architecture

Reviewability splitting it up

The speed framing (for context, not a competition)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants