Skip to content

OPENNLP-1850: Improve Whitespace UTF normalization#1101

Draft
krickert wants to merge 13 commits into
mainfrom
OPENNLP-1850_Whitespace-UTF-Normalizae
Draft

OPENNLP-1850: Improve Whitespace UTF normalization#1101
krickert wants to merge 13 commits into
mainfrom
OPENNLP-1850_Whitespace-UTF-Normalizae

Conversation

@krickert

Copy link
Copy Markdown
Contributor

Thank you for contributing to Apache OpenNLP.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

krickert and others added 10 commits June 16, 2026 09:15
…decoding

NameFinderDL only decoded B-PER/I-PER and put the matched text in
Span.getType() instead of the entity label. Decode the BIO sequence
generically and harden it:

- Any B-<TYPE> begins a span whose type is the label minus the B- prefix
  (B-ORG -> ORG), extending while the following labels are I-<same type>.
  Span.getType() now reports the entity label (PER, ORG, LOC, ...) and
  ids2Labels fully drives recognition for any BIO-tagged model.
- isBeginLabel() requires a non-empty type after "B-", so a malformed "B-"
  label no longer starts an empty-type span. An argmax index with no entry
  in ids2Labels fails loudly instead of being silently skipped.
- Span.getProb() is now a numerically stable softmax over the token's label
  scores (bounded to [0,1]) instead of the raw max logit; handles +Inf,
  all-(-Inf) and NaN edge cases.
- find() inference is fail-loud and consistent with the sibling
  DocumentCategorizerDL: failures surface as IllegalStateException (cause
  preserved) and an unexpected/empty model-output shape is its own loud
  failure, rather than a bare RuntimeException or raw ClassCastException.
- Floor the character-search cursor at each sentence's start (via
  sentPosDetect) and thread it forward across that sentence's chunks, so a
  repeated entity surface form is located at its own occurrence instead of
  being re-matched against an earlier one -- which previously emitted
  duplicate or mis-located spans for multi-sentence/multi-chunk input.
- Span text reconstruction matches the source with flexible whitespace
  (\s*), so entities whose wordpiece tokenization splits internal
  punctuation or "&" apart (U.S.A, AT&T) are still located instead of
  silently dropped.
- Remove the now-unused SpanEnd record.
- Extract decodeSpans()/predictLabel()/findEntityEnd()/buildSpanText() and
  expose labelProbability()/maxIndex() for unit testing without an ONNX
  model; add NameFinderDLTest coverage for entity types, bounded and
  edge-case probabilities, malformed begin labels, wordpiece
  reconstruction, internal-punctuation and case-insensitive matching,
  missing labels, and cursor-threaded span location.
- Reconcile the OPENNLP-1844 concurrency/snapshot eval tests with the new
  all-types output (the George-Washington input now yields PER + LOC) and
  assert span types and covered text.
Keep unmapped label ids graceful, bound decoded span lookup to the current sentence, add diagnostics for unlocated decoded spans, and tighten exception types/messages plus helper documentation.
…d labels

Make the public no-space token constants immutable (Set.of instead of
mutable arrays) while keeping them public for third-party use.

Fail loud on an unmapped model output index: predictLabel now throws an
IllegalStateException naming the index instead of degrading the token to
"O", and the constructors document that ids2Labels must be exhaustive over
the model's output indices. Also document the IllegalArgumentException that
find() can raise on a vocabulary/model mismatch.

Add edge-case decoding tests: token/score count mismatch, orphan I- labels,
adjacent entities of different types, multi-token minimum-probability
semantics, repeated entities at distinct offsets within one call, regex
metacharacters in span text, and search-start clamping past end of text.
…and tests

Co-authored-by: Junie <junie@jetbrains.com>
Signed-off-by: Kristian Rickert <krickert@gmail.com>
… the TextNormalizer pipeline, and offset-preserving TextAnalyzer

Quote, digit, decimal, invisible-control, ellipsis, and bullet normalizers,
all reusing the cursor-based CharClass engine (O(1) membership, no regex).
TextNormalizer is a fluent builder that composes the rungs into an
AggregateCharSequenceNormalizer, with a conservative searchDefault() chain.
TextAnalyzer/AnalyzedToken tokenize and normalize per token while keeping each
token's source span, the offset-preserving building block for BM25 matching.
…components

InferenceOptions gains setNormalizeWhitespace and setNormalizeDashes (both off
by default). When enabled, NameFinderDL and DocumentCategorizerDL fold input
whitespace and/or dashes to their ASCII forms before inference via a shared
AbstractDL.normalizeInput helper. The mapping is one code point to one ASCII
character, so it is offset preserving for the Basic Multilingual Plane and any
spans the model produces still align with the input.
Add a Text Normalization chapter to the developer manual covering the
normalizer family, the TextNormalizer pipeline, script-gated diacritic folding
and its multilingual safety, the CharClass engine and user-defined code point
sets, offset-preserving analysis, and the Unicode reference data.
Brings in actions/checkout v7.0.0 CI updates. NameFinderDL conflicts
resolved in favor of the 1850 Unicode cursor matcher and chunking.
Additive Unicode text handling for matching, search, and tokenization
preprocessing (new types only, no breaking changes).

UAX #29 word tokenizer (opennlp.tools.tokenize.uax29):
- WordSegmenter, WordTokenizer (implements opennlp.tools.tokenize.Tokenizer),
  and WordType. A single-pass, table-driven engine with O(1) Word_Break lookups
  and no regular expression; 100% conformant on the official Unicode 17.0
  WordBreakTest suite (1944/1944). Offset-preserving spans and a zero-allocation
  streaming API.

Text normalization (opennlp.tools.util.normalizer):
- The layered Term model (Dimension, Term, TermAnalyzer): a token as a stack of
  normalization layers (NFC, NFKC, whitespace, dash, case fold, accent fold,
  confusable fold, stem, lemma) with eager configured layers, lazy memoized
  extras, and O(1) peel; integrates the UAX #29 tokenizer and the existing
  Stemmer/Lemmatizer as the token-level layers.
- Confusable (homoglyph) skeleton folding per UTS #39, from the bundled Unicode
  security data.
- Per-language profiles (NormalizationProfile, NormalizationProfiles) mirroring
  the Snowball algorithm set with LanguageDetector fallback, including a German
  DIN 5007-2 umlaut fold (a-umlaut to ae, eszett to ss).
- First-class builder configuration: whitespace/dash fold targets, locale case
  folding, accent-fold script scope, and max token length, over a general
  transform(dimension, normalizer) hook.

Documentation: a Text Normalization chapter and a UAX #29 tokenizer section in
the manual; the bundled Unicode data files (WordBreakProperty, emoji-data,
WordBreakTest, confusables) are attributed in NOTICE.

Tests: UAX #29 boundary conformance and unit tests, and unit tests for the
normalizer engine, term model, confusables, language profiles, and German fold.
@rzo1

rzo1 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

I know it is currently a draft state, but here are some thoughts after an intermediate pass, mostly around licensing, architecture and reviewability.

Licensing / release plumbing

The bundled Unicode data files (WordBreakProperty.txt, ExtendedPictographic.txt, confusables.txt and the WordBreakTest.txt under test) are all under the Unicode license, which is ASF Category A, so the content itself is fine to ship. A few things still need fixing before a release build is happy though:

  • The Unicode attribution was added to the generated NOTICE. It needs to go into src/license/NOTICE.template instead, otherwise it gets dropped the next time NOTICE is regenerated (same as we did for the stopword lists and the spellchecker in OPENNLP-1832).
  • rat-excludes is not updated for the four bundled .txt files. The default build hides this, but the apache-release profile runs RAT with the excludes, and RAT will not recognize the Unicode header. Please add the four paths with an OPENNLP-1850 comment.
  • LICENSE should carry the actual Unicode license text, the way it already does for the bundled stopword lists. The newer Unicode files only link to terms_of_use.html in their header rather than embedding the text, so a URL in NOTICE is not enough on its own.
  • ExtendedPictographic.txt is described in NOTICE as an unmodified emoji-data.txt, but it is actually a filtered subset: only the Extended_Pictographic property is kept (451 lines), and the file was renamed. The upstream emoji-data.txt carries six properties. The wording should say it is derived from emoji-data.txt by extracting that one property, not "unmodified".

Architecture

The bigger thing I would like to align on before this freezes as public API: there are now several overlapping ways to normalize text living next to each other. We already had the CharSequenceNormalizer family (Aggregate, Shrink, Number, Url, Twitter, Emoji). This PR adds about 14 more of those, plus a TextNormalizer builder over them, plus a TextAnalyzer/AnalyzedToken offset model in opennlp-api, plus a separate TermAnalyzer/Term/Dimension layered model in opennlp-runtime. The same rungs (nfc, nfkc, whitespace, dash, case fold, accent fold) end up declared in three places that can drift. TextAnalyzer and TermAnalyzer also do basically the same job (tokenize, normalize per token, keep the source span) but with different names and different tokenizers. I think we should pick one token-analysis entry point and one place that defines the steps before this ships, otherwise unifying it later is a breaking change.

Related: the api vs runtime split is worth a second look. The classes placed in opennlp-api are concrete algorithms plus Unicode tables, not contracts, and the only cross-module consumer is opennlp-dl (which uses CharClass and already depends on opennlp-runtime). The feature is currently split across both modules for no clear reason.

The DL changes (InferenceOptions normalize flags, AbstractDL.normalizeInput, and the NameFinderDL/DocumentCategorizerDL span decode rework) are good, but note these are behavioral changes to existing components rather than purely additive, so they deserve their own focused review.

Reviewability

This is around 70 files and 20k+ lines covering four fairly independent subsystems, which makes it hard to review as one unit. Would you be open to splitting it into stacked PRs under an OPENNLP-1850 umbrella, roughly: (1) the Unicode primitives in opennlp-api, (2) the UAX 29 tokenizer plus its data files and the rat/license plumbing, (3) the new CharSequenceNormalizer rungs, (4) the Term/Dimension layered model plus confusables, (5) the DL integration, and (6) the docs? That would let the foundational pieces go in quickly and give the contested parts (the overlapping abstractions, and the DL behavior change) the attention they need on their own.

Happy to help with the NOTICE/rat/LICENSE fixes if useful.

@krickert

Copy link
Copy Markdown
Contributor Author

Here's what I'll do now before splitting up -

  1. Move the portions out of the API
  2. Work on the obvious parts pointed out that need to land regardless
  3. Focus first on the API - that will help us triage how to split up the work

@mawiesne mawiesne changed the title Opennlp 1850 whitespace utf normalizae OPENNLP-1850: Improve Whitespace UTF normalization Jun 19, 2026
krickert added 3 commits June 19, 2026 16:48
…nalysis on Term

- Relocate CharClass, CodePointSet, UnicodeWhitespace, UnicodeDash (and their tests)
  from opennlp-api to opennlp-runtime, so the API keeps only contracts
  (CharSequenceNormalizer) and table-free value types (NormalizedText, OffsetMap).
- opennlp-dl now depends on opennlp-runtime (compile) for the CharClass it uses to
  chunk input (AbstractDL).
- Delete TextAnalyzer/AnalyzedToken: TermAnalyzer/Term is the single token-analysis
  entry point; original/normalized/span are read from Term. Manual updated.
Each character-level Dimension now carries its default CharSequenceNormalizer
(resolved lazily via a Supplier, so the confusables table is not loaded on enum
init). TermAnalyzer drops its parallel defaultTransforms() map and reads the
default from the dimension (builder overrides still win); TextNormalizer's nfc,
nfkc, whitespace, dash, case-fold, and accent-fold methods delegate to the same
source instead of re-listing the normalizers. The shared rungs are now defined
once. TextNormalizer-only cleanup steps (quotes, digits, ellipsis, bullets,
strip-invisible) stay standalone.
The bundled Unicode data files (WordBreakProperty.txt, ExtendedPictographic.txt,
confusables.txt, and the WordBreakTest.txt test fixture) ship under the Unicode
License V3 (ASF Category A). Make the release plumbing reflect that:

- Add the Unicode attribution to src/license/NOTICE.template so it survives
  NOTICE regeneration; it previously lived only in the generated NOTICE.
- Embed the full Unicode License V3 text in LICENSE, as is already done for the
  bundled stopword lists. The newer Unicode headers only link to
  terms_of_use.html rather than embedding the text, so the NOTICE link alone is
  not enough.
- Exclude the four bundled .txt files in rat-excludes so apache-release RAT does
  not flag their non-Apache headers.
- Correct the ExtendedPictographic.txt description: it is a filtered subset of
  emoji-data.txt (only the Extended_Pictographic property, renamed), not an
  unmodified copy.
@krickert

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough pass.
Below I touch on every point.

It's all pushed now (through 5ab7f873); and did a first pass. If you are OK with it - I can do 2 smaller PRs on top of this (I think that's easiest - but if you want it to be further broken down I can figure it out).

Licensing / release plumbing - done

  • NOTICE.template, not just NOTICE. The Unicode attribution now lives in
    src/license/NOTICE.template, so it survives regeneration (same as the stopword /
    spellchecker entries). The generated NOTICE is updated to match.
  • rat-excludes. Added the four bundled .txt paths under an OPENNLP-1850 comment
    (WordBreakProperty.txt, ExtendedPictographic.txt, confusables.txt, and the
    WordBreakTest.txt test fixture).
  • LICENSE carries the actual text. Embedded the full Unicode License V3 in LICENSE,
    the way the stopword lists embed their BSD text - since the newer Unicode headers only
    link to terms_of_use.html rather than inlining it, a URL in NOTICE wasn't enough on
    its own.
  • ExtendedPictographic.txt wording. Corrected - it's described as derived from
    emoji-data.txt by keeping only the Extended_Pictographic property (451 lines,
    renamed), not an unmodified copy. The wording also notes the upstream file carries the
    other five emoji properties that aren't retained.

Architecture

One token-analysis entry point, one place that defines the steps - done.
This wasn't bad at all, it also shrinks the surface:

  • TextAnalyzer / AnalyzedToken are deleted. TermAnalyzer / Term is the single
    entry point - Term.original() / normalized() / span() give you exactly what
    AnalyzedToken did, on the UAX OPENNLP-910: Add checkstyle #29 tokenizer instead of a second one.
  • Each Dimension now carries its own default CharSequenceNormalizer (lazily, so the
    confusables table isn't loaded just by touching the enum). TermAnalyzer dropped its
    parallel defaultTransforms() map, and TextNormalizer's nfc / nfkc / whitespace /
    dash / case-fold / accent-fold rungs now read from Dimension. The six shared rungs are
    defined once. (TextNormalizer's standalone cleanup steps - quotes, digits, ellipsis,
    bullets, strip-invisible - I left as-is for now since they already live in one place;
    promoting them to Dimensions is an easy follow-up if you'd rather everything be
    layer-able. LMK)

api vs runtime split is completed. opennlp-api/util/normalizer is now exactly three files:
CharSequenceNormalizer (the contract) plus the table-free value types NormalizedText
and OffsetMap. Everything with data or an engine - CharClass, CodePointSet,
UnicodeWhitespace, UnicodeDash, the whole normalizer family - moved down to
opennlp-runtime. Nothing table-shaped is left in the API. One consequence to sanity-check:
AbstractDL uses CharClass for input chunking, so opennlp-dl now compile-depends on
opennlp-runtime (it was test-scope before). That's the minimal fix; if you'd rather DL
stay api-thin, the alternative is to inject the normalizer through InferenceOptions just LMK.
I don't have an opin in either direction for that one.

DL changes are behavioral. The InferenceOptions flags, normalizeInput, and
the span-decode rework are behavioral changes to existing components, not purely additive, so
I'll go ahead and make that a separate review. There were good reasons for this: feeding offset-shifting normalization into the DL models requires the span decode to map predictions back to the original text rather than the normalized buffer. Therefore without that rework, NameFinderDL/DocumentCategorizerDL would emit spans pointing at the wrong characters whenever normalization changed the input length. However, the output remains unchanged

Reviewability splitting it up

Rather than stack onto the in-flight PRs (this is self-contained and would just tangle with those), I'd split this along its natural dependency seam into two stacks:

  • Stack 1 - normalization foundation: the API contracts + the runtime normalizer engine
    (CharClass, the normalizer family, Dimension, TextNormalizer, the offset model) + the
    DL opt-ins.
  • Stack 2 - UAX OPENNLP-910: Add checkstyle #29 tokenizer + advanced: the tokenizer and its boundary/perf engine (where
    the lookup tables live), the layered Term model, confusable folding (UTS Remove deprecated IndexHashTable class #39), and the
    per-language profiles

That contains the perf/lookup-table and DL-behavior conversations entirely in Stack 2 and lets
the foundation land clean. Your six-way split would also work; I leaned to two because the seam
between foundation and tokenizer is the one real dependency boundary, and the rest are more
"commits within a stack" than independently-landable units.

The speed framing (for context, not a competition)

Numbers from JMH harness (2 forks, warmed up), vs Lucene 10.3.2
StandardTokenizer and today's SimpleTokenizer, Latin corpus (Mchars/s):

tokenizer Mchars/s vs OpenNLP today vs Lucene 10
OpenNLP SimpleTokenizer (today) 282 1.00× 0.88×
Lucene 10.3.2 StandardTokenizer 320 1.13× 1.00×
new — boundary scan 388 1.38× 1.21×
new — streaming tokenize 335 1.19× 1.05×

The boundary output is byte-identical before and after the perf work — 1944/1944 on the official
UAX #29 conformance suite throughout; the transition table is derived from the readable rule
cascade at class-load, so the rules stay the source of truth. The public surface (the Tokenizer
impl, the streaming handler, the Term / Dimension layering) is independent of the engine
internals, so we can retune or swap the table without touching the interface. The Lucene
head-to-head benchmark stays out of the repo (it'd pull a Lucene dependency -
I don't think we should but I was just curious), but I'll share the harness in a temporary commit.

OK to cut the two stacks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants