OPENNLP-1850: Improve Whitespace UTF normalization#1101
Conversation
…decoding NameFinderDL only decoded B-PER/I-PER and put the matched text in Span.getType() instead of the entity label. Decode the BIO sequence generically and harden it: - Any B-<TYPE> begins a span whose type is the label minus the B- prefix (B-ORG -> ORG), extending while the following labels are I-<same type>. Span.getType() now reports the entity label (PER, ORG, LOC, ...) and ids2Labels fully drives recognition for any BIO-tagged model. - isBeginLabel() requires a non-empty type after "B-", so a malformed "B-" label no longer starts an empty-type span. An argmax index with no entry in ids2Labels fails loudly instead of being silently skipped. - Span.getProb() is now a numerically stable softmax over the token's label scores (bounded to [0,1]) instead of the raw max logit; handles +Inf, all-(-Inf) and NaN edge cases. - find() inference is fail-loud and consistent with the sibling DocumentCategorizerDL: failures surface as IllegalStateException (cause preserved) and an unexpected/empty model-output shape is its own loud failure, rather than a bare RuntimeException or raw ClassCastException. - Floor the character-search cursor at each sentence's start (via sentPosDetect) and thread it forward across that sentence's chunks, so a repeated entity surface form is located at its own occurrence instead of being re-matched against an earlier one -- which previously emitted duplicate or mis-located spans for multi-sentence/multi-chunk input. - Span text reconstruction matches the source with flexible whitespace (\s*), so entities whose wordpiece tokenization splits internal punctuation or "&" apart (U.S.A, AT&T) are still located instead of silently dropped. - Remove the now-unused SpanEnd record. - Extract decodeSpans()/predictLabel()/findEntityEnd()/buildSpanText() and expose labelProbability()/maxIndex() for unit testing without an ONNX model; add NameFinderDLTest coverage for entity types, bounded and edge-case probabilities, malformed begin labels, wordpiece reconstruction, internal-punctuation and case-insensitive matching, missing labels, and cursor-threaded span location. - Reconcile the OPENNLP-1844 concurrency/snapshot eval tests with the new all-types output (the George-Washington input now yields PER + LOC) and assert span types and covered text.
Keep unmapped label ids graceful, bound decoded span lookup to the current sentence, add diagnostics for unlocated decoded spans, and tighten exception types/messages plus helper documentation.
…d labels Make the public no-space token constants immutable (Set.of instead of mutable arrays) while keeping them public for third-party use. Fail loud on an unmapped model output index: predictLabel now throws an IllegalStateException naming the index instead of degrading the token to "O", and the constructors document that ids2Labels must be exhaustive over the model's output indices. Also document the IllegalArgumentException that find() can raise on a vocabulary/model mismatch. Add edge-case decoding tests: token/score count mismatch, orphan I- labels, adjacent entities of different types, multi-token minimum-probability semantics, repeated entities at distinct offsets within one call, regex metacharacters in span text, and search-start clamping past end of text.
…and tests Co-authored-by: Junie <junie@jetbrains.com> Signed-off-by: Kristian Rickert <krickert@gmail.com>
… the TextNormalizer pipeline, and offset-preserving TextAnalyzer Quote, digit, decimal, invisible-control, ellipsis, and bullet normalizers, all reusing the cursor-based CharClass engine (O(1) membership, no regex). TextNormalizer is a fluent builder that composes the rungs into an AggregateCharSequenceNormalizer, with a conservative searchDefault() chain. TextAnalyzer/AnalyzedToken tokenize and normalize per token while keeping each token's source span, the offset-preserving building block for BM25 matching.
…components InferenceOptions gains setNormalizeWhitespace and setNormalizeDashes (both off by default). When enabled, NameFinderDL and DocumentCategorizerDL fold input whitespace and/or dashes to their ASCII forms before inference via a shared AbstractDL.normalizeInput helper. The mapping is one code point to one ASCII character, so it is offset preserving for the Basic Multilingual Plane and any spans the model produces still align with the input.
Add a Text Normalization chapter to the developer manual covering the normalizer family, the TextNormalizer pipeline, script-gated diacritic folding and its multilingual safety, the CharClass engine and user-defined code point sets, offset-preserving analysis, and the Unicode reference data.
Brings in actions/checkout v7.0.0 CI updates. NameFinderDL conflicts resolved in favor of the 1850 Unicode cursor matcher and chunking.
Additive Unicode text handling for matching, search, and tokenization preprocessing (new types only, no breaking changes). UAX #29 word tokenizer (opennlp.tools.tokenize.uax29): - WordSegmenter, WordTokenizer (implements opennlp.tools.tokenize.Tokenizer), and WordType. A single-pass, table-driven engine with O(1) Word_Break lookups and no regular expression; 100% conformant on the official Unicode 17.0 WordBreakTest suite (1944/1944). Offset-preserving spans and a zero-allocation streaming API. Text normalization (opennlp.tools.util.normalizer): - The layered Term model (Dimension, Term, TermAnalyzer): a token as a stack of normalization layers (NFC, NFKC, whitespace, dash, case fold, accent fold, confusable fold, stem, lemma) with eager configured layers, lazy memoized extras, and O(1) peel; integrates the UAX #29 tokenizer and the existing Stemmer/Lemmatizer as the token-level layers. - Confusable (homoglyph) skeleton folding per UTS #39, from the bundled Unicode security data. - Per-language profiles (NormalizationProfile, NormalizationProfiles) mirroring the Snowball algorithm set with LanguageDetector fallback, including a German DIN 5007-2 umlaut fold (a-umlaut to ae, eszett to ss). - First-class builder configuration: whitespace/dash fold targets, locale case folding, accent-fold script scope, and max token length, over a general transform(dimension, normalizer) hook. Documentation: a Text Normalization chapter and a UAX #29 tokenizer section in the manual; the bundled Unicode data files (WordBreakProperty, emoji-data, WordBreakTest, confusables) are attributed in NOTICE. Tests: UAX #29 boundary conformance and unit tests, and unit tests for the normalizer engine, term model, confusables, language profiles, and German fold.
|
I know it is currently a draft state, but here are some thoughts after an intermediate pass, mostly around licensing, architecture and reviewability. Licensing / release plumbing The bundled Unicode data files (WordBreakProperty.txt, ExtendedPictographic.txt, confusables.txt and the WordBreakTest.txt under test) are all under the Unicode license, which is ASF Category A, so the content itself is fine to ship. A few things still need fixing before a release build is happy though:
Architecture The bigger thing I would like to align on before this freezes as public API: there are now several overlapping ways to normalize text living next to each other. We already had the Related: the api vs runtime split is worth a second look. The classes placed in opennlp-api are concrete algorithms plus Unicode tables, not contracts, and the only cross-module consumer is opennlp-dl (which uses CharClass and already depends on opennlp-runtime). The feature is currently split across both modules for no clear reason. The DL changes (InferenceOptions normalize flags, AbstractDL.normalizeInput, and the NameFinderDL/DocumentCategorizerDL span decode rework) are good, but note these are behavioral changes to existing components rather than purely additive, so they deserve their own focused review. Reviewability This is around 70 files and 20k+ lines covering four fairly independent subsystems, which makes it hard to review as one unit. Would you be open to splitting it into stacked PRs under an OPENNLP-1850 umbrella, roughly: (1) the Unicode primitives in opennlp-api, (2) the UAX 29 tokenizer plus its data files and the rat/license plumbing, (3) the new CharSequenceNormalizer rungs, (4) the Term/Dimension layered model plus confusables, (5) the DL integration, and (6) the docs? That would let the foundational pieces go in quickly and give the contested parts (the overlapping abstractions, and the DL behavior change) the attention they need on their own. Happy to help with the NOTICE/rat/LICENSE fixes if useful. |
|
Here's what I'll do now before splitting up -
|
…nalysis on Term - Relocate CharClass, CodePointSet, UnicodeWhitespace, UnicodeDash (and their tests) from opennlp-api to opennlp-runtime, so the API keeps only contracts (CharSequenceNormalizer) and table-free value types (NormalizedText, OffsetMap). - opennlp-dl now depends on opennlp-runtime (compile) for the CharClass it uses to chunk input (AbstractDL). - Delete TextAnalyzer/AnalyzedToken: TermAnalyzer/Term is the single token-analysis entry point; original/normalized/span are read from Term. Manual updated.
Each character-level Dimension now carries its default CharSequenceNormalizer (resolved lazily via a Supplier, so the confusables table is not loaded on enum init). TermAnalyzer drops its parallel defaultTransforms() map and reads the default from the dimension (builder overrides still win); TextNormalizer's nfc, nfkc, whitespace, dash, case-fold, and accent-fold methods delegate to the same source instead of re-listing the normalizers. The shared rungs are now defined once. TextNormalizer-only cleanup steps (quotes, digits, ellipsis, bullets, strip-invisible) stay standalone.
The bundled Unicode data files (WordBreakProperty.txt, ExtendedPictographic.txt, confusables.txt, and the WordBreakTest.txt test fixture) ship under the Unicode License V3 (ASF Category A). Make the release plumbing reflect that: - Add the Unicode attribution to src/license/NOTICE.template so it survives NOTICE regeneration; it previously lived only in the generated NOTICE. - Embed the full Unicode License V3 text in LICENSE, as is already done for the bundled stopword lists. The newer Unicode headers only link to terms_of_use.html rather than embedding the text, so the NOTICE link alone is not enough. - Exclude the four bundled .txt files in rat-excludes so apache-release RAT does not flag their non-Apache headers. - Correct the ExtendedPictographic.txt description: it is a filtered subset of emoji-data.txt (only the Extended_Pictographic property, renamed), not an unmodified copy.
|
Thanks for the thorough pass. It's all pushed now (through Licensing / release plumbing - done
ArchitectureOne token-analysis entry point, one place that defines the steps - done.
api vs runtime split is completed. DL changes are behavioral. The Reviewability splitting it upRather than stack onto the in-flight PRs (this is self-contained and would just tangle with those), I'd split this along its natural dependency seam into two stacks:
That contains the perf/lookup-table and DL-behavior conversations entirely in Stack 2 and lets The speed framing (for context, not a competition)Numbers from JMH harness (2 forks, warmed up), vs Lucene 10.3.2
The boundary output is byte-identical before and after the perf work — 1944/1944 on the official OK to cut the two stacks? |
Thank you for contributing to Apache OpenNLP.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.