OPENNLP-1850: Offset-safe input normalization in the DL components (3/4)#1105
Draft
krickert wants to merge 2 commits into
Draft
OPENNLP-1850: Offset-safe input normalization in the DL components (3/4)#1105krickert wants to merge 2 commits into
krickert wants to merge 2 commits into
Conversation
Contributor
Author
|
OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to
Supersedes #1101. |
This was referenced Jun 20, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
This PR (OPENNLP-1850, part 3/4) updates the DL components to handle Unicode whitespace/dashes more robustly and to locate decoded entity spans in the original source text without relying on regex-based matching, while adding opt-in, offset-aware input normalization controls via InferenceOptions.
Changes:
- Add Unicode-aware whitespace chunking in
AbstractDLand use it inNameFinderDL/DocumentCategorizerDLinstead oftext.split("\\s+"). - Replace regex-based span localization in
NameFinderDLwith a cursor-based matcher that treats span spaces as flexible Unicode whitespace and matches other code points case-insensitively. - Introduce opt-in
InferenceOptionstoggles for whitespace/dash folding and document the behavior; add targeted regression tests.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java | Adds shared Unicode whitespace/dash classes, optional input folding, and whitespace chunking helper used by DL components. |
| opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/InferenceOptions.java | Adds opt-in flags to normalize whitespace and dashes before inference. |
| opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java | Applies optional normalization, switches chunking to Unicode whitespace, and replaces regex span matching with a cursor matcher. |
| opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/doccat/DocumentCategorizerDL.java | Applies optional normalization and switches chunking to Unicode whitespace. |
| opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/AbstractDLChunkingTest.java | New model-free tests covering Unicode whitespace chunking and opt-in normalization behavior. |
| opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/namefinder/NameFinderDLTest.java | Adds regression tests for decoding spans across NBSP/ideographic spaces and updates comments for the new matcher. |
| opennlp-core/opennlp-ml/opennlp-dl/README.md | Documents Unicode whitespace chunking, cursor-based span localization, and new normalization options. |
| opennlp-core/opennlp-ml/opennlp-dl/pom.xml | Makes opennlp-runtime a compile dependency to use CharClass at runtime. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The behavioral DL integration, isolated for focused review. - opennlp-dl now compile-depends on opennlp-runtime for the CharClass engine used to chunk model input on Unicode whitespace/dash boundaries (was test scope). - InferenceOptions gains the input-normalization opt-ins; AbstractDL applies them offset-safely so that NameFinderDL and DocumentCategorizerDL decode predicted spans back to the original, un-normalized text.
Input folding can change UTF-16 length (a supplementary dash collapses to one hyphen), which shifted the spans NameFinderDL returned so they no longer aligned with the original input. Carry an offset map instead of restricting what folds. - AbstractDL.normalizeInputMapped returns the folded text plus an OffsetMap back to the original, so positions map correctly across any length change (shrink, or a future expansion such as ellipsis). The plain normalizeInput stays for DocumentCategorizerDL, which returns no positions and is unaffected. - NameFinderDL adds findInOriginal(String[]), returning spans in original-input coordinates via that map; TokenNameFinder find(String[]) is preserved but deprecated (equivalent when no fold changes length). - Model-free tests cover the supplementary-dash shrink and the length-preserving identity case.
dab5605 to
67c922a
Compare
1c17110 to
8534bb3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part 3/4 of OPENNLP-1850. Behavioral DL integration, isolated for focused review.
opennlp-dl compile-depends on opennlp-runtime for CharClass (input chunking on Unicode whitespace/dash); InferenceOptions opt-ins; AbstractDL applies them offset-safely so NameFinderDL/DocumentCategorizerDL decode spans back to the original text.
Note: this only depends on the foundation (#1103), not the tokenizer — once #1103 merges, this PR can be re-targeted to main and merged independently of #2.