OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) by krickert · Pull Request #1105 · apache/opennlp

krickert · 2026-06-20T12:36:51Z

Part 3/4 of OPENNLP-1850. Behavioral DL integration, isolated for focused review.

opennlp-dl compile-depends on opennlp-runtime for CharClass (input chunking on Unicode whitespace/dash); InferenceOptions opt-ins; AbstractDL applies them offset-safely so NameFinderDL/DocumentCategorizerDL decode spans back to the original text.

Note: this only depends on the foundation (#1103), not the tokenizer — once #1103 merges, this PR can be re-targeted to main and merged independently of #2.

krickert · 2026-06-20T12:37:31Z

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105 — Offset-safe input normalization in the DL components
OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106 — Documentation

Supersedes #1101.

Copilot

Pull request overview

This PR (OPENNLP-1850, part 3/4) updates the DL components to handle Unicode whitespace/dashes more robustly and to locate decoded entity spans in the original source text without relying on regex-based matching, while adding opt-in, offset-aware input normalization controls via InferenceOptions.

Changes:

Add Unicode-aware whitespace chunking in AbstractDL and use it in NameFinderDL / DocumentCategorizerDL instead of text.split("\\s+").
Replace regex-based span localization in NameFinderDL with a cursor-based matcher that treats span spaces as flexible Unicode whitespace and matches other code points case-insensitively.
Introduce opt-in InferenceOptions toggles for whitespace/dash folding and document the behavior; add targeted regression tests.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java	Adds shared Unicode whitespace/dash classes, optional input folding, and whitespace chunking helper used by DL components.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/InferenceOptions.java	Adds opt-in flags to normalize whitespace and dashes before inference.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java	Applies optional normalization, switches chunking to Unicode whitespace, and replaces regex span matching with a cursor matcher.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/doccat/DocumentCategorizerDL.java	Applies optional normalization and switches chunking to Unicode whitespace.
opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/AbstractDLChunkingTest.java	New model-free tests covering Unicode whitespace chunking and opt-in normalization behavior.
opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/namefinder/NameFinderDLTest.java	Adds regression tests for decoding spans across NBSP/ideographic spaces and updates comments for the new matcher.
opennlp-core/opennlp-ml/opennlp-dl/README.md	Documents Unicode whitespace chunking, cursor-based span localization, and new normalization options.
opennlp-core/opennlp-ml/opennlp-dl/pom.xml	Makes `opennlp-runtime` a compile dependency to use `CharClass` at runtime.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The behavioral DL integration, isolated for focused review. - opennlp-dl now compile-depends on opennlp-runtime for the CharClass engine used to chunk model input on Unicode whitespace/dash boundaries (was test scope). - InferenceOptions gains the input-normalization opt-ins; AbstractDL applies them offset-safely so that NameFinderDL and DocumentCategorizerDL decode predicted spans back to the original, un-normalized text.

Input folding can change UTF-16 length (a supplementary dash collapses to one hyphen), which shifted the spans NameFinderDL returned so they no longer aligned with the original input. Carry an offset map instead of restricting what folds. - AbstractDL.normalizeInputMapped returns the folded text plus an OffsetMap back to the original, so positions map correctly across any length change (shrink, or a future expansion such as ellipsis). The plain normalizeInput stays for DocumentCategorizerDL, which returns no positions and is unaffected. - NameFinderDL adds findInOriginal(String[]), returning spans in original-input coordinates via that map; TokenNameFinder find(String[]) is preserved but deprecated (equivalent when no fold changes length). - Model-free tests cover the supplementary-dash shrink and the length-preserving identity case.

krickert mentioned this pull request Jun 20, 2026

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103

Draft

This was referenced Jun 20, 2026

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104

Draft

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106

Draft

OPENNLP-1850: Improve Whitespace UTF normalization #1101

Closed

krickert marked this pull request as draft June 20, 2026 14:43

krickert requested a review from Copilot June 20, 2026 14:56

Copilot started reviewing on behalf of krickert June 20, 2026 14:57 View session

Copilot AI reviewed Jun 20, 2026

View reviewed changes

Comment thread opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java

krickert added 2 commits June 20, 2026 16:15

krickert force-pushed the OPENNLP-1850-2-tokenizer branch from dab5605 to 67c922a Compare June 20, 2026 20:16

krickert force-pushed the OPENNLP-1850-3-dl branch from 1c17110 to 8534bb3 Compare June 20, 2026 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPENNLP-1850: Offset-safe input normalization in the DL components (3/4)#1105

OPENNLP-1850: Offset-safe input normalization in the DL components (3/4)#1105
krickert wants to merge 2 commits into
OPENNLP-1850-2-tokenizerfrom
OPENNLP-1850-3-dl

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants