Skip to content

OPENNLP-1850: Offset-safe input normalization in the DL components (3/4)#1105

Draft
krickert wants to merge 2 commits into
OPENNLP-1850-2-tokenizerfrom
OPENNLP-1850-3-dl
Draft

OPENNLP-1850: Offset-safe input normalization in the DL components (3/4)#1105
krickert wants to merge 2 commits into
OPENNLP-1850-2-tokenizerfrom
OPENNLP-1850-3-dl

Conversation

@krickert

Copy link
Copy Markdown
Contributor

Part 3/4 of OPENNLP-1850. Behavioral DL integration, isolated for focused review.

opennlp-dl compile-depends on opennlp-runtime for CharClass (input chunking on Unicode whitespace/dash); InferenceOptions opt-ins; AbstractDL applies them offset-safely so NameFinderDL/DocumentCategorizerDL decode spans back to the original text.

Note: this only depends on the foundation (#1103), not the tokenizer — once #1103 merges, this PR can be re-targeted to main and merged independently of #2.

@krickert

Copy link
Copy Markdown
Contributor Author

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

  1. OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
  2. OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
  3. OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105 — Offset-safe input normalization in the DL components
  4. OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106 — Documentation

Supersedes #1101.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR (OPENNLP-1850, part 3/4) updates the DL components to handle Unicode whitespace/dashes more robustly and to locate decoded entity spans in the original source text without relying on regex-based matching, while adding opt-in, offset-aware input normalization controls via InferenceOptions.

Changes:

  • Add Unicode-aware whitespace chunking in AbstractDL and use it in NameFinderDL / DocumentCategorizerDL instead of text.split("\\s+").
  • Replace regex-based span localization in NameFinderDL with a cursor-based matcher that treats span spaces as flexible Unicode whitespace and matches other code points case-insensitively.
  • Introduce opt-in InferenceOptions toggles for whitespace/dash folding and document the behavior; add targeted regression tests.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java Adds shared Unicode whitespace/dash classes, optional input folding, and whitespace chunking helper used by DL components.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/InferenceOptions.java Adds opt-in flags to normalize whitespace and dashes before inference.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java Applies optional normalization, switches chunking to Unicode whitespace, and replaces regex span matching with a cursor matcher.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/doccat/DocumentCategorizerDL.java Applies optional normalization and switches chunking to Unicode whitespace.
opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/AbstractDLChunkingTest.java New model-free tests covering Unicode whitespace chunking and opt-in normalization behavior.
opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/namefinder/NameFinderDLTest.java Adds regression tests for decoding spans across NBSP/ideographic spaces and updates comments for the new matcher.
opennlp-core/opennlp-ml/opennlp-dl/README.md Documents Unicode whitespace chunking, cursor-based span localization, and new normalization options.
opennlp-core/opennlp-ml/opennlp-dl/pom.xml Makes opennlp-runtime a compile dependency to use CharClass at runtime.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

krickert added 2 commits June 20, 2026 16:15
The behavioral DL integration, isolated for focused review.

- opennlp-dl now compile-depends on opennlp-runtime for the CharClass engine
  used to chunk model input on Unicode whitespace/dash boundaries (was test scope).
- InferenceOptions gains the input-normalization opt-ins; AbstractDL applies them
  offset-safely so that NameFinderDL and DocumentCategorizerDL decode predicted
  spans back to the original, un-normalized text.
Input folding can change UTF-16 length (a supplementary dash collapses to one
hyphen), which shifted the spans NameFinderDL returned so they no longer aligned
with the original input. Carry an offset map instead of restricting what folds.

- AbstractDL.normalizeInputMapped returns the folded text plus an OffsetMap back
  to the original, so positions map correctly across any length change (shrink, or
  a future expansion such as ellipsis). The plain normalizeInput stays for
  DocumentCategorizerDL, which returns no positions and is unaffected.
- NameFinderDL adds findInOriginal(String[]), returning spans in original-input
  coordinates via that map; TokenNameFinder find(String[]) is preserved but
  deprecated (equivalent when no fold changes length).
- Model-free tests cover the supplementary-dash shrink and the length-preserving
  identity case.
@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from dab5605 to 67c922a Compare June 20, 2026 20:16
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 1c17110 to 8534bb3 Compare June 20, 2026 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants