OPENNLP-1846: Fix NameFinderDL only worked with Person, expand to all types by krickert · Pull Request #1086 · apache/opennlp

krickert · 2026-06-14T00:38:09Z

NameFinderDL only decoded B-PER/I-PER and ignored the remaining labels emitted by the model, and NameFinderDL put the matched text in Span.getType() instead of the entity label. Decode the BIO sequence generically:

any B- begins a span whose type is the label minus the B- prefix (B-ORG -> ORG), and the span extends while the following labels are I-<same type> (findSpanEnd generalized from I-PER to I-<type>);
Span.getType() now reports the entity label (PER, ORG, LOC, ...) rather than the covered text.

The ids2Labels map now fully drives recognition for any BIO-tagged token classification model. The B_PER/I_PER constants are retained for reference.

Update NameFinderDLEval: assert span types (PER/LOC) and covered text, and the additional location entity that the person-only decoder previously dropped ("United States" in the George Washington sentence).

Thank you for contributing to Apache OpenNLP.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?

For code changes:

Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

Copilot

Pull request overview

This PR fixes NameFinderDL’s BIO decoding so it recognizes all entity types emitted by a BIO-tagged token-classification ONNX model (not just PER), and ensures Span.getType() contains the entity label (e.g., PER, LOC) rather than the matched text. It also updates the DL eval tests to assert entity types and to cover the previously dropped location entity.

Changes:

Generalize BIO span decoding in NameFinderDL from B-PER/I-PER to B-<TYPE>/I-<TYPE> and set Span type to <TYPE>.
Update NameFinderDLEval assertions to validate span types and covered text, including an added LOC entity case.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java	Generalizes BIO decoding and changes `Span` type to be the entity label.
opennlp-eval-tests/src/test/java/opennlp/dl/namefinder/NameFinderDLEval.java	Updates tests to assert entity labels in `Span.getType()` and checks an additional `LOC` span.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

krickert · 2026-06-14T10:22:36Z

I'll wait for #1085 and #1084 before I move this out of draft.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

…decoding NameFinderDL only decoded B-PER/I-PER and put the matched text in Span.getType() instead of the entity label. Decode the BIO sequence generically and harden it: - Any B-<TYPE> begins a span whose type is the label minus the B- prefix (B-ORG -> ORG), extending while the following labels are I-<same type>. Span.getType() now reports the entity label (PER, ORG, LOC, ...) and ids2Labels fully drives recognition for any BIO-tagged model. - isBeginLabel() requires a non-empty type after "B-", so a malformed "B-" label no longer starts an empty-type span. An argmax index with no entry in ids2Labels fails loudly instead of being silently skipped. - Span.getProb() is now a numerically stable softmax over the token's label scores (bounded to [0,1]) instead of the raw max logit; handles +Inf, all-(-Inf) and NaN edge cases. - find() inference is fail-loud and consistent with the sibling DocumentCategorizerDL: failures surface as IllegalStateException (cause preserved) and an unexpected/empty model-output shape is its own loud failure, rather than a bare RuntimeException or raw ClassCastException. - Floor the character-search cursor at each sentence's start (via sentPosDetect) and thread it forward across that sentence's chunks, so a repeated entity surface form is located at its own occurrence instead of being re-matched against an earlier one -- which previously emitted duplicate or mis-located spans for multi-sentence/multi-chunk input. - Span text reconstruction matches the source with flexible whitespace (\s*), so entities whose wordpiece tokenization splits internal punctuation or "&" apart (U.S.A, AT&T) are still located instead of silently dropped. - Remove the now-unused SpanEnd record. - Extract decodeSpans()/predictLabel()/findEntityEnd()/buildSpanText() and expose labelProbability()/maxIndex() for unit testing without an ONNX model; add NameFinderDLTest coverage for entity types, bounded and edge-case probabilities, malformed begin labels, wordpiece reconstruction, internal-punctuation and case-insensitive matching, missing labels, and cursor-threaded span location. - Reconcile the OPENNLP-1844 concurrency/snapshot eval tests with the new all-types output (the George-Washington input now yields PER + LOC) and assert span types and covered text.

mawiesne · 2026-06-17T11:03:22Z

@krickert Review of this PR planned for Thursday (CEST)

rzo1

Thanks for the PR. I have the following questions / comments. None are blockers, but they're behavior changes that should be a conscious decision:

Robustness regression? -> unmapped label ids
predictLabel now throws IllegalArgumentException when ids2Labels.get(argmax) is null, and this runs for every token. Previously an unmapped index degraded gracefully to O. Now a single token mispredicting to an unmapped index throws away the entire find() result for the whole document. This is harmless with a complete label map, but a partial map (an easy user mistake) turns into a hard failure. I'd suggest either keeping the graceful "treat as O" behavior for unmapped indices, or documenting on the constructor that ids2Labels must be exhaustive over the model's output indices.

Edge cases (minor)

findByRegex searches [searchStart, text.length()), i.e. it is not bounded by the sentence end. If a reconstructed span fails to match within its own sentence, it can mislocate onto the same surface form in a later sentence. The per-sentence flooring comment implies tighter bounding than the code actually delivers; consider capping the region at the sentence end.
A decoded entity that can't be located is now silently dropped (vs. the old fallback that returned the original text). This is cleaner, but a logger.debug when a match fails would help diagnose reconstruction mismatches.

Note that these (together with the getType()/getProb() changes) are behavior changes, so they break the existing contract which is likely fine for a new release line, as long as we document them in the release notes.

mawiesne

Thx @krickert for the PR. I've left comments for clarification of inconsistencies and quality improvements, mostly missing comments / Javadoc.

Keep unmapped label ids graceful, bound decoded span lookup to the current sentence, add diagnostics for unlocated decoded spans, and tighten exception types/messages plus helper documentation.

…d labels Make the public no-space token constants immutable (Set.of instead of mutable arrays) while keeping them public for third-party use. Fail loud on an unmapped model output index: predictLabel now throws an IllegalStateException naming the index instead of degrading the token to "O", and the constructors document that ids2Labels must be exhaustive over the model's output indices. Also document the IllegalArgumentException that find() can raise on a vocabulary/model mismatch. Add edge-case decoding tests: token/score count mismatch, orphan I- labels, adjacent entities of different types, multi-token minimum-probability semantics, repeated entities at distinct offsets within one call, regex metacharacters in span text, and search-start clamping past end of text.

mawiesne · 2026-06-19T11:06:04Z

Ping @rzo1 - anything left from your side?

rzo1 · 2026-06-19T11:09:07Z

No.

krickert requested a review from Copilot June 14, 2026 00:38

Copilot started reviewing on behalf of krickert June 14, 2026 00:38 View session

Copilot AI reviewed Jun 14, 2026

View reviewed changes

Comment thread opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java Outdated

Comment thread opennlp-eval-tests/src/test/java/opennlp/dl/namefinder/NameFinderDLEval.java Outdated

krickert mentioned this pull request Jun 16, 2026

OPENNLP-1844 - Make opennlp-dl components thread-safe #1084

Merged

10 tasks

krickert force-pushed the OPENNLP-1846 branch from b829dd3 to 73e8da5 Compare June 16, 2026 12:20

krickert requested a review from Copilot June 16, 2026 12:22

Copilot started reviewing on behalf of krickert June 16, 2026 12:24 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

krickert force-pushed the OPENNLP-1846 branch from 73e8da5 to 4372555 Compare June 16, 2026 12:40

krickert force-pushed the OPENNLP-1846 branch from 4372555 to c22a7f5 Compare June 16, 2026 13:22

krickert marked this pull request as ready for review June 16, 2026 21:05

krickert requested review from atarora, jzonthemtn, mawiesne and rzo1 June 16, 2026 21:06

krickert changed the title ~~OPENNLP-1846 - Recognize all entity types in NameFinderDL, not only p…~~ OPENNLP-1846 - NameFinderDL only worked with Person, expand to all types Jun 17, 2026

mawiesne changed the title ~~OPENNLP-1846 - NameFinderDL only worked with Person, expand to all types~~ OPENNLP-1846: NameFinderDL only worked with Person, expand to all types Jun 17, 2026

mawiesne changed the title ~~OPENNLP-1846: NameFinderDL only worked with Person, expand to all types~~ OPENNLP-1846: Fix NameFinderDL only worked with Person, expand to all types Jun 17, 2026

mawiesne assigned krickert Jun 17, 2026

mawiesne added java Pull requests that update Java code tests Pull requests that add or update test code labels Jun 17, 2026

rzo1 reviewed Jun 17, 2026

View reviewed changes

mawiesne requested changes Jun 18, 2026

View reviewed changes

krickert added 2 commits June 18, 2026 04:23

OPENNLP-1846 - Address NameFinderDL review feedback

a3c423a

Keep unmapped label ids graceful, bound decoded span lookup to the current sentence, add diagnostics for unlocated decoded spans, and tighten exception types/messages plus helper documentation.

krickert requested review from mawiesne and rzo1 June 18, 2026 21:50

mawiesne approved these changes Jun 19, 2026

View reviewed changes

mawiesne requested review from rzo1 and removed request for rzo1 June 19, 2026 11:05

mawiesne merged commit 4dcf7ef into apache:main Jun 19, 2026
9 checks passed

Conversation

krickert commented Jun 14, 2026 • edited by mawiesne Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

For all changes:

For code changes:

For documentation related changes:

Note:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

krickert commented Jun 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mawiesne commented Jun 17, 2026

Uh oh!

rzo1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mawiesne left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mawiesne commented Jun 19, 2026

Uh oh!

rzo1 commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

krickert commented Jun 14, 2026 •

edited by mawiesne

Loading

rzo1 left a comment •

edited

Loading