Skip to content

OPENNLP-910: Add checkstyle#29

Closed
smarthi wants to merge 1 commit into
apache:trunkfrom
smarthi:OPENNLP-910
Closed

OPENNLP-910: Add checkstyle#29
smarthi wants to merge 1 commit into
apache:trunkfrom
smarthi:OPENNLP-910

Conversation

@smarthi

@smarthi smarthi commented Jan 3, 2017

Copy link
Copy Markdown
Member

No description provided.

@smarthi smarthi changed the title OPENNLP-910: Add checkstyle OPENNLP-910: [WIP: Do Not Merge] Add checkstyle Jan 3, 2017
Comment thread checkstyle.xml
</module>
<module name="OverloadMethodsDeclarationOrder"/>
<module name="VariableDeclarationUsageDistance"/>
<module name="CustomImportOrder">

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These days the IDE takes care of adding imports for us, do we need to enforce an order? Does that make sense?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, most projects i had seen enforce an order on the imports to keep them consistent through out. If anything this will ensure a uniform standard across the project.

Comment thread checkstyle.xml
<message key="ws.notPreceded"
value="GenericWhitespace ''{0}'' is not preceded with whitespace."/>
</module>
<module name="Indentation">

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 This one is important, and will make sure people configure their editor correctly to modify our code.

Comment thread checkstyle.xml Outdated
<message key="name.invalidPattern"
value="Parameter name ''{0}'' must match pattern ''{1}''."/>
</module>
<module name="CatchParameterName">

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not allow for e as parameter name? Is it better if people have to write ex or so ? I would remove this one.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree we can remove this

Comment thread checkstyle.xml Outdated
<property name="allowByTailComment" value="true"/>
<property name="allowNonPrintableEscapes" value="true"/>
</module>
<module name="LineLength">

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code conventions say 80 to 100 as maximum line length. Maybe we should define a hard limit at 110 or 120 rather than 140?

@smarthi smarthi Jan 3, 2017

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can set that to 110.

Comment thread checkstyle.xml Outdated
<property name="tagOrder" value="@param, @return, @throws, @deprecated"/>
<property name="target" value="CLASS_DEF, INTERFACE_DEF, ENUM_DEF, METHOD_DEF, CTOR_DEF, VARIABLE_DEF"/>
</module>
<module name="JavadocMethod">

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many places where JavaDoc is missing, and JavaDoc only has value if someone actually takes time to write something. I suggest to not enable that for now or we really improve the JavaDoc. But I am against adding lots of empty no-value JavaDoc just to make this check happy.

@kottmann

kottmann commented Jan 3, 2017

Copy link
Copy Markdown
Member

I suggest to exclude the porter and snowball stemmer from checkstyle for now.

@kottmann

kottmann commented Jan 6, 2017

Copy link
Copy Markdown
Member

Looks like the current config is not excluding the stemmers, this seems to work:
<excludes>**/stemmer/**/*</excludes>

@smarthi smarthi changed the title OPENNLP-910: [WIP: Do Not Merge] Add checkstyle OPENNLP-910: Add checkstyle Jan 6, 2017
@asfgit asfgit closed this in a83ea28 Jan 6, 2017
@smarthi smarthi deleted the OPENNLP-910 branch January 6, 2017 22:19
asfgit pushed a commit that referenced this pull request Apr 16, 2017
asfgit pushed a commit that referenced this pull request Apr 20, 2017
krickert added a commit that referenced this pull request Jun 19, 2026
Additive Unicode text handling for matching, search, and tokenization
preprocessing (new types only, no breaking changes).

UAX #29 word tokenizer (opennlp.tools.tokenize.uax29):
- WordSegmenter, WordTokenizer (implements opennlp.tools.tokenize.Tokenizer),
  and WordType. A single-pass, table-driven engine with O(1) Word_Break lookups
  and no regular expression; 100% conformant on the official Unicode 17.0
  WordBreakTest suite (1944/1944). Offset-preserving spans and a zero-allocation
  streaming API.

Text normalization (opennlp.tools.util.normalizer):
- The layered Term model (Dimension, Term, TermAnalyzer): a token as a stack of
  normalization layers (NFC, NFKC, whitespace, dash, case fold, accent fold,
  confusable fold, stem, lemma) with eager configured layers, lazy memoized
  extras, and O(1) peel; integrates the UAX #29 tokenizer and the existing
  Stemmer/Lemmatizer as the token-level layers.
- Confusable (homoglyph) skeleton folding per UTS #39, from the bundled Unicode
  security data.
- Per-language profiles (NormalizationProfile, NormalizationProfiles) mirroring
  the Snowball algorithm set with LanguageDetector fallback, including a German
  DIN 5007-2 umlaut fold (a-umlaut to ae, eszett to ss).
- First-class builder configuration: whitespace/dash fold targets, locale case
  folding, accent-fold script scope, and max token length, over a general
  transform(dimension, normalizer) hook.

Documentation: a Text Normalization chapter and a UAX #29 tokenizer section in
the manual; the bundled Unicode data files (WordBreakProperty, emoji-data,
WordBreakTest, confusables) are attributed in NOTICE.

Tests: UAX #29 boundary conformance and unit tests, and unit tests for the
normalizer engine, term model, confusables, language profiles, and German fold.
krickert added a commit that referenced this pull request Jun 20, 2026
Builds on the normalization foundation.

- opennlp-runtime tokenize/uax29: the UAX #29 word segmenter and Tokenizer
  implementation (WordSegmenter, WordTokenizer, WordType, WordBreak, boundary
  engine) with bundled Unicode WordBreakProperty and emoji ExtendedPictographic
  data, validated against the official WordBreakTest conformance suite (1944/1944).
- The layered Term model (Term, TermAnalyzer) that tokenizes then normalizes per
  token over the Dimension ladder, the per-language NormalizationProfile registry,
  and the confusable-fold coverage.
- Extends the bundled-Unicode attribution (NOTICE, NOTICE.template, LICENSE,
  rat-excludes) to the WordBreakProperty / ExtendedPictographic / WordBreakTest
  data files, and restores Dimension's javadoc cross-links now that the Term
  layer is present.
krickert added a commit that referenced this pull request Jun 20, 2026
- WordBoundaryConformanceTest: guard the conformance resource stream with
  Objects.requireNonNull and a clear message instead of an opaque NPE in
  InputStreamReader, and remove the unused NO_BOUNDARY constant.
- NormalizationProfiles.forLanguage: fail loud on a null language argument at the
  public entry point, with a null-rejection test.
krickert added a commit that referenced this pull request Jun 20, 2026
Adds the normalizer manual chapter and updates the tokenizer, doccat, namefinder,
and introduction chapters (and the master opennlp.xml) to cover the new
normalization pipeline and word tokenizer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants