OPENNLP-910: Add checkstyle#29
Conversation
| </module> | ||
| <module name="OverloadMethodsDeclarationOrder"/> | ||
| <module name="VariableDeclarationUsageDistance"/> | ||
| <module name="CustomImportOrder"> |
There was a problem hiding this comment.
These days the IDE takes care of adding imports for us, do we need to enforce an order? Does that make sense?
There was a problem hiding this comment.
yes, most projects i had seen enforce an order on the imports to keep them consistent through out. If anything this will ensure a uniform standard across the project.
| <message key="ws.notPreceded" | ||
| value="GenericWhitespace ''{0}'' is not preceded with whitespace."/> | ||
| </module> | ||
| <module name="Indentation"> |
There was a problem hiding this comment.
+1 This one is important, and will make sure people configure their editor correctly to modify our code.
| <message key="name.invalidPattern" | ||
| value="Parameter name ''{0}'' must match pattern ''{1}''."/> | ||
| </module> | ||
| <module name="CatchParameterName"> |
There was a problem hiding this comment.
This does not allow for e as parameter name? Is it better if people have to write ex or so ? I would remove this one.
| <property name="allowByTailComment" value="true"/> | ||
| <property name="allowNonPrintableEscapes" value="true"/> | ||
| </module> | ||
| <module name="LineLength"> |
There was a problem hiding this comment.
The code conventions say 80 to 100 as maximum line length. Maybe we should define a hard limit at 110 or 120 rather than 140?
There was a problem hiding this comment.
We can set that to 110.
| <property name="tagOrder" value="@param, @return, @throws, @deprecated"/> | ||
| <property name="target" value="CLASS_DEF, INTERFACE_DEF, ENUM_DEF, METHOD_DEF, CTOR_DEF, VARIABLE_DEF"/> | ||
| </module> | ||
| <module name="JavadocMethod"> |
There was a problem hiding this comment.
There are many places where JavaDoc is missing, and JavaDoc only has value if someone actually takes time to write something. I suggest to not enable that for now or we really improve the JavaDoc. But I am against adding lots of empty no-value JavaDoc just to make this check happy.
|
I suggest to exclude the porter and snowball stemmer from checkstyle for now. |
|
Looks like the current config is not excluding the stemmers, this seems to work: |
Additive Unicode text handling for matching, search, and tokenization preprocessing (new types only, no breaking changes). UAX #29 word tokenizer (opennlp.tools.tokenize.uax29): - WordSegmenter, WordTokenizer (implements opennlp.tools.tokenize.Tokenizer), and WordType. A single-pass, table-driven engine with O(1) Word_Break lookups and no regular expression; 100% conformant on the official Unicode 17.0 WordBreakTest suite (1944/1944). Offset-preserving spans and a zero-allocation streaming API. Text normalization (opennlp.tools.util.normalizer): - The layered Term model (Dimension, Term, TermAnalyzer): a token as a stack of normalization layers (NFC, NFKC, whitespace, dash, case fold, accent fold, confusable fold, stem, lemma) with eager configured layers, lazy memoized extras, and O(1) peel; integrates the UAX #29 tokenizer and the existing Stemmer/Lemmatizer as the token-level layers. - Confusable (homoglyph) skeleton folding per UTS #39, from the bundled Unicode security data. - Per-language profiles (NormalizationProfile, NormalizationProfiles) mirroring the Snowball algorithm set with LanguageDetector fallback, including a German DIN 5007-2 umlaut fold (a-umlaut to ae, eszett to ss). - First-class builder configuration: whitespace/dash fold targets, locale case folding, accent-fold script scope, and max token length, over a general transform(dimension, normalizer) hook. Documentation: a Text Normalization chapter and a UAX #29 tokenizer section in the manual; the bundled Unicode data files (WordBreakProperty, emoji-data, WordBreakTest, confusables) are attributed in NOTICE. Tests: UAX #29 boundary conformance and unit tests, and unit tests for the normalizer engine, term model, confusables, language profiles, and German fold.
Builds on the normalization foundation. - opennlp-runtime tokenize/uax29: the UAX #29 word segmenter and Tokenizer implementation (WordSegmenter, WordTokenizer, WordType, WordBreak, boundary engine) with bundled Unicode WordBreakProperty and emoji ExtendedPictographic data, validated against the official WordBreakTest conformance suite (1944/1944). - The layered Term model (Term, TermAnalyzer) that tokenizes then normalizes per token over the Dimension ladder, the per-language NormalizationProfile registry, and the confusable-fold coverage. - Extends the bundled-Unicode attribution (NOTICE, NOTICE.template, LICENSE, rat-excludes) to the WordBreakProperty / ExtendedPictographic / WordBreakTest data files, and restores Dimension's javadoc cross-links now that the Term layer is present.
- WordBoundaryConformanceTest: guard the conformance resource stream with Objects.requireNonNull and a clear message instead of an opaque NPE in InputStreamReader, and remove the unused NO_BOUNDARY constant. - NormalizationProfiles.forLanguage: fail loud on a null language argument at the public entry point, with a null-rejection test.
Adds the normalizer manual chapter and updates the tokenizer, doccat, namefinder, and introduction chapters (and the master opennlp.xml) to cover the new normalization pipeline and word tokenizer.
No description provided.