ADFA-4033 | Improve OCR sanitization, value cleaners, and widget support by jatezzz · Pull Request #1333 · appdevforall/CodeOnTheGo

jatezzz · 2026-05-22T21:44:13Z

Description

This PR enhances the OCR parsing engine by introducing new widget grammars (TextView, Button) and refining extraction rules. It simplifies the logic in NumberCleaner and DimensionCleaner, improves regex matching in OcrSanitizerRules for layout dimensions and colors, and adds better handling for OCR edge cases (e.g., trailing zeros in dimensions, confusing letters with numbers).

Details

Added ButtonGrammar and TextViewGrammar to UiGrammarValidator and WidgetGrammar.
Updated sanitization regex rules in OcrSanitizerRules to better catch ids, dimensions, and backgrounds.
Refactored ValueCleanersImpl to use mapped character replacements for numbers and streamlined dimension extraction.
Updated FuzzyAttributeParser to properly identify color tokens and numeric values.
Expanded WidgetTagParser to support new prefixes and improved OCR character numeric checks.

document_4936224381652174808.mp4

Ticket

ADFA-4033

Observation

ValueCleanersImpl was significantly refactored for readability and efficiency, particularly in how it cleans numbers and translates common OCR misreads. The fuzzy matching threshold for colors was slightly relaxed to accommodate higher error rates in OCR scanning.

coderabbitai · 2026-05-22T21:52:59Z

📝 Walkthrough

Release Notes - ADFA-4033: Improve OCR Sanitization, Value Cleaners, and Widget Support

New widget grammars: added ButtonGrammar and TextViewGrammar and registered them in the UiGrammarValidator.
Widget tag parsing improvements: WidgetTagParser expanded tag prefix recognition (adds support for variants like "8", "8W", "S8"), relaxed capture/validation logic, and made OCR numeric-like character checks case-insensitive.
Grammar & attribute updates:
- LayoutGrammar now validates BACKGROUND and BACKGROUND_TINT in addition to VISIBILITY.
- ImageViewGrammar simplified to only add SRC validation (removed explicit LAYOUT_GRAVITY, BACKGROUND, BACKGROUND_TINT entries).
OCR sanitization enhancements: OcrSanitizerRules updated with more permissive/robust regexes to normalize id/src tokens, layout_width/layout_height, and background/color tokens to a consistent "key: " format.
Value cleaning refactor:
- NumberCleaner: replaced regex-based letter→digit heuristics with a deterministic ocrCharMap character mapping and simplified extraction to the first signed integer token.
- DimensionCleaner: reworked to extract units independently, clean only the numeric token via NumberCleaner, reattach units, and trim an OCR-introduced trailing "0" for large values (>= 1000) to mitigate common misreads.
- ColorCleaner: expanded colorMap with additional OCR variants/synonyms (notably for red/blue), normalizes to lowercase/underscore tokens, and lowered fuzzy-matching threshold for color matching from 75 to 70.
FuzzyAttributeParser: centralized numeric-type detection, added an isColorToken helper, simplified shouldTreatTokenAsValue logic, and changed flushAttribute to preserve the first mapped value (no longer overwrites existing attribute entries).
Attribute models: added DimensionValueSet (wrap_content / match_parent and related keywords) to support dimension parsing.

Risks & best-practice notes

Lowered color fuzzy threshold (75 → 70) increases risk of false-positive color matches on noisy OCR output.
colorMap visibility changed from private to public (val); this exposes internal mapping and may create unintended external dependencies—consider restoring encapsulation if not intended.
flushAttribute now preserves the first value for a given attribute key; this may hide downstream conflicts where later values should override earlier ones.
ImageViewGrammar no longer validates several attributes it previously did (LAYOUT_GRAVITY, BACKGROUND, BACKGROUND_TINT). Verify this aligns with expected behavior and UI generation requirements.
WidgetTagParser acceptance logic changed—test thoroughly with existing OCR samples to ensure no regression in tag detection and to confirm the new prefix variants don't create false positives.
DimensionCleaner’s trailing-zero heuristic (trim when numeric >= 1000) could wrongly alter legitimately large values in edge cases—validate against real-world dimension data.

Testing recommendations

Run full OCR-to-XML regression tests with representative OCR samples (including noisy/degraded images) focusing on:
- Tag extraction and widget recognition (especially new prefixes and Button/TextView).
- Color detection precision/recall after fuzzy-threshold change.
- Dimension parsing accuracy, including unit handling and trailing-zero trimming for >=1000.
- Number normalization and typical OCR misread cases (letter/digit confusions).
- Attribute accumulation behavior where multiple tokens map to the same XML attribute (ensure preserving-first-value is correct).
Review code owners/consumers for any reliance on ColorCleaner.colorMap and adjust visibility if needed.

Walkthrough

Broadens OCR-tolerant tag extraction; adds TextView/Button grammars and layout/background validators; improves fuzzy attribute token classification and preserves first-seen attribute values; refactors number/dimension/color cleaners for OCR noise; and widens OCR sanitization rules.

Changes

OCR Parsing & Attribute Grammar Enhancement

Layer / File(s)	Summary
OCR Tag Extraction & Validation `cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt`	Tag extraction regex now recognizes additional OCR prefix variants (`8`, `8W`, `S8`), validation accepts non-empty separators or numeric-like tokens, and OCR character checks use uppercase normalization.
Widget Grammar Registry & Definitions `cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/UiGrammarValidator.kt`, `cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/WidgetGrammar.kt`	Grammar registry reordered to include `SliderGrammar`, `ButtonGrammar`, `TextViewGrammar`; `LayoutGrammar` gains `BACKGROUND` and `BACKGROUND_TINT`; `ImageViewGrammar` restricted to `SRC`; adds `TextViewGrammar` and `ButtonGrammar`.
Dimension Value Set `cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/AttributeModels.kt`	Adds `DimensionValueSet` with `match`/`wrap` keyword sets and combined `allKeywords`.
Fuzzy Attribute Parsing & Token Classification `cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt`	Introduces `numericTypes` set, refactors `shouldTreatTokenAsValue` to use a single lowercased token and `isColorToken`, and changes `flushAttribute` to preserve the first mapped XML attribute value.
Value Cleaning & OCR Normalization `cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt`	`NumberCleaner` uses per-character OCR mapping, `DimensionCleaner` separates unit and cleans numeric part (with trailing-zero heuristic), `ColorCleaner` expands synonyms and lowers fuzzy match threshold to 70; `colorMap` visibility relaxed.
OCR Sanitization Patterns `cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt`	Sanitizer rules widened and replacements standardized to use colon+space for color (`background: red`), dimensions (`layout_width:` / `layout_height:` ), and structure (`id:` / `src:` ).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

appdevforall/CodeOnTheGo#1310: Overlaps with WidgetTagParser tag extraction and OCR tokenization adjustments.
appdevforall/CodeOnTheGo#1289: Related changes to ValueCleanersImpl Dimension parsing and trailing-zero normalization.
appdevforall/CodeOnTheGo#1220: Related updates to UI grammar validation and attribute parsing behavior.

Suggested reviewers

avestaadfa
Daniel-ADFA
hal-eisen-adfa

Poem

🐰 I hop through pixels, read each fuzzy tag and code,
I map mistyped colors, trim numbers on the road,
I keep the first value found and tidy grammar rows,
I patch confused digits up where OCR wind blows,
Hooray — widgets parsed, and XML softly grows!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly matches the main objectives: improving OCR sanitization, value cleaners, and adding widget support (TextView, Button grammars).
Description check	✅ Passed	The description is well-aligned with the changeset, detailing enhancements to OCR parsing, new widget grammars, refactored value cleaners, and improved sanitization rules.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/ADFA-4033-improve-ocr-sanitization-experimental

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/WidgetGrammar.kt`:
- Around line 111-117: TextViewGrammar and ButtonGrammar inherit TextGrammar but
TextGrammar lacks inclusion of AttributeKey.TEXT.xmlName, so android:text is
being stripped by UiGrammarValidator.enforceGrammar; to fix, update the grammar
used by TextViewGrammar and ButtonGrammar to include TEXT validation (either add
AttributeKey.TEXT.xmlName to TextGrammar's allowed attributes or override
validators in TextViewGrammar/ButtonGrammar to include
AttributeKey.TEXT.xmlName) and ensure UiGrammarValidator.enforceGrammar
recognizes that key for these tags when validating.

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt`:
- Around line 85-89: The parser currently only accepts dimension values when a
digit is present, causing tokens like "wrap_content" or "match_parent" to be
misclassified; update the when branch in FuzzyAttributeParser (the check using
currentKey, currentKey?.valueType, numericTypes and lowerToken) to also accept
known non-numeric dimension keywords by adding a check for currentKey?.valueType
== ValueType.DIMENSION && lowerToken in a new dimensionKeywords set (e.g.,
"wrap_content","match_parent","match_constraint","fill_parent"), so that
dimension tokens without digits are treated as values rather than being
fuzzy-matched as keys.

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt`:
- Around line 36-37: The two sanitizer regex entries in OcrSanitizerRules.kt
that target the id and src keys currently allow matches inside longer words
because they don't enforce a token boundary after the key; update the two map
entries that begin with the id-pattern and src-pattern to require a clear key
boundary after the token (e.g., add a trailing word-boundary or otherwise
require the separator to be present) so replacements only occur for standalone
keys (adjust the patterns that currently start with the id-variant and the
src-variant to include the boundary check).

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt`:
- Around line 62-64: The numeric cleanup currently calls num.toLong() inside
ValueCleanersImpl (the regex match -> num block), which can throw on oversized
OCR digit strings; replace the unsafe conversion with a safe one (e.g.,
num.toLongOrNull()) and only perform the >= 1000L check when the conversion
succeeds, otherwise fall back to the non-truncated branch (do not call dropLast
on failure). Update the conditional that computes finalNum to use the safe
conversion result and ensure no exception path remains when parsing extremely
long numeric OCR outputs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1b6f78cb-de6f-4d14-8084-b4616e2bbfa2

📥 Commits

Reviewing files that changed from the base of the PR and between a479086 and cf88aba.

📒 Files selected for processing (6)

cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/UiGrammarValidator.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/WidgetGrammar.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt (1)
37-37: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

src key regex is too permissive and can rewrite unrelated words (e.g., set).

At Line 37, \\bS[ec][rt]\\b also matches normal tokens like set, which can be incorrectly normalized to src: .
Suggested fix
-        "\\bS[ec][rt]\\b\\s*[:;]?" to "src: "
+        "\\b(?:src|src\\.|s[ec]r[ct])\\b\\s*[:;]?" to "src: "
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt`
at line 37, The regex rule mapping "\\bS[ec][rt]\\b\\s*[:;]?" => "src: " is too
permissive (matches words like "set"); update the rule in OcrSanitizerRules.kt
to require a 'c' or 'r' as the second character so ordinary words with 'e' don't
match — for example replace the pattern with "\\bS[cr][rt]\\b\\s*[:;]?" (or an
equivalent pattern that ensures the second char is not 'e') while keeping the
replacement "src: ".

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt`:
- Around line 47-48: The current checks in ValueCleanersImpl (using
DimensionValueSet.matchKeywords and wrapKeywords against the normalized string)
perform substring matches and can false-positive on words like "parent" or
"content"; change the logic to match whole tokens instead (e.g., split
normalized into whitespace/punctuation-delimited tokens or use word-boundary
regex) and test each token for equality against DimensionValueSet.matchKeywords
/ wrapKeywords rather than using "in" substring checks so only exact keyword
tokens map to MATCH_PARENT or WRAP_CONTENT.

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt`:
- Line 9: The alternation in tagExtractRegex allows shorter prefixes to match
before longer ones (e.g., "S" before "S8", "8" before "8W"), causing wrong
prefix extraction; reorder the alternatives in the Regex declared as
tagExtractRegex so multi-character prefixes (SW, 8W, S8, etc.) appear before
single-character ones (S, 8) while keeping the (?i) flag and the rest of the
pattern intact so longer prefixes are matched first.

---

Duplicate comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt`:
- Line 37: The regex rule mapping "\\bS[ec][rt]\\b\\s*[:;]?" => "src: " is too
permissive (matches words like "set"); update the rule in OcrSanitizerRules.kt
to require a 'c' or 'r' as the second character so ordinary words with 'e' don't
match — for example replace the pattern with "\\bS[cr][rt]\\b\\s*[:;]?" (or an
equivalent pattern that ensures the second char is not 'e') while keeping the
replacement "src: ".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 76296306-8d7f-40ed-a630-a77155c4879d

📥 Commits

Reviewing files that changed from the base of the PR and between cf88aba and 911cc55.

📒 Files selected for processing (7)

cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/UiGrammarValidator.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/WidgetGrammar.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/AttributeModels.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt

Refine regex rules, optimize value cleaners, and add support for TextView and Button grammars.

…ze some validations

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt (1)

9-9: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reorder prefix alternatives so longer OCR variants match first.

At Line 9, shorter alternatives (S, 8) can match before (S8, 8W), which can mis-normalize tags like S8-12/8W-12.

Suggested fix

-    private val tagExtractRegex = Regex("^(?i)(B|P|D|T|C|R|SW|S|8|8W|S8)([\\s\\-_.,|/]*)([A-Z0-9_\\-]+)")
+    private val tagExtractRegex = Regex("^(?i)(8W|S8|SW|B|P|D|T|C|R|S|8)([\\s\\-_.,|/]*)([A-Z0-9_\\-]+)")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt`
at line 9, The prefix alternation in tagExtractRegex in WidgetTagParser.kt lists
short tokens before longer ones so inputs like "S8-12" or "8W-12" can be
misparsed; update the Regex pattern used in the tagExtractRegex property to list
longer OCR variants first (e.g., "SW", "8W", "S8" before "S" and "8") so the
engine matches the longer prefixes prior to the single-character alternatives
and preserves correct normalization.

🧹 Nitpick comments (2)

cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt (2)

82-82: 💤 Low value

colorMap visibility changed from private to public.

This exposes the internal color mapping. If external access is intentional (e.g., for testing or use in FuzzyAttributeParser), consider documenting the intent. Otherwise, restore private visibility.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt`
at line 82, The colorMap property in ValueCleanersImpl was made public but
should remain private unless intentionally exposed; revert its visibility back
to private on the colorMap declaration in ValueCleanersImpl (or if external
access is required for FuzzyAttributeParser or tests, add a clear
comment/docstring explaining the intentional exposure and provide a controlled
accessor method instead). Ensure the symbol name colorMap in class
ValueCleanersImpl is either marked private or accompanied by documented
justification and a dedicated getter used by FuzzyAttributeParser/tests.

98-98: 💤 Low value

Redundant .replace(" ", "_") call.

The regex [^a-z_] already removes spaces (along with all non-letter/underscore characters), making the subsequent .replace(" ", "_") a no-op.

Proposed fix

-        val normalizedValue = rawValue.lowercase().replace(Regex("[^a-z_]"), "").replace(" ", "_")
+        val normalizedValue = rawValue.lowercase().replace(Regex("[^a-z_]"), "")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt`
at line 98, The normalizedValue computation in ValueCleanersImpl.kt redundantly
calls .replace(" ", "_") after using replace(Regex("[^a-z_]"), "") which already
strips spaces; remove the trailing .replace(" ", "_") so normalizedValue is
produced by rawValue.lowercase().replace(Regex("[^a-z_]"), "") (or, if you
intended to convert spaces to underscores, change the regex to preserve spaces
instead and then perform the replace). Update the expression that computes
normalizedValue and run tests to confirm behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt`:
- Line 88: The current predicate uses substring matching
(DimensionValueSet.allKeywords.any { it in lowerToken }) which wrongly accepts
tokens containing a keyword as part of a larger word; change it to
exact/word-boundary matching by testing equality or using a regex with word
boundaries against lowerToken (e.g., DimensionValueSet.allKeywords.any { kw ->
lowerToken == kw ||
Regex("\\b${Regex.escape(kw)}\\b").containsMatchIn(lowerToken) }) so that
currentKey?.valueType == ValueType.DIMENSION only returns true for exact
dimension tokens; update the check where currentKey?.valueType,
ValueType.DIMENSION, DimensionValueSet.allKeywords and lowerToken are referenced
in FuzzyAttributeParser.kt.

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt`:
- Around line 20-21: The regex rules that replace detected keys into
"layout_width" and "layout_height" (the map entries for "[il]ay..." and
"[il]ay...hei...") can match inside longer words like "display", so update those
two keys to include a leading word boundary (e.g., prepend \b) to ensure they
only match standalone or word-start occurrences; modify the entries in
OcrSanitizerRules.kt that map the
"[il]ay[a-z]*[-_.\\s]*w[a-z0-9]*\\.?\\s*[:;]\\s*" and
"[il]ay[a-z]*[-_.\\s]*hei[a-z0-9]*\\.?\\s*[:;]\\s*" patterns to add the
word-boundary anchor before the pattern.

---

Duplicate comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt`:
- Line 9: The prefix alternation in tagExtractRegex in WidgetTagParser.kt lists
short tokens before longer ones so inputs like "S8-12" or "8W-12" can be
misparsed; update the Regex pattern used in the tagExtractRegex property to list
longer OCR variants first (e.g., "SW", "8W", "S8" before "S" and "8") so the
engine matches the longer prefixes prior to the single-character alternatives
and preserves correct normalization.

---

Nitpick comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt`:
- Line 82: The colorMap property in ValueCleanersImpl was made public but should
remain private unless intentionally exposed; revert its visibility back to
private on the colorMap declaration in ValueCleanersImpl (or if external access
is required for FuzzyAttributeParser or tests, add a clear comment/docstring
explaining the intentional exposure and provide a controlled accessor method
instead). Ensure the symbol name colorMap in class ValueCleanersImpl is either
marked private or accompanied by documented justification and a dedicated getter
used by FuzzyAttributeParser/tests.
- Line 98: The normalizedValue computation in ValueCleanersImpl.kt redundantly
calls .replace(" ", "_") after using replace(Regex("[^a-z_]"), "") which already
strips spaces; remove the trailing .replace(" ", "_") so normalizedValue is
produced by rawValue.lowercase().replace(Regex("[^a-z_]"), "") (or, if you
intended to convert spaces to underscores, change the regex to preserve spaces
instead and then perform the replace). Update the expression that computes
normalizedValue and run tests to confirm behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a9050152-a3cf-4e8c-a85b-7838b030492b

📥 Commits

Reviewing files that changed from the base of the PR and between 911cc55 and 1b527b5.

📒 Files selected for processing (7)

cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/UiGrammarValidator.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/WidgetGrammar.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/AttributeModels.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt

jatezzz requested review from a team, Daniel-ADFA and avestaadfa May 22, 2026 21:47

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

hal-eisen-adfa approved these changes May 22, 2026

View reviewed changes

jatezzz force-pushed the feat/ADFA-4033-improve-ocr-sanitization-experimental branch from cbc5549 to 911cc55 Compare May 26, 2026 14:10

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Comment thread ...src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt

Comment thread ...e-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt

jatezzz added 2 commits May 26, 2026 13:38

feat(parser): improve OCR sanitization, cleaners, and widget support

06494ad

Refine regex rules, optimize value cleaners, and add support for TextView and Button grammars.

fix: add some attributes validations, fix some OCR rules and centrali…

1b527b5

…ze some validations

jatezzz force-pushed the feat/ADFA-4033-improve-ocr-sanitization-experimental branch from 911cc55 to 1b527b5 Compare May 26, 2026 18:38

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Comment thread .../main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt

Comment thread ...ava/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt

jatezzz merged commit 99d896f into stage May 26, 2026
2 checks passed

jatezzz deleted the feat/ADFA-4033-improve-ocr-sanitization-experimental branch May 26, 2026 18:51

Uh oh!

Conversation

jatezzz commented May 22, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Details

Ticket

Observation

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Notes - ADFA-4033: Improve OCR Sanitization, Value Cleaners, and Widget Support

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jatezzz commented May 22, 2026 •

edited by atlassian Bot

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading