Skip to content

ADFA-4033 | Improve OCR sanitization, value cleaners, and widget support#1333

Merged
jatezzz merged 2 commits into
stagefrom
feat/ADFA-4033-improve-ocr-sanitization-experimental
May 26, 2026
Merged

ADFA-4033 | Improve OCR sanitization, value cleaners, and widget support#1333
jatezzz merged 2 commits into
stagefrom
feat/ADFA-4033-improve-ocr-sanitization-experimental

Conversation

@jatezzz
Copy link
Copy Markdown
Collaborator

@jatezzz jatezzz commented May 22, 2026

Description

This PR enhances the OCR parsing engine by introducing new widget grammars (TextView, Button) and refining extraction rules. It simplifies the logic in NumberCleaner and DimensionCleaner, improves regex matching in OcrSanitizerRules for layout dimensions and colors, and adds better handling for OCR edge cases (e.g., trailing zeros in dimensions, confusing letters with numbers).

Details

  • Added ButtonGrammar and TextViewGrammar to UiGrammarValidator and WidgetGrammar.
  • Updated sanitization regex rules in OcrSanitizerRules to better catch ids, dimensions, and backgrounds.
  • Refactored ValueCleanersImpl to use mapped character replacements for numbers and streamlined dimension extraction.
  • Updated FuzzyAttributeParser to properly identify color tokens and numeric values.
  • Expanded WidgetTagParser to support new prefixes and improved OCR character numeric checks.
document_4936224381652174808.mp4

Ticket

ADFA-4033

Observation

ValueCleanersImpl was significantly refactored for readability and efficiency, particularly in how it cleans numbers and translates common OCR misreads. The fuzzy matching threshold for colors was slightly relaxed to accommodate higher error rates in OCR scanning.

@jatezzz jatezzz requested review from a team, Daniel-ADFA and avestaadfa May 22, 2026 21:47
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

Review Change Stack

📝 Walkthrough

Release Notes - ADFA-4033: Improve OCR Sanitization, Value Cleaners, and Widget Support

  • New widget grammars: added ButtonGrammar and TextViewGrammar and registered them in the UiGrammarValidator.
  • Widget tag parsing improvements: WidgetTagParser expanded tag prefix recognition (adds support for variants like "8", "8W", "S8"), relaxed capture/validation logic, and made OCR numeric-like character checks case-insensitive.
  • Grammar & attribute updates:
    • LayoutGrammar now validates BACKGROUND and BACKGROUND_TINT in addition to VISIBILITY.
    • ImageViewGrammar simplified to only add SRC validation (removed explicit LAYOUT_GRAVITY, BACKGROUND, BACKGROUND_TINT entries).
  • OCR sanitization enhancements: OcrSanitizerRules updated with more permissive/robust regexes to normalize id/src tokens, layout_width/layout_height, and background/color tokens to a consistent "key: " format.
  • Value cleaning refactor:
    • NumberCleaner: replaced regex-based letter→digit heuristics with a deterministic ocrCharMap character mapping and simplified extraction to the first signed integer token.
    • DimensionCleaner: reworked to extract units independently, clean only the numeric token via NumberCleaner, reattach units, and trim an OCR-introduced trailing "0" for large values (>= 1000) to mitigate common misreads.
    • ColorCleaner: expanded colorMap with additional OCR variants/synonyms (notably for red/blue), normalizes to lowercase/underscore tokens, and lowered fuzzy-matching threshold for color matching from 75 to 70.
  • FuzzyAttributeParser: centralized numeric-type detection, added an isColorToken helper, simplified shouldTreatTokenAsValue logic, and changed flushAttribute to preserve the first mapped value (no longer overwrites existing attribute entries).
  • Attribute models: added DimensionValueSet (wrap_content / match_parent and related keywords) to support dimension parsing.

Risks & best-practice notes

  • Lowered color fuzzy threshold (75 → 70) increases risk of false-positive color matches on noisy OCR output.
  • colorMap visibility changed from private to public (val); this exposes internal mapping and may create unintended external dependencies—consider restoring encapsulation if not intended.
  • flushAttribute now preserves the first value for a given attribute key; this may hide downstream conflicts where later values should override earlier ones.
  • ImageViewGrammar no longer validates several attributes it previously did (LAYOUT_GRAVITY, BACKGROUND, BACKGROUND_TINT). Verify this aligns with expected behavior and UI generation requirements.
  • WidgetTagParser acceptance logic changed—test thoroughly with existing OCR samples to ensure no regression in tag detection and to confirm the new prefix variants don't create false positives.
  • DimensionCleaner’s trailing-zero heuristic (trim when numeric >= 1000) could wrongly alter legitimately large values in edge cases—validate against real-world dimension data.

Testing recommendations

  • Run full OCR-to-XML regression tests with representative OCR samples (including noisy/degraded images) focusing on:
    • Tag extraction and widget recognition (especially new prefixes and Button/TextView).
    • Color detection precision/recall after fuzzy-threshold change.
    • Dimension parsing accuracy, including unit handling and trailing-zero trimming for >=1000.
    • Number normalization and typical OCR misread cases (letter/digit confusions).
    • Attribute accumulation behavior where multiple tokens map to the same XML attribute (ensure preserving-first-value is correct).
  • Review code owners/consumers for any reliance on ColorCleaner.colorMap and adjust visibility if needed.

Walkthrough

Broadens OCR-tolerant tag extraction; adds TextView/Button grammars and layout/background validators; improves fuzzy attribute token classification and preserves first-seen attribute values; refactors number/dimension/color cleaners for OCR noise; and widens OCR sanitization rules.

Changes

OCR Parsing & Attribute Grammar Enhancement

Layer / File(s) Summary
OCR Tag Extraction & Validation
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt
Tag extraction regex now recognizes additional OCR prefix variants (8, 8W, S8), validation accepts non-empty separators or numeric-like tokens, and OCR character checks use uppercase normalization.
Widget Grammar Registry & Definitions
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/UiGrammarValidator.kt, cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/WidgetGrammar.kt
Grammar registry reordered to include SliderGrammar, ButtonGrammar, TextViewGrammar; LayoutGrammar gains BACKGROUND and BACKGROUND_TINT; ImageViewGrammar restricted to SRC; adds TextViewGrammar and ButtonGrammar.
Dimension Value Set
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/AttributeModels.kt
Adds DimensionValueSet with match/wrap keyword sets and combined allKeywords.
Fuzzy Attribute Parsing & Token Classification
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt
Introduces numericTypes set, refactors shouldTreatTokenAsValue to use a single lowercased token and isColorToken, and changes flushAttribute to preserve the first mapped XML attribute value.
Value Cleaning & OCR Normalization
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt
NumberCleaner uses per-character OCR mapping, DimensionCleaner separates unit and cleans numeric part (with trailing-zero heuristic), ColorCleaner expands synonyms and lowers fuzzy match threshold to 70; colorMap visibility relaxed.
OCR Sanitization Patterns
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt
Sanitizer rules widened and replacements standardized to use colon+space for color (background: red), dimensions (layout_width: / layout_height: ), and structure (id: / src: ).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • avestaadfa
  • Daniel-ADFA
  • hal-eisen-adfa

Poem

🐰 I hop through pixels, read each fuzzy tag and code,
I map mistyped colors, trim numbers on the road,
I keep the first value found and tidy grammar rows,
I patch confused digits up where OCR wind blows,
Hooray — widgets parsed, and XML softly grows!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly matches the main objectives: improving OCR sanitization, value cleaners, and adding widget support (TextView, Button grammars).
Description check ✅ Passed The description is well-aligned with the changeset, detailing enhancements to OCR parsing, new widget grammars, refactored value cleaners, and improved sanitization rules.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/ADFA-4033-improve-ocr-sanitization-experimental

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/WidgetGrammar.kt`:
- Around line 111-117: TextViewGrammar and ButtonGrammar inherit TextGrammar but
TextGrammar lacks inclusion of AttributeKey.TEXT.xmlName, so android:text is
being stripped by UiGrammarValidator.enforceGrammar; to fix, update the grammar
used by TextViewGrammar and ButtonGrammar to include TEXT validation (either add
AttributeKey.TEXT.xmlName to TextGrammar's allowed attributes or override
validators in TextViewGrammar/ButtonGrammar to include
AttributeKey.TEXT.xmlName) and ensure UiGrammarValidator.enforceGrammar
recognizes that key for these tags when validating.

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt`:
- Around line 85-89: The parser currently only accepts dimension values when a
digit is present, causing tokens like "wrap_content" or "match_parent" to be
misclassified; update the when branch in FuzzyAttributeParser (the check using
currentKey, currentKey?.valueType, numericTypes and lowerToken) to also accept
known non-numeric dimension keywords by adding a check for currentKey?.valueType
== ValueType.DIMENSION && lowerToken in a new dimensionKeywords set (e.g.,
"wrap_content","match_parent","match_constraint","fill_parent"), so that
dimension tokens without digits are treated as values rather than being
fuzzy-matched as keys.

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt`:
- Around line 36-37: The two sanitizer regex entries in OcrSanitizerRules.kt
that target the id and src keys currently allow matches inside longer words
because they don't enforce a token boundary after the key; update the two map
entries that begin with the id-pattern and src-pattern to require a clear key
boundary after the token (e.g., add a trailing word-boundary or otherwise
require the separator to be present) so replacements only occur for standalone
keys (adjust the patterns that currently start with the id-variant and the
src-variant to include the boundary check).

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt`:
- Around line 62-64: The numeric cleanup currently calls num.toLong() inside
ValueCleanersImpl (the regex match -> num block), which can throw on oversized
OCR digit strings; replace the unsafe conversion with a safe one (e.g.,
num.toLongOrNull()) and only perform the >= 1000L check when the conversion
succeeds, otherwise fall back to the non-truncated branch (do not call dropLast
on failure). Update the conditional that computes finalNum to use the safe
conversion result and ensure no exception path remains when parsing extremely
long numeric OCR outputs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1b6f78cb-de6f-4d14-8084-b4616e2bbfa2

📥 Commits

Reviewing files that changed from the base of the PR and between a479086 and cf88aba.

📒 Files selected for processing (6)
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/UiGrammarValidator.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/WidgetGrammar.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt

@jatezzz jatezzz force-pushed the feat/ADFA-4033-improve-ocr-sanitization-experimental branch from cbc5549 to 911cc55 Compare May 26, 2026 14:10
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt (1)

37-37: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

src key regex is too permissive and can rewrite unrelated words (e.g., set).

At Line 37, \\bS[ec][rt]\\b also matches normal tokens like set, which can be incorrectly normalized to src: .

Suggested fix
-        "\\bS[ec][rt]\\b\\s*[:;]?" to "src: "
+        "\\b(?:src|src\\.|s[ec]r[ct])\\b\\s*[:;]?" to "src: "
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt`
at line 37, The regex rule mapping "\\bS[ec][rt]\\b\\s*[:;]?" => "src: " is too
permissive (matches words like "set"); update the rule in OcrSanitizerRules.kt
to require a 'c' or 'r' as the second character so ordinary words with 'e' don't
match — for example replace the pattern with "\\bS[cr][rt]\\b\\s*[:;]?" (or an
equivalent pattern that ensures the second char is not 'e') while keeping the
replacement "src: ".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt`:
- Around line 47-48: The current checks in ValueCleanersImpl (using
DimensionValueSet.matchKeywords and wrapKeywords against the normalized string)
perform substring matches and can false-positive on words like "parent" or
"content"; change the logic to match whole tokens instead (e.g., split
normalized into whitespace/punctuation-delimited tokens or use word-boundary
regex) and test each token for equality against DimensionValueSet.matchKeywords
/ wrapKeywords rather than using "in" substring checks so only exact keyword
tokens map to MATCH_PARENT or WRAP_CONTENT.

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt`:
- Line 9: The alternation in tagExtractRegex allows shorter prefixes to match
before longer ones (e.g., "S" before "S8", "8" before "8W"), causing wrong
prefix extraction; reorder the alternatives in the Regex declared as
tagExtractRegex so multi-character prefixes (SW, 8W, S8, etc.) appear before
single-character ones (S, 8) while keeping the (?i) flag and the rest of the
pattern intact so longer prefixes are matched first.

---

Duplicate comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt`:
- Line 37: The regex rule mapping "\\bS[ec][rt]\\b\\s*[:;]?" => "src: " is too
permissive (matches words like "set"); update the rule in OcrSanitizerRules.kt
to require a 'c' or 'r' as the second character so ordinary words with 'e' don't
match — for example replace the pattern with "\\bS[cr][rt]\\b\\s*[:;]?" (or an
equivalent pattern that ensures the second char is not 'e') while keeping the
replacement "src: ".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 76296306-8d7f-40ed-a630-a77155c4879d

📥 Commits

Reviewing files that changed from the base of the PR and between cf88aba and 911cc55.

📒 Files selected for processing (7)
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/UiGrammarValidator.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/WidgetGrammar.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/AttributeModels.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt

jatezzz added 2 commits May 26, 2026 13:38
Refine regex rules, optimize value cleaners, and add support for TextView and Button grammars.
@jatezzz jatezzz force-pushed the feat/ADFA-4033-improve-ocr-sanitization-experimental branch from 911cc55 to 1b527b5 Compare May 26, 2026 18:38
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt (1)

9-9: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reorder prefix alternatives so longer OCR variants match first.

At Line 9, shorter alternatives (S, 8) can match before (S8, 8W), which can mis-normalize tags like S8-12/8W-12.

Suggested fix
-    private val tagExtractRegex = Regex("^(?i)(B|P|D|T|C|R|SW|S|8|8W|S8)([\\s\\-_.,|/]*)([A-Z0-9_\\-]+)")
+    private val tagExtractRegex = Regex("^(?i)(8W|S8|SW|B|P|D|T|C|R|S|8)([\\s\\-_.,|/]*)([A-Z0-9_\\-]+)")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt`
at line 9, The prefix alternation in tagExtractRegex in WidgetTagParser.kt lists
short tokens before longer ones so inputs like "S8-12" or "8W-12" can be
misparsed; update the Regex pattern used in the tagExtractRegex property to list
longer OCR variants first (e.g., "SW", "8W", "S8" before "S" and "8") so the
engine matches the longer prefixes prior to the single-character alternatives
and preserves correct normalization.
🧹 Nitpick comments (2)
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt (2)

82-82: 💤 Low value

colorMap visibility changed from private to public.

This exposes the internal color mapping. If external access is intentional (e.g., for testing or use in FuzzyAttributeParser), consider documenting the intent. Otherwise, restore private visibility.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt`
at line 82, The colorMap property in ValueCleanersImpl was made public but
should remain private unless intentionally exposed; revert its visibility back
to private on the colorMap declaration in ValueCleanersImpl (or if external
access is required for FuzzyAttributeParser or tests, add a clear
comment/docstring explaining the intentional exposure and provide a controlled
accessor method instead). Ensure the symbol name colorMap in class
ValueCleanersImpl is either marked private or accompanied by documented
justification and a dedicated getter used by FuzzyAttributeParser/tests.

98-98: 💤 Low value

Redundant .replace(" ", "_") call.

The regex [^a-z_] already removes spaces (along with all non-letter/underscore characters), making the subsequent .replace(" ", "_") a no-op.

Proposed fix
-        val normalizedValue = rawValue.lowercase().replace(Regex("[^a-z_]"), "").replace(" ", "_")
+        val normalizedValue = rawValue.lowercase().replace(Regex("[^a-z_]"), "")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt`
at line 98, The normalizedValue computation in ValueCleanersImpl.kt redundantly
calls .replace(" ", "_") after using replace(Regex("[^a-z_]"), "") which already
strips spaces; remove the trailing .replace(" ", "_") so normalizedValue is
produced by rawValue.lowercase().replace(Regex("[^a-z_]"), "") (or, if you
intended to convert spaces to underscores, change the regex to preserve spaces
instead and then perform the replace). Update the expression that computes
normalizedValue and run tests to confirm behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt`:
- Line 88: The current predicate uses substring matching
(DimensionValueSet.allKeywords.any { it in lowerToken }) which wrongly accepts
tokens containing a keyword as part of a larger word; change it to
exact/word-boundary matching by testing equality or using a regex with word
boundaries against lowerToken (e.g., DimensionValueSet.allKeywords.any { kw ->
lowerToken == kw ||
Regex("\\b${Regex.escape(kw)}\\b").containsMatchIn(lowerToken) }) so that
currentKey?.valueType == ValueType.DIMENSION only returns true for exact
dimension tokens; update the check where currentKey?.valueType,
ValueType.DIMENSION, DimensionValueSet.allKeywords and lowerToken are referenced
in FuzzyAttributeParser.kt.

In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt`:
- Around line 20-21: The regex rules that replace detected keys into
"layout_width" and "layout_height" (the map entries for "[il]ay..." and
"[il]ay...hei...") can match inside longer words like "display", so update those
two keys to include a leading word boundary (e.g., prepend \b) to ensure they
only match standalone or word-start occurrences; modify the entries in
OcrSanitizerRules.kt that map the
"[il]ay[a-z]*[-_.\\s]*w[a-z0-9]*\\.?\\s*[:;]\\s*" and
"[il]ay[a-z]*[-_.\\s]*hei[a-z0-9]*\\.?\\s*[:;]\\s*" patterns to add the
word-boundary anchor before the pattern.

---

Duplicate comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt`:
- Line 9: The prefix alternation in tagExtractRegex in WidgetTagParser.kt lists
short tokens before longer ones so inputs like "S8-12" or "8W-12" can be
misparsed; update the Regex pattern used in the tagExtractRegex property to list
longer OCR variants first (e.g., "SW", "8W", "S8" before "S" and "8") so the
engine matches the longer prefixes prior to the single-character alternatives
and preserves correct normalization.

---

Nitpick comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt`:
- Line 82: The colorMap property in ValueCleanersImpl was made public but should
remain private unless intentionally exposed; revert its visibility back to
private on the colorMap declaration in ValueCleanersImpl (or if external access
is required for FuzzyAttributeParser or tests, add a clear comment/docstring
explaining the intentional exposure and provide a controlled accessor method
instead). Ensure the symbol name colorMap in class ValueCleanersImpl is either
marked private or accompanied by documented justification and a dedicated getter
used by FuzzyAttributeParser/tests.
- Line 98: The normalizedValue computation in ValueCleanersImpl.kt redundantly
calls .replace(" ", "_") after using replace(Regex("[^a-z_]"), "") which already
strips spaces; remove the trailing .replace(" ", "_") so normalizedValue is
produced by rawValue.lowercase().replace(Regex("[^a-z_]"), "") (or, if you
intended to convert spaces to underscores, change the regex to preserve spaces
instead and then perform the replace). Update the expression that computes
normalizedValue and run tests to confirm behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a9050152-a3cf-4e8c-a85b-7838b030492b

📥 Commits

Reviewing files that changed from the base of the PR and between 911cc55 and 1b527b5.

📒 Files selected for processing (7)
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/UiGrammarValidator.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/WidgetGrammar.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/AttributeModels.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt
  • cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt

@jatezzz jatezzz merged commit 99d896f into stage May 26, 2026
2 checks passed
@jatezzz jatezzz deleted the feat/ADFA-4033-improve-ocr-sanitization-experimental branch May 26, 2026 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants