ADFA-4033 | Improve OCR sanitization, value cleaners, and widget support#1333
Conversation
📝 WalkthroughRelease Notes - ADFA-4033: Improve OCR Sanitization, Value Cleaners, and Widget Support
Risks & best-practice notes
Testing recommendations
WalkthroughBroadens OCR-tolerant tag extraction; adds TextView/Button grammars and layout/background validators; improves fuzzy attribute token classification and preserves first-seen attribute values; refactors number/dimension/color cleaners for OCR noise; and widens OCR sanitization rules. ChangesOCR Parsing & Attribute Grammar Enhancement
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/WidgetGrammar.kt`:
- Around line 111-117: TextViewGrammar and ButtonGrammar inherit TextGrammar but
TextGrammar lacks inclusion of AttributeKey.TEXT.xmlName, so android:text is
being stripped by UiGrammarValidator.enforceGrammar; to fix, update the grammar
used by TextViewGrammar and ButtonGrammar to include TEXT validation (either add
AttributeKey.TEXT.xmlName to TextGrammar's allowed attributes or override
validators in TextViewGrammar/ButtonGrammar to include
AttributeKey.TEXT.xmlName) and ensure UiGrammarValidator.enforceGrammar
recognizes that key for these tags when validating.
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt`:
- Around line 85-89: The parser currently only accepts dimension values when a
digit is present, causing tokens like "wrap_content" or "match_parent" to be
misclassified; update the when branch in FuzzyAttributeParser (the check using
currentKey, currentKey?.valueType, numericTypes and lowerToken) to also accept
known non-numeric dimension keywords by adding a check for currentKey?.valueType
== ValueType.DIMENSION && lowerToken in a new dimensionKeywords set (e.g.,
"wrap_content","match_parent","match_constraint","fill_parent"), so that
dimension tokens without digits are treated as values rather than being
fuzzy-matched as keys.
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt`:
- Around line 36-37: The two sanitizer regex entries in OcrSanitizerRules.kt
that target the id and src keys currently allow matches inside longer words
because they don't enforce a token boundary after the key; update the two map
entries that begin with the id-pattern and src-pattern to require a clear key
boundary after the token (e.g., add a trailing word-boundary or otherwise
require the separator to be present) so replacements only occur for standalone
keys (adjust the patterns that currently start with the id-variant and the
src-variant to include the boundary check).
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt`:
- Around line 62-64: The numeric cleanup currently calls num.toLong() inside
ValueCleanersImpl (the regex match -> num block), which can throw on oversized
OCR digit strings; replace the unsafe conversion with a safe one (e.g.,
num.toLongOrNull()) and only perform the >= 1000L check when the conversion
succeeds, otherwise fall back to the non-truncated branch (do not call dropLast
on failure). Update the conditional that computes finalNum to use the safe
conversion result and ensure no exception path remains when parsing extremely
long numeric OCR outputs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1b6f78cb-de6f-4d14-8084-b4616e2bbfa2
📒 Files selected for processing (6)
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/UiGrammarValidator.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/WidgetGrammar.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt
cbc5549 to
911cc55
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt (1)
37-37:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
srckey regex is too permissive and can rewrite unrelated words (e.g.,set).At Line 37,
\\bS[ec][rt]\\balso matches normal tokens likeset, which can be incorrectly normalized tosrc:.Suggested fix
- "\\bS[ec][rt]\\b\\s*[:;]?" to "src: " + "\\b(?:src|src\\.|s[ec]r[ct])\\b\\s*[:;]?" to "src: "🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt` at line 37, The regex rule mapping "\\bS[ec][rt]\\b\\s*[:;]?" => "src: " is too permissive (matches words like "set"); update the rule in OcrSanitizerRules.kt to require a 'c' or 'r' as the second character so ordinary words with 'e' don't match — for example replace the pattern with "\\bS[cr][rt]\\b\\s*[:;]?" (or an equivalent pattern that ensures the second char is not 'e') while keeping the replacement "src: ".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt`:
- Around line 47-48: The current checks in ValueCleanersImpl (using
DimensionValueSet.matchKeywords and wrapKeywords against the normalized string)
perform substring matches and can false-positive on words like "parent" or
"content"; change the logic to match whole tokens instead (e.g., split
normalized into whitespace/punctuation-delimited tokens or use word-boundary
regex) and test each token for equality against DimensionValueSet.matchKeywords
/ wrapKeywords rather than using "in" substring checks so only exact keyword
tokens map to MATCH_PARENT or WRAP_CONTENT.
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt`:
- Line 9: The alternation in tagExtractRegex allows shorter prefixes to match
before longer ones (e.g., "S" before "S8", "8" before "8W"), causing wrong
prefix extraction; reorder the alternatives in the Regex declared as
tagExtractRegex so multi-character prefixes (SW, 8W, S8, etc.) appear before
single-character ones (S, 8) while keeping the (?i) flag and the rest of the
pattern intact so longer prefixes are matched first.
---
Duplicate comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt`:
- Line 37: The regex rule mapping "\\bS[ec][rt]\\b\\s*[:;]?" => "src: " is too
permissive (matches words like "set"); update the rule in OcrSanitizerRules.kt
to require a 'c' or 'r' as the second character so ordinary words with 'e' don't
match — for example replace the pattern with "\\bS[cr][rt]\\b\\s*[:;]?" (or an
equivalent pattern that ensures the second char is not 'e') while keeping the
replacement "src: ".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 76296306-8d7f-40ed-a630-a77155c4879d
📒 Files selected for processing (7)
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/UiGrammarValidator.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/WidgetGrammar.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/AttributeModels.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt
Refine regex rules, optimize value cleaners, and add support for TextView and Button grammars.
…ze some validations
911cc55 to
1b527b5
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt (1)
9-9:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReorder prefix alternatives so longer OCR variants match first.
At Line 9, shorter alternatives (
S,8) can match before (S8,8W), which can mis-normalize tags likeS8-12/8W-12.Suggested fix
- private val tagExtractRegex = Regex("^(?i)(B|P|D|T|C|R|SW|S|8|8W|S8)([\\s\\-_.,|/]*)([A-Z0-9_\\-]+)") + private val tagExtractRegex = Regex("^(?i)(8W|S8|SW|B|P|D|T|C|R|S|8)([\\s\\-_.,|/]*)([A-Z0-9_\\-]+)")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt` at line 9, The prefix alternation in tagExtractRegex in WidgetTagParser.kt lists short tokens before longer ones so inputs like "S8-12" or "8W-12" can be misparsed; update the Regex pattern used in the tagExtractRegex property to list longer OCR variants first (e.g., "SW", "8W", "S8" before "S" and "8") so the engine matches the longer prefixes prior to the single-character alternatives and preserves correct normalization.
🧹 Nitpick comments (2)
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt (2)
82-82: 💤 Low value
colorMapvisibility changed fromprivateto public.This exposes the internal color mapping. If external access is intentional (e.g., for testing or use in
FuzzyAttributeParser), consider documenting the intent. Otherwise, restoreprivatevisibility.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt` at line 82, The colorMap property in ValueCleanersImpl was made public but should remain private unless intentionally exposed; revert its visibility back to private on the colorMap declaration in ValueCleanersImpl (or if external access is required for FuzzyAttributeParser or tests, add a clear comment/docstring explaining the intentional exposure and provide a controlled accessor method instead). Ensure the symbol name colorMap in class ValueCleanersImpl is either marked private or accompanied by documented justification and a dedicated getter used by FuzzyAttributeParser/tests.
98-98: 💤 Low valueRedundant
.replace(" ", "_")call.The regex
[^a-z_]already removes spaces (along with all non-letter/underscore characters), making the subsequent.replace(" ", "_")a no-op.Proposed fix
- val normalizedValue = rawValue.lowercase().replace(Regex("[^a-z_]"), "").replace(" ", "_") + val normalizedValue = rawValue.lowercase().replace(Regex("[^a-z_]"), "")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt` at line 98, The normalizedValue computation in ValueCleanersImpl.kt redundantly calls .replace(" ", "_") after using replace(Regex("[^a-z_]"), "") which already strips spaces; remove the trailing .replace(" ", "_") so normalizedValue is produced by rawValue.lowercase().replace(Regex("[^a-z_]"), "") (or, if you intended to convert spaces to underscores, change the regex to preserve spaces instead and then perform the replace). Update the expression that computes normalizedValue and run tests to confirm behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt`:
- Line 88: The current predicate uses substring matching
(DimensionValueSet.allKeywords.any { it in lowerToken }) which wrongly accepts
tokens containing a keyword as part of a larger word; change it to
exact/word-boundary matching by testing equality or using a regex with word
boundaries against lowerToken (e.g., DimensionValueSet.allKeywords.any { kw ->
lowerToken == kw ||
Regex("\\b${Regex.escape(kw)}\\b").containsMatchIn(lowerToken) }) so that
currentKey?.valueType == ValueType.DIMENSION only returns true for exact
dimension tokens; update the check where currentKey?.valueType,
ValueType.DIMENSION, DimensionValueSet.allKeywords and lowerToken are referenced
in FuzzyAttributeParser.kt.
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt`:
- Around line 20-21: The regex rules that replace detected keys into
"layout_width" and "layout_height" (the map entries for "[il]ay..." and
"[il]ay...hei...") can match inside longer words like "display", so update those
two keys to include a leading word boundary (e.g., prepend \b) to ensure they
only match standalone or word-start occurrences; modify the entries in
OcrSanitizerRules.kt that map the
"[il]ay[a-z]*[-_.\\s]*w[a-z0-9]*\\.?\\s*[:;]\\s*" and
"[il]ay[a-z]*[-_.\\s]*hei[a-z0-9]*\\.?\\s*[:;]\\s*" patterns to add the
word-boundary anchor before the pattern.
---
Duplicate comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt`:
- Line 9: The prefix alternation in tagExtractRegex in WidgetTagParser.kt lists
short tokens before longer ones so inputs like "S8-12" or "8W-12" can be
misparsed; update the Regex pattern used in the tagExtractRegex property to list
longer OCR variants first (e.g., "SW", "8W", "S8" before "S" and "8") so the
engine matches the longer prefixes prior to the single-character alternatives
and preserves correct normalization.
---
Nitpick comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.kt`:
- Line 82: The colorMap property in ValueCleanersImpl was made public but should
remain private unless intentionally exposed; revert its visibility back to
private on the colorMap declaration in ValueCleanersImpl (or if external access
is required for FuzzyAttributeParser or tests, add a clear comment/docstring
explaining the intentional exposure and provide a controlled accessor method
instead). Ensure the symbol name colorMap in class ValueCleanersImpl is either
marked private or accompanied by documented justification and a dedicated getter
used by FuzzyAttributeParser/tests.
- Line 98: The normalizedValue computation in ValueCleanersImpl.kt redundantly
calls .replace(" ", "_") after using replace(Regex("[^a-z_]"), "") which already
strips spaces; remove the trailing .replace(" ", "_") so normalizedValue is
produced by rawValue.lowercase().replace(Regex("[^a-z_]"), "") (or, if you
intended to convert spaces to underscores, change the regex to preserve spaces
instead and then perform the replace). Update the expression that computes
normalizedValue and run tests to confirm behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a9050152-a3cf-4e8c-a85b-7838b030492b
📒 Files selected for processing (7)
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/UiGrammarValidator.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/grammar/WidgetGrammar.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/AttributeModels.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/ValueCleanersImpl.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt
Description
This PR enhances the OCR parsing engine by introducing new widget grammars (TextView, Button) and refining extraction rules. It simplifies the logic in
NumberCleanerandDimensionCleaner, improves regex matching inOcrSanitizerRulesfor layout dimensions and colors, and adds better handling for OCR edge cases (e.g., trailing zeros in dimensions, confusing letters with numbers).Details
ButtonGrammarandTextViewGrammartoUiGrammarValidatorandWidgetGrammar.OcrSanitizerRulesto better catch ids, dimensions, and backgrounds.ValueCleanersImplto use mapped character replacements for numbers and streamlined dimension extraction.FuzzyAttributeParserto properly identify color tokens and numeric values.WidgetTagParserto support new prefixes and improved OCR character numeric checks.document_4936224381652174808.mp4
Ticket
ADFA-4033
Observation
ValueCleanersImplwas significantly refactored for readability and efficiency, particularly in how it cleans numbers and translates common OCR misreads. The fuzzy matching threshold for colors was slightly relaxed to accommodate higher error rates in OCR scanning.