ADFA-3911 | Fix widget tag parsing and centralize OCR correction rules#1310
Conversation
📝 WalkthroughRelease NotesFeatures & Improvements
Technical Changes
|
| Layer / File(s) | Summary |
|---|---|
OCR sanitizer framework setup cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizer.kt, CompositeOcrSanitizer.kt, OcrSanitizerFactory.kt, OcrSanitizerRules.kt |
OcrSanitizer interface and DictionaryRegexSanitizer base class establish regex-replacement sanitizers. CompositeOcrSanitizer composes multiple sanitizers. OcrSanitizerFactory builds a default composite from ColorSanitizer, TextAttributeSanitizer, DimensionSanitizer, MarginPaddingSanitizer, and StructureSanitizer. |
FuzzyAttributeParser sanitizer integration cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt |
FuzzyAttributeParser obtains a default sanitizer from OcrSanitizerFactory and applies sanitizer.sanitize(annotation) before tokenization, replacing inline regex cleanup. |
WidgetTagParser extraction and validation cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt |
tagExtractRegex now captures prefix, separator, and token; parseTagParts returns nullable parsed parts and validates prefixes against VALID_PREFIXES. normalizeTagText and extractTag use parsed parts with uppercase fallback. |
MarginAnnotationParser unified margin handling cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/MarginAnnotationParser.kt |
parse now delegates to parseMarginsGlobally which pools explicit blocks from both margins, resolves implicit blocks against the shared canvasTags pool, and merges annotations; duplicate widget values are joined with " |
Sanitizer rules tests cv-image-to-xml/src/test/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRulesTest.kt |
Unit tests validate each sanitizer class and a test confirms the default composite sanitizer applies multiple rules in order on combined input. |
Sequence Diagram(s)
sequenceDiagram
participant FA as FuzzyAttributeParser
participant OSF as OcrSanitizerFactory
participant COS as CompositeOcrSanitizer
participant CS as ColorSanitizer
participant TS as TextAttributeSanitizer
participant DS as DimensionSanitizer
participant MPS as MarginPaddingSanitizer
participant SS as StructureSanitizer
FA->>OSF: createDefaultSanitizer()
OSF->>COS: new CompositeOcrSanitizer([CS, TS, DS, MPS, SS])
COS-->>FA: sanitizer instance
FA->>COS: sanitize(annotation)
COS->>CS: sanitize(annotation)
CS-->>COS: color-normalized
COS->>TS: sanitize(previous)
TS-->>COS: text-style-normalized
COS->>DS: sanitize(previous)
DS-->>COS: dimension-normalized
COS->>MPS: sanitize(previous)
MPS-->>COS: margin-padding-normalized
COS->>SS: sanitize(previous)
SS-->>COS: structure-normalized
COS-->>FA: fully sanitized annotation
Estimated code review effort
🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
- appdevforall/CodeOnTheGo#1261: Related changes to MarginAnnotationParser's explicit-block assignment and side handling.
- appdevforall/CodeOnTheGo#1233: Related to FuzzyAttributeParser OCR normalization/refactoring.
- appdevforall/CodeOnTheGo#1047: Prior changes that modified MarginAnnotationParser's parse flow.
Suggested reviewers
- avestaadfa
- Daniel-ADFA
"I'm a rabbit with a tidy mind,
I hop through OCR crumbs I find.
I stitch punctuation, color, size,
and join the margins, neat and wise.
Hooray — clean tags and saner lines!"
🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | Docstring coverage is 19.05% which is insufficient. The required threshold is 80.00%. | Write docstrings for the functions missing them to satisfy the coverage threshold. |
✅ Passed checks (4 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | ✅ Passed | The title clearly and concisely summarizes the two main changes: widget tag parsing fix and centralized OCR correction rules. |
| Description check | ✅ Passed | The description provides detailed explanation of all major changes across four files, including the specific issues fixed and the rationale for centralization. |
| Linked Issues check | ✅ Passed | Check skipped because no linked issues were found for this pull request. |
| Out of Scope Changes check | ✅ Passed | Check skipped because no linked issues were found for this pull request. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches
📝 Generate docstrings
- Create stacked PR
- Commit on current branch
🧪 Generate unit tests (beta)
- Create PR with unit tests
- Commit unit tests in branch
fix/ADFA-3911-widget-tag-parsing-and-ocr-rules-experimental
Tip
💬 Introducing Slack Agent: The best way for teams to turn conversations into code.
Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
- Generate code and open pull requests
- Plan features and break down work
- Investigate incidents and troubleshoot customer tickets together
- Automate recurring tasks and respond to alerts with triggers
- Summarize progress and report instantly
Built for teams:
- Shared memory across your entire org—no repeating context
- Per-thread sandboxes to safely plan and execute work
- Governance built-in—scoped access, auditability, and budget controls
One agent for your entire SDLC. Right inside Slack.
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerFactory.kt (1)
5-15: ⚡ Quick winConsider documenting the sanitizer ordering rationale.
The order in which sanitizers are applied may affect the final output if rules overlap. Consider adding a comment explaining why this specific sequence (Color → TextAttribute → Dimension → MarginPadding → Structure) was chosen.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerFactory.kt` around lines 5 - 15, Add a brief comment above createDefaultSanitizer explaining the rationale for the sanitizer application order: why ColorSanitizer runs before TextAttributeSanitizer, then DimensionSanitizer, MarginPaddingSanitizer, and finally StructureSanitizer; reference CompositeOcrSanitizer and the list of sanitizers (ColorSanitizer, TextAttributeSanitizer, DimensionSanitizer, MarginPaddingSanitizer, StructureSanitizer) and describe any dependencies or side-effects that justify the sequence (e.g., color normalization first so later attribute merging/structure decisions operate on normalized values).cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt (2)
8-10: ⚡ Quick winAvoid duplicating the valid-prefix list between
tagRegexandVALID_PREFIXES.
tagRegex(line 8) andVALID_PREFIXES(line 10) both encode the same prefix set (B|P|D|T|C|R|SW|S). A future addition (e.g., a new widget prefix) is easy to apply to one and forget the other, silently desynchronizing extraction validation frommatches(tagRegex)checks. Consider derivingtagRegexfromVALID_PREFIXESso they stay in lockstep.♻️ Proposed derivation
- private val tagRegex = Regex("^(?i)(B|P|D|T|C|R|SW|S)-[A-Z0-9_]+$") - private val tagExtractRegex = Regex("^(?i)([A-Z0-9\\s]+)([\\s\\-_.]+)([A-Z0-9_\\-]+)") - private val VALID_PREFIXES = setOf("B", "P", "D", "T", "C", "R", "SW", "S") + private val VALID_PREFIXES = setOf("B", "P", "D", "T", "C", "R", "SW", "S") + private val tagRegex = Regex( + "^(?i)(${VALID_PREFIXES.joinToString("|")})-[A-Z0-9_]+$" + ) + private val tagExtractRegex = Regex("^(?i)([A-Z0-9\\s]+)([\\s\\-_.]+)([A-Z0-9_\\-]+)")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt` around lines 8 - 10, The prefix list is duplicated between VALID_PREFIXES and the hardcoded tagRegex; update the implementation so tagRegex is constructed from VALID_PREFIXES instead of repeating the literals: build a regex alternation by joining VALID_PREFIXES (sort by length desc to ensure multi-char prefixes like "SW" match before "S"), escape any regex-special chars if needed, prepend the case-insensitive flag (?i) and then compile that into tagRegex; keep tagExtractRegex unchanged but ensure any logic that relies on the prefix set uses VALID_PREFIXES (or the dynamically-built regex) so the two stay in sync.
75-79: 💤 Low value
isValidTagMatchis now effectively dead-code-protected.With the updated
tagExtractRegex, group 2 is[\s\-_.]+(one or more), soseparator.isEmpty()can never be true when the regex matches. The guard therefore always returnstrue, and the three call sites (isTag,normalizeTagText,extractTag) perform a no-op check. Either tighten the validation to do something meaningful (e.g., reject single-character ambiguous separators that aren't- _ .), or drop the helper to reduce noise.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt` around lines 75 - 79, The current isValidTagMatch guard is useless because group 2 is always non-empty; update isValidTagMatch to reject ambiguous single-character separators (e.g., a single whitespace or other punctuation) while allowing "-", "_" and ".". Concretely, in isValidTagMatch(match: MatchResult) inspect val separator = match.groupValues[2] and val rawToken = match.groupValues[3] and return false when separator.length == 1 && separator !in listOf("-", "_", ".") && rawToken.firstOrNull()?.isLetter() == true; otherwise return true. This keeps the three call sites (isTag, normalizeTagText, extractTag) meaningful by filtering out tokens that start with a letter preceded by an ambiguous one-char separator.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/MarginAnnotationParser.kt`:
- Around line 27-33: The merge is allowing the same canvas tag to be assigned by
both parseMarginGroup(distribution.leftMargin, canvasTags) and
parseMarginGroup(distribution.rightMargin, canvasTags) because splitCanvasTags
was removed and each resolveImplicitBlocks maintains independent
unresolvedTagsByPrefix; fix by reintroducing shared consumption or re-splitting:
either (A) restore splitCanvasTags and pass side-specific tag lists into
parseMarginGroup so left/right only see their half, or (B) keep the unified
canvasTags but add a shared consumedTags set that both parseMarginGroup calls
consult/mark and have resolveImplicitBlocks check against (and against
existingAnnotations across both margins) before popping a tag; update
annotationMap merging logic only if you intentionally want multi-side
annotations.
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt`:
- Around line 17-23: Add unit tests that exercise the OCR sanitizer
implementations (e.g., DimensionSanitizer, ColorSanitizer and any other classes
extending DictionaryRegexSanitizer) to assert that their rawRules produce the
expected normalized strings for common OCR misreads; specifically, create tests
that instantiate each sanitizer (or call the public normalization method on
DictionaryRegexSanitizer if available), feed representative malformed inputs
like "ilayout wioidth", "layout hei5ht.", "mwa tch-parent" and assert outputs
equal "layout_width:", "layout_height:", "match_parent" (and analogous expected
results for ColorSanitizer), and include negative/edge cases to ensure rules
don't overreplace. Ensure tests reference the class names DimensionSanitizer and
ColorSanitizer (or other sanitizer class names) so failures point directly to
the right implementation.
---
Nitpick comments:
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerFactory.kt`:
- Around line 5-15: Add a brief comment above createDefaultSanitizer explaining
the rationale for the sanitizer application order: why ColorSanitizer runs
before TextAttributeSanitizer, then DimensionSanitizer, MarginPaddingSanitizer,
and finally StructureSanitizer; reference CompositeOcrSanitizer and the list of
sanitizers (ColorSanitizer, TextAttributeSanitizer, DimensionSanitizer,
MarginPaddingSanitizer, StructureSanitizer) and describe any dependencies or
side-effects that justify the sequence (e.g., color normalization first so later
attribute merging/structure decisions operate on normalized values).
In
`@cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt`:
- Around line 8-10: The prefix list is duplicated between VALID_PREFIXES and the
hardcoded tagRegex; update the implementation so tagRegex is constructed from
VALID_PREFIXES instead of repeating the literals: build a regex alternation by
joining VALID_PREFIXES (sort by length desc to ensure multi-char prefixes like
"SW" match before "S"), escape any regex-special chars if needed, prepend the
case-insensitive flag (?i) and then compile that into tagRegex; keep
tagExtractRegex unchanged but ensure any logic that relies on the prefix set
uses VALID_PREFIXES (or the dynamically-built regex) so the two stay in sync.
- Around line 75-79: The current isValidTagMatch guard is useless because group
2 is always non-empty; update isValidTagMatch to reject ambiguous
single-character separators (e.g., a single whitespace or other punctuation)
while allowing "-", "_" and ".". Concretely, in isValidTagMatch(match:
MatchResult) inspect val separator = match.groupValues[2] and val rawToken =
match.groupValues[3] and return false when separator.length == 1 && separator
!in listOf("-", "_", ".") && rawToken.firstOrNull()?.isLetter() == true;
otherwise return true. This keeps the three call sites (isTag, normalizeTagText,
extractTag) meaningful by filtering out tokens that start with a letter preceded
by an ambiguous one-char separator.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c1d0639a-5974-4ad6-96df-f9c177f786ae
📒 Files selected for processing (7)
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/MarginAnnotationParser.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/CompositeOcrSanitizer.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizer.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerFactory.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt
Addresses mixed up Button/Switch tags, improves margin parsing, and refactors OCR corrections.
62cb964 to
7fc9832
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cv-image-to-xml/src/test/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRulesTest.kt (1)
8-67: ⚡ Quick winConsider adding edge case tests.
The current tests verify the main sanitization paths well. Consider adding tests for:
- Empty/blank strings (verify graceful handling)
- Already-correct input (verify sanitizers don't break valid data)
- Idempotency (verify
sanitizer.sanitize(sanitizer.sanitize(x)) == sanitizer.sanitize(x))These would increase confidence that sanitizers handle unexpected input safely and won't corrupt valid data.
📋 Example edge case tests
`@Test` fun `sanitizers handle empty input gracefully`() { val sanitizers = listOf( ColorSanitizer(), TextAttributeSanitizer(), DimensionSanitizer() ) sanitizers.forEach { sanitizer -> assertEquals("", sanitizer.sanitize("")) } } `@Test` fun `sanitizers preserve already-correct input`() { val sanitizer = OcrSanitizerFactory.createDefaultSanitizer() val validInput = "background red | text_style: bold | layout_width: match_parent" assertEquals(validInput, sanitizer.sanitize(validInput)) } `@Test` fun `sanitizers are idempotent`() { val sanitizer = OcrSanitizerFactory.createDefaultSanitizer() val input = "backgroundired | text stjle" val firstPass = sanitizer.sanitize(input) val secondPass = sanitizer.sanitize(firstPass) assertEquals(firstPass, secondPass) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cv-image-to-xml/src/test/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRulesTest.kt` around lines 8 - 67, Add edge-case unit tests to the existing OcrSanitizerRulesTest: verify empty/blank input handling by calling sanitize("") on ColorSanitizer, TextAttributeSanitizer, DimensionSanitizer (and others) and expecting "" (or the defined safe output); add a test that passes an already-correct string to OcrSanitizerFactory.createDefaultSanitizer().sanitize(...) and asserts it remains unchanged; add an idempotency test that runs sanitizer.sanitize(input) twice and asserts the first and second results are equal using the default sanitizer; use the existing classes/methods (ColorSanitizer, TextAttributeSanitizer, DimensionSanitizer, MarginPaddingSanitizer, StructureSanitizer, OcrSanitizerFactory.createDefaultSanitizer(), sanitizer.sanitize) so tests live alongside the current cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@cv-image-to-xml/src/test/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRulesTest.kt`:
- Line 16: Fix inconsistent indentation in OcrSanitizerRulesTest.kt by replacing
tab characters with 4 space characters for the annotated test lines;
specifically update the `@Test` annotation and surrounding lines where tabs are
used (lines containing the `@Test` marker and the subsequent test method
declarations in OcrSanitizerRulesTest) so they use spaces only—ensure
occurrences around the `@Test` annotation and the related test method
signatures/body (the blocks referenced at lines 16, 24-25, 32-33, 41, and 59 in
the diff) are converted to 4-space indentation to match the rest of the file.
---
Nitpick comments:
In
`@cv-image-to-xml/src/test/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRulesTest.kt`:
- Around line 8-67: Add edge-case unit tests to the existing
OcrSanitizerRulesTest: verify empty/blank input handling by calling sanitize("")
on ColorSanitizer, TextAttributeSanitizer, DimensionSanitizer (and others) and
expecting "" (or the defined safe output); add a test that passes an
already-correct string to
OcrSanitizerFactory.createDefaultSanitizer().sanitize(...) and asserts it
remains unchanged; add an idempotency test that runs sanitizer.sanitize(input)
twice and asserts the first and second results are equal using the default
sanitizer; use the existing classes/methods (ColorSanitizer,
TextAttributeSanitizer, DimensionSanitizer, MarginPaddingSanitizer,
StructureSanitizer, OcrSanitizerFactory.createDefaultSanitizer(),
sanitizer.sanitize) so tests live alongside the current cases.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e50bdc0b-c4a0-4292-a4c9-2f81aeb70a98
📒 Files selected for processing (8)
cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/MarginAnnotationParser.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/CompositeOcrSanitizer.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizer.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerFactory.ktcv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.ktcv-image-to-xml/src/test/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRulesTest.kt
🚧 Files skipped from review as they are similar to previous changes (7)
- cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerFactory.kt
- cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/CompositeOcrSanitizer.kt
- cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizer.kt
- cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/MarginAnnotationParser.kt
- cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/WidgetTagParser.kt
- cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/FuzzyAttributeParser.kt
- cv-image-to-xml/src/main/java/org/appdevforall/codeonthego/computervision/domain/parser/sanitizer/OcrSanitizerRules.kt
Description
Fixed an issue where Button widget values were being mixed up with Switch widget values due to improper parsing of separator characters. Additionally, centralized and expanded OCR sanitization rules to better handle common misreads (such as "bsld" or "text stjle" to "text_style") to improve overall attribute parsing accuracy.
Details
WidgetTagParser.kt: Added the dot (.) character to the separator regex group intagExtractRegexto correctly parse and isolate widget tags, preventing Button/Switch collision.OcrSanitizationRules.kt: Created a new dedicated file containing a list ofOcrCorrectionRuleto centralize regex patterns and replacements for colors, dimensions, margins, and text attributes.FuzzyAttributeParser.kt: RefactoredtokenizeAnnotationto fold over the newpreTokenizationRuleslist instead of using hardcoded chained string replacements.MarginAnnotationParser.kt: Updated margin parsing logic to safely concatenate overlapping right margin annotations using a pipe (|) delimiter instead of overwriting existing map keys.Screen.Recording.2026-05-14.at.11.56.39.AM.mov
Ticket
ADFA-3911
Observation
Moving the OCR sanitization rules into their own file (
OcrSanitizationRules.kt) significantly cleans up theFuzzyAttributeParserand makes it much easier to scale and maintain new OCR correction patterns as more edge cases are discovered in the future.