Validate translation worker on real pages#620
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a real-page translation probe and runtime verifier that validate translated body segments across multiple pages, refactors HTML segment extraction to mark body segments and limit JSON-mode batch size, adds a parser verification script and package check step, and relaxes the deploy workflow’s queue-creation idempotency grep. ChangesTranslation Worker Real-Page Probing
Deployment Workflow Idempotency
Sequence DiagramsequenceDiagram
participant CI as CI Workflow
participant Verify as verify-real-ai.ts
participant Worker as Translation Worker
participant Origin as Origin Server
CI->>Verify: start verify-real-ai
Verify->>Worker: GET /__translation-test__/real-runtime
Worker-->>Verify: runtime probe payload
Verify->>Verify: validate runtime probe
loop for each REAL_PAGE_PROBE
Verify->>Worker: GET /__translation-test__/real-page?path=...&locale=es&batches=...
Worker->>Origin: fetch origin HTML for path
Origin-->>Worker: return HTML
Worker->>Worker: extract segments (mark inBody)
Worker->>Worker: build batches (max 12 items)
Worker->>Worker: translate selected batches via JSON-mode
Worker->>Worker: compute diffs, counts, samples
Worker-->>Verify: page probe payload (page metadata + samples)
Verify->>Verify: validate page payload (path, locale, counts, samples)
end
Verify->>CI: report success/failure
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Review rate limit: 3/5 reviews remaining, refill in 23 minutes. Comment |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
apps/translation-worker/src/index.ts (1)
1987-1987: 💤 Low valueProperty
sourceBytesreports character count, not byte count.
sourceHtml.lengthreturns the number of UTF-16 code units (character count), not bytes. For multi-byte characters, this differs from byte count. Since this is informational metadata only, consider renaming tosourceCharsor usingnew TextEncoder().encode(sourceHtml).lengthfor actual bytes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/translation-worker/src/index.ts` at line 1987, The property sourceBytes is currently assigned using sourceHtml.length (UTF-16 code units), which reports character count not actual byte length; update the assignment in the object that sets sourceBytes to either rename the field to sourceChars (if only informational) or compute actual byte length via new TextEncoder().encode(sourceHtml).length (if true byte count is needed); locate the assignment to sourceBytes and replace sourceHtml.length accordingly or rename the property consistently wherever used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@apps/translation-worker/src/index.ts`:
- Line 1987: The property sourceBytes is currently assigned using
sourceHtml.length (UTF-16 code units), which reports character count not actual
byte length; update the assignment in the object that sets sourceBytes to either
rename the field to sourceChars (if only informational) or compute actual byte
length via new TextEncoder().encode(sourceHtml).length (if true byte count is
needed); locate the assignment to sourceBytes and replace sourceHtml.length
accordingly or rename the property consistently wherever used.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2d8c3933-f851-433f-9449-ae93518442b5
📒 Files selected for processing (3)
.github/workflows/deploy-translation.ymlapps/translation-worker/scripts/verify-real-ai.tsapps/translation-worker/src/index.ts
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/translation-worker/scripts/verify-real-ai.ts`:
- Around line 36-39: The REAL_PAGE_PROBES array in verify-real-ai.ts pins checks
to exact marketing/docs copy (REAL_PAGE_PROBES), causing flakiness; replace
these hard-coded strings with stable structural assertions or selector-based
checks (e.g., verify presence of main landmark, nav items, or headings) or
derive the expected snippets at runtime from the fetched page before asserting.
Update the verification logic that consumes REAL_PAGE_PROBES to use CSS
selectors/ARIA landmarks or dynamic extraction instead of exact text matches so
copy edits won't break the gate.
In `@apps/translation-worker/src/index.ts`:
- Around line 2025-2070: The probe currently uses findBatchText(batches, ...)
which scans every batch entry (including title/meta/attrs), letting a non-body
occurrence satisfy a requiredChecks entry; update the logic so requiredChecks
are matched only against body segments: either (preferred) filter segments to
body-only before calling buildBatches (use collectSegments()'s segment.type or
equivalent) so batches contains only body text, or change findBatchText to
accept segment metadata and verify the matched source came from a segment
flagged as body (reference functions/symbols: collectSegments, segments,
buildBatches, batches, findBatchText, probeRealPageTranslation, requiredChecks).
Ensure selectedBatchIndexes and checkSources still use the body-limited matches.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 231d3e3e-cfb2-423b-adcb-826d20edfe1d
📒 Files selected for processing (4)
apps/translation-worker/package.jsonapps/translation-worker/scripts/verify-parser.tsapps/translation-worker/scripts/verify-real-ai.tsapps/translation-worker/src/index.ts
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
apps/translation-worker/src/index.ts (1)
2009-2055:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep required probe checks scoped to visible body text.
findBatchText()still searches every batch entry, so a phrase duplicated in<title>, meta content, or an attribute can satisfy acheck=without translating the body text this probe is supposed to validate. Restrict the match tosegment.inBody && segment.mode === 'text', or build the probe batches from that filtered segment set before selecting batches.Suggested fix
-function findBatchText(batches: string[][], expectedText: string): { batchIndex: number; textIndex: number; source: string } | null { +function findBatchText( + segments: Segment[], + batches: string[][], + expectedText: string, +): { batchIndex: number; textIndex: number; source: string } | null { + let segmentIndex = 0 for (let batchIndex = 0; batchIndex < batches.length; batchIndex += 1) { const batch = batches[batchIndex] - for (let textIndex = 0; textIndex < batch.length; textIndex += 1) { + for (let textIndex = 0; textIndex < batch.length; textIndex += 1, segmentIndex += 1) { const source = batch[textIndex] - if (source.includes(expectedText)) return { batchIndex, textIndex, source } + const segment = segments[segmentIndex] + if (segment?.inBody && segment.mode === 'text' && source.includes(expectedText)) { + return { batchIndex, textIndex, source } + } } } return null }- const found = findBatchText(batches, check) + const found = findBatchText(segments, batches, check)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/translation-worker/src/index.ts` around lines 2009 - 2055, The probe is matching required checks against any segment (including title/meta/attributes) because findBatchText scans all batches built from collectSegments; restrict checks to visible body text by either (A) filtering segments before calling buildBatches (use only segments where segment.inBody && segment.mode === 'text' when creating batches in probeRealPageTranslation) or (B) change findBatchText to accept and check segment metadata and only match when segment.inBody && segment.mode === 'text'; update calls in probeRealPageTranslation (and selectedBatchIndexes/checkSources logic) to use the filtered batches so requiredChecks only validate actual body text translations (reference symbols: findBatchText, probeRealPageTranslation, collectSegments, buildBatches, segments, segment.inBody, segment.mode).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/translation-worker/src/index.ts`:
- Around line 727-729: The current findClosingTag function returns the first
occurrence of `</{tagName}` which breaks on nested same-name elements; update it
to perform a depth-aware scan: starting from `startIndex`, iterate through the
HTML searching for both `<{tagName}` (opening) and `</{tagName}` (closing),
incrementing a depth counter on each opening and decrementing on each closing,
and only return when depth reaches zero (returning the closing match index, end,
and tag string). Apply the same depth-aware logic to the other similar helper
used around the 739-750 region (the corresponding call/path that currently
relies on a single-match `findNamedTag`) so nested `<svg>`, `<code>`, etc., are
skipped correctly. Ensure you reuse or extend `findNamedTag` behavior or create
a new helper (referencing `findClosingTag` and `findNamedTag`) to avoid
duplicating scanning logic.
---
Duplicate comments:
In `@apps/translation-worker/src/index.ts`:
- Around line 2009-2055: The probe is matching required checks against any
segment (including title/meta/attributes) because findBatchText scans all
batches built from collectSegments; restrict checks to visible body text by
either (A) filtering segments before calling buildBatches (use only segments
where segment.inBody && segment.mode === 'text' when creating batches in
probeRealPageTranslation) or (B) change findBatchText to accept and check
segment metadata and only match when segment.inBody && segment.mode === 'text';
update calls in probeRealPageTranslation (and selectedBatchIndexes/checkSources
logic) to use the filtered batches so requiredChecks only validate actual body
text translations (reference symbols: findBatchText, probeRealPageTranslation,
collectSegments, buildBatches, segments, segment.inBody, segment.mode).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e7dbb06e-ad8e-40f2-80e9-f03ce15ee5e8
📒 Files selected for processing (1)
apps/translation-worker/src/index.ts
|



Summary
/and/docs/through Workers AI JSON mode before deployVerification
capgo-translation-refreshqueueSummary by CodeRabbit
New Features
Improvements
Bug Fixes
Tests