Validate translation worker on real pages by riderx · Pull Request #620 · Cap-go/website

riderx · 2026-05-01T23:07:54Z

Summary

fix translation deploy queue setup to accept Wrangler's existing queue message: "already taken"
extend the real Wrangler remote probe to translate actual capgo.app pages, not only synthetic strings
validate Spanish real-page batches for / and /docs/ through Workers AI JSON mode before deploy

Verification

bun run ci:verify:translation
cd apps/translation-worker && bunx wrangler deploy --dry-run
cd apps/translation-worker && bunx wrangler deploy --dry-run -c wrangler.real-test.jsonc
bun run verify:real-translation
local deploy queue setup check against existing capgo-translation-refresh queue

Summary by CodeRabbit

New Features
- Real-page translation probe endpoint that runs per-page translation checks across multiple pages and requires stricter validation before marking runs as passed.
Improvements
- Tracks and validates translations specifically for body content; reduced JSON batch size and bumped cache version for finer-grained batching. Increased probe timeouts for reliability.
Bug Fixes
- Deployment workflow now treats more “already exists/already taken” queue messages as non-fatal.
Tests
- Added parser verification script and expanded probe/test helpers for more thorough validation.

coderabbitai · 2026-05-01T23:08:02Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: deca8565-ddaa-4282-85cd-1a47ca292f89

📥 Commits

Reviewing files that changed from the base of the PR and between 5a26254 and 668e428.

📒 Files selected for processing (3)

apps/translation-worker/scripts/verify-parser.ts
apps/translation-worker/scripts/verify-real-ai.ts
apps/translation-worker/src/index.ts

🚧 Files skipped from review as they are similar to previous changes (1)

apps/translation-worker/scripts/verify-parser.ts

📝 Walkthrough

Walkthrough

Adds a real-page translation probe and runtime verifier that validate translated body segments across multiple pages, refactors HTML segment extraction to mark body segments and limit JSON-mode batch size, adds a parser verification script and package check step, and relaxes the deploy workflow’s queue-creation idempotency grep.

Changes

Translation Worker Real-Page Probing

Layer / File(s)	Summary
Data Shape / Probe Payload `apps/translation-worker/scripts/verify-real-ai.ts`	`ProbePayload` gains optional `page` with `path`, `locale`, numeric counts (`segmentCount`, `bodySegmentCount`, `batchCount`, `translatedBatchCount`, `translatedSegmentCount`, `changedCount`) and validation inputs (`bodyChecks`, `samples`).
Probe Timing & Config `apps/translation-worker/scripts/verify-real-ai.ts`	Increased `TIMEOUT_MS`, relaxed `REQUEST_TIMEOUT_MS`, and added `REAL_PAGE_PROBES` list for multiple page probes.
Probe Fetch & Validation `apps/translation-worker/scripts/verify-real-ai.ts`	Introduced `fetchJsonProbe`; split runtime validation into `fetchRuntimeProbe` and `fetchRealPageProbe` (enforces `payload.ok`, expected model, `page` presence, exact `page.path`, `page.locale === 'es'`, minimum counts, non-empty `bodyChecks`/`samples`).
Probe Orchestration `apps/translation-worker/scripts/verify-real-ai.ts`	Probe flow now calls runtime probe first, then sequentially validates each `REAL_PAGE_PROBES` `real-page` URL; run passes only if all page probes succeed.
Segment Model & Batching Constants `apps/translation-worker/src/index.ts`	`Segment` adds `inBody: boolean`; `TRANSLATION_CACHE_VERSION` bumped; `MAX_BATCH_ITEMS` reduced from 32 to 12.
HTML Parsing / Segment Collection `apps/translation-worker/src/index.ts`	`collectSegments`, `appendTag`, and `addSegment` track `insideBody` and propagate `inBody`; added `RAW_TEXT_SKIP_TAGS` and `findNamedTag` for case-insensitive tag handling.
Body-scoped Validation `apps/translation-worker/src/index.ts`	Added `bodyTranslationStats` and `assertTranslatedBody` to compute candidate/changed counts using only `inBody` segments; finalization calls `assertTranslatedBody(...)`.
Real-Page Probe Implementation `apps/translation-worker/src/index.ts`	Added `/__translation-test__/real-page` handling via `probeRealPageTranslation(...)`: fetches origin HTML, extracts segments (marking body), builds/clamps batches, translates selected batches with JSON-mode, computes normalized `changedCount`, and returns `page` metadata plus samples.
Test Routing & Exports `apps/translation-worker/src/index.ts`	`handleTranslationTestRequest` routes `.../real-page`; `__translationWorkerTest` export expanded to include `bodyTranslationStats`, `buildBatches`, `collectSegments`, and `renderTranslatedHtml`.
Parser Verification Script `apps/translation-worker/scripts/verify-parser.ts`	New script uses `__translationWorkerTest` with an embedded HTML fixture to assert body-only segment collection, compute body translation stats, render translated HTML, and verify skipped-script content remains unchanged.
Package Scripts `apps/translation-worker/package.json`	`check` now runs `tsc --noEmit && bun run test:parser`; added `test:parser` = `bun run scripts/verify-parser.ts`.

Deployment Workflow Idempotency

Layer / File(s)	Summary
Workflow Step `.github/workflows/deploy-translation.yml`	`ensure_queue` helper’s post-create grep pattern expanded to include `already.*${queue}`, treating additional “already …” output variants as non-fatal when the target queue already exists.

Sequence Diagram

sequenceDiagram
    participant CI as CI Workflow
    participant Verify as verify-real-ai.ts
    participant Worker as Translation Worker
    participant Origin as Origin Server

    CI->>Verify: start verify-real-ai
    Verify->>Worker: GET /__translation-test__/real-runtime
    Worker-->>Verify: runtime probe payload
    Verify->>Verify: validate runtime probe

    loop for each REAL_PAGE_PROBE
        Verify->>Worker: GET /__translation-test__/real-page?path=...&locale=es&batches=...
        Worker->>Origin: fetch origin HTML for path
        Origin-->>Worker: return HTML
        Worker->>Worker: extract segments (mark inBody)
        Worker->>Worker: build batches (max 12 items)
        Worker->>Worker: translate selected batches via JSON-mode
        Worker->>Worker: compute diffs, counts, samples
        Worker-->>Verify: page probe payload (page metadata + samples)
        Verify->>Verify: validate page payload (path, locale, counts, samples)
    end

    Verify->>CI: report success/failure

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Fix translation worker HTML parsing and HEAD cache refresh #608: overlaps changes to HTML parsing and tag-scanning used by segment extraction.
Add translation fallback tests #618: touches translation-worker exports and package test scripts similar to this PR’s additions.
[codex] Ensure development translation queue #614: modifies deploy workflow queue-creation idempotency logic related to the expanded grep handling.

Poem

🐰 I hop through pages, marking what’s in the body,

batch a few lines, translate them a little shoddy.
I fetch the site, sample, count, and test,
keep skipped code safe and let queues rest.
Happy hops — small checks that serve the nest.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Validate translation worker on real pages' directly matches the main objective: extending the real Wrangler remote probe to translate actual capgo.app pages and validate Spanish real-page batches.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/translation-deploy-real-page-probe

_{Review rate limit: 3/5 reviews remaining, refill in 23 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector · 2026-05-01T23:11:48Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

coderabbitai

🧹 Nitpick comments (1)

apps/translation-worker/src/index.ts (1)
1987-1987: 💤 Low value

Property sourceBytes reports character count, not byte count.

sourceHtml.length returns the number of UTF-16 code units (character count), not bytes. For multi-byte characters, this differs from byte count. Since this is informational metadata only, consider renaming to sourceChars or using new TextEncoder().encode(sourceHtml).length for actual bytes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/translation-worker/src/index.ts` at line 1987, The property sourceBytes
is currently assigned using sourceHtml.length (UTF-16 code units), which reports
character count not actual byte length; update the assignment in the object that
sets sourceBytes to either rename the field to sourceChars (if only
informational) or compute actual byte length via new
TextEncoder().encode(sourceHtml).length (if true byte count is needed); locate
the assignment to sourceBytes and replace sourceHtml.length accordingly or
rename the property consistently wherever used.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@apps/translation-worker/src/index.ts`:
- Line 1987: The property sourceBytes is currently assigned using
sourceHtml.length (UTF-16 code units), which reports character count not actual
byte length; update the assignment in the object that sets sourceBytes to either
rename the field to sourceChars (if only informational) or compute actual byte
length via new TextEncoder().encode(sourceHtml).length (if true byte count is
needed); locate the assignment to sourceBytes and replace sourceHtml.length
accordingly or rename the property consistently wherever used.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2d8c3933-f851-433f-9449-ae93518442b5

📥 Commits

Reviewing files that changed from the base of the PR and between d9e998b and 160582e.

📒 Files selected for processing (3)

.github/workflows/deploy-translation.yml
apps/translation-worker/scripts/verify-real-ai.ts
apps/translation-worker/src/index.ts

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/translation-worker/scripts/verify-real-ai.ts`:
- Around line 36-39: The REAL_PAGE_PROBES array in verify-real-ai.ts pins checks
to exact marketing/docs copy (REAL_PAGE_PROBES), causing flakiness; replace
these hard-coded strings with stable structural assertions or selector-based
checks (e.g., verify presence of main landmark, nav items, or headings) or
derive the expected snippets at runtime from the fetched page before asserting.
Update the verification logic that consumes REAL_PAGE_PROBES to use CSS
selectors/ARIA landmarks or dynamic extraction instead of exact text matches so
copy edits won't break the gate.

In `@apps/translation-worker/src/index.ts`:
- Around line 2025-2070: The probe currently uses findBatchText(batches, ...)
which scans every batch entry (including title/meta/attrs), letting a non-body
occurrence satisfy a requiredChecks entry; update the logic so requiredChecks
are matched only against body segments: either (preferred) filter segments to
body-only before calling buildBatches (use collectSegments()'s segment.type or
equivalent) so batches contains only body text, or change findBatchText to
accept segment metadata and verify the matched source came from a segment
flagged as body (reference functions/symbols: collectSegments, segments,
buildBatches, batches, findBatchText, probeRealPageTranslation, requiredChecks).
Ensure selectedBatchIndexes and checkSources still use the body-limited matches.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 231d3e3e-cfb2-423b-adcb-826d20edfe1d

📥 Commits

Reviewing files that changed from the base of the PR and between 82885d3 and ac984a1.

📒 Files selected for processing (4)

apps/translation-worker/package.json
apps/translation-worker/scripts/verify-parser.ts
apps/translation-worker/scripts/verify-real-ai.ts
apps/translation-worker/src/index.ts

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

apps/translation-worker/src/index.ts (1)

2009-2055: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep required probe checks scoped to visible body text.

findBatchText() still searches every batch entry, so a phrase duplicated in <title>, meta content, or an attribute can satisfy a check= without translating the body text this probe is supposed to validate. Restrict the match to segment.inBody && segment.mode === 'text', or build the probe batches from that filtered segment set before selecting batches.

Suggested fix

-function findBatchText(batches: string[][], expectedText: string): { batchIndex: number; textIndex: number; source: string } | null {
+function findBatchText(
+  segments: Segment[],
+  batches: string[][],
+  expectedText: string,
+): { batchIndex: number; textIndex: number; source: string } | null {
+  let segmentIndex = 0
   for (let batchIndex = 0; batchIndex < batches.length; batchIndex += 1) {
     const batch = batches[batchIndex]
-    for (let textIndex = 0; textIndex < batch.length; textIndex += 1) {
+    for (let textIndex = 0; textIndex < batch.length; textIndex += 1, segmentIndex += 1) {
       const source = batch[textIndex]
-      if (source.includes(expectedText)) return { batchIndex, textIndex, source }
+      const segment = segments[segmentIndex]
+      if (segment?.inBody && segment.mode === 'text' && source.includes(expectedText)) {
+        return { batchIndex, textIndex, source }
+      }
     }
   }
   return null
 }

-    const found = findBatchText(batches, check)
+    const found = findBatchText(segments, batches, check)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/translation-worker/src/index.ts` around lines 2009 - 2055, The probe is
matching required checks against any segment (including title/meta/attributes)
because findBatchText scans all batches built from collectSegments; restrict
checks to visible body text by either (A) filtering segments before calling
buildBatches (use only segments where segment.inBody && segment.mode === 'text'
when creating batches in probeRealPageTranslation) or (B) change findBatchText
to accept and check segment metadata and only match when segment.inBody &&
segment.mode === 'text'; update calls in probeRealPageTranslation (and
selectedBatchIndexes/checkSources logic) to use the filtered batches so
requiredChecks only validate actual body text translations (reference symbols:
findBatchText, probeRealPageTranslation, collectSegments, buildBatches,
segments, segment.inBody, segment.mode).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/translation-worker/src/index.ts`:
- Around line 727-729: The current findClosingTag function returns the first
occurrence of `</{tagName}` which breaks on nested same-name elements; update it
to perform a depth-aware scan: starting from `startIndex`, iterate through the
HTML searching for both `<{tagName}` (opening) and `</{tagName}` (closing),
incrementing a depth counter on each opening and decrementing on each closing,
and only return when depth reaches zero (returning the closing match index, end,
and tag string). Apply the same depth-aware logic to the other similar helper
used around the 739-750 region (the corresponding call/path that currently
relies on a single-match `findNamedTag`) so nested `<svg>`, `<code>`, etc., are
skipped correctly. Ensure you reuse or extend `findNamedTag` behavior or create
a new helper (referencing `findClosingTag` and `findNamedTag`) to avoid
duplicating scanning logic.

---

Duplicate comments:
In `@apps/translation-worker/src/index.ts`:
- Around line 2009-2055: The probe is matching required checks against any
segment (including title/meta/attributes) because findBatchText scans all
batches built from collectSegments; restrict checks to visible body text by
either (A) filtering segments before calling buildBatches (use only segments
where segment.inBody && segment.mode === 'text' when creating batches in
probeRealPageTranslation) or (B) change findBatchText to accept and check
segment metadata and only match when segment.inBody && segment.mode === 'text';
update calls in probeRealPageTranslation (and selectedBatchIndexes/checkSources
logic) to use the filtered batches so requiredChecks only validate actual body
text translations (reference symbols: findBatchText, probeRealPageTranslation,
collectSegments, buildBatches, segments, segment.inBody, segment.mode).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7dbb06e-ad8e-40f2-80e9-f03ce15ee5e8

📥 Commits

Reviewing files that changed from the base of the PR and between ac984a1 and 5a26254.

📒 Files selected for processing (1)

apps/translation-worker/src/index.ts

sonarqubecloud · 2026-05-02T00:06:12Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Validate translation worker on real pages

160582e

riderx marked this pull request as ready for review May 1, 2026 23:11

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

riderx added 3 commits May 2, 2026 01:17

Fix translation probe byte count

82885d3

Translate body text after skipped scripts

ac984a1

Deduplicate translation tag scanning

5a26254

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

Comment thread apps/translation-worker/scripts/verify-real-ai.ts Outdated

Comment thread apps/translation-worker/src/index.ts Outdated

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

Comment thread apps/translation-worker/src/index.ts

Address translation probe review feedback

668e428

riderx merged commit 2fafb32 into main May 2, 2026
10 checks passed

This was referenced May 2, 2026

[codex] Fix translation worker queue progress #624

Merged

[codex] fix translated internal links #644

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validate translation worker on real pages#620

Validate translation worker on real pages#620
riderx merged 5 commits into
mainfrom
codex/translation-deploy-real-page-probe

riderx commented May 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 1, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot commented May 1, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

riderx commented May 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot commented May 1, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 2, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

riderx commented May 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 1, 2026 •

edited

Loading