Skip to content

Validate translation worker on real pages#620

Merged
riderx merged 5 commits into
mainfrom
codex/translation-deploy-real-page-probe
May 2, 2026
Merged

Validate translation worker on real pages#620
riderx merged 5 commits into
mainfrom
codex/translation-deploy-real-page-probe

Conversation

@riderx
Copy link
Copy Markdown
Member

@riderx riderx commented May 1, 2026

Summary

  • fix translation deploy queue setup to accept Wrangler's existing queue message: "already taken"
  • extend the real Wrangler remote probe to translate actual capgo.app pages, not only synthetic strings
  • validate Spanish real-page batches for / and /docs/ through Workers AI JSON mode before deploy

Verification

  • bun run ci:verify:translation
  • cd apps/translation-worker && bunx wrangler deploy --dry-run
  • cd apps/translation-worker && bunx wrangler deploy --dry-run -c wrangler.real-test.jsonc
  • bun run verify:real-translation
  • local deploy queue setup check against existing capgo-translation-refresh queue

Summary by CodeRabbit

  • New Features

    • Real-page translation probe endpoint that runs per-page translation checks across multiple pages and requires stricter validation before marking runs as passed.
  • Improvements

    • Tracks and validates translations specifically for body content; reduced JSON batch size and bumped cache version for finer-grained batching. Increased probe timeouts for reliability.
  • Bug Fixes

    • Deployment workflow now treats more “already exists/already taken” queue messages as non-fatal.
  • Tests

    • Added parser verification script and expanded probe/test helpers for more thorough validation.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: deca8565-ddaa-4282-85cd-1a47ca292f89

📥 Commits

Reviewing files that changed from the base of the PR and between 5a26254 and 668e428.

📒 Files selected for processing (3)
  • apps/translation-worker/scripts/verify-parser.ts
  • apps/translation-worker/scripts/verify-real-ai.ts
  • apps/translation-worker/src/index.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • apps/translation-worker/scripts/verify-parser.ts

📝 Walkthrough

Walkthrough

Adds a real-page translation probe and runtime verifier that validate translated body segments across multiple pages, refactors HTML segment extraction to mark body segments and limit JSON-mode batch size, adds a parser verification script and package check step, and relaxes the deploy workflow’s queue-creation idempotency grep.

Changes

Translation Worker Real-Page Probing

Layer / File(s) Summary
Data Shape / Probe Payload
apps/translation-worker/scripts/verify-real-ai.ts
ProbePayload gains optional page with path, locale, numeric counts (segmentCount, bodySegmentCount, batchCount, translatedBatchCount, translatedSegmentCount, changedCount) and validation inputs (bodyChecks, samples).
Probe Timing & Config
apps/translation-worker/scripts/verify-real-ai.ts
Increased TIMEOUT_MS, relaxed REQUEST_TIMEOUT_MS, and added REAL_PAGE_PROBES list for multiple page probes.
Probe Fetch & Validation
apps/translation-worker/scripts/verify-real-ai.ts
Introduced fetchJsonProbe; split runtime validation into fetchRuntimeProbe and fetchRealPageProbe (enforces payload.ok, expected model, page presence, exact page.path, page.locale === 'es', minimum counts, non-empty bodyChecks/samples).
Probe Orchestration
apps/translation-worker/scripts/verify-real-ai.ts
Probe flow now calls runtime probe first, then sequentially validates each REAL_PAGE_PROBES real-page URL; run passes only if all page probes succeed.
Segment Model & Batching Constants
apps/translation-worker/src/index.ts
Segment adds inBody: boolean; TRANSLATION_CACHE_VERSION bumped; MAX_BATCH_ITEMS reduced from 32 to 12.
HTML Parsing / Segment Collection
apps/translation-worker/src/index.ts
collectSegments, appendTag, and addSegment track insideBody and propagate inBody; added RAW_TEXT_SKIP_TAGS and findNamedTag for case-insensitive tag handling.
Body-scoped Validation
apps/translation-worker/src/index.ts
Added bodyTranslationStats and assertTranslatedBody to compute candidate/changed counts using only inBody segments; finalization calls assertTranslatedBody(...).
Real-Page Probe Implementation
apps/translation-worker/src/index.ts
Added /__translation-test__/real-page handling via probeRealPageTranslation(...): fetches origin HTML, extracts segments (marking body), builds/clamps batches, translates selected batches with JSON-mode, computes normalized changedCount, and returns page metadata plus samples.
Test Routing & Exports
apps/translation-worker/src/index.ts
handleTranslationTestRequest routes .../real-page; __translationWorkerTest export expanded to include bodyTranslationStats, buildBatches, collectSegments, and renderTranslatedHtml.
Parser Verification Script
apps/translation-worker/scripts/verify-parser.ts
New script uses __translationWorkerTest with an embedded HTML fixture to assert body-only segment collection, compute body translation stats, render translated HTML, and verify skipped-script content remains unchanged.
Package Scripts
apps/translation-worker/package.json
check now runs tsc --noEmit && bun run test:parser; added test:parser = bun run scripts/verify-parser.ts.

Deployment Workflow Idempotency

Layer / File(s) Summary
Workflow Step
.github/workflows/deploy-translation.yml
ensure_queue helper’s post-create grep pattern expanded to include already.*${queue}, treating additional “already …” output variants as non-fatal when the target queue already exists.

Sequence Diagram

sequenceDiagram
    participant CI as CI Workflow
    participant Verify as verify-real-ai.ts
    participant Worker as Translation Worker
    participant Origin as Origin Server

    CI->>Verify: start verify-real-ai
    Verify->>Worker: GET /__translation-test__/real-runtime
    Worker-->>Verify: runtime probe payload
    Verify->>Verify: validate runtime probe

    loop for each REAL_PAGE_PROBE
        Verify->>Worker: GET /__translation-test__/real-page?path=...&locale=es&batches=...
        Worker->>Origin: fetch origin HTML for path
        Origin-->>Worker: return HTML
        Worker->>Worker: extract segments (mark inBody)
        Worker->>Worker: build batches (max 12 items)
        Worker->>Worker: translate selected batches via JSON-mode
        Worker->>Worker: compute diffs, counts, samples
        Worker-->>Verify: page probe payload (page metadata + samples)
        Verify->>Verify: validate page payload (path, locale, counts, samples)
    end

    Verify->>CI: report success/failure
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 I hop through pages, marking what’s in the body,

batch a few lines, translate them a little shoddy.
I fetch the site, sample, count, and test,
keep skipped code safe and let queues rest.
Happy hops — small checks that serve the nest.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Validate translation worker on real pages' directly matches the main objective: extending the real Wrangler remote probe to translate actual capgo.app pages and validate Spanish real-page batches.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/translation-deploy-real-page-probe

Review rate limit: 3/5 reviews remaining, refill in 23 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@riderx riderx marked this pull request as ready for review May 1, 2026 23:11
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
apps/translation-worker/src/index.ts (1)

1987-1987: 💤 Low value

Property sourceBytes reports character count, not byte count.

sourceHtml.length returns the number of UTF-16 code units (character count), not bytes. For multi-byte characters, this differs from byte count. Since this is informational metadata only, consider renaming to sourceChars or using new TextEncoder().encode(sourceHtml).length for actual bytes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/translation-worker/src/index.ts` at line 1987, The property sourceBytes
is currently assigned using sourceHtml.length (UTF-16 code units), which reports
character count not actual byte length; update the assignment in the object that
sets sourceBytes to either rename the field to sourceChars (if only
informational) or compute actual byte length via new
TextEncoder().encode(sourceHtml).length (if true byte count is needed); locate
the assignment to sourceBytes and replace sourceHtml.length accordingly or
rename the property consistently wherever used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@apps/translation-worker/src/index.ts`:
- Line 1987: The property sourceBytes is currently assigned using
sourceHtml.length (UTF-16 code units), which reports character count not actual
byte length; update the assignment in the object that sets sourceBytes to either
rename the field to sourceChars (if only informational) or compute actual byte
length via new TextEncoder().encode(sourceHtml).length (if true byte count is
needed); locate the assignment to sourceBytes and replace sourceHtml.length
accordingly or rename the property consistently wherever used.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2d8c3933-f851-433f-9449-ae93518442b5

📥 Commits

Reviewing files that changed from the base of the PR and between d9e998b and 160582e.

📒 Files selected for processing (3)
  • .github/workflows/deploy-translation.yml
  • apps/translation-worker/scripts/verify-real-ai.ts
  • apps/translation-worker/src/index.ts

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/translation-worker/scripts/verify-real-ai.ts`:
- Around line 36-39: The REAL_PAGE_PROBES array in verify-real-ai.ts pins checks
to exact marketing/docs copy (REAL_PAGE_PROBES), causing flakiness; replace
these hard-coded strings with stable structural assertions or selector-based
checks (e.g., verify presence of main landmark, nav items, or headings) or
derive the expected snippets at runtime from the fetched page before asserting.
Update the verification logic that consumes REAL_PAGE_PROBES to use CSS
selectors/ARIA landmarks or dynamic extraction instead of exact text matches so
copy edits won't break the gate.

In `@apps/translation-worker/src/index.ts`:
- Around line 2025-2070: The probe currently uses findBatchText(batches, ...)
which scans every batch entry (including title/meta/attrs), letting a non-body
occurrence satisfy a requiredChecks entry; update the logic so requiredChecks
are matched only against body segments: either (preferred) filter segments to
body-only before calling buildBatches (use collectSegments()'s segment.type or
equivalent) so batches contains only body text, or change findBatchText to
accept segment metadata and verify the matched source came from a segment
flagged as body (reference functions/symbols: collectSegments, segments,
buildBatches, batches, findBatchText, probeRealPageTranslation, requiredChecks).
Ensure selectedBatchIndexes and checkSources still use the body-limited matches.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 231d3e3e-cfb2-423b-adcb-826d20edfe1d

📥 Commits

Reviewing files that changed from the base of the PR and between 82885d3 and ac984a1.

📒 Files selected for processing (4)
  • apps/translation-worker/package.json
  • apps/translation-worker/scripts/verify-parser.ts
  • apps/translation-worker/scripts/verify-real-ai.ts
  • apps/translation-worker/src/index.ts

Comment thread apps/translation-worker/scripts/verify-real-ai.ts Outdated
Comment thread apps/translation-worker/src/index.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
apps/translation-worker/src/index.ts (1)

2009-2055: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep required probe checks scoped to visible body text.

findBatchText() still searches every batch entry, so a phrase duplicated in <title>, meta content, or an attribute can satisfy a check= without translating the body text this probe is supposed to validate. Restrict the match to segment.inBody && segment.mode === 'text', or build the probe batches from that filtered segment set before selecting batches.

Suggested fix
-function findBatchText(batches: string[][], expectedText: string): { batchIndex: number; textIndex: number; source: string } | null {
+function findBatchText(
+  segments: Segment[],
+  batches: string[][],
+  expectedText: string,
+): { batchIndex: number; textIndex: number; source: string } | null {
+  let segmentIndex = 0
   for (let batchIndex = 0; batchIndex < batches.length; batchIndex += 1) {
     const batch = batches[batchIndex]
-    for (let textIndex = 0; textIndex < batch.length; textIndex += 1) {
+    for (let textIndex = 0; textIndex < batch.length; textIndex += 1, segmentIndex += 1) {
       const source = batch[textIndex]
-      if (source.includes(expectedText)) return { batchIndex, textIndex, source }
+      const segment = segments[segmentIndex]
+      if (segment?.inBody && segment.mode === 'text' && source.includes(expectedText)) {
+        return { batchIndex, textIndex, source }
+      }
     }
   }
   return null
 }
-    const found = findBatchText(batches, check)
+    const found = findBatchText(segments, batches, check)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/translation-worker/src/index.ts` around lines 2009 - 2055, The probe is
matching required checks against any segment (including title/meta/attributes)
because findBatchText scans all batches built from collectSegments; restrict
checks to visible body text by either (A) filtering segments before calling
buildBatches (use only segments where segment.inBody && segment.mode === 'text'
when creating batches in probeRealPageTranslation) or (B) change findBatchText
to accept and check segment metadata and only match when segment.inBody &&
segment.mode === 'text'; update calls in probeRealPageTranslation (and
selectedBatchIndexes/checkSources logic) to use the filtered batches so
requiredChecks only validate actual body text translations (reference symbols:
findBatchText, probeRealPageTranslation, collectSegments, buildBatches,
segments, segment.inBody, segment.mode).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/translation-worker/src/index.ts`:
- Around line 727-729: The current findClosingTag function returns the first
occurrence of `</{tagName}` which breaks on nested same-name elements; update it
to perform a depth-aware scan: starting from `startIndex`, iterate through the
HTML searching for both `<{tagName}` (opening) and `</{tagName}` (closing),
incrementing a depth counter on each opening and decrementing on each closing,
and only return when depth reaches zero (returning the closing match index, end,
and tag string). Apply the same depth-aware logic to the other similar helper
used around the 739-750 region (the corresponding call/path that currently
relies on a single-match `findNamedTag`) so nested `<svg>`, `<code>`, etc., are
skipped correctly. Ensure you reuse or extend `findNamedTag` behavior or create
a new helper (referencing `findClosingTag` and `findNamedTag`) to avoid
duplicating scanning logic.

---

Duplicate comments:
In `@apps/translation-worker/src/index.ts`:
- Around line 2009-2055: The probe is matching required checks against any
segment (including title/meta/attributes) because findBatchText scans all
batches built from collectSegments; restrict checks to visible body text by
either (A) filtering segments before calling buildBatches (use only segments
where segment.inBody && segment.mode === 'text' when creating batches in
probeRealPageTranslation) or (B) change findBatchText to accept and check
segment metadata and only match when segment.inBody && segment.mode === 'text';
update calls in probeRealPageTranslation (and selectedBatchIndexes/checkSources
logic) to use the filtered batches so requiredChecks only validate actual body
text translations (reference symbols: findBatchText, probeRealPageTranslation,
collectSegments, buildBatches, segments, segment.inBody, segment.mode).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7dbb06e-ad8e-40f2-80e9-f03ce15ee5e8

📥 Commits

Reviewing files that changed from the base of the PR and between ac984a1 and 5a26254.

📒 Files selected for processing (1)
  • apps/translation-worker/src/index.ts

Comment thread apps/translation-worker/src/index.ts
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 2, 2026

@riderx riderx merged commit 2fafb32 into main May 2, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant