fix(translate): preserve fenced code blocks across translation by rainxchzed · Pull Request #642 · OpenHub-Store/GitHub-Store

rainxchzed · 2026-05-19T05:44:05Z

Summary

Stop translating content inside ``` fenced code blocks. Translators rewrite identifiers, mangle string literals, and break example commands.

How

Before chunking + sending to the translator, extractCodeFences walks the body and replaces every ```…``` block with a marker ⟦CF_<n>_END⟧. Markers chosen empirically — Unicode math brackets + ALL_CAPS underscore token survive Google + Youdao verbatim across the 33 supported targets.
After all chunks return, restoreCodeFences splices the original fence bodies back at each marker. Tolerates whitespace padding (Youdao occasionally inserts NBSP).
Cache key still uses the original text, so cache hit/miss behavior is unchanged.
Unfenced inline code (single backticks) is intentionally untouched — short snippets in prose translate fine.

Test plan

Open a repo with a code-heavy README (e.g. kotlinx.coroutines), enable auto-translate to Chinese, expand About. Code fences stay verbatim; prose around them is translated.
Same on a What's New section that pastes shell commands in fences — commands intact.
Compile both targets — ✓ verified.

Summary by CodeRabbit

Bug Fixes
- Translation now masks machine-readable spans (fenced code, HTML fragments, and URLs) before chunking to prevent accidental modification.
- Chunked translation operates on masked text, then restores the original spans in the final output.
- Ensures large texts and multi-chunk workflows preserve protected content integrity.

coderabbitai · 2026-05-19T05:44:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5ab58e8b-86c0-41a9-99cb-fccb41c1501c

📥 Commits

Reviewing files that changed from the base of the PR and between 56e4cd9 and f2c539b.

📒 Files selected for processing (1)

feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt

🚧 Files skipped from review as they are similar to previous changes (1)

feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt

Walkthrough

The PR masks machine-readable spans (fenced code, HTML, markdown/inline/bare URLs, GFM alerts) with opaque markers before chunking and translation, translates the masked chunks, then restores the original spans into the combined translated text.

Changes

Span protection for translation

Layer / File(s)	Summary
Protection utilities `feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt`	Adds `TranslationProtection`, `protectFromTranslation`, `replaceWithMarker`, and `restoreProtectedSpans` to extract and mask fenced code blocks, HTML elements/bodies and self-closing tags, markdown link/image URL tails, bare URLs, and specific GFM alert markers; returns masked text plus the original spans.
Integration with translate flow `feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt`	`translate` now derives chunking input from the masked text produced by `protectFromTranslation`, translates chunks, joins translated chunks, and then calls `restoreProtectedSpans` with the captured spans to reinsert original fenced/HTML/URL content into `TranslationResult.translatedText`.

🎯 4 (Complex) | ⏱️ ~45 minutes

🐰 I hid the code and HTML tight,
Masked them safe before the flight,
Chunks translate while markers stay,
Then I stitch the originals back today,
A tidy hop — restored just right. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately describes the main change: preserving fenced code blocks during translation, which is the primary focus of the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/translate-preserve-code-fences

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-05-19T05:46:41Z

Greptile Summary

This PR expands the translation pipeline in TranslationRepositoryImpl to protect machine-readable spans — fenced code blocks, HTML tags, markdown/bare URLs, and GFM alert markers — from being mangled by the translator. Each span is replaced with a unique ⟦TR_N_END⟧ marker before chunking, and restored from a spans list after all translated chunks are joined.

protectFromTranslation runs five sequential regex passes (code fences → self-closing HTML → paired HTML → markdown URLs → bare URLs → GFM alerts) and accumulates original spans in an indexed list; restoreProtectedSpans iterates that list in order and uses replaceFirst with whitespace-tolerant patterns to splice them back.
The existing chunkText / chunkLargeParagraph machinery now operates on the masked text; the cache key is unchanged so hit/miss behaviour is unaffected.

Confidence Score: 3/5

The masking/restoration round-trip is correct for the common case, but the character-level line splitter can silently bisect a marker, leaving raw token fragments in the output instead of the original code block.

The character-offset split in chunkLargeParagraph predates the new marker scheme. Now that masked text can contain ⟦TR_N_END⟧ tokens, the split point can fall mid-token. Because restoreProtectedSpans silently no-ops on unmatched markers, this failure mode produces wrong output with no indication anything went wrong.

feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt — specifically the interaction between chunkLargeParagraph's character-level fallback path and the new span markers.

Important Files Changed

Filename	Overview
feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt	Adds `protectFromTranslation` / `restoreProtectedSpans` to mask code fences, HTML, URLs, and GFM alert markers before chunked translation; the restoration loop can fail silently when `chunkLargeParagraph`'s raw character-offset split bisects a marker.

Sequence Diagram

sequenceDiagram
    participant C as translate()
    participant P as protectFromTranslation()
    participant CH as chunkText()
    participant TR as Translator
    participant R as restoreProtectedSpans()

    C->>P: raw text
    P->>P: mask code fences → ⟦TR_0_END⟧
    P->>P: mask HTML tags → ⟦TR_1_END⟧
    P->>P: mask markdown URLs → ⟦TR_2_END⟧
    P->>P: mask bare URLs → ⟦TR_3_END⟧
    P->>P: mask GFM alerts → ⟦TR_4_END⟧
    P-->>C: maskedText + spans[]

    C->>CH: maskedText
    CH-->>C: chunks[]

    loop each chunk
        C->>TR: chunk (with markers)
        TR-->>C: translated chunk (markers pass-through)
    end

    C->>C: join translated chunks → joined
    C->>R: joined + spans[]
    R->>R: replaceFirst ⟦TR_0_END⟧ → spans[0]
    R->>R: replaceFirst ⟦TR_1_END⟧ → spans[1]
    R->>R: ...
    R-->>C: fully restored translated text

Comments Outside Diff (1)

feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt, line 263-268 (link)

Marker bisection at character-split boundary

chunkLargeParagraph slices a line at raw character offsets when line.length > maxChunkSize. A ⟦TR_N_END⟧ marker is 14+ characters wide; if one straddles the split point (e.g. a 5 000-char prose line whose marker begins at offset 4 994), the first chunk ends with ⟦TR_ and the second begins with N_END⟧. Both halves get independently translated, the translator garbles the partial token, and restoreProtectedSpans can no longer match the full marker pattern — leaving a raw fragment like ⟦TR_ in the final output instead of the original code block.

A safe fix is to find the nearest marker boundary before the cut-point, or to guarantee no marker can appear in a line that would trigger the character split (e.g. by only masking content that is already on its own line).

_{Reviews (3): Last reviewed commit: "fix(translate): preserve GFM alert marke..." | Re-trigger Greptile}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt`:
- Around line 121-129: The restoration regex in restoreCodeFences uses \s* which
doesn't match NBSP on the JVM; update the Regex to explicitly include NBSP (or
other space separators) in each whitespace slot used around the marker so
NBSP-padded markers are matched. For example, replace occurrences of "\s*"
inside the pattern with a character-class that includes both regular whitespace
and \u00A0 (e.g. "[\\s\\u00A0]*") so the Regex("⟦\\s*CF_\\s*${i}\\s*_END\\s*⟧")
becomes a pattern that tolerates NBSP padding before calling replaceFirst and
leaving the existing use of Regex.escapeReplacement(original) intact.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 063ab584-aeec-4af0-96e6-6e94679db4ce

📥 Commits

Reviewing files that changed from the base of the PR and between 62642ef and 5c36395.

📒 Files selected for processing (1)

feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt

coderabbitai · 2026-05-19T05:47:30Z

+    private fun restoreCodeFences(translated: String, fences: List<String>): String {
+        if (fences.isEmpty()) return translated
+        var result = translated
+        fences.forEachIndexed { i, original ->
+            // Tolerate translator inserting whitespace around the marker
+            // (Youdao occasionally pads with NBSP). Falls back to leaving
+            // any unmatched marker in place rather than corrupting prose.
+            val pattern = Regex("⟦\\s*CF_\\s*${i}\\s*_END\\s*⟧")
+            result = pattern.replaceFirst(result, Regex.escapeReplacement(original))


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In Kotlin/JVM regular expressions, does \s include U+00A0 (NO-BREAK SPACE), and what is the recommended pattern to match both normal whitespace and NBSP?

💡 Result:

In Kotlin/JVM, the regular expression shorthand \s does not include the non-breaking space (U+00A0) by default [1][2][3]. The \s character class is defined as [ \t\n\x0B\f\r] unless the UNICODE_CHARACTER_CLASS flag is enabled [4][5][6]. To match both standard whitespace and the non-breaking space (NBSP), you have a few recommended approaches: 1. Explicitly include the character: You can use the character class [\s\u00A0] to match standard whitespace and the NBSP explicitly [1][2]. 2. Use horizontal whitespace (\h): Since Java 8, the \h shorthand matches horizontal whitespace, which explicitly includes U+00A0 (along with other characters like the thin space and ideographic space) [4][5][6]. This is often the cleanest solution if you specifically want to target horizontal spacing [7]. 3. Use Unicode character classes: You can use \p{Z} (any Unicode separator) or \p{Zs} (space separator) to match a broader range of Unicode whitespace characters, including U+00A0 [1][2]. Example usage in Kotlin: // Using explicit inclusion val regex = Regex("[\s\u00A0]+") // Using horizontal whitespace shorthand val regex = Regex("\h+") Note that while \p{javaWhitespace} is sometimes suggested, it does not include U+00A0, so it is generally not the correct choice for this specific requirement [8][2].

Citations:

1: https://stackoverflow.com/questions/1822772/java-regular-expression-to-match-all-whitespace-characters/45243501

2: https://stackoverflow.com/questions/1822772/java-regular-expression-to-match-all-whitespace-characters

3: https://thelinuxcode.com/use-regex-whitespace-in-java/

4: https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/util/regex/Pattern.html

5: https://docs.oracle.com/en/java/javase/26/docs/api/java.base/java/util/regex/Pattern.html

6: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

7: https://slack-chats.kotlinlang.org/t/12646580/how-can-i-make-this-regex-work-on-kotlin-js-the-same-way-it-

8: https://stackoverflow.com/questions/56235645/how-to-split-on-unicode-whitespace-in-kotlin

🏁 Script executed:

# Find and read the file to verify the exact content at line 128 fd -t f "TranslationRepositoryImpl.kt" feature/details/

Repository: OpenHub-Store/GitHub-Store

Length of output: 178

🏁 Script executed:

# Read the file around lines 121-129 to verify the pattern sed -n '121,129p' feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt | cat -n

Repository: OpenHub-Store/GitHub-Store

Length of output: 720

🏁 Script executed:

# Check the entire restoreCodeFences function and look for other similar patterns sed -n '121,137p' feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt | cat -n

Repository: OpenHub-Store/GitHub-Store

Length of output: 927

🏁 Script executed:

# Check if there are other similar Regex patterns with \s in the same file rg "Regex\(" feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt -A 1

Repository: OpenHub-Store/GitHub-Store

Length of output: 346

\s on Line 128 won't reliably match NBSP, so marker restoration can fail.

Line 128 uses \s* but the comment explicitly states NBSP padding is tolerated. On JVM regex, NBSP (\u00A0) is not covered by \s, so markers like ⟦CF_0_END⟧ can remain unreplaced in the translated output when Youdao or other translators pad with NBSP.

Suggested fix

- val pattern = Regex("⟦\\s*CF_\\s*${i}\\s*_END\\s*⟧") + val pattern = Regex("⟦[\\s\\u00A0]*CF_[\\s\\u00A0]*${i}[\\s\\u00A0]*_END[\\s\\u00A0]*⟧")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

private fun restoreCodeFences(translated: String, fences: List<String>): String {

if (fences.isEmpty()) return translated

var result = translated

fences.forEachIndexed { i, original ->

// Tolerate translator inserting whitespace around the marker

// (Youdao occasionally pads with NBSP). Falls back to leaving

// any unmatched marker in place rather than corrupting prose.

val pattern = Regex("⟦\\s*CF_\\s*${i}\\s*_END\\s*⟧")

result = pattern.replaceFirst(result, Regex.escapeReplacement(original))

private fun restoreCodeFences(translated: String, fences: List<String>): String {

if (fences.isEmpty()) return translated

var result = translated

fences.forEachIndexed { i, original ->

// Tolerate translator inserting whitespace around the marker

// (Youdao occasionally pads with NBSP). Falls back to leaving

// any unmatched marker in place rather than corrupting prose.

val pattern = Regex("⟦[\\s\\u00A0]*CF_[\\s\\u00A0]*${i}[\\s\\u00A0]*_END[\\s\\u00A0]*⟧")

result = pattern.replaceFirst(result, Regex.escapeReplacement(original))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt` around lines 121 - 129, The restoration regex in restoreCodeFences uses \s* which doesn't match NBSP on the JVM; update the Regex to explicitly include NBSP (or other space separators) in each whitespace slot used around the marker so NBSP-padded markers are matched. For example, replace occurrences of "\s*" inside the pattern with a character-class that includes both regular whitespace and \u00A0 (e.g. "[\\s\\u00A0]*") so the Regex("⟦\\s*CF_\\s*${i}\\s*_END\\s*⟧") becomes a pattern that tolerates NBSP padding before calling replaceFirst and leaving the existing use of Regex.escapeReplacement(original) intact.

coderabbitai

♻️ Duplicate comments (1)

feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt (1)

169-173: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

\s* won't match NBSP padding from Youdao.

The comment states this tolerates whitespace, but JVM's \s doesn't include \u00A0 (NBSP). Markers padded with NBSP will remain unreplaced.

Suggested fix

-            val pattern = Regex("⟦\\s*TR_\\s*${i}\\s*_END\\s*⟧")
+            val pattern = Regex("⟦[\\s\\u00A0]*TR_[\\s\\u00A0]*${i}[\\s\\u00A0]*_END[\\s\\u00A0]*⟧")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt`
around lines 169 - 173, The Regex used in TranslationRepositoryImpl to tolerate
whitespace around the marker (pattern = Regex("⟦\\s*TR_\\s*${i}\\s*_END\\s*⟧"))
doesn't match NBSP (U+00A0), so markers padded by Youdao stay unreplaced; update
the pattern to include NBSP (or the Unicode separator class) wherever \\s* is
used — e.g. replace each "\\s*" in the literal with "[\\s\\u00A0]*" or use
"\\p{Zs}*" so the Regex (used in pattern and subsequently in
pattern.replaceFirst(...)) will match non-breaking spaces as well.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt`:
- Around line 169-173: The Regex used in TranslationRepositoryImpl to tolerate
whitespace around the marker (pattern = Regex("⟦\\s*TR_\\s*${i}\\s*_END\\s*⟧"))
doesn't match NBSP (U+00A0), so markers padded by Youdao stay unreplaced; update
the pattern to include NBSP (or the Unicode separator class) wherever \\s* is
used — e.g. replace each "\\s*" in the literal with "[\\s\\u00A0]*" or use
"\\p{Zs}*" so the Regex (used in pattern and subsequently in
pattern.replaceFirst(...)) will match non-breaking spaces as well.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 18084f34-0dfa-4c8a-846c-d06ecc1723db

📥 Commits

Reviewing files that changed from the base of the PR and between 5c36395 and 56e4cd9.

📒 Files selected for processing (1)

feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt

fix(translate): preserve fenced code blocks across translation

5c36395

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

Comment thread .../data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt Outdated

Comment thread .../data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt Outdated

fix(translate): also preserve HTML tags, markdown URLs, bare URLs

56e4cd9

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

rainxchzed mentioned this pull request May 19, 2026

fix(markdown): stop inline badge image overlap (SVG / unknown intrinsic size) #643

Merged

4 tasks

fix(translate): preserve GFM alert markers ([!NOTE] etc.)

f2c539b

rainxchzed merged commit 5f1d63f into main May 19, 2026
1 check passed

rainxchzed deleted the fix/translate-preserve-code-fences branch May 19, 2026 06:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(translate): preserve fenced code blocks across translation#642

fix(translate): preserve fenced code blocks across translation#642
rainxchzed merged 3 commits into
mainfrom
fix/translate-preserve-code-fences

rainxchzed commented May 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot commented May 19, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rainxchzed commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rainxchzed commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading

greptile-apps Bot commented May 19, 2026 •

edited

Loading