Skip to content

fix(translate): preserve fenced code blocks across translation#642

Merged
rainxchzed merged 3 commits into
mainfrom
fix/translate-preserve-code-fences
May 19, 2026
Merged

fix(translate): preserve fenced code blocks across translation#642
rainxchzed merged 3 commits into
mainfrom
fix/translate-preserve-code-fences

Conversation

@rainxchzed
Copy link
Copy Markdown
Member

@rainxchzed rainxchzed commented May 19, 2026

Summary

Stop translating content inside ``` fenced code blocks. Translators rewrite identifiers, mangle string literals, and break example commands.

How

  • Before chunking + sending to the translator, extractCodeFences walks the body and replaces every ```…``` block with a marker ⟦CF_<n>_END⟧. Markers chosen empirically — Unicode math brackets + ALL_CAPS underscore token survive Google + Youdao verbatim across the 33 supported targets.
  • After all chunks return, restoreCodeFences splices the original fence bodies back at each marker. Tolerates whitespace padding (Youdao occasionally inserts NBSP).
  • Cache key still uses the original text, so cache hit/miss behavior is unchanged.
  • Unfenced inline code (single backticks) is intentionally untouched — short snippets in prose translate fine.

Test plan

  • Open a repo with a code-heavy README (e.g. kotlinx.coroutines), enable auto-translate to Chinese, expand About. Code fences stay verbatim; prose around them is translated.
  • Same on a What's New section that pastes shell commands in fences — commands intact.
  • Compile both targets — ✓ verified.

Summary by CodeRabbit

  • Bug Fixes
    • Translation now masks machine-readable spans (fenced code, HTML fragments, and URLs) before chunking to prevent accidental modification.
    • Chunked translation operates on masked text, then restores the original spans in the final output.
    • Ensures large texts and multi-chunk workflows preserve protected content integrity.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5ab58e8b-86c0-41a9-99cb-fccb41c1501c

📥 Commits

Reviewing files that changed from the base of the PR and between 56e4cd9 and f2c539b.

📒 Files selected for processing (1)
  • feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt
🚧 Files skipped from review as they are similar to previous changes (1)
  • feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt

Walkthrough

The PR masks machine-readable spans (fenced code, HTML, markdown/inline/bare URLs, GFM alerts) with opaque markers before chunking and translation, translates the masked chunks, then restores the original spans into the combined translated text.

Changes

Span protection for translation

Layer / File(s) Summary
Protection utilities
feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt
Adds TranslationProtection, protectFromTranslation, replaceWithMarker, and restoreProtectedSpans to extract and mask fenced code blocks, HTML elements/bodies and self-closing tags, markdown link/image URL tails, bare URLs, and specific GFM alert markers; returns masked text plus the original spans.
Integration with translate flow
feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt
translate now derives chunking input from the masked text produced by protectFromTranslation, translates chunks, joins translated chunks, and then calls restoreProtectedSpans with the captured spans to reinsert original fenced/HTML/URL content into TranslationResult.translatedText.

🎯 4 (Complex) | ⏱️ ~45 minutes

🐰 I hid the code and HTML tight,
Masked them safe before the flight,
Chunks translate while markers stay,
Then I stitch the originals back today,
A tidy hop — restored just right. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately describes the main change: preserving fenced code blocks during translation, which is the primary focus of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/translate-preserve-code-fences

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 19, 2026

Greptile Summary

This PR expands the translation pipeline in TranslationRepositoryImpl to protect machine-readable spans — fenced code blocks, HTML tags, markdown/bare URLs, and GFM alert markers — from being mangled by the translator. Each span is replaced with a unique ⟦TR_N_END⟧ marker before chunking, and restored from a spans list after all translated chunks are joined.

  • protectFromTranslation runs five sequential regex passes (code fences → self-closing HTML → paired HTML → markdown URLs → bare URLs → GFM alerts) and accumulates original spans in an indexed list; restoreProtectedSpans iterates that list in order and uses replaceFirst with whitespace-tolerant patterns to splice them back.
  • The existing chunkText / chunkLargeParagraph machinery now operates on the masked text; the cache key is unchanged so hit/miss behaviour is unaffected.

Confidence Score: 3/5

The masking/restoration round-trip is correct for the common case, but the character-level line splitter can silently bisect a marker, leaving raw token fragments in the output instead of the original code block.

The character-offset split in chunkLargeParagraph predates the new marker scheme. Now that masked text can contain ⟦TR_N_END⟧ tokens, the split point can fall mid-token. Because restoreProtectedSpans silently no-ops on unmatched markers, this failure mode produces wrong output with no indication anything went wrong.

feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt — specifically the interaction between chunkLargeParagraph's character-level fallback path and the new span markers.

Important Files Changed

Filename Overview
feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt Adds protectFromTranslation / restoreProtectedSpans to mask code fences, HTML, URLs, and GFM alert markers before chunked translation; the restoration loop can fail silently when chunkLargeParagraph's raw character-offset split bisects a marker.

Sequence Diagram

sequenceDiagram
    participant C as translate()
    participant P as protectFromTranslation()
    participant CH as chunkText()
    participant TR as Translator
    participant R as restoreProtectedSpans()

    C->>P: raw text
    P->>P: mask code fences → ⟦TR_0_END⟧
    P->>P: mask HTML tags → ⟦TR_1_END⟧
    P->>P: mask markdown URLs → ⟦TR_2_END⟧
    P->>P: mask bare URLs → ⟦TR_3_END⟧
    P->>P: mask GFM alerts → ⟦TR_4_END⟧
    P-->>C: maskedText + spans[]

    C->>CH: maskedText
    CH-->>C: chunks[]

    loop each chunk
        C->>TR: chunk (with markers)
        TR-->>C: translated chunk (markers pass-through)
    end

    C->>C: join translated chunks → joined
    C->>R: joined + spans[]
    R->>R: replaceFirst ⟦TR_0_END⟧ → spans[0]
    R->>R: replaceFirst ⟦TR_1_END⟧ → spans[1]
    R->>R: ...
    R-->>C: fully restored translated text
Loading

Comments Outside Diff (1)

  1. feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt, line 263-268 (link)

    P1 Marker bisection at character-split boundary

    chunkLargeParagraph slices a line at raw character offsets when line.length > maxChunkSize. A ⟦TR_N_END⟧ marker is 14+ characters wide; if one straddles the split point (e.g. a 5 000-char prose line whose marker begins at offset 4 994), the first chunk ends with ⟦TR_ and the second begins with N_END⟧. Both halves get independently translated, the translator garbles the partial token, and restoreProtectedSpans can no longer match the full marker pattern — leaving a raw fragment like ⟦TR_ in the final output instead of the original code block.

    A safe fix is to find the nearest marker boundary before the cut-point, or to guarantee no marker can appear in a line that would trigger the character split (e.g. by only masking content that is already on its own line).

    Fix in Claude Code

Fix All in Claude Code

Reviews (3): Last reviewed commit: "fix(translate): preserve GFM alert marke..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt`:
- Around line 121-129: The restoration regex in restoreCodeFences uses \s* which
doesn't match NBSP on the JVM; update the Regex to explicitly include NBSP (or
other space separators) in each whitespace slot used around the marker so
NBSP-padded markers are matched. For example, replace occurrences of "\s*"
inside the pattern with a character-class that includes both regular whitespace
and \u00A0 (e.g. "[\\s\\u00A0]*") so the Regex("⟦\\s*CF_\\s*${i}\\s*_END\\s*⟧")
becomes a pattern that tolerates NBSP padding before calling replaceFirst and
leaving the existing use of Regex.escapeReplacement(original) intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 063ab584-aeec-4af0-96e6-6e94679db4ce

📥 Commits

Reviewing files that changed from the base of the PR and between 62642ef and 5c36395.

📒 Files selected for processing (1)
  • feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt

Comment on lines +121 to +129
private fun restoreCodeFences(translated: String, fences: List<String>): String {
if (fences.isEmpty()) return translated
var result = translated
fences.forEachIndexed { i, original ->
// Tolerate translator inserting whitespace around the marker
// (Youdao occasionally pads with NBSP). Falls back to leaving
// any unmatched marker in place rather than corrupting prose.
val pattern = Regex("⟦\\s*CF_\\s*${i}\\s*_END\\s*⟧")
result = pattern.replaceFirst(result, Regex.escapeReplacement(original))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In Kotlin/JVM regular expressions, does \s include U+00A0 (NO-BREAK SPACE), and what is the recommended pattern to match both normal whitespace and NBSP?

💡 Result:

In Kotlin/JVM, the regular expression shorthand \s does not include the non-breaking space (U+00A0) by default [1][2][3]. The \s character class is defined as [ \t\n\x0B\f\r] unless the UNICODE_CHARACTER_CLASS flag is enabled [4][5][6]. To match both standard whitespace and the non-breaking space (NBSP), you have a few recommended approaches: 1. Explicitly include the character: You can use the character class [\s\u00A0] to match standard whitespace and the NBSP explicitly [1][2]. 2. Use horizontal whitespace (\h): Since Java 8, the \h shorthand matches horizontal whitespace, which explicitly includes U+00A0 (along with other characters like the thin space and ideographic space) [4][5][6]. This is often the cleanest solution if you specifically want to target horizontal spacing [7]. 3. Use Unicode character classes: You can use \p{Z} (any Unicode separator) or \p{Zs} (space separator) to match a broader range of Unicode whitespace characters, including U+00A0 [1][2]. Example usage in Kotlin: // Using explicit inclusion val regex = Regex("[\s\u00A0]+") // Using horizontal whitespace shorthand val regex = Regex("\h+") Note that while \p{javaWhitespace} is sometimes suggested, it does not include U+00A0, so it is generally not the correct choice for this specific requirement [8][2].

Citations:


🏁 Script executed:

# Find and read the file to verify the exact content at line 128
fd -t f "TranslationRepositoryImpl.kt" feature/details/

Repository: OpenHub-Store/GitHub-Store

Length of output: 178


🏁 Script executed:

# Read the file around lines 121-129 to verify the pattern
sed -n '121,129p' feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt | cat -n

Repository: OpenHub-Store/GitHub-Store

Length of output: 720


🏁 Script executed:

# Check the entire restoreCodeFences function and look for other similar patterns
sed -n '121,137p' feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt | cat -n

Repository: OpenHub-Store/GitHub-Store

Length of output: 927


🏁 Script executed:

# Check if there are other similar Regex patterns with \s in the same file
rg "Regex\(" feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt -A 1

Repository: OpenHub-Store/GitHub-Store

Length of output: 346


\s on Line 128 won't reliably match NBSP, so marker restoration can fail.

Line 128 uses \s* but the comment explicitly states NBSP padding is tolerated. On JVM regex, NBSP (\u00A0) is not covered by \s, so markers like ⟦CF_0_END⟧ can remain unreplaced in the translated output when Youdao or other translators pad with NBSP.

Suggested fix
-            val pattern = Regex("⟦\\s*CF_\\s*${i}\\s*_END\\s*⟧")
+            val pattern = Regex("⟦[\\s\\u00A0]*CF_[\\s\\u00A0]*${i}[\\s\\u00A0]*_END[\\s\\u00A0]*⟧")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
private fun restoreCodeFences(translated: String, fences: List<String>): String {
if (fences.isEmpty()) return translated
var result = translated
fences.forEachIndexed { i, original ->
// Tolerate translator inserting whitespace around the marker
// (Youdao occasionally pads with NBSP). Falls back to leaving
// any unmatched marker in place rather than corrupting prose.
val pattern = Regex("\\s*CF_\\s*${i}\\s*_END\\s*⟧")
result = pattern.replaceFirst(result, Regex.escapeReplacement(original))
private fun restoreCodeFences(translated: String, fences: List<String>): String {
if (fences.isEmpty()) return translated
var result = translated
fences.forEachIndexed { i, original ->
// Tolerate translator inserting whitespace around the marker
// (Youdao occasionally pads with NBSP). Falls back to leaving
// any unmatched marker in place rather than corrupting prose.
val pattern = Regex("[\\s\\u00A0]*CF_[\\s\\u00A0]*${i}[\\s\\u00A0]*_END[\\s\\u00A0]*⟧")
result = pattern.replaceFirst(result, Regex.escapeReplacement(original))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt`
around lines 121 - 129, The restoration regex in restoreCodeFences uses \s*
which doesn't match NBSP on the JVM; update the Regex to explicitly include NBSP
(or other space separators) in each whitespace slot used around the marker so
NBSP-padded markers are matched. For example, replace occurrences of "\s*"
inside the pattern with a character-class that includes both regular whitespace
and \u00A0 (e.g. "[\\s\\u00A0]*") so the Regex("⟦\\s*CF_\\s*${i}\\s*_END\\s*⟧")
becomes a pattern that tolerates NBSP padding before calling replaceFirst and
leaving the existing use of Regex.escapeReplacement(original) intact.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt (1)

169-173: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

\s* won't match NBSP padding from Youdao.

The comment states this tolerates whitespace, but JVM's \s doesn't include \u00A0 (NBSP). Markers padded with NBSP will remain unreplaced.

Suggested fix
-            val pattern = Regex("⟦\\s*TR_\\s*${i}\\s*_END\\s*⟧")
+            val pattern = Regex("⟦[\\s\\u00A0]*TR_[\\s\\u00A0]*${i}[\\s\\u00A0]*_END[\\s\\u00A0]*⟧")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt`
around lines 169 - 173, The Regex used in TranslationRepositoryImpl to tolerate
whitespace around the marker (pattern = Regex("⟦\\s*TR_\\s*${i}\\s*_END\\s*⟧"))
doesn't match NBSP (U+00A0), so markers padded by Youdao stay unreplaced; update
the pattern to include NBSP (or the Unicode separator class) wherever \\s* is
used — e.g. replace each "\\s*" in the literal with "[\\s\\u00A0]*" or use
"\\p{Zs}*" so the Regex (used in pattern and subsequently in
pattern.replaceFirst(...)) will match non-breaking spaces as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt`:
- Around line 169-173: The Regex used in TranslationRepositoryImpl to tolerate
whitespace around the marker (pattern = Regex("⟦\\s*TR_\\s*${i}\\s*_END\\s*⟧"))
doesn't match NBSP (U+00A0), so markers padded by Youdao stay unreplaced; update
the pattern to include NBSP (or the Unicode separator class) wherever \\s* is
used — e.g. replace each "\\s*" in the literal with "[\\s\\u00A0]*" or use
"\\p{Zs}*" so the Regex (used in pattern and subsequently in
pattern.replaceFirst(...)) will match non-breaking spaces as well.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 18084f34-0dfa-4c8a-846c-d06ecc1723db

📥 Commits

Reviewing files that changed from the base of the PR and between 5c36395 and 56e4cd9.

📒 Files selected for processing (1)
  • feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt

@rainxchzed rainxchzed merged commit 5f1d63f into main May 19, 2026
1 check passed
@rainxchzed rainxchzed deleted the fix/translate-preserve-code-fences branch May 19, 2026 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant