fix(translate): preserve fenced code blocks across translation#642
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughThe PR masks machine-readable spans (fenced code, HTML, markdown/inline/bare URLs, GFM alerts) with opaque markers before chunking and translation, translates the masked chunks, then restores the original spans into the combined translated text. ChangesSpan protection for translation
🎯 4 (Complex) | ⏱️ ~45 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR expands the translation pipeline in
Confidence Score: 3/5The masking/restoration round-trip is correct for the common case, but the character-level line splitter can silently bisect a marker, leaving raw token fragments in the output instead of the original code block. The character-offset split in feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt — specifically the interaction between Important Files Changed
Sequence DiagramsequenceDiagram
participant C as translate()
participant P as protectFromTranslation()
participant CH as chunkText()
participant TR as Translator
participant R as restoreProtectedSpans()
C->>P: raw text
P->>P: mask code fences → ⟦TR_0_END⟧
P->>P: mask HTML tags → ⟦TR_1_END⟧
P->>P: mask markdown URLs → ⟦TR_2_END⟧
P->>P: mask bare URLs → ⟦TR_3_END⟧
P->>P: mask GFM alerts → ⟦TR_4_END⟧
P-->>C: maskedText + spans[]
C->>CH: maskedText
CH-->>C: chunks[]
loop each chunk
C->>TR: chunk (with markers)
TR-->>C: translated chunk (markers pass-through)
end
C->>C: join translated chunks → joined
C->>R: joined + spans[]
R->>R: replaceFirst ⟦TR_0_END⟧ → spans[0]
R->>R: replaceFirst ⟦TR_1_END⟧ → spans[1]
R->>R: ...
R-->>C: fully restored translated text
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt`:
- Around line 121-129: The restoration regex in restoreCodeFences uses \s* which
doesn't match NBSP on the JVM; update the Regex to explicitly include NBSP (or
other space separators) in each whitespace slot used around the marker so
NBSP-padded markers are matched. For example, replace occurrences of "\s*"
inside the pattern with a character-class that includes both regular whitespace
and \u00A0 (e.g. "[\\s\\u00A0]*") so the Regex("⟦\\s*CF_\\s*${i}\\s*_END\\s*⟧")
becomes a pattern that tolerates NBSP padding before calling replaceFirst and
leaving the existing use of Regex.escapeReplacement(original) intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 063ab584-aeec-4af0-96e6-6e94679db4ce
📒 Files selected for processing (1)
feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt
| private fun restoreCodeFences(translated: String, fences: List<String>): String { | ||
| if (fences.isEmpty()) return translated | ||
| var result = translated | ||
| fences.forEachIndexed { i, original -> | ||
| // Tolerate translator inserting whitespace around the marker | ||
| // (Youdao occasionally pads with NBSP). Falls back to leaving | ||
| // any unmatched marker in place rather than corrupting prose. | ||
| val pattern = Regex("⟦\\s*CF_\\s*${i}\\s*_END\\s*⟧") | ||
| result = pattern.replaceFirst(result, Regex.escapeReplacement(original)) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Kotlin/JVM regular expressions, does \s include U+00A0 (NO-BREAK SPACE), and what is the recommended pattern to match both normal whitespace and NBSP?
💡 Result:
In Kotlin/JVM, the regular expression shorthand \s does not include the non-breaking space (U+00A0) by default [1][2][3]. The \s character class is defined as [ \t\n\x0B\f\r] unless the UNICODE_CHARACTER_CLASS flag is enabled [4][5][6]. To match both standard whitespace and the non-breaking space (NBSP), you have a few recommended approaches: 1. Explicitly include the character: You can use the character class [\s\u00A0] to match standard whitespace and the NBSP explicitly [1][2]. 2. Use horizontal whitespace (\h): Since Java 8, the \h shorthand matches horizontal whitespace, which explicitly includes U+00A0 (along with other characters like the thin space and ideographic space) [4][5][6]. This is often the cleanest solution if you specifically want to target horizontal spacing [7]. 3. Use Unicode character classes: You can use \p{Z} (any Unicode separator) or \p{Zs} (space separator) to match a broader range of Unicode whitespace characters, including U+00A0 [1][2]. Example usage in Kotlin: // Using explicit inclusion val regex = Regex("[\s\u00A0]+") // Using horizontal whitespace shorthand val regex = Regex("\h+") Note that while \p{javaWhitespace} is sometimes suggested, it does not include U+00A0, so it is generally not the correct choice for this specific requirement [8][2].
Citations:
- 1: https://stackoverflow.com/questions/1822772/java-regular-expression-to-match-all-whitespace-characters/45243501
- 2: https://stackoverflow.com/questions/1822772/java-regular-expression-to-match-all-whitespace-characters
- 3: https://thelinuxcode.com/use-regex-whitespace-in-java/
- 4: https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/util/regex/Pattern.html
- 5: https://docs.oracle.com/en/java/javase/26/docs/api/java.base/java/util/regex/Pattern.html
- 6: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
- 7: https://slack-chats.kotlinlang.org/t/12646580/how-can-i-make-this-regex-work-on-kotlin-js-the-same-way-it-
- 8: https://stackoverflow.com/questions/56235645/how-to-split-on-unicode-whitespace-in-kotlin
🏁 Script executed:
# Find and read the file to verify the exact content at line 128
fd -t f "TranslationRepositoryImpl.kt" feature/details/Repository: OpenHub-Store/GitHub-Store
Length of output: 178
🏁 Script executed:
# Read the file around lines 121-129 to verify the pattern
sed -n '121,129p' feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt | cat -nRepository: OpenHub-Store/GitHub-Store
Length of output: 720
🏁 Script executed:
# Check the entire restoreCodeFences function and look for other similar patterns
sed -n '121,137p' feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt | cat -nRepository: OpenHub-Store/GitHub-Store
Length of output: 927
🏁 Script executed:
# Check if there are other similar Regex patterns with \s in the same file
rg "Regex\(" feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt -A 1Repository: OpenHub-Store/GitHub-Store
Length of output: 346
\s on Line 128 won't reliably match NBSP, so marker restoration can fail.
Line 128 uses \s* but the comment explicitly states NBSP padding is tolerated. On JVM regex, NBSP (\u00A0) is not covered by \s, so markers like ⟦CF_0_END⟧ can remain unreplaced in the translated output when Youdao or other translators pad with NBSP.
Suggested fix
- val pattern = Regex("⟦\\s*CF_\\s*${i}\\s*_END\\s*⟧")
+ val pattern = Regex("⟦[\\s\\u00A0]*CF_[\\s\\u00A0]*${i}[\\s\\u00A0]*_END[\\s\\u00A0]*⟧")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| private fun restoreCodeFences(translated: String, fences: List<String>): String { | |
| if (fences.isEmpty()) return translated | |
| var result = translated | |
| fences.forEachIndexed { i, original -> | |
| // Tolerate translator inserting whitespace around the marker | |
| // (Youdao occasionally pads with NBSP). Falls back to leaving | |
| // any unmatched marker in place rather than corrupting prose. | |
| val pattern = Regex("⟦\\s*CF_\\s*${i}\\s*_END\\s*⟧") | |
| result = pattern.replaceFirst(result, Regex.escapeReplacement(original)) | |
| private fun restoreCodeFences(translated: String, fences: List<String>): String { | |
| if (fences.isEmpty()) return translated | |
| var result = translated | |
| fences.forEachIndexed { i, original -> | |
| // Tolerate translator inserting whitespace around the marker | |
| // (Youdao occasionally pads with NBSP). Falls back to leaving | |
| // any unmatched marker in place rather than corrupting prose. | |
| val pattern = Regex("⟦[\\s\\u00A0]*CF_[\\s\\u00A0]*${i}[\\s\\u00A0]*_END[\\s\\u00A0]*⟧") | |
| result = pattern.replaceFirst(result, Regex.escapeReplacement(original)) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt`
around lines 121 - 129, The restoration regex in restoreCodeFences uses \s*
which doesn't match NBSP on the JVM; update the Regex to explicitly include NBSP
(or other space separators) in each whitespace slot used around the marker so
NBSP-padded markers are matched. For example, replace occurrences of "\s*"
inside the pattern with a character-class that includes both regular whitespace
and \u00A0 (e.g. "[\\s\\u00A0]*") so the Regex("⟦\\s*CF_\\s*${i}\\s*_END\\s*⟧")
becomes a pattern that tolerates NBSP padding before calling replaceFirst and
leaving the existing use of Regex.escapeReplacement(original) intact.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt (1)
169-173:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
\s*won't match NBSP padding from Youdao.The comment states this tolerates whitespace, but JVM's
\sdoesn't include\u00A0(NBSP). Markers padded with NBSP will remain unreplaced.Suggested fix
- val pattern = Regex("⟦\\s*TR_\\s*${i}\\s*_END\\s*⟧") + val pattern = Regex("⟦[\\s\\u00A0]*TR_[\\s\\u00A0]*${i}[\\s\\u00A0]*_END[\\s\\u00A0]*⟧")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt` around lines 169 - 173, The Regex used in TranslationRepositoryImpl to tolerate whitespace around the marker (pattern = Regex("⟦\\s*TR_\\s*${i}\\s*_END\\s*⟧")) doesn't match NBSP (U+00A0), so markers padded by Youdao stay unreplaced; update the pattern to include NBSP (or the Unicode separator class) wherever \\s* is used — e.g. replace each "\\s*" in the literal with "[\\s\\u00A0]*" or use "\\p{Zs}*" so the Regex (used in pattern and subsequently in pattern.replaceFirst(...)) will match non-breaking spaces as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt`:
- Around line 169-173: The Regex used in TranslationRepositoryImpl to tolerate
whitespace around the marker (pattern = Regex("⟦\\s*TR_\\s*${i}\\s*_END\\s*⟧"))
doesn't match NBSP (U+00A0), so markers padded by Youdao stay unreplaced; update
the pattern to include NBSP (or the Unicode separator class) wherever \\s* is
used — e.g. replace each "\\s*" in the literal with "[\\s\\u00A0]*" or use
"\\p{Zs}*" so the Regex (used in pattern and subsequently in
pattern.replaceFirst(...)) will match non-breaking spaces as well.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 18084f34-0dfa-4c8a-846c-d06ecc1723db
📒 Files selected for processing (1)
feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/repository/TranslationRepositoryImpl.kt
Summary
Stop translating content inside
```fenced code blocks. Translators rewrite identifiers, mangle string literals, and break example commands.How
extractCodeFenceswalks the body and replaces every```…```block with a marker⟦CF_<n>_END⟧. Markers chosen empirically — Unicode math brackets + ALL_CAPS underscore token survive Google + Youdao verbatim across the 33 supported targets.restoreCodeFencessplices the original fence bodies back at each marker. Tolerates whitespace padding (Youdao occasionally inserts NBSP).text, so cache hit/miss behavior is unchanged.Test plan
Summary by CodeRabbit