feat(markdown): polish — sub/sup, HTML entities, blockquote, pre-code#620
Conversation
There was a problem hiding this comment.
rainxchzed has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (14)
WalkthroughThis PR extends Markdown rendering capabilities by implementing HTML-to-Markdown conversions in the preprocessor and expands HTML entity decoding to support a wider Unicode range and named entity references. The feature is documented across twelve localized release notes describing improved blockquote/code-fence handling, Unicode sub/superscript rendering, and HTML entity decoding. ChangesMarkdown Rendering Polish and Release Notes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt`:
- Line 341: The regex in preprocessMarkdown.kt currently uses language-(\w+)
which fails to capture common language IDs like c++, f#, objective-c, or
shell-session; update that capture group to accept plus signs, hashes and
hyphens (e.g. replace language-(\w+) with language-([A-Za-z0-9_+#-]+) or more
generally language-([^"'\s>]+)) so the regex on the line containing
"""<pre[^>]*>\s*<code(?:\s+[^>]*?class\s*=\s*["'][^"']*?language-(\w+)[^"']*?["'])?[^>]*>(.*?)</code>\s*</pre>"""
correctly preserves all common fenced-code language IDs.
- Line 347: The current Markdown code-fence generation always emits triple
backticks and can break if the variable code contains backticks; update the
block in preprocessMarkdown.kt where you build the fence for a code snippet (the
section referencing lang and code) to compute the longest run of backticks
inside code (use Regex like `\`{3,}\`` to find matches), default to 2 if none
found, then create a fence string of one more backtick (e.g., fence =
"`".repeat(maxTicks+1)) and use that fence instead of the hardcoded "```" so the
emitted "\n$fence$lang\n$code\n$fence\n" remains well-formed.
- Around line 494-505: The Unicode entity decoding replaces numeric entities by
calling code.toChar().toString(), which truncates supplementary-plane code
points; in the numeric-decimal and numeric-hex replace blocks (the lambda used
to build processed via processed.replace(...){ match -> ... }), replace
code.toChar().toString() with a construct that preserves full Unicode code
points (e.g., create a String from the code point such as
String(intArrayOf(code))) so both decimal and hex entity decoding handle values
up to 0x10FFFF correctly; keep the existing range check (32..0x10FFFF) and apply
the same change in both places.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: cc90bac9-daf8-4250-91c7-fff281eb8e8e
📒 Files selected for processing (14)
core/presentation/src/commonMain/composeResources/files/whatsnew/18.jsoncore/presentation/src/commonMain/composeResources/files/whatsnew/ar/18.jsoncore/presentation/src/commonMain/composeResources/files/whatsnew/bn/18.jsoncore/presentation/src/commonMain/composeResources/files/whatsnew/es/18.jsoncore/presentation/src/commonMain/composeResources/files/whatsnew/fr/18.jsoncore/presentation/src/commonMain/composeResources/files/whatsnew/hi/18.jsoncore/presentation/src/commonMain/composeResources/files/whatsnew/it/18.jsoncore/presentation/src/commonMain/composeResources/files/whatsnew/ja/18.jsoncore/presentation/src/commonMain/composeResources/files/whatsnew/ko/18.jsoncore/presentation/src/commonMain/composeResources/files/whatsnew/pl/18.jsoncore/presentation/src/commonMain/composeResources/files/whatsnew/ru/18.jsoncore/presentation/src/commonMain/composeResources/files/whatsnew/tr/18.jsoncore/presentation/src/commonMain/composeResources/files/whatsnew/zh-CN/18.jsonfeature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt
| processed = | ||
| processed.replace( | ||
| Regex( | ||
| """<pre[^>]*>\s*<code(?:\s+[^>]*?class\s*=\s*["'][^"']*?language-(\w+)[^"']*?["'])?[^>]*>(.*?)</code>\s*</pre>""", |
There was a problem hiding this comment.
Language capture is too restrictive for common fenced-code language IDs.
Line 341 uses language-(\w+), which misses values like c++, objective-c, f#, and shell-session. This silently drops syntax hints.
💡 Proposed fix
- """<pre[^>]*>\s*<code(?:\s+[^>]*?class\s*=\s*["'][^"']*?language-(\w+)[^"']*?["'])?[^>]*>(.*?)</code>\s*</pre>""",
+ """<pre[^>]*>\s*<code(?:\s+[^>]*?class\s*=\s*["'][^"']*?language-([A-Za-z0-9_+#.-]+)[^"']*?["'])?[^>]*>(.*?)</code>\s*</pre>""",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """<pre[^>]*>\s*<code(?:\s+[^>]*?class\s*=\s*["'][^"']*?language-(\w+)[^"']*?["'])?[^>]*>(.*?)</code>\s*</pre>""", | |
| """<pre[^>]*>\s*<code(?:\s+[^>]*?class\s*=\s*["'][^"']*?language-([A-Za-z0-9_+#.-]+)[^"']*?["'])?[^>]*>(.*?)</code>\s*</pre>""", |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt`
at line 341, The regex in preprocessMarkdown.kt currently uses language-(\w+)
which fails to capture common language IDs like c++, f#, objective-c, or
shell-session; update that capture group to accept plus signs, hashes and
hyphens (e.g. replace language-(\w+) with language-([A-Za-z0-9_+#-]+) or more
generally language-([^"'\s>]+)) so the regex on the line containing
"""<pre[^>]*>\s*<code(?:\s+[^>]*?class\s*=\s*["'][^"']*?language-(\w+)[^"']*?["'])?[^>]*>(.*?)</code>\s*</pre>"""
correctly preserves all common fenced-code language IDs.
| ) { match -> | ||
| val lang = match.groupValues[1] | ||
| val code = match.groupValues[2] | ||
| "\n```$lang\n$code\n```\n" |
There was a problem hiding this comment.
Code-fence generation can break when code content contains triple backticks.
Line 347 always emits ``` fences. If code already includes triple backticks, the generated Markdown block becomes malformed.
💡 Proposed fix
) { match ->
val lang = match.groupValues[1]
val code = match.groupValues[2]
- "\n```$lang\n$code\n```\n"
+ val maxTicks = Regex("""`{3,}""").findAll(code).maxOfOrNull { it.value.length } ?: 2
+ val fence = "`".repeat(maxTicks + 1)
+ "\n$fence$lang\n$code\n$fence\n"
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt`
at line 347, The current Markdown code-fence generation always emits triple
backticks and can break if the variable code contains backticks; update the
block in preprocessMarkdown.kt where you build the fence for a code snippet (the
section referencing lang and code) to compute the longest run of backticks
inside code (use Regex like `\`{3,}\`` to find matches), default to 2 if none
found, then create a fence string of one more backtick (e.g., fence =
"`".repeat(maxTicks+1)) and use that fence instead of the hardcoded "```" so the
emitted "\n$fence$lang\n$code\n$fence\n" remains well-formed.
| if (code != null && code in 32..0x10FFFF) { | ||
| code.toChar().toString() | ||
| } else { | ||
| match.value | ||
| } | ||
| } | ||
| // Numeric HTML entities (hex): &#xHHHH; → char. | ||
| processed = | ||
| processed.replace(Regex("""&#x([0-9A-Fa-f]+);""")) { match -> | ||
| val code = match.groupValues[1].toIntOrNull(16) | ||
| if (code != null && code in 32..0x10FFFF) { | ||
| code.toChar().toString() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the potentially lossy conversion path and where it is used.
rg -n -C2 '0x10FFFF|toChar\(\)' feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt
# Demonstrate why toChar is unsafe for supplementary code points.
python - <<'PY'
cp = 0x1F600 # 😀
print("Expected code point:", hex(cp), chr(cp))
print("Truncated via 16-bit lower unit:", hex(cp & 0xFFFF), chr(cp & 0xFFFF))
PYRepository: OpenHub-Store/GitHub-Store
Length of output: 735
🏁 Script executed:
# Check if Character.toChars() is available in Kotlin
rg -n 'Character\.' feature/details/data/ || echo "No existing Character usage found"
# Check Kotlin version and available APIs
head -20 feature/details/data/build.gradle.kts || head -20 feature/details/data/build.gradle
# Verify the context: is this commonMain code (multiplatform)?
head -5 feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt
# Look for other potentially unsafe toChar() conversions
rg -n 'toChar()' feature/details/data/src/commonMain/kotlin/Repository: OpenHub-Store/GitHub-Store
Length of output: 1094
🏁 Script executed:
# Check what Kotlin version constraint says (from coding guidelines: Kotlin 2.3.10)
# Search for existing surrogate pair or Unicode handling in the codebase
rg -n 'surrogate|Surrogate|\\\\u|toChars|fromCodePoint' feature/details/data/src/ | head -20
# Check if there's a Kotlin extension or utility for code point to string
rg -n 'codePoint|CodePoint' feature/ | head -20
# Look for similar HTML entity handling or Unicode operations in related files
fd -t f -e kt feature/details/ -exec grep -l 'Regex\|replace\|entity' {} \; | head -10Repository: OpenHub-Store/GitHub-Store
Length of output: 328
🏁 Script executed:
# Search for existing multiplatform Unicode handling
rg -n 'codePoint|CodePoint|surrogate' feature/details/
# Check what's available in Kotlin stdlib for code point handling
# In Kotlin, String constructor can take IntArray of code points
rg -n 'String(' feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt | head -20
# Look at the full context of the entity decoding to understand better
sed -n '490,510p' feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.ktRepository: OpenHub-Store/GitHub-Store
Length of output: 882
🏁 Script executed:
# Verify String constructor with code points exists in Kotlin stdlib
# by checking if it's used elsewhere in the project
rg -n 'String(intArrayOf|String(.*codePoint' feature/
# Also check the Kotlin version to confirm modern stdlib
cat feature/details/data/build.gradle.kts | grep -i kotlinRepository: OpenHub-Store/GitHub-Store
Length of output: 321
🏁 Script executed:
# Simple search for intArrayOf usage
rg 'intArrayOf' feature/
# Search for code point handling
rg 'codePoint' feature/
# Check if there's any String construction with arrays
rg 'String.*Array' feature/details/data/src/commonMain/Repository: OpenHub-Store/GitHub-Store
Length of output: 52
🏁 Script executed:
# Let's check the actual Kotlin stdlib version available
rg 'stdlib' feature/details/data/build.gradle.kts -A 2 -B 2
# Search more broadly for how Unicode strings are created in this codebase
rg 'String\(' feature/details/data/src/commonMain/kotlin/ | head -20
# Let's verify the issue one more time with a simpler approach
# Check the exact problematic code and its context
sed -n '492,508p' feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.ktRepository: OpenHub-Store/GitHub-Store
Length of output: 4160
Unicode entity decoding truncates supplementary plane code points.
Lines 495 and 505 use toChar(), which only preserves the lower 16 bits despite the range check allowing up to 0x10FFFF. This corrupts supplementary plane characters (emoji, rare scripts, historic symbols).
Use String(intArrayOf(code)) instead, which properly handles all valid Unicode code points in multiplatform Kotlin:
💡 Fix
if (code != null && code in 32..0x10FFFF) {
- code.toChar().toString()
+ String(intArrayOf(code))
} else {
match.value
}Apply to both decimal (line 495) and hexadecimal (line 505) entity decoding blocks.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (code != null && code in 32..0x10FFFF) { | |
| code.toChar().toString() | |
| } else { | |
| match.value | |
| } | |
| } | |
| // Numeric HTML entities (hex): &#xHHHH; → char. | |
| processed = | |
| processed.replace(Regex("""&#x([0-9A-Fa-f]+);""")) { match -> | |
| val code = match.groupValues[1].toIntOrNull(16) | |
| if (code != null && code in 32..0x10FFFF) { | |
| code.toChar().toString() | |
| if (code != null && code in 32..0x10FFFF) { | |
| String(intArrayOf(code)) | |
| } else { | |
| match.value | |
| } | |
| } | |
| // Numeric HTML entities (hex): &`#xHHHH`; → char. | |
| processed = | |
| processed.replace(Regex("""&`#x`([0-9A-Fa-f]+);""")) { match -> | |
| val code = match.groupValues[1].toIntOrNull(16) | |
| if (code != null && code in 32..0x10FFFF) { | |
| String(intArrayOf(code)) | |
| } else { | |
| match.value | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt`
around lines 494 - 505, The Unicode entity decoding replaces numeric entities by
calling code.toChar().toString(), which truncates supplementary-plane code
points; in the numeric-decimal and numeric-hex replace blocks (the lambda used
to build processed via processed.replace(...){ match -> ... }), replace
code.toChar().toString() with a construct that preserves full Unicode code
points (e.g., create a String from the code point such as
String(intArrayOf(code))) so both decimal and hex entity decoding handle values
up to 0x10FFFF correctly; keep the existing range check (32..0x10FFFF) and apply
the same change in both places.
3eb2ffb to
485b5d1
Compare
There was a problem hiding this comment.
rainxchzed has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
Preprocessor bundle for the long tail of HTML-in-markdown:
Test plan
Summary by CodeRabbit