Skip to content

feat(markdown): polish — sub/sup, HTML entities, blockquote, pre-code#620

Merged
rainxchzed merged 1 commit into
mainfrom
feat/markdown-html-polish
May 16, 2026
Merged

feat(markdown): polish — sub/sup, HTML entities, blockquote, pre-code#620
rainxchzed merged 1 commit into
mainfrom
feat/markdown-html-polish

Conversation

@rainxchzed
Copy link
Copy Markdown
Member

@rainxchzed rainxchzed commented May 16, 2026

Preprocessor bundle for the long tail of HTML-in-markdown:

  • `
    ` → ```XYZ … ``` fence (preserves lang for the syntax-highlighter PR's slot).
  • `
    ` → line-prefixed `> ` markdown blockquote.
  • `X` / `X` → Unicode superscript / subscript per-char (H₂O, x², ⁿ). Unmappable chars stay literal; markdown lib has no inline BaselineShift hook so this is the best we can do without a custom span renderer.
  • Expanded HTML entity decoder: ©, ®, ™, —, –, …, ‘curly’ + “quotes”, «guillemets», ×, ÷, ±, °, µ, €, £, ¥, ¢, ←→↑↓, ≤≥≠, etc. Plus hex `&#xHHHH;` entities.

Test plan

  • Compile clean
  • Device: open repo with footnoted `1`, copyright `©`, em-dashes, embedded `
    ` — verify all render correctly.

Summary by CodeRabbit

  • Improved
    • Enhanced Markdown rendering: better HTML blockquote and code-block formatting with language hints
    • Added Unicode subscripts/superscripts (e.g., H₂O, x²)
    • Expanded HTML entity decoding for more symbols (©, ™, em/en dashes, curly quotes, etc.)

Review Change Stack

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rainxchzed has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 16, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 03b9187c-028e-40cb-8fca-6d6a3253d390

📥 Commits

Reviewing files that changed from the base of the PR and between 3eb2ffb and 485b5d1.

📒 Files selected for processing (14)
  • core/presentation/src/commonMain/composeResources/files/whatsnew/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/ar/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/bn/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/es/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/fr/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/hi/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/it/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/ja/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/ko/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/pl/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/ru/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/tr/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/zh-CN/18.json
  • feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt

Walkthrough

This PR extends Markdown rendering capabilities by implementing HTML-to-Markdown conversions in the preprocessor and expands HTML entity decoding to support a wider Unicode range and named entity references. The feature is documented across twelve localized release notes describing improved blockquote/code-fence handling, Unicode sub/superscript rendering, and HTML entity decoding.

Changes

Markdown Rendering Polish and Release Notes

Layer / File(s) Summary
HTML-to-Markdown block conversions
feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt
<pre><code> blocks are converted to Markdown fences with optional language hints; <blockquote> elements become blockquote-prefixed lines.
Character mapping and expanded entity decoding
feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt
Superscript/subscript characters map to Unicode equivalents via lookup tables; HTML entity decoding uses a centralized HTML_ENTITIES map and handles decimal/hex entities across a broader Unicode range (32..0x10FFFF).
Localized release notes for v1.8.3
core/presentation/src/commonMain/composeResources/files/whatsnew/*/18.json
Thirteen localized files (EN, AR, BN, ES, FR, HI, IT, JA, KO, PL, RU, TR, ZH-CN) update the "IMPROVED" section to describe Markdown rendering fixes: blockquote/code-fence correctness, Unicode sub/superscript handling, and named HTML entity decoding.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Poem

A rabbit hops through Markdown's grace,
With blockquotes, quotes, and code's embrace.
H₂O now shines in Unicode bright,
And entities decode just right.
Thirteen tongues now tell the story true! 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding Markdown rendering polish with support for sub/superscripts, HTML entities, blockquotes, and pre-code blocks.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/markdown-html-polish

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt`:
- Line 341: The regex in preprocessMarkdown.kt currently uses language-(\w+)
which fails to capture common language IDs like c++, f#, objective-c, or
shell-session; update that capture group to accept plus signs, hashes and
hyphens (e.g. replace language-(\w+) with language-([A-Za-z0-9_+#-]+) or more
generally language-([^"'\s>]+)) so the regex on the line containing
"""<pre[^>]*>\s*<code(?:\s+[^>]*?class\s*=\s*["'][^"']*?language-(\w+)[^"']*?["'])?[^>]*>(.*?)</code>\s*</pre>"""
correctly preserves all common fenced-code language IDs.
- Line 347: The current Markdown code-fence generation always emits triple
backticks and can break if the variable code contains backticks; update the
block in preprocessMarkdown.kt where you build the fence for a code snippet (the
section referencing lang and code) to compute the longest run of backticks
inside code (use Regex like `\`{3,}\`` to find matches), default to 2 if none
found, then create a fence string of one more backtick (e.g., fence =
"`".repeat(maxTicks+1)) and use that fence instead of the hardcoded "```" so the
emitted "\n$fence$lang\n$code\n$fence\n" remains well-formed.
- Around line 494-505: The Unicode entity decoding replaces numeric entities by
calling code.toChar().toString(), which truncates supplementary-plane code
points; in the numeric-decimal and numeric-hex replace blocks (the lambda used
to build processed via processed.replace(...){ match -> ... }), replace
code.toChar().toString() with a construct that preserves full Unicode code
points (e.g., create a String from the code point such as
String(intArrayOf(code))) so both decimal and hex entity decoding handle values
up to 0x10FFFF correctly; keep the existing range check (32..0x10FFFF) and apply
the same change in both places.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cc90bac9-daf8-4250-91c7-fff281eb8e8e

📥 Commits

Reviewing files that changed from the base of the PR and between 01226dc and 3eb2ffb.

📒 Files selected for processing (14)
  • core/presentation/src/commonMain/composeResources/files/whatsnew/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/ar/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/bn/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/es/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/fr/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/hi/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/it/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/ja/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/ko/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/pl/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/ru/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/tr/18.json
  • core/presentation/src/commonMain/composeResources/files/whatsnew/zh-CN/18.json
  • feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt

processed =
processed.replace(
Regex(
"""<pre[^>]*>\s*<code(?:\s+[^>]*?class\s*=\s*["'][^"']*?language-(\w+)[^"']*?["'])?[^>]*>(.*?)</code>\s*</pre>""",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Language capture is too restrictive for common fenced-code language IDs.

Line 341 uses language-(\w+), which misses values like c++, objective-c, f#, and shell-session. This silently drops syntax hints.

💡 Proposed fix
-                """<pre[^>]*>\s*<code(?:\s+[^>]*?class\s*=\s*["'][^"']*?language-(\w+)[^"']*?["'])?[^>]*>(.*?)</code>\s*</pre>""",
+                """<pre[^>]*>\s*<code(?:\s+[^>]*?class\s*=\s*["'][^"']*?language-([A-Za-z0-9_+#.-]+)[^"']*?["'])?[^>]*>(.*?)</code>\s*</pre>""",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""<pre[^>]*>\s*<code(?:\s+[^>]*?class\s*=\s*["'][^"']*?language-(\w+)[^"']*?["'])?[^>]*>(.*?)</code>\s*</pre>""",
"""<pre[^>]*>\s*<code(?:\s+[^>]*?class\s*=\s*["'][^"']*?language-([A-Za-z0-9_+#.-]+)[^"']*?["'])?[^>]*>(.*?)</code>\s*</pre>""",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt`
at line 341, The regex in preprocessMarkdown.kt currently uses language-(\w+)
which fails to capture common language IDs like c++, f#, objective-c, or
shell-session; update that capture group to accept plus signs, hashes and
hyphens (e.g. replace language-(\w+) with language-([A-Za-z0-9_+#-]+) or more
generally language-([^"'\s>]+)) so the regex on the line containing
"""<pre[^>]*>\s*<code(?:\s+[^>]*?class\s*=\s*["'][^"']*?language-(\w+)[^"']*?["'])?[^>]*>(.*?)</code>\s*</pre>"""
correctly preserves all common fenced-code language IDs.

) { match ->
val lang = match.groupValues[1]
val code = match.groupValues[2]
"\n```$lang\n$code\n```\n"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Code-fence generation can break when code content contains triple backticks.

Line 347 always emits ``` fences. If code already includes triple backticks, the generated Markdown block becomes malformed.

💡 Proposed fix
         ) { match ->
             val lang = match.groupValues[1]
             val code = match.groupValues[2]
-            "\n```$lang\n$code\n```\n"
+            val maxTicks = Regex("""`{3,}""").findAll(code).maxOfOrNull { it.value.length } ?: 2
+            val fence = "`".repeat(maxTicks + 1)
+            "\n$fence$lang\n$code\n$fence\n"
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt`
at line 347, The current Markdown code-fence generation always emits triple
backticks and can break if the variable code contains backticks; update the
block in preprocessMarkdown.kt where you build the fence for a code snippet (the
section referencing lang and code) to compute the longest run of backticks
inside code (use Regex like `\`{3,}\`` to find matches), default to 2 if none
found, then create a fence string of one more backtick (e.g., fence =
"`".repeat(maxTicks+1)) and use that fence instead of the hardcoded "```" so the
emitted "\n$fence$lang\n$code\n$fence\n" remains well-formed.

Comment on lines +494 to 505
if (code != null && code in 32..0x10FFFF) {
code.toChar().toString()
} else {
match.value
}
}
// Numeric HTML entities (hex): &#xHHHH; → char.
processed =
processed.replace(Regex("""&#x([0-9A-Fa-f]+);""")) { match ->
val code = match.groupValues[1].toIntOrNull(16)
if (code != null && code in 32..0x10FFFF) {
code.toChar().toString()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the potentially lossy conversion path and where it is used.
rg -n -C2 '0x10FFFF|toChar\(\)' feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt

# Demonstrate why toChar is unsafe for supplementary code points.
python - <<'PY'
cp = 0x1F600  # 😀
print("Expected code point:", hex(cp), chr(cp))
print("Truncated via 16-bit lower unit:", hex(cp & 0xFFFF), chr(cp & 0xFFFF))
PY

Repository: OpenHub-Store/GitHub-Store

Length of output: 735


🏁 Script executed:

# Check if Character.toChars() is available in Kotlin
rg -n 'Character\.' feature/details/data/ || echo "No existing Character usage found"

# Check Kotlin version and available APIs
head -20 feature/details/data/build.gradle.kts || head -20 feature/details/data/build.gradle

# Verify the context: is this commonMain code (multiplatform)?
head -5 feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt

# Look for other potentially unsafe toChar() conversions
rg -n 'toChar()' feature/details/data/src/commonMain/kotlin/

Repository: OpenHub-Store/GitHub-Store

Length of output: 1094


🏁 Script executed:

# Check what Kotlin version constraint says (from coding guidelines: Kotlin 2.3.10)
# Search for existing surrogate pair or Unicode handling in the codebase
rg -n 'surrogate|Surrogate|\\\\u|toChars|fromCodePoint' feature/details/data/src/ | head -20

# Check if there's a Kotlin extension or utility for code point to string
rg -n 'codePoint|CodePoint' feature/ | head -20

# Look for similar HTML entity handling or Unicode operations in related files
fd -t f -e kt feature/details/ -exec grep -l 'Regex\|replace\|entity' {} \; | head -10

Repository: OpenHub-Store/GitHub-Store

Length of output: 328


🏁 Script executed:

# Search for existing multiplatform Unicode handling
rg -n 'codePoint|CodePoint|surrogate' feature/details/

# Check what's available in Kotlin stdlib for code point handling
# In Kotlin, String constructor can take IntArray of code points
rg -n 'String(' feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt | head -20

# Look at the full context of the entity decoding to understand better
sed -n '490,510p' feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt

Repository: OpenHub-Store/GitHub-Store

Length of output: 882


🏁 Script executed:

# Verify String constructor with code points exists in Kotlin stdlib
# by checking if it's used elsewhere in the project
rg -n 'String(intArrayOf|String(.*codePoint' feature/

# Also check the Kotlin version to confirm modern stdlib
cat feature/details/data/build.gradle.kts | grep -i kotlin

Repository: OpenHub-Store/GitHub-Store

Length of output: 321


🏁 Script executed:

# Simple search for intArrayOf usage
rg 'intArrayOf' feature/

# Search for code point handling
rg 'codePoint' feature/

# Check if there's any String construction with arrays
rg 'String.*Array' feature/details/data/src/commonMain/

Repository: OpenHub-Store/GitHub-Store

Length of output: 52


🏁 Script executed:

# Let's check the actual Kotlin stdlib version available
rg 'stdlib' feature/details/data/build.gradle.kts -A 2 -B 2

# Search more broadly for how Unicode strings are created in this codebase
rg 'String\(' feature/details/data/src/commonMain/kotlin/ | head -20

# Let's verify the issue one more time with a simpler approach
# Check the exact problematic code and its context
sed -n '492,508p' feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt

Repository: OpenHub-Store/GitHub-Store

Length of output: 4160


Unicode entity decoding truncates supplementary plane code points.

Lines 495 and 505 use toChar(), which only preserves the lower 16 bits despite the range check allowing up to 0x10FFFF. This corrupts supplementary plane characters (emoji, rare scripts, historic symbols).

Use String(intArrayOf(code)) instead, which properly handles all valid Unicode code points in multiplatform Kotlin:

💡 Fix
             if (code != null && code in 32..0x10FFFF) {
-                code.toChar().toString()
+                String(intArrayOf(code))
             } else {
                 match.value
             }

Apply to both decimal (line 495) and hexadecimal (line 505) entity decoding blocks.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (code != null && code in 32..0x10FFFF) {
code.toChar().toString()
} else {
match.value
}
}
// Numeric HTML entities (hex): &#xHHHH; → char.
processed =
processed.replace(Regex("""&#x([0-9A-Fa-f]+);""")) { match ->
val code = match.groupValues[1].toIntOrNull(16)
if (code != null && code in 32..0x10FFFF) {
code.toChar().toString()
if (code != null && code in 32..0x10FFFF) {
String(intArrayOf(code))
} else {
match.value
}
}
// Numeric HTML entities (hex): &`#xHHHH`; → char.
processed =
processed.replace(Regex("""&`#x`([0-9A-Fa-f]+);""")) { match ->
val code = match.groupValues[1].toIntOrNull(16)
if (code != null && code in 32..0x10FFFF) {
String(intArrayOf(code))
} else {
match.value
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@feature/details/data/src/commonMain/kotlin/zed/rainxch/details/data/utils/preprocessMarkdown.kt`
around lines 494 - 505, The Unicode entity decoding replaces numeric entities by
calling code.toChar().toString(), which truncates supplementary-plane code
points; in the numeric-decimal and numeric-hex replace blocks (the lambda used
to build processed via processed.replace(...){ match -> ... }), replace
code.toChar().toString() with a construct that preserves full Unicode code
points (e.g., create a String from the code point such as
String(intArrayOf(code))) so both decimal and hex entity decoding handle values
up to 0x10FFFF correctly; keep the existing range check (32..0x10FFFF) and apply
the same change in both places.

@rainxchzed rainxchzed force-pushed the feat/markdown-html-polish branch from 3eb2ffb to 485b5d1 Compare May 16, 2026 15:45
@rainxchzed rainxchzed merged commit 83d4d92 into main May 16, 2026
1 check was pending
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rainxchzed has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant