fix: prefer HTTP header charset and fallback to UTF-8 when missing#1196
fix: prefer HTTP header charset and fallback to UTF-8 when missing#1196Ashinch merged 2 commits intoReadYouApp:mainfrom
Conversation
|
What Content-Type do those problematic feeds return? I think the parser should prefer the charset provided by the feed (via the HTTP Content-Type header) and fall back to UTF-8 only when it’s missing. This PR forces UTF-8 for all cases, which may break feeds that legitimately specify a different encoding. |
|
Thanks for the clarification @Ashinch ! Here's what I found while debugging the problematic feeds. Some feeds (for example: with no According to RFC 3023 rules,
The feed itself is UTF-8, so decoding it as ASCII causes the broken characters. Proposed solutionAs you mentioned earlier, the parser should prefer the charset from the HTTP Content-Type header and fall back to UTF-8 only when missing, the safest approach is:
val httpContentType =
contentType?.let {
if (it.contains("charset=", ignoreCase = true)) it
else "$it; charset=UTF-8"
} ?: "text/xml; charset=UTF-8"
This keeps ROME’s charset detection intact, avoids forcing UTF-8 in cases where a different charset is explicitly declared, and fixes feeds that omit the charset but are actually UTF-8. Happy to adjust further if needed! |
Thanks for following up! This looks good, and we can go ahead with it. It might not cover every encoding issue, but it should work fine for most feeds. |
|
Feel free to @Ashinch when it's ready. |
|
Hey @Ashinch |
|
Btw, the CI builds are consistently failing due to some kind of OOM error, which doesn't seem to be related to the changes in the PRs:
It may be the GitHub Actions runner running out of RAM during Compose/Kotlin compilation. GitHub has recently updated repository cache limits, which might affect build caching or memory usage. This could be worth checking: Maybe you could review the CI configuration to see if anything needs adjustment. |
Fixes #1193
This fixes incorrectly decoded characters in some feeds that do not report a correct charset in their Content-Type header.
Rome was guessing the charset and decoding wrongly for certain sources
(e.g. feeds containing Chinese characters).
The fix is to force UTF-8 in XmlReader, which resolves the issue across feeds.
Tested with multiple feeds, including the problematic one.
Before / After