fix: avoid topic fallback for non-Latin titles via pragmatic ASCII transliteration#1526
fix: avoid topic fallback for non-Latin titles via pragmatic ASCII transliteration#1526ahmedqasid wants to merge 2 commits into
Conversation
|
I think this PR needs a clearer scope statement, because in its current form it does more than fix Arabic-only titles. By adding
So this is not only an Arabic fix. It changes the behavior for Thai, Japanese, Korean, Hebrew, Cyrillic, and other scripts as well. I think the PR description and tests should reflect that broader impact explicitly. The second concern is about transliteration quality. What this PR introduces is a generic ASCII approximation, not linguistically correct multi-language romanization. That may be acceptable as a pragmatic fallback to avoid collapsing to If the goal of this PR is “avoid empty/topic fallback for non-Latin titles by generating a usable ASCII slug”, then I think that should be stated much more explicitly in the PR description and test coverage. |
Pure-Arabic (and other non-Latin) titles previously got stripped by slugify and collapsed to the "topic" fallback, so every Arabic question landed at /questions/<id>/topic. Mirror the existing convertChinese pre-step using go-unidecode so titles in Arabic, Cyrillic, Hebrew, Thai etc. produce a readable ASCII slug. Latin-only and Chinese-only inputs short-circuit and remain byte-identical to the previous output. Gated by a package-level atomic flag (default on) exposed via SetTransliterateNonLatin so an admin toggle can be wired up in a follow-up PR without re-plumbing call sites.
Reviewer pointed out the fix changes slug generation for many non-Latin scripts, not just Arabic. Pin the actual behavior across Thai, Japanese hiragana, Korean, Hebrew, and Cyrillic so the test surface matches the real scope of the change. Also pin the pre-existing Japanese-kanji-via-pinyin path so reviewers can see it is unchanged by this PR.
b89301a to
3f124c8
Compare
|
Sorry for the late reply — thanks for the review, both points are fair. Scope: You're right, this isn't Arabic-specific — it fixes the topic fallback for every non-Latin, non-CJK script slugify was stripping (Thai, Japanese hiragana, Korean, Hebrew, Cyrillic, …). I've leaned into that: retitled Transliteration quality: Agreed — go-unidecode is a generic Unicode→ASCII approximation, not a per-language romanizer (こんにちは → konnichiha, ไทย → aithy). I've documented this as an explicit non-goal rather than implying Tests: Expanded into a table-driven matrix — one case per affected script, plus pins for the unchanged paths (Latin, Chinese/pinyin, Japanese kanji, emoji, empty) so existing behavior stays byte-identical. If you'd rather scope this to Arabic only, containsNonLatin is the one place to tighten — but that leaves the other scripts collapsing to topic. I think the broader fix is the better call, happy to narrow if you disagree. |
fix: avoid
topicfallback for non-Latin titles via pragmatic ASCII transliterationWhat this PR is (and isn't)
Goal: when a question title contains characters outside Basic Latin / Latin Extended / CJK Han, generate a URL slug that is a deterministic ASCII approximation instead of letting
slugifystrip everything and falling back to the literal"topic".Non-goal: this is not a linguistically correct multi-language romanizer. The output is a machine-acceptable ASCII slug, not what a native speaker would choose. For example,
こんにちは→konnichiha(not the more naturalkon'nichiwa),ไทย→aithy(notthai). Treat the slug as an opaque, stable, indexable identifier — the path-after-/questions/<id>/is for SEO and shareability, the canonical reference is always the ID.The bug
Pure non-Latin titles previously got stripped by
slugify.Slugify, hit the empty-result fallback inhtmltext.UrlTitle, and collapsed to the literal slug"topic". On a live multilingual site, every Arabic / Thai / Japanese-hiragana / Korean / Hebrew / Cyrillic question ended up at/questions/<id>/topic.The fix
UrlTitle()gets aconvertNonLatinpre-step that mirrors the existingconvertChinesepre-step pattern, usinggithub.com/mozillazg/go-unidecode(same author asgo-pinyinalready in the repo, to minimise new-dep friction).The non-Latin detector skips ASCII, Latin-1 Supplement, Latin Extended-A/B, and CJK Han. Inputs that hit none of those non-Latin letter categories short-circuit and return unchanged, so Latin-only and Chinese-only inputs remain byte-identical (pinned by tests).
Scope — what scripts are affected
This PR changes behavior for any title containing letters in scripts that
slugifydoesn't handle. Confirmed by tests inpkg/htmltext/htmltext_test.go:كيف حالكtopickyf-hlkمرحبا hellohellomrhb-helloไทย ไทยtopicaithy-aithyこんにちはtopickonnichiha안녕하세요topicannyeonghaseyoשלום עולםtopicshlvm-vlmПривет мирtopicprivet-mirUnchanged:
hello world)hello-world这是一个,标题,title)zhe-shi-yi-ge-biao-ti(pinyin path)日本)ri-ben(caught by pre-existing pinyin path; treated as Chinese reading, not Japanese — a pre-existing limitation, not introduced by this PR)😂😂😂)topictopicTransliteration quality — explicit acknowledgement
go-unidecodeis a generic Unicode → ASCII approximation. It is not a per-language romanization library. Specifically:ใ→ai(Thai romanization isioraidepending on standard),한→han,語→Yu(Chinese pinyin reading even when used in Japanese), etc.url_titleis recomputed on every request.If maintainers prefer to scope this PR more narrowly (e.g. Arabic only, and reject Thai/Hebrew/Cyrillic/etc.), the detector in
containsNonLatincan be tightened to specific Unicode blocks — but that means the other scripts continue to collapse totopic, which is the bug we're trying to fix. I'd argue the broader fix is preferable to a piecemeal one, but happy to narrow if you want.Live deployment / real-world verification
This patch has been running in production on ask.namasoft.com (an Apache Answer instance we operate) since deployment, built directly from this branch via
docker compose build. The site hosts Arabic-language questions, so the fix exercises the affected code path on every page load.Sample question URL on the deployed instance:
The slug in the URL is the transliterated Arabic title rather than
topic. No data migration was needed sinceurl_titleis computed on every request fromTitleand never persisted (see Why this is safe to ship below).Admin-configurable
The transliteration is gated by a package-level
atomic.Bool(default on, since the current behavior is objectively broken for affected users):htmltext.SetTransliterateNonLatin(enabled bool)htmltext.IsTransliterateNonLatinEnabled() boolThis is deliberately the minimum surface needed to satisfy "the setting must be readable from
UrlTitle()". A follow-up PR can add an admin UI section that callsSetTransliterateNonLatinon save and on startup, without having to re-plumb everyhtmltext.UrlTitlecall site throughcontext.Context.Default choice — please confirm: I picked default-on because the existing
topicbehavior is a bug for affected users. If you'd prefer default-off for strict backward compat on existing installs, flip theinit()inpkg/htmltext/htmltext.gotoStore(false)and surface the toggle as opt-in.Why this is safe to ship
url_titleis not a persisted column. It's not on theQuestionentity ininternal/entity/question_entity.go, no migration has ever added/dropped it, and every call site (question_service.go,revision_service.go,vote_service.go, search/report/review/rank/comment services, controllers, repos) recomputes it fromTitleat response-build time viahtmltext.UrlTitle(...).Test coverage
pkg/htmltext/htmltext_test.go:TestUrlTitleTable— table-driven, one case per affected script (the full matrix above), plus:empty→topicpure latin unchanged→ byte-identical to pre-fixpure chinese unchanged→ byte-identical to pre-fix (pins existing pinyin behavior)japanese kanji goes through pinyin path unchanged→ documents the pre-existing Han-block limitationemoji only falls back to topic→ unchangedlong arabic truncates at cutLongTitle boundary→ exercises the 150-byte cap and UTF-8 boundary safetyTestUrlTitleTransliterationToggle— with the toggle off, non-Latin titles collapse totopic(pre-fix behavior); with it on, they transliterate.TestUrlTitleleft untouched.Test plan for reviewers:
go test ./pkg/htmltext/...— all passtopicmain(covered by table tests)Out of scope (intentionally)
Non-Latin Languages Handlingadmin page +SiteType+ service / controller / migration in a follow-up if maintainers want it."topic"empty-result fallback.convertChinesepre-step pattern instead.Issues / discussion
I didn't find an existing upstream issue covering this — happy to be pointed at one if there is.
🤖 Generated with Claude Code