Skip to content

fix: avoid topic fallback for non-Latin titles via pragmatic ASCII transliteration#1526

Open
ahmedqasid wants to merge 2 commits into
apache:mainfrom
ahmedqasid:fix/arabic-url-slug
Open

fix: avoid topic fallback for non-Latin titles via pragmatic ASCII transliteration#1526
ahmedqasid wants to merge 2 commits into
apache:mainfrom
ahmedqasid:fix/arabic-url-slug

Conversation

@ahmedqasid
Copy link
Copy Markdown

@ahmedqasid ahmedqasid commented May 15, 2026

fix: avoid topic fallback for non-Latin titles via pragmatic ASCII transliteration

Scope update (in response to review): this PR is intentionally broader than its original "Arabic-only" framing. The implementation changes URL slug generation for every non-Latin, non-CJK script that slugify previously stripped — see Scope below for the explicit list. The goal is not linguistically correct romanization; it is "avoid collapsing to /topic by producing a usable ASCII slug."

What this PR is (and isn't)

Goal: when a question title contains characters outside Basic Latin / Latin Extended / CJK Han, generate a URL slug that is a deterministic ASCII approximation instead of letting slugify strip everything and falling back to the literal "topic".

Non-goal: this is not a linguistically correct multi-language romanizer. The output is a machine-acceptable ASCII slug, not what a native speaker would choose. For example, こんにちはkonnichiha (not the more natural kon'nichiwa), ไทยaithy (not thai). Treat the slug as an opaque, stable, indexable identifier — the path-after-/questions/<id>/ is for SEO and shareability, the canonical reference is always the ID.

The bug

Pure non-Latin titles previously got stripped by slugify.Slugify, hit the empty-result fallback in htmltext.UrlTitle, and collapsed to the literal slug "topic". On a live multilingual site, every Arabic / Thai / Japanese-hiragana / Korean / Hebrew / Cyrillic question ended up at /questions/<id>/topic.

The fix

UrlTitle() gets a convertNonLatin pre-step that mirrors the existing convertChinese pre-step pattern, using github.com/mozillazg/go-unidecode (same author as go-pinyin already in the repo, to minimise new-dep friction).

UrlTitle(title)
  → convertChinese(title)        // pre-existing: Han-block → pinyin
  → convertNonLatin(title)       // NEW: detect non-Latin letters → unidecode to ASCII
  → clearEmoji / slugify / url.QueryEscape / cutLongTitle (unchanged)

The non-Latin detector skips ASCII, Latin-1 Supplement, Latin Extended-A/B, and CJK Han. Inputs that hit none of those non-Latin letter categories short-circuit and return unchanged, so Latin-only and Chinese-only inputs remain byte-identical (pinned by tests).

Scope — what scripts are affected

This PR changes behavior for any title containing letters in scripts that slugify doesn't handle. Confirmed by tests in pkg/htmltext/htmltext_test.go:

Script Example title Before After
Arabic كيف حالك topic kyf-hlk
Mixed Latin + Arabic مرحبا hello hello mrhb-hello
Thai ไทย ไทย topic aithy-aithy
Japanese hiragana こんにちは topic konnichiha
Korean 안녕하세요 topic annyeonghaseyo
Hebrew שלום עולם topic shlvm-vlm
Cyrillic Привет мир topic privet-mir

Unchanged:

Case Behavior
Pure Latin (hello world) unchanged → hello-world
Pure Chinese (这是一个,标题,title) unchanged → zhe-shi-yi-ge-biao-ti (pinyin path)
Japanese with Han-block kanji (日本) unchanged → ri-ben (caught by pre-existing pinyin path; treated as Chinese reading, not Japanese — a pre-existing limitation, not introduced by this PR)
Emoji only (😂😂😂) unchanged → topic
Empty / whitespace unchanged → topic

Transliteration quality — explicit acknowledgement

go-unidecode is a generic Unicode → ASCII approximation. It is not a per-language romanization library. Specifically:

  • It will pick one approximation per codepoint regardless of language context. ai (Thai romanization is i or ai depending on standard), han, Yu (Chinese pinyin reading even when used in Japanese), etc.
  • The result is good enough to be a stable, URL-safe, human-recognizable handle, but speakers of the source language will not consider it "correct."
  • It is deterministic, so the same title always produces the same slug — important since url_title is recomputed on every request.

If maintainers prefer to scope this PR more narrowly (e.g. Arabic only, and reject Thai/Hebrew/Cyrillic/etc.), the detector in containsNonLatin can be tightened to specific Unicode blocks — but that means the other scripts continue to collapse to topic, which is the bug we're trying to fix. I'd argue the broader fix is preferable to a piecemeal one, but happy to narrow if you want.

Live deployment / real-world verification

This patch has been running in production on ask.namasoft.com (an Apache Answer instance we operate) since deployment, built directly from this branch via docker compose build. The site hosts Arabic-language questions, so the fix exercises the affected code path on every page load.

Sample question URL on the deployed instance:

https://ask.namasoft.com/questions/10010000000000115

The slug in the URL is the transliterated Arabic title rather than topic. No data migration was needed since url_title is computed on every request from Title and never persisted (see Why this is safe to ship below).

Admin-configurable

The transliteration is gated by a package-level atomic.Bool (default on, since the current behavior is objectively broken for affected users):

  • htmltext.SetTransliterateNonLatin(enabled bool)
  • htmltext.IsTransliterateNonLatinEnabled() bool

This is deliberately the minimum surface needed to satisfy "the setting must be readable from UrlTitle()". A follow-up PR can add an admin UI section that calls SetTransliterateNonLatin on save and on startup, without having to re-plumb every htmltext.UrlTitle call site through context.Context.

Default choice — please confirm: I picked default-on because the existing topic behavior is a bug for affected users. If you'd prefer default-off for strict backward compat on existing installs, flip the init() in pkg/htmltext/htmltext.go to Store(false) and surface the toggle as opt-in.

Why this is safe to ship

  • url_title is not a persisted column. It's not on the Question entity in internal/entity/question_entity.go, no migration has ever added/dropped it, and every call site (question_service.go, revision_service.go, vote_service.go, search/report/review/rank/comment services, controllers, repos) recomputes it from Title at response-build time via htmltext.UrlTitle(...).
  • That means the fix is read-only: existing rows light up with correct slugs on the next request, with no migration and no data rewrite.
  • Rollback is just redeploying the prior image; nothing on disk changes.

Test coverage

pkg/htmltext/htmltext_test.go:

  • TestUrlTitleTable — table-driven, one case per affected script (the full matrix above), plus:
    • emptytopic
    • pure latin unchanged → byte-identical to pre-fix
    • pure chinese unchanged → byte-identical to pre-fix (pins existing pinyin behavior)
    • japanese kanji goes through pinyin path unchanged → documents the pre-existing Han-block limitation
    • emoji only falls back to topic → unchanged
    • long arabic truncates at cutLongTitle boundary → exercises the 150-byte cap and UTF-8 boundary safety
  • TestUrlTitleTransliterationToggle — with the toggle off, non-Latin titles collapse to topic (pre-fix behavior); with it on, they transliterate.
  • Existing TestUrlTitle left untouched.

Test plan for reviewers:

  • go test ./pkg/htmltext/... — all pass
  • Visit the live sample URL above and confirm slug is transliterated, not topic
  • Verify Chinese / Latin / emoji-only / empty behavior is byte-identical to main (covered by table tests)

Out of scope (intentionally)

  • No admin UI / site setting plumbing in this PR — see Admin-configurable above. Happy to do the React Non-Latin Languages Handling admin page + SiteType + service / controller / migration in a follow-up if maintainers want it.
  • No change to the "topic" empty-result fallback.
  • No plugin interface for slug generation — mirrored the existing convertChinese pre-step pattern instead.
  • No per-language romanization library — this is an explicit non-goal; see Transliteration quality above.

Issues / discussion

I didn't find an existing upstream issue covering this — happy to be pointed at one if there is.

🤖 Generated with Claude Code

@LinkinStars
Copy link
Copy Markdown
Member

LinkinStars commented May 19, 2026

I think this PR needs a clearer scope statement, because in its current form it does more than fix Arabic-only titles.

By adding go-unidecode before slugify, it changes slug generation for many non-Latin scripts, not just Arabic. For example, on my local reproduction:

  • مرحبا hello changed from hello to mrhb-hello
  • ไทย ไทย changed from topic to aithy-aithy
  • こんにちは changed from topic to konnichiha
  • 안녕하세요 changed from topic to annyeonghaseyo

So this is not only an Arabic fix. It changes the behavior for Thai, Japanese, Korean, Hebrew, Cyrillic, and other scripts as well. I think the PR description and tests should reflect that broader impact explicitly.

The second concern is about transliteration quality. What this PR introduces is a generic ASCII approximation, not linguistically correct multi-language romanization. That may be acceptable as a pragmatic fallback to avoid collapsing to topic, but it is different from saying the generated slug is “correct” for each language. For example, こんにちは -> konnichiha is machine-acceptable as an ASCII slug, but it is not necessarily the natural or expected romanization users would want. The same concern applies to other languages as well.

If the goal of this PR is “avoid empty/topic fallback for non-Latin titles by generating a usable ASCII slug”, then I think that should be stated much more explicitly in the PR description and test coverage.

Pure-Arabic (and other non-Latin) titles previously got stripped by
slugify and collapsed to the "topic" fallback, so every Arabic question
landed at /questions/<id>/topic. Mirror the existing convertChinese
pre-step using go-unidecode so titles in Arabic, Cyrillic, Hebrew, Thai
etc. produce a readable ASCII slug. Latin-only and Chinese-only inputs
short-circuit and remain byte-identical to the previous output.

Gated by a package-level atomic flag (default on) exposed via
SetTransliterateNonLatin so an admin toggle can be wired up in a
follow-up PR without re-plumbing call sites.
Reviewer pointed out the fix changes slug generation for many non-Latin
scripts, not just Arabic. Pin the actual behavior across Thai, Japanese
hiragana, Korean, Hebrew, and Cyrillic so the test surface matches the
real scope of the change.

Also pin the pre-existing Japanese-kanji-via-pinyin path so reviewers
can see it is unchanged by this PR.
@ahmedqasid ahmedqasid force-pushed the fix/arabic-url-slug branch from b89301a to 3f124c8 Compare May 23, 2026 19:29
@ahmedqasid ahmedqasid changed the title fix: transliterate non-Latin titles in URL slugs fix: avoid topic fallback for non-Latin titles via pragmatic ASCII transliteration May 23, 2026
@ahmedqasid
Copy link
Copy Markdown
Author

Sorry for the late reply — thanks for the review, both points are fair.

Scope: You're right, this isn't Arabic-specific — it fixes the topic fallback for every non-Latin, non-CJK script slugify was stripping (Thai, Japanese hiragana, Korean, Hebrew, Cyrillic, …). I've leaned into that: retitled
the PR, added a "What this is / isn't" section, and a scope table with before/after per script.

Transliteration quality: Agreed — go-unidecode is a generic Unicode→ASCII approximation, not a per-language romanizer (こんにちは → konnichiha, ไทย → aithy). I've documented this as an explicit non-goal rather than implying
correctness. The slug is just a stable, URL-safe handle; the canonical reference is the question ID.

Tests: Expanded into a table-driven matrix — one case per affected script, plus pins for the unchanged paths (Latin, Chinese/pinyin, Japanese kanji, emoji, empty) so existing behavior stays byte-identical.

If you'd rather scope this to Arabic only, containsNonLatin is the one place to tighten — but that leaves the other scripts collapsing to topic. I think the broader fix is the better call, happy to narrow if you disagree.

@LinkinStars LinkinStars self-requested a review May 25, 2026 02:49
@LinkinStars LinkinStars self-assigned this May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants