fix: avoid topic fallback for non-Latin titles via pragmatic ASCII transliteration by ahmedqasid · Pull Request #1526 · apache/answer

ahmedqasid · 2026-05-15T13:34:57Z

fix: avoid `topic` fallback for non-Latin titles via pragmatic ASCII transliteration

Scope update (in response to review): this PR is intentionally broader than its original "Arabic-only" framing. The implementation changes URL slug generation for every non-Latin, non-CJK script that slugify previously stripped — see Scope below for the explicit list. The goal is not linguistically correct romanization; it is "avoid collapsing to /topic by producing a usable ASCII slug."

What this PR is (and isn't)

Goal: when a question title contains characters outside Basic Latin / Latin Extended / CJK Han, generate a URL slug that is a deterministic ASCII approximation instead of letting slugify strip everything and falling back to the literal "topic".

Non-goal: this is not a linguistically correct multi-language romanizer. The output is a machine-acceptable ASCII slug, not what a native speaker would choose. For example, こんにちは → konnichiha (not the more natural kon'nichiwa), ไทย → aithy (not thai). Treat the slug as an opaque, stable, indexable identifier — the path-after-/questions/<id>/ is for SEO and shareability, the canonical reference is always the ID.

The bug

Pure non-Latin titles previously got stripped by slugify.Slugify, hit the empty-result fallback in htmltext.UrlTitle, and collapsed to the literal slug "topic". On a live multilingual site, every Arabic / Thai / Japanese-hiragana / Korean / Hebrew / Cyrillic question ended up at /questions/<id>/topic.

The fix

UrlTitle() gets a convertNonLatin pre-step that mirrors the existing convertChinese pre-step pattern, using github.com/mozillazg/go-unidecode (same author as go-pinyin already in the repo, to minimise new-dep friction).

UrlTitle(title)
  → convertChinese(title)        // pre-existing: Han-block → pinyin
  → convertNonLatin(title)       // NEW: detect non-Latin letters → unidecode to ASCII
  → clearEmoji / slugify / url.QueryEscape / cutLongTitle (unchanged)

The non-Latin detector skips ASCII, Latin-1 Supplement, Latin Extended-A/B, and CJK Han. Inputs that hit none of those non-Latin letter categories short-circuit and return unchanged, so Latin-only and Chinese-only inputs remain byte-identical (pinned by tests).

Scope — what scripts are affected

This PR changes behavior for any title containing letters in scripts that slugify doesn't handle. Confirmed by tests in pkg/htmltext/htmltext_test.go:

Script	Example title	Before	After
Arabic	`كيف حالك`	`topic`	`kyf-hlk`
Mixed Latin + Arabic	`مرحبا hello`	`hello`	`mrhb-hello`
Thai	`ไทย ไทย`	`topic`	`aithy-aithy`
Japanese hiragana	`こんにちは`	`topic`	`konnichiha`
Korean	`안녕하세요`	`topic`	`annyeonghaseyo`
Hebrew	`שלום עולם`	`topic`	`shlvm-vlm`
Cyrillic	`Привет мир`	`topic`	`privet-mir`

Unchanged:

Case	Behavior
Pure Latin (`hello world`)	unchanged → `hello-world`
Pure Chinese (`这是一个，标题，title`)	unchanged → `zhe-shi-yi-ge-biao-ti` (pinyin path)
Japanese with Han-block kanji (`日本`)	unchanged → `ri-ben` (caught by pre-existing pinyin path; treated as Chinese reading, not Japanese — a pre-existing limitation, not introduced by this PR)
Emoji only (`😂😂😂`)	unchanged → `topic`
Empty / whitespace	unchanged → `topic`

Transliteration quality — explicit acknowledgement

go-unidecode is a generic Unicode → ASCII approximation. It is not a per-language romanization library. Specifically:

It will pick one approximation per codepoint regardless of language context. ใ → ai (Thai romanization is i or ai depending on standard), 한 → han, 語 → Yu (Chinese pinyin reading even when used in Japanese), etc.
The result is good enough to be a stable, URL-safe, human-recognizable handle, but speakers of the source language will not consider it "correct."
It is deterministic, so the same title always produces the same slug — important since url_title is recomputed on every request.

If maintainers prefer to scope this PR more narrowly (e.g. Arabic only, and reject Thai/Hebrew/Cyrillic/etc.), the detector in containsNonLatin can be tightened to specific Unicode blocks — but that means the other scripts continue to collapse to topic, which is the bug we're trying to fix. I'd argue the broader fix is preferable to a piecemeal one, but happy to narrow if you want.

Live deployment / real-world verification

This patch has been running in production on ask.namasoft.com (an Apache Answer instance we operate) since deployment, built directly from this branch via docker compose build. The site hosts Arabic-language questions, so the fix exercises the affected code path on every page load.

Sample question URL on the deployed instance:

https://ask.namasoft.com/questions/10010000000000115

The slug in the URL is the transliterated Arabic title rather than topic. No data migration was needed since url_title is computed on every request from Title and never persisted (see Why this is safe to ship below).

Admin-configurable

The transliteration is gated by a package-level atomic.Bool (default on, since the current behavior is objectively broken for affected users):

htmltext.SetTransliterateNonLatin(enabled bool)
htmltext.IsTransliterateNonLatinEnabled() bool

This is deliberately the minimum surface needed to satisfy "the setting must be readable from UrlTitle()". A follow-up PR can add an admin UI section that calls SetTransliterateNonLatin on save and on startup, without having to re-plumb every htmltext.UrlTitle call site through context.Context.

Default choice — please confirm: I picked default-on because the existing topic behavior is a bug for affected users. If you'd prefer default-off for strict backward compat on existing installs, flip the init() in pkg/htmltext/htmltext.go to Store(false) and surface the toggle as opt-in.

Why this is safe to ship

url_title is not a persisted column. It's not on the Question entity in internal/entity/question_entity.go, no migration has ever added/dropped it, and every call site (question_service.go, revision_service.go, vote_service.go, search/report/review/rank/comment services, controllers, repos) recomputes it from Title at response-build time via htmltext.UrlTitle(...).
That means the fix is read-only: existing rows light up with correct slugs on the next request, with no migration and no data rewrite.
Rollback is just redeploying the prior image; nothing on disk changes.

Test coverage

pkg/htmltext/htmltext_test.go:

TestUrlTitleTable — table-driven, one case per affected script (the full matrix above), plus:
- empty → topic
- pure latin unchanged → byte-identical to pre-fix
- pure chinese unchanged → byte-identical to pre-fix (pins existing pinyin behavior)
- japanese kanji goes through pinyin path unchanged → documents the pre-existing Han-block limitation
- emoji only falls back to topic → unchanged
- long arabic truncates at cutLongTitle boundary → exercises the 150-byte cap and UTF-8 boundary safety
TestUrlTitleTransliterationToggle — with the toggle off, non-Latin titles collapse to topic (pre-fix behavior); with it on, they transliterate.
Existing TestUrlTitle left untouched.

Test plan for reviewers:

go test ./pkg/htmltext/... — all pass
Visit the live sample URL above and confirm slug is transliterated, not topic
Verify Chinese / Latin / emoji-only / empty behavior is byte-identical to main (covered by table tests)

Out of scope (intentionally)

No admin UI / site setting plumbing in this PR — see Admin-configurable above. Happy to do the React Non-Latin Languages Handling admin page + SiteType + service / controller / migration in a follow-up if maintainers want it.
No change to the "topic" empty-result fallback.
No plugin interface for slug generation — mirrored the existing convertChinese pre-step pattern instead.
No per-language romanization library — this is an explicit non-goal; see Transliteration quality above.

Issues / discussion

I didn't find an existing upstream issue covering this — happy to be pointed at one if there is.

🤖 Generated with Claude Code

LinkinStars · 2026-05-19T06:27:48Z

I think this PR needs a clearer scope statement, because in its current form it does more than fix Arabic-only titles.

By adding go-unidecode before slugify, it changes slug generation for many non-Latin scripts, not just Arabic. For example, on my local reproduction:

مرحبا hello changed from hello to mrhb-hello
ไทย ไทย changed from topic to aithy-aithy
こんにちは changed from topic to konnichiha
안녕하세요 changed from topic to annyeonghaseyo

So this is not only an Arabic fix. It changes the behavior for Thai, Japanese, Korean, Hebrew, Cyrillic, and other scripts as well. I think the PR description and tests should reflect that broader impact explicitly.

The second concern is about transliteration quality. What this PR introduces is a generic ASCII approximation, not linguistically correct multi-language romanization. That may be acceptable as a pragmatic fallback to avoid collapsing to topic, but it is different from saying the generated slug is “correct” for each language. For example, こんにちは -> konnichiha is machine-acceptable as an ASCII slug, but it is not necessarily the natural or expected romanization users would want. The same concern applies to other languages as well.

If the goal of this PR is “avoid empty/topic fallback for non-Latin titles by generating a usable ASCII slug”, then I think that should be stated much more explicitly in the PR description and test coverage.

Pure-Arabic (and other non-Latin) titles previously got stripped by slugify and collapsed to the "topic" fallback, so every Arabic question landed at /questions/<id>/topic. Mirror the existing convertChinese pre-step using go-unidecode so titles in Arabic, Cyrillic, Hebrew, Thai etc. produce a readable ASCII slug. Latin-only and Chinese-only inputs short-circuit and remain byte-identical to the previous output. Gated by a package-level atomic flag (default on) exposed via SetTransliterateNonLatin so an admin toggle can be wired up in a follow-up PR without re-plumbing call sites.

Reviewer pointed out the fix changes slug generation for many non-Latin scripts, not just Arabic. Pin the actual behavior across Thai, Japanese hiragana, Korean, Hebrew, and Cyrillic so the test surface matches the real scope of the change. Also pin the pre-existing Japanese-kanji-via-pinyin path so reviewers can see it is unchanged by this PR.

ahmedqasid · 2026-05-23T19:36:44Z

Sorry for the late reply — thanks for the review, both points are fair.

Scope: You're right, this isn't Arabic-specific — it fixes the topic fallback for every non-Latin, non-CJK script slugify was stripping (Thai, Japanese hiragana, Korean, Hebrew, Cyrillic, …). I've leaned into that: retitled
the PR, added a "What this is / isn't" section, and a scope table with before/after per script.

Transliteration quality: Agreed — go-unidecode is a generic Unicode→ASCII approximation, not a per-language romanizer (こんにちは → konnichiha, ไทย → aithy). I've documented this as an explicit non-goal rather than implying
correctness. The slug is just a stable, URL-safe handle; the canonical reference is the question ID.

Tests: Expanded into a table-driven matrix — one case per affected script, plus pins for the unchanged paths (Latin, Chinese/pinyin, Japanese kanji, emoji, empty) so existing behavior stays byte-identical.

If you'd rather scope this to Arabic only, containsNonLatin is the one place to tighten — but that leaves the other scripts collapsing to topic. I think the broader fix is the better call, happy to narrow if you disagree.

ahmedqasid added 2 commits May 23, 2026 22:29

ahmedqasid force-pushed the fix/arabic-url-slug branch from b89301a to 3f124c8 Compare May 23, 2026 19:29

ahmedqasid changed the title ~~fix: transliterate non-Latin titles in URL slugs~~ fix: avoid topic fallback for non-Latin titles via pragmatic ASCII transliteration May 23, 2026

LinkinStars self-requested a review May 25, 2026 02:49

LinkinStars self-assigned this May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid topic fallback for non-Latin titles via pragmatic ASCII transliteration#1526

fix: avoid topic fallback for non-Latin titles via pragmatic ASCII transliteration#1526
ahmedqasid wants to merge 2 commits into
apache:mainfrom
ahmedqasid:fix/arabic-url-slug

ahmedqasid commented May 15, 2026 •

edited

Loading

Uh oh!

LinkinStars commented May 19, 2026 •

edited

Loading

Uh oh!

ahmedqasid commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ahmedqasid commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fix: avoid topic fallback for non-Latin titles via pragmatic ASCII transliteration

What this PR is (and isn't)

The bug

The fix

Scope — what scripts are affected

Transliteration quality — explicit acknowledgement

Live deployment / real-world verification

Admin-configurable

Why this is safe to ship

Test coverage

Out of scope (intentionally)

Issues / discussion

Uh oh!

LinkinStars commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahmedqasid commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ahmedqasid commented May 15, 2026 •

edited

Loading

fix: avoid `topic` fallback for non-Latin titles via pragmatic ASCII transliteration

LinkinStars commented May 19, 2026 •

edited

Loading