Skip to content

Proposal: I18n: Support non-English hyphenation dictionaries #232

@na4zagin3

Description

@na4zagin3

This proposal is to add support of hyphenation of non-English languages. This is the first step of supporting internationalization.

Proposal

  • Add a new type:
    • hyphen-dict Hyphenation pattern. Underlying OCaml representation is LoadHyph.t.
  • Add new primitives:
    • load-hyphen-dict : string -> hyphen-dict
    • set-hyphen-dict : hyphen-dict -> ctx -> ctx
    • get-hyphen-dict : ctx -> hyphen-dict
  • Use BCP 47 Language Tag or UTS#35 Language Identifier for filenames of hyphenation dictionary files.
    • The current hyphenation file english.satysfi-hyph needs to be renamed with en.satysfi-hyph.

load-hyphen-pattern language loads a hyphenation dictionary from hyph/<language>.satysfi-hyph. It raises an exception when the file is not found.

set-hyphen-pattern hyph ctx sets hyphnation pattern hyph to ctx.hyphenation_pattern.

get-hyphen-pattern ctx returns hyphnation pattern ctx.hyphenation_pattern.

Current Implementation

  • English hyphenation is located at lib-satysfi/dist/hyph/english.satysfi-hyph
  • english.satysfi-hyph is loaded at
    default_hyphen_dictionary := LoadHyph.main (Config.resolve_lib_file_exn (make_lib_path "dist/hyph/english.satysfi-hyph"));
  • The only operation which sets hyphenation_dictionary is get_pdf_mode_initial_context at
    hyphen_dictionary = !default_hyphen_dictionary;

Alternative Options

Activate multiple hyphen-dicts at the same time

This proposal based on a design where users can replace English hyphenation pattern with other language's. It may be natural to set a hyphenation dictionary to each language/script (i.e., set-hyphen-dict : language-tag -> hyphen-dict > ctx -> ctx or set-hyphen-dict : hyphen-dict language-tag-map -> ctx -> ctx) rather than applying given hyphenation pattern globally, if we decide to extend the multi-language system, where English and Japanese are automatically detected with script types.

Introducing new type hyphen-dict

Instead of introducing hyphen-dict and having users explicitly handle hyphenation dictionaries, we could provide primitives get/set strings that represent languages (e.g., set-hyphen-dict : string -> ctx -> ctx).

However, hyphen-dict type allows more extension points (e.g., tweaking hyphenation patterns, adding exceptional words ad hoc) in future.

load-hyphen-dict throwing exceptions

load-hyphen-dict can have signature load-hyphen-dict : string -> hyphen-dict option. I don't have strong opinion about this. I was thinking of having a new package for each language, therefore specifying wrong filenames is unlikely.

Having a primitive to get available hyphenation dictionary files

I could include another primitive get-hyph-dict-list that returns available files under hyph/ (for example, returning [ "en" ]). This primitive is not mandatory.

Renaming english.satysfi-hyph for en.satysfi-hyph

We could leave the filename as is. However, considering even TeX has already adopted naming scheme with BCP 47 Language Tag, there is no reason to stick at traditional naming scheme with language names in English.

で、そのTeXとかいうやつのハイフネーションファイルの名前も今ではコレだったりする。#TeX pic.twitter.com/48vtJFz8G7

— 某ZR(ざんねん🙃) (@zr_tex8r) January 11, 2020

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    To do (v0.1.0)

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions