Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number formatting for many languages and sorting (collation) with ICU #1632

Merged
merged 5 commits into from
Nov 30, 2022

Conversation

Omikhleia
Copy link
Member

@Omikhleia Omikhleia commented Nov 27, 2022

Rationale

  • SILE should be able to correctly format numbers in appropriate number system and with appropriate format style for as many languages as possible, and not just a few selected ("en", "eo", "tr")
    • in "default" form (en: 1234 or ar: ١٢٣٤ or zh: 1234)
    • in "decimal" form, i.e. usually with thousands separator etc. (1,234 or ١٬٢٣٤ or 1,234)
    • in "ordinal" form, i.e. the "nth's" (1,234th or ١٬٢٣٤. or 第1,234)
    • as an extra in spelt-out form (one thousand two hundred etc., or some stuff I can't copy, or 一千二百三十四)
    • EDIT: (N.B.) I took the most neutral names I could think of, for these... What typographic rules say, for a given language, may be different; e.g. some French typographers distinguish a cardinal use (= "decimal" with separator, in that case a space) and an ordinal use (as in a year date, a page number, etc. = "default" without separator). Terminology is sometimes ambiguous, it seems (because of course, there are also the actual ordinals: with spaces at thousands... and the raised "e" &c.)
  • SILE should be able to correctly sort strings with appropriate rules according to the (main) language usages: indexer and packages of that vein may need it or their sorting entries will be garbage.

First part (number formatting) = Closes #1630

Notes

  • Deprecations are currently set to 0.14.6 for removal in 0.16.0
  • For counters, it is slightly breaking though - if we consider that what was in tests/counters.sil is a "reference" of sorts, though fully undocumented. I would tend to think it is acceptable (for the main intended use of counters, i.e. sectioning and lists). It's unlikely to affect many users anyway; and the benefits are worth it: We get "default" (cardinal), "decimal" (with thousands separator, etc.), "ordinal" (our former "nth"), as well as "string" (spelt-out) support, for most languages (i.e. as long as supported by ICU).

En français, on écrit « en l'an 1984 », mais « dans 4 500 ans ».
Probably English and others have similar rules, when to use 1984 or 1,984

N.B. I haven't considered all format style options from ICU (e.g. "duration", "currency", etc.) because I felt they were unlikely to be needed. That could be added, if someone really wants - But I'm not convinced it would have a real use in SILE (and in my tests at least, the ICU library is somewhat inconsistent in how it honors these).

Second part (string sorting) = Add language-dependent sorting with ICU collation options

As noted, an imperative for indexes (= relates to #1339), etc.

> SILE.call("language", { main = "fr" })
> t = { "Jean100", "Jean2", "Alinoé", "Alinéa" }
> table.sort(t)
> print(table.concat(t, ", "))

--> Alinoé, Alinéa, Jean100, Jean2 = bad indexing order

> t = { "Jean100", "Jean2", "Alinoé", "Alinéa" }
> SU.collatedSort(t)
> print(table.concat(t, ", "))

--> Alinéa, Alinoé, Jean2, Jean100 = good indexing order, yay!

Supports options passed to SU.collatedSort, or also defined in SU.collatedSort.xx for language-specific override of the default values if need be.

Mentioned Roman (title case) but actually needed ROMAN (upper case)
Update deprecated calls.

Remove display formats that were undocumented and mixing the
concepts of numbering system and format style. Slighty breaking
possibly, but if this was supposed to be a real feature, a syntax
extension to counters would be better...

Note that example in arabic-indic was using decimal ١٬٩٨٤ and
is now a default number ١٩٨٤. Such a huge value is unlikely to be as
common case for counters... But this seems more consistent e.g. with
what CSS "list-style: arabic-indic;" also does, so likely better.
@Omikhleia Omikhleia changed the title Extend number formatting to many languages with ICU Number formatting for many languages and sorting (collation) with ICU Nov 28, 2022
@Omikhleia Omikhleia marked this pull request as ready for review November 28, 2022 19:49
@alerque
Copy link
Member

alerque commented Nov 29, 2022

Cardinal vs. Ordinal is also very applicable in Turkish, and to a lesser degree English. It's probably a distinction we should make and supply both for rather than assuming one or the other given the most common use case.

Otherwise 💯 to the analysis here.

@alerque alerque added the enhancement Software improvement or feature request label Nov 29, 2022
@alerque alerque added this to the v0.14.6 milestone Nov 29, 2022
core/utilities-numbers.lua Show resolved Hide resolved
@alerque
Copy link
Member

alerque commented Nov 29, 2022

I'm looking into dropping the bits of our implementations that are now duplicated. Somewhat amusingly our "string" (spellout) implementation in English handles bigger numbers than the ICU one which starts failing at 32 bits. It leaves off at quadrillions, ours goes on to handle quintillion, sextillion, septillion, and octillion!

Similarly for the Turkish implementation, ours is more complete.

The English apostrophe thing is probably wrong according to most style guides though.

@Omikhleia
Copy link
Member Author

(our implementation) is more complete.

Yep, or even better (with tunable settings, in the case of Esperanto). I'm ok with SILE having a mechanism for bypassing ICU when it can do better, I didn't intend to have them necessarily replace. The main point is to address other languages generically with decent defaults.1

Footnotes

  1. That what we already do e.g. for node breakers, defaulting to ICU breakpoints but possibly overriding the logic as e.g. in SILE.nodeMakers.fr, SILE.nodeMakers.ja, etc. (to handle resp. punctuations, some JIS class things, etc.).

* TR output should be identical for small numbers, better with thousands
  separators for large numbers.
* EN output is always different but arguably better. We had invalid
  apostrophes after numbers (correct in ICU) and also didn't have
  thousands separators.
@alerque alerque merged commit cd329e8 into sile-typesetter:master Nov 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Software improvement or feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Number formatting in foreign languages
2 participants