Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserving clusters with Common script bases in script segmentation #13

Open
NorbertLindenberg opened this issue Jul 24, 2019 · 2 comments

Comments

@NorbertLindenberg
Copy link

The algorithms documented in Script Segmentation have one problem in common: They do not guarantee to keep clusters (in the OpenType sense) together that have a base whose script property is Common. The most important such bases are U+00A0 NO-BREAK SPACE, which The Unicode Standard (page 60) recommends as the base for showing nonspacing marks in isolation, and U+25CC DOTTED CIRCLE, which is commonly used in code charts or on keyboards for the same purpose.

The Script Segmentation document discusses this issue, but concludes:

The fact that the run breaking algorithm may miscategorise the script of a common character is not a problem unless that character undergoes specific script only styling. If the C characters here should be rendered/shaped differently according to whether they resolve to script A or B, then their correct categorisation becomes important.

I believe this underestimates the problem. The failure to keep such clusters together leads to the following issues:

  • OpenType shaping engines or AAT fonts may insert additional dotted circles before the combining marks which reaches them without the base character that the author has already provided. For example, if I write ◌ៀ (that’s U+25CC DOTTED CIRCLE and U+17C0 KHMER VOWEL SIGN IE), you’ll very likely see two dotted circles where the document contains only one. However, this behavior is dependent on both the textual context and specific implementations of OpenType shaping engines and AAT fonts, so it can’t be relied on either if you do want a dotted circle. This issue has been extensively documented by Richard Ishida (“The Combining Character Conundrum”) and by Marc Durdin.
  • If a cluster has more than one combining mark, OpenType shaping engines or AAT fonts may break it up even further by inserting a dotted circle before every single mark. This is especially toxic for some Brahmic scripts, where a syllable may include virama characters that are not known outside of Unicode but are used to encode subjoined consonants. For example, if I try to show the subjoined Myanmar consonant wa using ◌္ဝ (that’s U+25CC DOTTED CIRCLE, U+1039 MYANMAR SIGN VIRAMA, U+101D MYANMAR LETTER WA), you’ll very likely see a character ္ that doesn’t occur outside of Unicode, and the consonant ဝ to the right rather than below the base.
  • If a font is designed to properly position combining marks relative to the no-break space or dotted circle, or to use wider variants of them to accommodate wider marks, or to raise the base line of all base glyphs to increase the space available for below-base marks, this doesn’t work if the cluster is broken up. (Such features would be the “rendered/shaped differently according to whether they resolve to script A or B” mentioned in the document, and I’ve seen all of them applied in actual fonts.)

To solve this problem, algorithms that segment text into script runs should check for Common script base characters whether they’re followed by combining marks, and, if so, give any script that can be determined from such marks (other than Common or Inherited) preference over any script determined from preceding characters.

The set of Common script characters that should be considered bases for this purpose needs to be determined. A candidate set would be those Common script characters that the Universal Shaping Engine classifies as BASE_OTHER.

A full solution to the issues described above requires similar care in breaking text into font runs (e.g., when using fallback fonts), but let's start here.

@mhosken
Copy link
Collaborator

mhosken commented Jul 24, 2019

Might it be argued that there should be no script change before a character with general category M (that is any M: Mn, Mc, etc.)?

@NorbertLindenberg
Copy link
Author

NorbertLindenberg commented Jul 25, 2019

@mhosken I'm not sure what exactly you’re proposing. If it’s a clarification for the phrase “combining mark” in my proposed solution, then yes.

More precisely, for a string

AAACC◌MMM

where A are characters with script 𝑨, C are characters with script Common, ◌ is dotted circle or another Common script base, M are characters with general category M[cen] and script 𝑴, 𝑴≠𝑨, then the script boundary should be placed at

AAACC|◌MMM

instead of

AAACC◌|MMM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants