Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support lots of ligatures #53

Merged
merged 17 commits into from
Jun 6, 2024
Merged

Conversation

Jules-Bertholet
Copy link
Contributor

@Jules-Bertholet Jules-Bertholet commented May 28, 2024

Arabic Lam-Alef ligature

Unicode:

The lam-alef type of ligatures are extremely common in the Arabic script. These ligatures occur in almost all font designs, except for a few modern styles. When supported by the style of the font, lam-alef ligatures are considered obligatory. This means that all character sequences rendered in that font, which match the rules specified in the following discussion, must form these ligatures.

In practice, the three monospace Arabic fonts that seem to be most widespread (Courier New, Simplified Arabic Fixed, and Kawkab Mono) all render this ligature with width 1.

Buginese <a, -i> ya ligature

Documented in Unicode.

Hebrew Alef-Lamed ligature

This one is not documented in the Unicode standard, but all the Hebrew monospace fonts I could find have it. And it requires an explicit ZWJ to get, so there is no real risk in including it.

Khmer coeng signs

Documented in Unicode.

Old Turkic ligature

Documented in Unicode.

Tifinagh bi-consonants

Documented in Unicode.

Emoji modifiers and ZWJ sequences

Documented in UTS 51.

U+17D8 KHMER SIGN BEYYAL

According to the charts, this character should be equivalent to U+17D4 U+179B U+17D4, so give it the same width as that sequence.


This PR also fixes a bug with canonical equivalence for the CJK width of certain strings containing '\u{0338}' COMBINING LONG SOLIDUS OVERLAY. I can try to split that out if you'd prefer.

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a huge fan of the python getting super complex but it's fine

class EffectiveWidth(enum.IntEnum):
"""Represents the width of a Unicode character. All East Asian Width classes resolve into
either `EffectiveWidth.NARROW`, `EffectiveWidth.WIDE`, or `EffectiveWidth.AMBIGUOUS`.
def to_sorted_ranges(iter: Iterable[Codepoint]) -> list[tuple[Codepoint, Codepoint]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: given the amount of logic that is ending up in Python here I think it may make sense to switch to a rust script for this. unfortunately that will make CI much slower since a rust script would need HTTP deps

Perhaps we can instead put more work into documenting and cleaning up the python. you've been doing pretty well so far!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a rust script would need HTTP deps

Or we could just shell out to curl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not opposed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(let's not do that in this PR, though)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we could also just check in the fetched file and have a CI job ensure it's up to date.

@Manishearth Manishearth merged commit afab363 into unicode-rs:master Jun 6, 2024
2 checks passed
@Jules-Bertholet Jules-Bertholet deleted the ligatures branch June 6, 2024 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants