Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locale not canonicalizing tlangs to lowercase #1925

Closed
jedel1043 opened this issue May 22, 2022 · 4 comments · Fixed by #4134
Closed

Locale not canonicalizing tlangs to lowercase #1925

jedel1043 opened this issue May 22, 2022 · 4 comments · Fixed by #4134
Labels
C-locale Component: Locale identifiers, BCP47 help wanted Issue needs an assignee S-medium Size: Less than a week (larger bug fix or enhancement) T-bug Type: Bad behavior, security, privacy

Comments

@jedel1043
Copy link
Contributor

jedel1043 commented May 22, 2022

Locale canonicalizes tlangs (the "-t-en-latn" part inside the "en-t-en-latn" locale) incorrectly, preserving the canonicalized case instead of converting all to lowercase.

The RFC6497 from the IETF that describes the extension T (and tlangs) specifies on one of its paragraphs:

2.3.  Canonicalization

   As required by [BCP47], the use of uppercase or lowercase letters is
   not significant in the subtags used in this extension.  The canonical
   form for all subtags in the extension is lowercase, with the fields
   ordered by the separators, alphabetically.  The order of subtags
   within a field is significant, and MUST NOT be changed in the process
   of canonicalizing.

Emphasis on

The canonical form for all subtags in the extension is lowercase ...

Minimized snippet with the bug

let test: Locale = "en-t-en-latn-ca-emodeng".parse().unwrap();
println!("{test}"); // prints "en-t-en-Latn-CA-emodeng" instead of "en-t-en-latn-ca-emodeng"

Haven't checked if it would be necessary to fix only Locale::from_bytes or also Locale::canonicalize and LocaleCanonicalizer::canonicalize.

@sffc sffc added T-bug Type: Bad behavior, security, privacy C-locale Component: Locale identifiers, BCP47 S-medium Size: Less than a week (larger bug fix or enhancement) labels May 23, 2022
@sffc
Copy link
Member

sffc commented May 23, 2022

Good catch.

CC @zbraniecki @dminor

@zbraniecki
Copy link
Member

Damn, that's unfortunate. There has been a lot of good honest work put into optimizing all the codepaths to canonicalize to one form.

We can either rewrite quite a bit of helper code to handle "normal" vs "lowercase" scenarios or bit the bullet and allocate tlang, lowercase it, and move on.

I'm leaning toward the latter as much as it pains me.

@sffc
Copy link
Member

sffc commented May 24, 2022

It's a nice property that I can get a real LanguageIdentifier out of the tlang field. If we make tlang a separate tree, then it should at least have helper functions to perform the conversion.

@zbraniecki
Copy link
Member

Let's wait for #1932 before we look into this. I think we may either want to introduce a concept of write_to_lowercase to Region and Script.

@sffc sffc added this to the ICU4X 1.1 milestone May 26, 2022
@sffc sffc added the help wanted Issue needs an assignee label May 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-locale Component: Locale identifiers, BCP47 help wanted Issue needs an assignee S-medium Size: Less than a week (larger bug fix or enhancement) T-bug Type: Bad behavior, security, privacy
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants