Regular subtag APIs #1932

robertbastian · 2022-05-23T22:27:37Z

No description provided.

zbraniecki

praise: generally this looks awesome. It will make it easier to maintain the code. There's some cost to readability (macros are by nature a bit more challenging), but I think it's a good ROI.

My only larger concern is about performance of LSR subtags. I handcrafted parsing of each to a cycle. My reasoning is that by far the hottest paths in parsing of Locale are:

Are the first couple characters a valid Language?
Is the second subtag a Script or not?
Is the second/third subtag a Region?

You may notice that each of the three has different codepath to attempt to always check length first and bail early. Then check first character if possible and bail early.
I also tested (criterion, non iai tho!) whether it is faster to first check ascii and conditionally convert to tinystr, or convert and then test - testing tinystr casing is much faster, but parsing to tinystr costs.

This whole microoptimization may be moot tho and maybe the actual result is not going to be affected in a significant way, but please, test it. With iai if possible.

Particularly, rejecting invalid first characters, and fast-skipping Script as commonly omitted - the handcrafting allowed us to not have separate Script::is_valid from parsing, because parsing was instruction-optimized for that.

You also have a couple places where you, for normalization reasons, repeat validation calls where they weren't before. It may or may not be needed, and I'm ok with some overhead in all subtags except of LID so please, test those. Maybe no-ops will be enough to avoid complicating the macro?

components/locid/src/helpers.rs

components/locid/src/subtags/region.rs

robertbastian · 2022-06-15T12:30:17Z

Parsing now does:

raw slice length check
parse TinyStr
~~- TinyStr length check (new, this can be smaller than the raw length)~~
validation
normalization

For Language and Script this is the same as before. Variant failed early if the slice length is 4 the first character is alpha, which it doesn't do anymore. Region needs to branch on the TinyStr's length for both validation and normalization, that used to be one branch and now it's two consecutive ones, hopefully the compiler can optimize it out. These can be joined if needed but it would complicate the macro for all subtags.

The from-raw path does:

parse TinyStr
TinyStr length check
validation+normalization check

This is identical to before for LSRV. For other tags this path might be new and validation and normalization are sometimes two TinyStr calls, as the combined predicate doesn't exist.

robertbastian · 2022-06-15T12:51:29Z

Benchmarks look bad though and I don't know why...

robertbastian · 2022-06-17T08:06:51Z

^ rebased on #2068

sffc

Looks great! I'm requesting changes because we shouldn't land unsafe code that isn't documented.

components/locid/src/helpers.rs

components/locid/src/macros.rs

utils/tinystr/src/ascii.rs

Manishearth

don't have strong opinions on this component but looks good 👍

sffc · 2022-07-22T14:06:01Z

@robertbastian Can you determine whether this change can be decoupled from 1.0? If there are any methods that are being removed or renamed, you can do that in a standalone PR first

robertbastian · 2022-08-30T15:40:44Z

This is purely additive:

https://github.com/unicode-org/icu4x/pull/1932/files#diff-161e594b1bdf9bd23ef2db902ea1fc2a231d94d56e4c6f8dec2cf6a135f5636e

CLAassistant · 2022-08-30T17:13:10Z

All committers have signed the CLA.

robertbastian · 2022-08-31T13:01:05Z

bench_langid_constr
  Instructions:                2946 (+4.394047%)
  L1 Accesses:                 3715 (+3.481894%)
  L2 Accesses:                    9 (No change)
  RAM Accesses:                  53 (-1.851852%)
  Estimated Cycles:            5615 (+1.628959%)

bench_langid_compare_components
  Instructions:                  14 (No change)
  L1 Accesses:                   19 (No change)
  L2 Accesses:                    3 (No change)
  RAM Accesses:                   1 (No change)
  Estimated Cycles:              69 (No change)

bench_langid_compare_components_str
  Instructions:                  32 (No change)
  L1 Accesses:                   41 (No change)
  L2 Accesses:                    3 (No change)
  RAM Accesses:                   2 (No change)
  Estimated Cycles:             126 (No change)

bench_langid_strict_cmp
  Instructions:                 661 (No change)
  L1 Accesses:                  762 (No change)
  L2 Accesses:                    9 (No change)
  RAM Accesses:                  16 (No change)
  Estimated Cycles:            1367 (No change)

bench_langid_matching
  Instructions:                 126 (No change)
  L1 Accesses:                  163 (No change)
  L2 Accesses:                    9 (No change)
  RAM Accesses:                  10 (No change)
  Estimated Cycles:             558 (No change)

bench_langid_matching_str
  Instructions:                1523 (+2.146211%)
  L1 Accesses:                 1785 (+1.825442%)
  L2 Accesses:                    7 (No change)
  RAM Accesses:                  27 (No change)
  Estimated Cycles:            2765 (+1.170874%)

bench_langid_serialize
  Instructions:                4563 (No change)
  L1 Accesses:                 6423 (+0.015571%)
  L2 Accesses:                   17 (No change)
  RAM Accesses:                  52 (-1.886792%)
  Estimated Cycles:            8328 (-0.406601%)

bench_langid_canonicalize
  Instructions:                7306 (+1.726539%)
  L1 Accesses:                 9928 (+1.275120%)
  L2 Accesses:                   18 (+5.882353%)
  RAM Accesses:                  71 (-2.739726%)
  Estimated Cycles:           12503 (+0.482199%)

langid/overview         time:   [3.4459 us 3.4749 us 3.5056 us]
                        change: [-1.4087% -0.4060% +0.6212%] (p = 0.44 > 0.05)
                        No change in performance detected.

langid/construct/langid time:   [990.92 ns 1.0023 us 1.0170 us]
                        change: [+5.3227% +6.5564% +8.1515%] (p = 0.00 < 0.05)
                        Performance has regressed.

langid/to_string/langid time:   [1.0699 us 1.0786 us 1.0886 us]
                        change: [-31.112% -29.571% -28.160%] (p = 0.00 < 0.05)
                        Performance has improved.
langid/to_string/langid/writeable
                        time:   [685.64 ns 689.68 ns 693.66 ns]
                        change: [+3.1572% +4.1634% +5.1570%] (p = 0.00 < 0.05)
                        Performance has regressed.

langid/compare/struct/langid
                        time:   [140.11 ns 140.99 ns 141.92 ns]
                        change: [-4.3621% -3.1981% -2.0392%] (p = 0.00 < 0.05)
                        Performance has improved.
langid/compare/str/langid
                        time:   [1.0888 us 1.0950 us 1.1018 us]
                        change: [+10.561% +11.239% +11.883%] (p = 0.00 < 0.05)
                        Performance has regressed.
langid/compare/strict_cmp/langid
                        time:   [456.62 ns 459.14 ns 461.62 ns]
                        change: [-2.8353% -1.6745% -0.7075%] (p = 0.00 < 0.05)
                        Change within noise threshold.

langid/canonicalize/langid
                        time:   [2.5731 us 2.5853 us 2.5983 us]
                        change: [-5.8140% -4.7544% -3.6143%] (p = 0.00 < 0.05)
                        Performance has improved.

locale/overview         time:   [4.6733 us 4.7106 us 4.7473 us]
                        change: [+2.3118% +3.3342% +4.3377%] (p = 0.00 < 0.05)
                        Performance has regressed.

locale/construct/locale time:   [1.9567 us 1.9715 us 1.9893 us]
                        change: [-2.1714% -0.9435% +0.2710%] (p = 0.13 > 0.05)
                        No change in performance detected.

locale/to_string/locale time:   [1.3082 us 1.3209 us 1.3366 us]
                        change: [-4.3982% -2.9053% -1.4882%] (p = 0.00 < 0.05)
                        Performance has improved.
locale/to_string/locale/writeable
                        time:   [1.0715 us 1.0770 us 1.0828 us]
                        change: [+0.3527% +1.3658% +2.3336%] (p = 0.01 < 0.05)
                        Change within noise threshold.

locale/compare/struct/locale
                        time:   [265.12 ns 266.57 ns 268.18 ns]
                        change: [-3.9578% -3.1652% -2.3480%] (p = 0.00 < 0.05)
                        Performance has improved.
locale/compare/str/locale
                        time:   [2.1938 us 2.2045 us 2.2159 us]
                        change: [+4.0290% +5.0860% +6.1479%] (p = 0.00 < 0.05)
                        Performance has regressed.
locale/compare/strict_cmp/locale
                        time:   [637.09 ns 644.99 ns 655.71 ns]
                        change: [-2.6526% -1.3131% +0.0698%] (p = 0.05 > 0.05)
                        No change in performance detected.

locale/canonicalize/locale
                        time:   [3.8947 us 3.9168 us 3.9424 us]
                        change: [+1.7613% +2.6642% +3.5496%] (p = 0.00 < 0.05)
                        Performance has regressed.

subtags/language/parse  time:   [196.31 ns 197.31 ns 198.32 ns]
                        change: [-0.6625% +0.3728% +1.4782%] (p = 0.50 > 0.05)
                        No change in performance detected.

subtags/script/parse    time:   [134.40 ns 135.13 ns 135.89 ns]
                        change: [-0.7296% +0.2002% +1.1192%] (p = 0.69 > 0.05)
                        No change in performance detected.

subtags/region/parse    time:   [164.77 ns 165.65 ns 166.63 ns]
                        change: [-21.710% -21.058% -20.438%] (p = 0.00 < 0.05)
                        Performance has improved.

subtags/variant/parse   time:   [212.03 ns 213.21 ns 214.52 ns]
                        change: [+10.139% +11.212% +12.197%] (p = 0.00 < 0.05)
                        Performance has regressed.

sffc · 2022-08-31T16:28:18Z

Benchmarks look OK to me.

sffc

Praise: Nice work!

sffc · 2022-09-02T17:32:10Z

components/locid/src/helpers.rs

+            /// Compare with a potentially unnormalized BCP-47 string.
+            ///
+            /// The return value is equivalent to what would happen if you first parsed the
+            /// BCP-47 string and then performed a structucal comparison.


Suggested change

/// BCP-47 string and then performed a structucal comparison.

/// BCP-47 string and then performed a structural comparison.

components/locid/src/helpers.rs

sffc · 2022-09-02T17:44:11Z

utils/tinystr/src/ascii.rs

+    pub const fn as_bytes(&self) -> &[u8] {
+        // Safe because `self.bytes.as_slice()` pointer-casts to `&[u8]`,
+        // and changing the length of that slice to self.len() < N is safe.
+        unsafe { core::mem::transmute((self.bytes.as_slice().as_ptr(), self.len())) }


Ping @Manishearth to review this unsafe block

In general I want to move away from relying on the slice layout matching tuples (miri will eventually not like this, but also we'll have better slice APIs by then)

Can you instead take self.bytes.as_slice() and just transmute that reference? It should work.

Also, explicitly include the from/to types on the transmute::<_,_>() turbofish (and perhaps import core::mem somewhere)

sffc · 2022-09-02T18:12:19Z

components/locid/src/subtags/variant.rs

+    s.to_ascii_lowercase(),
+    s.is_ascii_lowercase()
+        && s.is_ascii_alphanumeric()
+        && (s.len() != 4 || s.all_bytes()[0].is_ascii_digit()),


Praise: This is a clearer way to express the invariant than we had before

variant = 5*8alphanum ; registered variants / (DIGIT 3alphanum)

https://www.rfc-editor.org/rfc/bcp/bcp47.txt

sffc · 2022-09-02T22:15:26Z

I want this so I'm going to merge it.

robertbastian force-pushed the locale branch from b2796ef to 52a0daa Compare May 23, 2022 23:41

zbraniecki reviewed May 24, 2022

View reviewed changes

components/locid/src/helpers.rs Outdated Show resolved Hide resolved

components/locid/src/helpers.rs Outdated Show resolved Hide resolved

components/locid/src/subtags/region.rs Outdated Show resolved Hide resolved

zbraniecki mentioned this pull request May 24, 2022

Locale not canonicalizing tlangs to lowercase #1925

Closed

robertbastian force-pushed the locale branch from 52a0daa to 200b266 Compare June 4, 2022 04:49

robertbastian mentioned this pull request Jun 14, 2022

normalizing_eq, strict_cmp for LSRV subtags #2048

Merged

robertbastian force-pushed the locale branch 2 times, most recently from 1d2be8d to 4baa586 Compare June 14, 2022 14:43

unicode-org deleted a comment from jira-pull-request-webhook bot Jun 14, 2022

robertbastian force-pushed the locale branch 2 times, most recently from 72df063 to 5ae99df Compare June 14, 2022 16:08

unicode-org deleted a comment from jira-pull-request-webhook bot Jun 14, 2022

robertbastian force-pushed the locale branch 4 times, most recently from b3c0de1 to 44da72d Compare June 15, 2022 12:21

robertbastian force-pushed the locale branch 4 times, most recently from 74306f5 to e24979e Compare June 17, 2022 08:04

unicode-org deleted a comment from jira-pull-request-webhook bot Jun 17, 2022

robertbastian requested review from Manishearth and removed request for nciric June 17, 2022 15:49

sffc requested changes Jun 23, 2022

View reviewed changes

comments

e7e337f

sffc added this to the ICU4X 1.0 (Polish) milestone Jun 23, 2022

Manishearth previously approved these changes Jun 23, 2022

View reviewed changes

Merge branch 'main' into locale

3dd89f5

robertbastian dismissed Manishearth’s stale review via 3dd89f5 August 30, 2022 17:13

robertbastian added 3 commits August 30, 2022 19:19

recover macros mod

358c066

fix

b911133

Merge branch 'main' into locale

cc58354

unicode-org deleted a comment from jira-pull-request-webhook bot Aug 31, 2022

robertbastian requested a review from zbraniecki August 31, 2022 13:01

robertbastian requested a review from sffc August 31, 2022 16:28

Merge branch 'main' into locale

3263f71

sffc approved these changes Sep 2, 2022

View reviewed changes

sffc requested a review from Manishearth September 2, 2022 18:30

sffc mentioned this pull request Sep 2, 2022

Fix unsafe code block in utils/tinystr/src/ascii.rs #2504

Closed

sffc merged commit 6e6ea16 into unicode-org:main Sep 2, 2022

sffc mentioned this pull request Sep 2, 2022

Fix typo in locid/src/helpers.rs #2505

Closed

robertbastian deleted the locale branch September 5, 2022 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regular subtag APIs #1932

Regular subtag APIs #1932

robertbastian commented May 23, 2022

zbraniecki left a comment

robertbastian commented Jun 15, 2022 •

edited

Loading

robertbastian commented Jun 15, 2022

robertbastian commented Jun 17, 2022

sffc left a comment

Manishearth left a comment

sffc commented Jul 22, 2022

robertbastian commented Aug 30, 2022 •

edited

Loading

CLAassistant commented Aug 30, 2022 •

edited

Loading

robertbastian commented Aug 31, 2022

sffc commented Aug 31, 2022

sffc left a comment

sffc Sep 2, 2022

sffc Sep 3, 2022

sffc Sep 2, 2022

Manishearth Sep 2, 2022

sffc Sep 2, 2022

sffc Sep 2, 2022

sffc commented Sep 2, 2022

	/// BCP-47 string and then performed a structucal comparison.
	/// BCP-47 string and then performed a structural comparison.

Regular subtag APIs #1932

Regular subtag APIs #1932

Conversation

robertbastian commented May 23, 2022

zbraniecki left a comment

Choose a reason for hiding this comment

robertbastian commented Jun 15, 2022 • edited Loading

robertbastian commented Jun 15, 2022

robertbastian commented Jun 17, 2022

sffc left a comment

Choose a reason for hiding this comment

Manishearth left a comment

Choose a reason for hiding this comment

sffc commented Jul 22, 2022

robertbastian commented Aug 30, 2022 • edited Loading

CLAassistant commented Aug 30, 2022 • edited Loading

robertbastian commented Aug 31, 2022

sffc commented Aug 31, 2022

sffc left a comment

Choose a reason for hiding this comment

sffc Sep 2, 2022

Choose a reason for hiding this comment

sffc Sep 3, 2022

Choose a reason for hiding this comment

sffc Sep 2, 2022

Choose a reason for hiding this comment

Manishearth Sep 2, 2022

Choose a reason for hiding this comment

sffc Sep 2, 2022

Choose a reason for hiding this comment

sffc Sep 2, 2022

Choose a reason for hiding this comment

sffc commented Sep 2, 2022

robertbastian commented Jun 15, 2022 •

edited

Loading

robertbastian commented Aug 30, 2022 •

edited

Loading

CLAassistant commented Aug 30, 2022 •

edited

Loading