Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ASCII case conversions more than 4× faster #59283

Merged
merged 11 commits into from
Mar 28, 2019

Conversation

SimonSapin
Copy link
Contributor

@SimonSapin SimonSapin commented Mar 18, 2019

Reformatted output of ./x.py bench src/libcore --test-args ascii below. The libcore benchmark calls [u8]::make_ascii_lowercase. lookup has code (effectively) identical to that before this PR, and branchless mask_shifted_bool_match_range after this PR.

See code comments in u8::to_ascii_uppercase in src/libcore/num/mod.rs for an explanation of the branchless algorithm.

Update: the algorithm was simplified while keeping the performance. See branchless v.s. mask_shifted_bool_match_range benchmarks.

Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on u32 to convert four bytes at a time. The fake_simd_u32 benchmarks implements this with let (before, aligned, after) = bytes.align_to_mut::<u32>(). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII).

This could be fixed (to optimize [u8]::make_ascii_lowercase and [u8]::make_ascii_uppercase in src/libcore/slice/mod.rs) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR.

Benchmark results on Linux x64 with Intel i7-7700K: (updated from #59283 (comment))

6830 bytes string:

alloc_only                          ... bench:    112 ns/iter (+/- 0) = 62410 MB/s
black_box_read_each_byte            ... bench:  1,733 ns/iter (+/- 8) = 4033 MB/s
lookup_table                        ... bench:  1,766 ns/iter (+/- 11) = 3958 MB/s
branch_and_subtract                 ... bench:    417 ns/iter (+/- 1) = 16762 MB/s
branch_and_mask                     ... bench:    401 ns/iter (+/- 1) = 17431 MB/s
branchless                          ... bench:    365 ns/iter (+/- 0) = 19150 MB/s
libcore                             ... bench:    367 ns/iter (+/- 1) = 19046 MB/s
fake_simd_u32                       ... bench:    361 ns/iter (+/- 2) = 19362 MB/s
fake_simd_u64                       ... bench:    361 ns/iter (+/- 1) = 19362 MB/s
mask_mult_bool_branchy_lookup_table ... bench:  6,309 ns/iter (+/- 19) = 1107 MB/s
mask_mult_bool_lookup_table         ... bench:  4,183 ns/iter (+/- 29) = 1671 MB/s
mask_mult_bool_match_range          ... bench:    339 ns/iter (+/- 0) = 20619 MB/s
mask_shifted_bool_match_range       ... bench:    339 ns/iter (+/- 1) = 20619 MB/s

32 bytes string:


alloc_only                          ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
black_box_read_each_byte            ... bench:     29 ns/iter (+/- 0) = 1103 MB/s
lookup_table                        ... bench:     24 ns/iter (+/- 4) = 1333 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branchless                          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
libcore                             ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
fake_simd_u32                       ... bench:     17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64                       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     42 ns/iter (+/- 0) = 761 MB/s
mask_mult_bool_lookup_table         ... bench:     35 ns/iter (+/- 0) = 914 MB/s
mask_mult_bool_match_range          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_shifted_bool_match_range       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s

7 bytes string:

alloc_only                          ... bench:     14 ns/iter (+/- 0) = 500 MB/s
black_box_read_each_byte            ... bench:     22 ns/iter (+/- 0) = 318 MB/s
lookup_table                        ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branchless                          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
libcore                             ... bench:     20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u32                       ... bench:     18 ns/iter (+/- 0) = 388 MB/s
fake_simd_u64                       ... bench:     21 ns/iter (+/- 0) = 333 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     20 ns/iter (+/- 0) = 350 MB/s
mask_mult_bool_lookup_table         ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_mult_bool_match_range          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_shifted_bool_match_range       ... bench:     19 ns/iter (+/- 0) = 368 MB/s

@SimonSapin SimonSapin added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Mar 18, 2019
@rust-highfive
Copy link
Collaborator

r? @joshtriplett

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 18, 2019
@rust-highfive

This comment has been minimized.

Copy link
Contributor

@raphlinus raphlinus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, modulo the tidy warnings.

I like that the explanation is so much longer than the code :)

@joshtriplett
Copy link
Member

Looks good to me. r=me as soon as CI passes.

@ollie27
Copy link
Member

ollie27 commented Mar 18, 2019

Might this be slower for platforms without SIMD which can't take advantage of auto-vectorization or does that not matter?

@raphlinus
Copy link
Contributor

It's probably still faster than the status quo on those platforms because it does the computation without branches. If one cared deeply about those platforms, then the pseudo-SIMD approach could be resurrected. However, I think this is a pretty good compromise.

@SimonSapin
Copy link
Contributor Author

I guess it depends on whether LLVM can auto-vectorize based on "classic" u32 operations. But either way it’s likely still faster than the current lookup table.

I also just realize that when doing one byte at a time, instead of convoluted add-then-mask to emulate comparison, we can use actual comparison to obtain a bool, then cast to u8 to obtain a 1 or 0, then multiply that by a mask:

byte &= !(0x20 * (b'a' <= byte && byte <= b'z') as u8)

This even turns out to be slightly faster! I’ll update the PR.

@SimonSapin
Copy link
Contributor Author

If instead of b'a' <= byte && byte <= b'z' in the above I use byte.is_ascii_lowercase(), the performance is completely destroyed and goes to several slower than before this PR. So I also change the implementations of all u8::is_ascii_* methods to use match expressions with range patterns instead of the ASCII_CHARACTER_CLASS lookup table. When benchmarking black_box(bytes.iter().all(u8::is_ascii_FOO), the change is small, possibly noise.

Benchmark results in GIF for "visual diff":

a

Benchmark results in text

Before:

test ascii::long::is_ascii                                 ... bench:         187 ns/iter (+/- 0) = 37379 MB/s
test ascii::long::is_ascii_alphabetic                      ... bench:          94 ns/iter (+/- 0) = 74361 MB/s
test ascii::long::is_ascii_alphanumeric                    ... bench:         125 ns/iter (+/- 0) = 55920 MB/s
test ascii::long::is_ascii_control                         ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_digit                           ... bench:         125 ns/iter (+/- 0) = 55920 MB/s
test ascii::long::is_ascii_graphic                         ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_hexdigit                        ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_lowercase                       ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_punctuation                     ... bench:         124 ns/iter (+/- 1) = 56370 MB/s
test ascii::long::is_ascii_uppercase                       ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_whitespace                      ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::medium::is_ascii                               ... bench:          28 ns/iter (+/- 0) = 1142 MB/s
test ascii::medium::is_ascii_alphabetic                    ... bench:          24 ns/iter (+/- 0) = 1333 MB/s
test ascii::medium::is_ascii_alphanumeric                  ... bench:          24 ns/iter (+/- 0) = 1333 MB/s
test ascii::medium::is_ascii_control                       ... bench:          23 ns/iter (+/- 1) = 1391 MB/s
test ascii::medium::is_ascii_digit                         ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_graphic                       ... bench:          24 ns/iter (+/- 0) = 1333 MB/s
test ascii::medium::is_ascii_hexdigit                      ... bench:          23 ns/iter (+/- 0) = 1391 MB/s
test ascii::medium::is_ascii_lowercase                     ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_punctuation                   ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_uppercase                     ... bench:          22 ns/iter (+/- 2) = 1454 MB/s
test ascii::medium::is_ascii_whitespace                    ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::short::is_ascii                                ... bench:          23 ns/iter (+/- 1) = 304 MB/s
test ascii::short::is_ascii_alphabetic                     ... bench:          24 ns/iter (+/- 0) = 291 MB/s
test ascii::short::is_ascii_alphanumeric                   ... bench:          24 ns/iter (+/- 0) = 291 MB/s
test ascii::short::is_ascii_control                        ... bench:          22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_digit                          ... bench:          22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_graphic                        ... bench:          25 ns/iter (+/- 0) = 280 MB/s
test ascii::short::is_ascii_hexdigit                       ... bench:          24 ns/iter (+/- 0) = 291 MB/s
test ascii::short::is_ascii_lowercase                      ... bench:          23 ns/iter (+/- 1) = 304 MB/s
test ascii::short::is_ascii_punctuation                    ... bench:          22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_uppercase                      ... bench:          24 ns/iter (+/- 1) = 291 MB/s
test ascii::short::is_ascii_whitespace                     ... bench:          22 ns/iter (+/- 0) = 318 MB/s

After:

test ascii::long::is_ascii                                 ... bench:         186 ns/iter (+/- 0) = 37580 MB/s
test ascii::long::is_ascii_alphabetic                      ... bench:          96 ns/iter (+/- 0) = 72812 MB/s
test ascii::long::is_ascii_alphanumeric                    ... bench:         119 ns/iter (+/- 0) = 58739 MB/s
test ascii::long::is_ascii_control                         ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_digit                           ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_graphic                         ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_hexdigit                        ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_lowercase                       ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_punctuation                     ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_uppercase                       ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_whitespace                      ... bench:         134 ns/iter (+/- 0) = 52164 MB/s
test ascii::medium::is_ascii                               ... bench:          28 ns/iter (+/- 0) = 1142 MB/s
test ascii::medium::is_ascii_alphabetic                    ... bench:          23 ns/iter (+/- 0) = 1391 MB/s
test ascii::medium::is_ascii_alphanumeric                  ... bench:          23 ns/iter (+/- 0) = 1391 MB/s
test ascii::medium::is_ascii_control                       ... bench:          20 ns/iter (+/- 0) = 1600 MB/s
test ascii::medium::is_ascii_digit                         ... bench:          20 ns/iter (+/- 0) = 1600 MB/s
test ascii::medium::is_ascii_graphic                       ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_hexdigit                      ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_lowercase                     ... bench:          20 ns/iter (+/- 0) = 1600 MB/s
test ascii::medium::is_ascii_punctuation                   ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_uppercase                     ... bench:          21 ns/iter (+/- 0) = 1523 MB/s
test ascii::medium::is_ascii_whitespace                    ... bench:          20 ns/iter (+/- 0) = 1600 MB/s
test ascii::short::is_ascii                                ... bench:          23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_alphabetic                     ... bench:          23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_alphanumeric                   ... bench:          23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_control                        ... bench:          20 ns/iter (+/- 0) = 350 MB/s
test ascii::short::is_ascii_digit                          ... bench:          20 ns/iter (+/- 0) = 350 MB/s
test ascii::short::is_ascii_graphic                        ... bench:          23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_hexdigit                       ... bench:          22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_lowercase                      ... bench:          20 ns/iter (+/- 0) = 350 MB/s
test ascii::short::is_ascii_punctuation                    ... bench:          22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_uppercase                      ... bench:          21 ns/iter (+/- 0) = 333 MB/s
test ascii::short::is_ascii_whitespace                     ... bench:          20 ns/iter (+/- 0) = 350 MB/s

@SimonSapin
Copy link
Contributor Author

Benchmark results from the original PR description, in case they end up being relevant:

6830 bytes string:

alloc_only                ... bench:    109 ns/iter (+/- 0) = 62660 MB/s
black_box_read_each_byte  ... bench:  1,708 ns/iter (+/- 5) = 3998 MB/s
lookup                    ... bench:  1,725 ns/iter (+/- 2) = 3959 MB/s
branch_and_subtract       ... bench:    413 ns/iter (+/- 1) = 16537 MB/s
branch_and_mask           ... bench:    411 ns/iter (+/- 2) = 16618 MB/s
branchless                ... bench:    377 ns/iter (+/- 2) = 18116 MB/s
libcore                   ... bench:    378 ns/iter (+/- 2) = 18068 MB/s
fake_simd_u32             ... bench:    373 ns/iter (+/- 1) = 18310 MB/s
fake_simd_u64             ... bench:    374 ns/iter (+/- 0) = 18262 MB/s

32 bytes string:

alloc_only                ... bench:     13 ns/iter (+/- 0) = 2461 MB/s
black_box_read_each_byte  ... bench:     28 ns/iter (+/- 0) = 1142 MB/s
lookup                    ... bench:     25 ns/iter (+/- 0) = 1280 MB/s
branch_and_subtract       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask           ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branchless                ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
libcore                   ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
fake_simd_u32             ... bench:     17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64             ... bench:     17 ns/iter (+/- 0) = 1882 MB/s

7 bytes string:

alloc_only                ... bench:     13 ns/iter (+/- 0) = 538 MB/s
black_box_read_each_byte  ... bench:     22 ns/iter (+/- 0) = 318 MB/s
lookup                    ... bench:     17 ns/iter (+/- 0) = 411 MB/s
branch_and_subtract       ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask           ... bench:     17 ns/iter (+/- 0) = 411 MB/s
branchless                ... bench:     21 ns/iter (+/- 0) = 333 MB/s
libcore                   ... bench:     21 ns/iter (+/- 0) = 333 MB/s
fake_simd_u32             ... bench:     20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u64             ... bench:     23 ns/iter (+/- 0) = 304 MB/s

@@ -3794,7 +3794,8 @@ impl u8 {
#[stable(feature = "ascii_methods_on_intrinsics", since = "1.23.0")]
#[inline]
pub fn to_ascii_uppercase(&self) -> u8 {
ASCII_UPPERCASE_MAP[*self as usize]
// Unset the fith bit if this is a lowercase letter
*self & !((self.is_ascii_lowercase() as u8) << 5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*self & !((self.is_ascii_lowercase() as u8) << 5)
*self - ((self.is_ascii_lowercase() as u8) << 5)

Using subtract is slightly faster for me:

test long::case12_mask_shifted_bool_match_range         ... bench:         776 ns/iter (+/- 26) = 9007 MB/s
test long::case13_sub_shifted_bool_match_range          ... bench:         734 ns/iter (+/- 49) = 9523 MB/s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also an improvement for me, but smaller:

test ascii::long::case12_mask_shifted_bool_match_range         ... bench:         352 ns/iter (+/- 0) = 19857 MB/s
test ascii::long::case13_subtract_shifted_bool_match_range     ... bench:         350 ns/iter (+/- 1) = 19971 MB/s
test ascii::medium::case12_mask_shifted_bool_match_range       ... bench:          15 ns/iter (+/- 0) = 2133 MB/s
test ascii::medium::case13_subtract_shifted_bool_match_range   ... bench:          15 ns/iter (+/- 0) = 2133 MB/s
test ascii::short::case12_mask_shifted_bool_match_range        ... bench:          19 ns/iter (+/- 0) = 368 MB/s
test ascii::short::case13_subtract_shifted_bool_match_range    ... bench:          18 ns/iter (+/- 0) = 388 MB/s

@ollie27
Copy link
Member

ollie27 commented Mar 19, 2019

A quick benchmark using i586-pc-windows-msvc target gets me:

test long::case00_alloc_only                            ... bench:         291 ns/iter (+/- 46) = 24020 MB/s
test long::case01_black_box_read_each_byte              ... bench:       4,214 ns/iter (+/- 163) = 1658 MB/s
test long::case02_lookup_table                          ... bench:       6,158 ns/iter (+/- 226) = 1135 MB/s
test long::case03_branch_and_subtract                   ... bench:      17,402 ns/iter (+/- 641) = 401 MB/s
test long::case04_branch_and_mask                       ... bench:      17,748 ns/iter (+/- 1,242) = 393 MB/s
test long::case05_branchless                            ... bench:      10,757 ns/iter (+/- 390) = 649 MB/s
test long::case06_libcore                               ... bench:       6,165 ns/iter (+/- 401) = 1133 MB/s
test long::case07_fake_simd_u32                         ... bench:       2,790 ns/iter (+/- 138) = 2505 MB/s
test long::case08_fake_simd_u64                         ... bench:       2,816 ns/iter (+/- 166) = 2482 MB/s
test long::case09_mask_mult_bool_branchy_lookup_table   ... bench:      11,366 ns/iter (+/- 353) = 614 MB/s
test long::case10_mask_mult_bool_lookup_table           ... bench:       9,793 ns/iter (+/- 486) = 713 MB/s
test long::case11_mask_mult_bool_match_range            ... bench:       8,949 ns/iter (+/- 330) = 781 MB/s
test long::case12_mask_shifted_bool_match_range         ... bench:       8,938 ns/iter (+/- 478) = 782 MB/s
test long::case13_sub_shifted_bool_match_range          ... bench:       8,136 ns/iter (+/- 363) = 859 MB/s
test medium::case00_alloc_only                          ... bench:          64 ns/iter (+/- 1) = 500 MB/s
test medium::case01_black_box_read_each_byte            ... bench:          73 ns/iter (+/- 2) = 438 MB/s
test medium::case02_lookup_table                        ... bench:          66 ns/iter (+/- 4) = 484 MB/s
test medium::case03_branch_and_subtract                 ... bench:          63 ns/iter (+/- 2) = 507 MB/s
test medium::case04_branch_and_mask                     ... bench:          64 ns/iter (+/- 2) = 500 MB/s
test medium::case05_branchless                          ... bench:         110 ns/iter (+/- 3) = 290 MB/s
test medium::case06_libcore                             ... bench:          62 ns/iter (+/- 4) = 516 MB/s
test medium::case07_fake_simd_u32                       ... bench:          79 ns/iter (+/- 2) = 405 MB/s
test medium::case08_fake_simd_u64                       ... bench:          80 ns/iter (+/- 2) = 400 MB/s
test medium::case09_mask_mult_bool_branchy_lookup_table ... bench:         118 ns/iter (+/- 5) = 271 MB/s
test medium::case10_mask_mult_bool_lookup_table         ... bench:          64 ns/iter (+/- 5) = 500 MB/s
test medium::case11_mask_mult_bool_match_range          ... bench:          62 ns/iter (+/- 3) = 516 MB/s
test medium::case12_mask_shifted_bool_match_range       ... bench:          62 ns/iter (+/- 2) = 516 MB/s
test medium::case13_sub_shifted_bool_match_range        ... bench:          61 ns/iter (+/- 3) = 524 MB/s
test short::case00_alloc_only                           ... bench:          62 ns/iter (+/- 3) = 112 MB/s
test short::case01_black_box_read_each_byte             ... bench:          65 ns/iter (+/- 4) = 107 MB/s
test short::case02_lookup_table                         ... bench:          61 ns/iter (+/- 1) = 114 MB/s
test short::case03_branch_and_subtract                  ... bench:          63 ns/iter (+/- 4) = 111 MB/s
test short::case04_branch_and_mask                      ... bench:          61 ns/iter (+/- 1) = 114 MB/s
test short::case05_branchless                           ... bench:          61 ns/iter (+/- 2) = 114 MB/s
test short::case06_libcore                              ... bench:          61 ns/iter (+/- 4) = 114 MB/s
test short::case07_fake_simd_u32                        ... bench:          74 ns/iter (+/- 4) = 94 MB/s
test short::case08_fake_simd_u64                        ... bench:          74 ns/iter (+/- 3) = 94 MB/s
test short::case09_mask_mult_bool_branchy_lookup_table  ... bench:          61 ns/iter (+/- 2) = 114 MB/s
test short::case10_mask_mult_bool_lookup_table          ... bench:          61 ns/iter (+/- 2) = 114 MB/s
test short::case11_mask_mult_bool_match_range           ... bench:          61 ns/iter (+/- 2) = 114 MB/s
test short::case12_mask_shifted_bool_match_range        ... bench:          61 ns/iter (+/- 2) = 114 MB/s
test short::case13_sub_shifted_bool_match_range         ... bench:          61 ns/iter (+/- 2) = 114 MB/s

Which shows that this can be slower than the lookup for a target without SIMD.

@SimonSapin
Copy link
Contributor Author

What commit were these i586 results on? Because the libcore performs exactly like lookup_table, which seems surprising.

@ollie27
Copy link
Member

ollie27 commented Mar 19, 2019

What commit were these i586 results on? Because the libcore performs exactly like lookup_table, which seems surprising.

I was just a recent nightly so that's why libcore is the same as lookup_table.

@SimonSapin
Copy link
Contributor Author

@joshtriplett I pushed several changes since your review, could you have another look?

@joshtriplett
Copy link
Member

@bors r+

@bors
Copy link
Contributor

bors commented Mar 26, 2019

📌 Commit 7fad370 has been approved by joshtriplett

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 26, 2019
Centril added a commit to Centril/rust that referenced this pull request Mar 27, 2019
…=joshtriplett

Make ASCII case conversions more than 4× faster

Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR.

~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~

**Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks.

Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII).

This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR.

Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment))

```rust
6830 bytes string:

alloc_only                          ... bench:    112 ns/iter (+/- 0) = 62410 MB/s
black_box_read_each_byte            ... bench:  1,733 ns/iter (+/- 8) = 4033 MB/s
lookup_table                        ... bench:  1,766 ns/iter (+/- 11) = 3958 MB/s
branch_and_subtract                 ... bench:    417 ns/iter (+/- 1) = 16762 MB/s
branch_and_mask                     ... bench:    401 ns/iter (+/- 1) = 17431 MB/s
branchless                          ... bench:    365 ns/iter (+/- 0) = 19150 MB/s
libcore                             ... bench:    367 ns/iter (+/- 1) = 19046 MB/s
fake_simd_u32                       ... bench:    361 ns/iter (+/- 2) = 19362 MB/s
fake_simd_u64                       ... bench:    361 ns/iter (+/- 1) = 19362 MB/s
mask_mult_bool_branchy_lookup_table ... bench:  6,309 ns/iter (+/- 19) = 1107 MB/s
mask_mult_bool_lookup_table         ... bench:  4,183 ns/iter (+/- 29) = 1671 MB/s
mask_mult_bool_match_range          ... bench:    339 ns/iter (+/- 0) = 20619 MB/s
mask_shifted_bool_match_range       ... bench:    339 ns/iter (+/- 1) = 20619 MB/s

32 bytes string:

alloc_only                          ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
black_box_read_each_byte            ... bench:     29 ns/iter (+/- 0) = 1103 MB/s
lookup_table                        ... bench:     24 ns/iter (+/- 4) = 1333 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branchless                          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
libcore                             ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
fake_simd_u32                       ... bench:     17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64                       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     42 ns/iter (+/- 0) = 761 MB/s
mask_mult_bool_lookup_table         ... bench:     35 ns/iter (+/- 0) = 914 MB/s
mask_mult_bool_match_range          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_shifted_bool_match_range       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s

7 bytes string:

alloc_only                          ... bench:     14 ns/iter (+/- 0) = 500 MB/s
black_box_read_each_byte            ... bench:     22 ns/iter (+/- 0) = 318 MB/s
lookup_table                        ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branchless                          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
libcore                             ... bench:     20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u32                       ... bench:     18 ns/iter (+/- 0) = 388 MB/s
fake_simd_u64                       ... bench:     21 ns/iter (+/- 0) = 333 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     20 ns/iter (+/- 0) = 350 MB/s
mask_mult_bool_lookup_table         ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_mult_bool_match_range          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_shifted_bool_match_range       ... bench:     19 ns/iter (+/- 0) = 368 MB/s
```
Centril added a commit to Centril/rust that referenced this pull request Mar 27, 2019
…=joshtriplett

Make ASCII case conversions more than 4× faster

Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR.

~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~

**Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks.

Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII).

This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR.

Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment))

```rust
6830 bytes string:

alloc_only                          ... bench:    112 ns/iter (+/- 0) = 62410 MB/s
black_box_read_each_byte            ... bench:  1,733 ns/iter (+/- 8) = 4033 MB/s
lookup_table                        ... bench:  1,766 ns/iter (+/- 11) = 3958 MB/s
branch_and_subtract                 ... bench:    417 ns/iter (+/- 1) = 16762 MB/s
branch_and_mask                     ... bench:    401 ns/iter (+/- 1) = 17431 MB/s
branchless                          ... bench:    365 ns/iter (+/- 0) = 19150 MB/s
libcore                             ... bench:    367 ns/iter (+/- 1) = 19046 MB/s
fake_simd_u32                       ... bench:    361 ns/iter (+/- 2) = 19362 MB/s
fake_simd_u64                       ... bench:    361 ns/iter (+/- 1) = 19362 MB/s
mask_mult_bool_branchy_lookup_table ... bench:  6,309 ns/iter (+/- 19) = 1107 MB/s
mask_mult_bool_lookup_table         ... bench:  4,183 ns/iter (+/- 29) = 1671 MB/s
mask_mult_bool_match_range          ... bench:    339 ns/iter (+/- 0) = 20619 MB/s
mask_shifted_bool_match_range       ... bench:    339 ns/iter (+/- 1) = 20619 MB/s

32 bytes string:

alloc_only                          ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
black_box_read_each_byte            ... bench:     29 ns/iter (+/- 0) = 1103 MB/s
lookup_table                        ... bench:     24 ns/iter (+/- 4) = 1333 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branchless                          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
libcore                             ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
fake_simd_u32                       ... bench:     17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64                       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     42 ns/iter (+/- 0) = 761 MB/s
mask_mult_bool_lookup_table         ... bench:     35 ns/iter (+/- 0) = 914 MB/s
mask_mult_bool_match_range          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_shifted_bool_match_range       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s

7 bytes string:

alloc_only                          ... bench:     14 ns/iter (+/- 0) = 500 MB/s
black_box_read_each_byte            ... bench:     22 ns/iter (+/- 0) = 318 MB/s
lookup_table                        ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branchless                          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
libcore                             ... bench:     20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u32                       ... bench:     18 ns/iter (+/- 0) = 388 MB/s
fake_simd_u64                       ... bench:     21 ns/iter (+/- 0) = 333 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     20 ns/iter (+/- 0) = 350 MB/s
mask_mult_bool_lookup_table         ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_mult_bool_match_range          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_shifted_bool_match_range       ... bench:     19 ns/iter (+/- 0) = 368 MB/s
```
Centril added a commit to Centril/rust that referenced this pull request Mar 27, 2019
…=joshtriplett

Make ASCII case conversions more than 4× faster

Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR.

~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~

**Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks.

Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII).

This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR.

Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment))

```rust
6830 bytes string:

alloc_only                          ... bench:    112 ns/iter (+/- 0) = 62410 MB/s
black_box_read_each_byte            ... bench:  1,733 ns/iter (+/- 8) = 4033 MB/s
lookup_table                        ... bench:  1,766 ns/iter (+/- 11) = 3958 MB/s
branch_and_subtract                 ... bench:    417 ns/iter (+/- 1) = 16762 MB/s
branch_and_mask                     ... bench:    401 ns/iter (+/- 1) = 17431 MB/s
branchless                          ... bench:    365 ns/iter (+/- 0) = 19150 MB/s
libcore                             ... bench:    367 ns/iter (+/- 1) = 19046 MB/s
fake_simd_u32                       ... bench:    361 ns/iter (+/- 2) = 19362 MB/s
fake_simd_u64                       ... bench:    361 ns/iter (+/- 1) = 19362 MB/s
mask_mult_bool_branchy_lookup_table ... bench:  6,309 ns/iter (+/- 19) = 1107 MB/s
mask_mult_bool_lookup_table         ... bench:  4,183 ns/iter (+/- 29) = 1671 MB/s
mask_mult_bool_match_range          ... bench:    339 ns/iter (+/- 0) = 20619 MB/s
mask_shifted_bool_match_range       ... bench:    339 ns/iter (+/- 1) = 20619 MB/s

32 bytes string:

alloc_only                          ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
black_box_read_each_byte            ... bench:     29 ns/iter (+/- 0) = 1103 MB/s
lookup_table                        ... bench:     24 ns/iter (+/- 4) = 1333 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branchless                          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
libcore                             ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
fake_simd_u32                       ... bench:     17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64                       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     42 ns/iter (+/- 0) = 761 MB/s
mask_mult_bool_lookup_table         ... bench:     35 ns/iter (+/- 0) = 914 MB/s
mask_mult_bool_match_range          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_shifted_bool_match_range       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s

7 bytes string:

alloc_only                          ... bench:     14 ns/iter (+/- 0) = 500 MB/s
black_box_read_each_byte            ... bench:     22 ns/iter (+/- 0) = 318 MB/s
lookup_table                        ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branchless                          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
libcore                             ... bench:     20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u32                       ... bench:     18 ns/iter (+/- 0) = 388 MB/s
fake_simd_u64                       ... bench:     21 ns/iter (+/- 0) = 333 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     20 ns/iter (+/- 0) = 350 MB/s
mask_mult_bool_lookup_table         ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_mult_bool_match_range          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_shifted_bool_match_range       ... bench:     19 ns/iter (+/- 0) = 368 MB/s
```
cuviper added a commit to cuviper/rust that referenced this pull request Mar 28, 2019
…=joshtriplett

Make ASCII case conversions more than 4× faster

Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR.

~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~

**Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks.

Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII).

This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR.

Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment))

```rust
6830 bytes string:

alloc_only                          ... bench:    112 ns/iter (+/- 0) = 62410 MB/s
black_box_read_each_byte            ... bench:  1,733 ns/iter (+/- 8) = 4033 MB/s
lookup_table                        ... bench:  1,766 ns/iter (+/- 11) = 3958 MB/s
branch_and_subtract                 ... bench:    417 ns/iter (+/- 1) = 16762 MB/s
branch_and_mask                     ... bench:    401 ns/iter (+/- 1) = 17431 MB/s
branchless                          ... bench:    365 ns/iter (+/- 0) = 19150 MB/s
libcore                             ... bench:    367 ns/iter (+/- 1) = 19046 MB/s
fake_simd_u32                       ... bench:    361 ns/iter (+/- 2) = 19362 MB/s
fake_simd_u64                       ... bench:    361 ns/iter (+/- 1) = 19362 MB/s
mask_mult_bool_branchy_lookup_table ... bench:  6,309 ns/iter (+/- 19) = 1107 MB/s
mask_mult_bool_lookup_table         ... bench:  4,183 ns/iter (+/- 29) = 1671 MB/s
mask_mult_bool_match_range          ... bench:    339 ns/iter (+/- 0) = 20619 MB/s
mask_shifted_bool_match_range       ... bench:    339 ns/iter (+/- 1) = 20619 MB/s

32 bytes string:

alloc_only                          ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
black_box_read_each_byte            ... bench:     29 ns/iter (+/- 0) = 1103 MB/s
lookup_table                        ... bench:     24 ns/iter (+/- 4) = 1333 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branchless                          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
libcore                             ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
fake_simd_u32                       ... bench:     17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64                       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     42 ns/iter (+/- 0) = 761 MB/s
mask_mult_bool_lookup_table         ... bench:     35 ns/iter (+/- 0) = 914 MB/s
mask_mult_bool_match_range          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_shifted_bool_match_range       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s

7 bytes string:

alloc_only                          ... bench:     14 ns/iter (+/- 0) = 500 MB/s
black_box_read_each_byte            ... bench:     22 ns/iter (+/- 0) = 318 MB/s
lookup_table                        ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branchless                          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
libcore                             ... bench:     20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u32                       ... bench:     18 ns/iter (+/- 0) = 388 MB/s
fake_simd_u64                       ... bench:     21 ns/iter (+/- 0) = 333 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     20 ns/iter (+/- 0) = 350 MB/s
mask_mult_bool_lookup_table         ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_mult_bool_match_range          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_shifted_bool_match_range       ... bench:     19 ns/iter (+/- 0) = 368 MB/s
```
bors added a commit that referenced this pull request Mar 28, 2019
Rollup of 18 pull requests

Successful merges:

 - #57293 (Make some lints incremental)
 - #57565 (syntax: Remove warning for unnecessary path disambiguators)
 - #58253 (librustc_driver => 2018)
 - #58837 (librustc_interface => 2018)
 - #59268 (Add suggestion to use `&*var` when `&str: From<String>` is expected)
 - #59283 (Make ASCII case conversions more than 4× faster)
 - #59284 (adjust MaybeUninit API to discussions)
 - #59372 (add rustfix-able suggestions to trim_{left,right} deprecations)
 - #59390 (Make `ptr::eq` documentation mention fat-pointer behavior)
 - #59393 (Refactor tuple comparison tests)
 - #59420 ([CI] record docker image info for reuse)
 - #59421 (Reject integer suffix when tuple indexing)
 - #59430 (Renames `EvalContext` to `InterpretCx`)
 - #59439 (Generalize diagnostic for `x = y` where `bool` is the expected type)
 - #59449 (fix: Make incremental artifact deletion more robust)
 - #59451 (Add `Default` to `std::alloc::System`)
 - #59459 (Add some tests)
 - #59460 (Include id in Thread's Debug implementation)

Failed merges:

r? @ghost
@bors bors merged commit 7fad370 into rust-lang:master Mar 28, 2019
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request May 31, 2019
Version 1.35.0 (2019-05-23)
==========================

Language
--------
- [`FnOnce`, `FnMut`, and the `Fn` traits are now implemented for `Box<FnOnce>`,
  `Box<FnMut>`, and `Box<Fn>` respectively.][59500]
- [You can now coerce closures into unsafe function pointers.][59580] e.g.
  ```rust
  unsafe fn call_unsafe(func: unsafe fn()) {
      func()
  }

  pub fn main() {
      unsafe { call_unsafe(|| {}); }
  }
  ```


Compiler
--------
- [Added the `armv6-unknown-freebsd-gnueabihf` and
  `armv7-unknown-freebsd-gnueabihf` targets.][58080]
- [Added the `wasm32-unknown-wasi` target.][59464]


Libraries
---------
- [`Thread` will now show its ID in `Debug` output.][59460]
- [`StdinLock`, `StdoutLock`, and `StderrLock` now implement `AsRawFd`.][59512]
- [`alloc::System` now implements `Default`.][59451]
- [Expanded `Debug` output (`{:#?}`) for structs now has a trailing comma on the
  last field.][59076]
- [`char::{ToLowercase, ToUppercase}` now
  implement `ExactSizeIterator`.][58778]
- [All `NonZero` numeric types now implement `FromStr`.][58717]
- [Removed the `Read` trait bounds
  on the `BufReader::{get_ref, get_mut, into_inner}` methods.][58423]
- [You can now call the `dbg!` macro without any parameters to print the file
  and line where it is called.][57847]
- [In place ASCII case conversions are now up to 4× faster.][59283]
  e.g. `str::make_ascii_lowercase`
- [`hash_map::{OccupiedEntry, VacantEntry}` now implement `Sync`
  and `Send`.][58369]

Stabilized APIs
---------------
- [`f32::copysign`]
- [`f64::copysign`]
- [`RefCell::replace_with`]
- [`RefCell::map_split`]
- [`ptr::hash`]
- [`Range::contains`]
- [`RangeFrom::contains`]
- [`RangeTo::contains`]
- [`RangeInclusive::contains`]
- [`RangeToInclusive::contains`]
- [`Option::copied`]

Cargo
-----
- [You can now set `cargo:rustc-cdylib-link-arg` at build time to pass custom
  linker arguments when building a `cdylib`.][cargo/6298] Its usage is highly
  platform specific.

Misc
----
- [The Rust toolchain is now available natively for musl based distros.][58575]

[59460]: rust-lang/rust#59460
[59464]: rust-lang/rust#59464
[59500]: rust-lang/rust#59500
[59512]: rust-lang/rust#59512
[59580]: rust-lang/rust#59580
[59283]: rust-lang/rust#59283
[59451]: rust-lang/rust#59451
[59076]: rust-lang/rust#59076
[58778]: rust-lang/rust#58778
[58717]: rust-lang/rust#58717
[58369]: rust-lang/rust#58369
[58423]: rust-lang/rust#58423
[58080]: rust-lang/rust#58080
[57847]: rust-lang/rust#57847
[58575]: rust-lang/rust#58575
[cargo/6298]: rust-lang/cargo#6298
[`f32::copysign`]: https://doc.rust-lang.org/stable/std/primitive.f32.html#method.copysign
[`f64::copysign`]: https://doc.rust-lang.org/stable/std/primitive.f64.html#method.copysign
[`RefCell::replace_with`]: https://doc.rust-lang.org/stable/std/cell/struct.RefCell.html#method.replace_with
[`RefCell::map_split`]: https://doc.rust-lang.org/stable/std/cell/struct.RefCell.html#method.map_split
[`ptr::hash`]: https://doc.rust-lang.org/stable/std/ptr/fn.hash.html
[`Range::contains`]: https://doc.rust-lang.org/std/ops/struct.Range.html#method.contains
[`RangeFrom::contains`]: https://doc.rust-lang.org/std/ops/struct.RangeFrom.html#method.contains
[`RangeTo::contains`]: https://doc.rust-lang.org/std/ops/struct.RangeTo.html#method.contains
[`RangeInclusive::contains`]: https://doc.rust-lang.org/std/ops/struct.RangeInclusive.html#method.contains
[`RangeToInclusive::contains`]: https://doc.rust-lang.org/std/ops/struct.RangeToInclusive.html#method.contains
[`Option::copied`]: https://doc.rust-lang.org/std/option/enum.Option.html#method.copied
@SimonSapin SimonSapin deleted the branchless-ascii-case branch November 28, 2019 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants