Make [u8]::reverse() 5x faster #41764

scottmcm · 2017-05-05T03:45:17Z

Since LLVM doesn't vectorize the loop for us, do unaligned reads of a larger type and use LLVM's bswap intrinsic to do the reversing of the actual bytes. cfg!-restricted to x86 and x86_64, as I assume it wouldn't help on things like ARMv5.

Also makes [u16]::reverse() a more modest 1.5x faster by loading/storing u32 and swapping the u16s with ROT16.

Thank you ptr::*_unaligned for making this easy :)

Benchmark results (from my i5-2500K):

# Before
test slice::reverse_u8      ... bench:  273,836 ns/iter (+/- 15,592) =  3829 MB/s
test slice::reverse_u16     ... bench:  139,793 ns/iter (+/- 17,748) =  7500 MB/s
test slice::reverse_u32     ... bench:   74,997 ns/iter  (+/- 5,130) = 13981 MB/s
test slice::reverse_u64     ... bench:   47,452 ns/iter  (+/- 2,213) = 22097 MB/s

# After
test slice::reverse_u8      ... bench:   52,170 ns/iter (+/- 3,962) = 20099 MB/s
test slice::reverse_u16     ... bench:   93,330 ns/iter (+/- 4,412) = 11235 MB/s
test slice::reverse_u32     ... bench:   74,731 ns/iter (+/- 1,425) = 14031 MB/s
test slice::reverse_u64     ... bench:   47,556 ns/iter (+/- 3,025) = 22049 MB/s

If you're curious about the assembly, instead of doing this

movzx	eax, byte ptr [rdi]
movzx	ecx, byte ptr [rsi]
mov	byte ptr [rdi], cl
mov	byte ptr [rsi], al

it does this

mov	rax, qword ptr [rdx]
mov	rbx, qword ptr [r11 + rcx - 8]
bswap	rbx
mov	qword ptr [rdx], rbx
bswap	rax
mov	qword ptr [r11 + rcx - 8], rax

Since LLVM doesn't vectorize the loop for us, do unaligned reads of a larger type and use LLVM's bswap intrinsic to do the reversing of the actual bytes. cfg!-restricted to x86 and x86_64, as I assume it wouldn't help on things like ARMv5. Also makes [u16]::reverse() a more modest 1.5x faster by loading/storing u32 and swapping the u16s with ROT16. Thank you ptr::*_unaligned for making this easy :)

rust-highfive · 2017-05-05T03:45:31Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @brson (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

brson · 2017-05-05T06:58:57Z

This is a neat patch!

It combines several low level abstractions to do something very precise and unusual, and it's written clearly and all the abstractions disappear into a blip of assembly. Good Rust.

If you think the compiler should be doing this optimization (or better) itself maybe add a comment saying so, in case someday it should be removed.

It looks right to me but somebody else should review. Maybe r? @bluss?

frewsxcv · 2017-05-06T02:01:59Z

src/libcollections/benches/slice.rs

+reverse!(reverse_u8, u8);
+reverse!(reverse_u16, u16);
+reverse!(reverse_u32, u32);
+reverse!(reverse_u64, u64);


Should u128 also be here?

Makes sense to have all the primitives. Also added [u8;3] and Simd<[f64;4]> while I was at it, to show more of the perf range.

Results, from fastest to slowest:

test slice::reverse_simd_f64x4 ... bench: 36,818 ns/iter (+/- 924) = 28479 MB/s test slice::reverse_u128 ... bench: 41,797 ns/iter (+/- 3,127) = 25087 MB/s test slice::reverse_u64 ... bench: 47,062 ns/iter (+/- 898) = 22280 MB/s test slice::reverse_u8 ... bench: 51,678 ns/iter (+/- 3,819) = 20290 MB/s test slice::reverse_u32 ... bench: 74,404 ns/iter (+/- 387) = 14092 MB/s test slice::reverse_u16 ... bench: 92,952 ns/iter (+/- 2,385) = 11280 MB/s test slice::reverse_u8x3 ... bench: 181,223 ns/iter (+/- 6,541) = 5786 MB/s

None of these are affected by e8fad32.

arielb1 · 2017-05-09T09:26:27Z

@bluss are you planning on reviewing this PR?

brson · 2017-05-09T20:20:41Z

@bors r+

bors · 2017-05-09T20:20:42Z

📌 Commit da91361 has been approved by brson

Make [u8]::reverse() 5x faster Since LLVM doesn't vectorize the loop for us, do unaligned reads of a larger type and use LLVM's bswap intrinsic to do the reversing of the actual bytes. cfg!-restricted to x86 and x86_64, as I assume it wouldn't help on things like ARMv5. Also makes [u16]::reverse() a more modest 1.5x faster by loading/storing u32 and swapping the u16s with ROT16. Thank you ptr::*_unaligned for making this easy :) Benchmark results (from my i5-2500K): ```text test slice::reverse_u8 ... bench: 273,836 ns/iter (+/- 15,592) = 3829 MB/s test slice::reverse_u16 ... bench: 139,793 ns/iter (+/- 17,748) = 7500 MB/s test slice::reverse_u32 ... bench: 74,997 ns/iter (+/- 5,130) = 13981 MB/s test slice::reverse_u64 ... bench: 47,452 ns/iter (+/- 2,213) = 22097 MB/s test slice::reverse_u8 ... bench: 52,170 ns/iter (+/- 3,962) = 20099 MB/s test slice::reverse_u16 ... bench: 93,330 ns/iter (+/- 4,412) = 11235 MB/s test slice::reverse_u32 ... bench: 74,731 ns/iter (+/- 1,425) = 14031 MB/s test slice::reverse_u64 ... bench: 47,556 ns/iter (+/- 3,025) = 22049 MB/s ``` If you're curious about the assembly, instead of doing this ``` movzx eax, byte ptr [rdi] movzx ecx, byte ptr [rsi] mov byte ptr [rdi], cl mov byte ptr [rsi], al ``` it does this ``` mov rax, qword ptr [rdx] mov rbx, qword ptr [r11 + rcx - 8] bswap rbx mov qword ptr [rdx], rbx bswap rax mov qword ptr [r11 + rcx - 8], rax ```

bors · 2017-05-10T08:54:57Z

⌛ Testing commit da91361 with merge 2b97174...

Make [u8]::reverse() 5x faster Since LLVM doesn't vectorize the loop for us, do unaligned reads of a larger type and use LLVM's bswap intrinsic to do the reversing of the actual bytes. cfg!-restricted to x86 and x86_64, as I assume it wouldn't help on things like ARMv5. Also makes [u16]::reverse() a more modest 1.5x faster by loading/storing u32 and swapping the u16s with ROT16. Thank you ptr::*_unaligned for making this easy :) Benchmark results (from my i5-2500K): ```text # Before test slice::reverse_u8 ... bench: 273,836 ns/iter (+/- 15,592) = 3829 MB/s test slice::reverse_u16 ... bench: 139,793 ns/iter (+/- 17,748) = 7500 MB/s test slice::reverse_u32 ... bench: 74,997 ns/iter (+/- 5,130) = 13981 MB/s test slice::reverse_u64 ... bench: 47,452 ns/iter (+/- 2,213) = 22097 MB/s # After test slice::reverse_u8 ... bench: 52,170 ns/iter (+/- 3,962) = 20099 MB/s test slice::reverse_u16 ... bench: 93,330 ns/iter (+/- 4,412) = 11235 MB/s test slice::reverse_u32 ... bench: 74,731 ns/iter (+/- 1,425) = 14031 MB/s test slice::reverse_u64 ... bench: 47,556 ns/iter (+/- 3,025) = 22049 MB/s ``` If you're curious about the assembly, instead of doing this ``` movzx eax, byte ptr [rdi] movzx ecx, byte ptr [rsi] mov byte ptr [rdi], cl mov byte ptr [rsi], al ``` it does this ``` mov rax, qword ptr [rdx] mov rbx, qword ptr [r11 + rcx - 8] bswap rbx mov qword ptr [rdx], rbx bswap rax mov qword ptr [r11 + rcx - 8], rax ```

bors · 2017-05-10T11:37:18Z

☀️ Test successful - status-appveyor, status-travis
Approved by: brson
Pushing 2b97174 to master...

gnzlbg · 2017-05-18T06:28:21Z

llvm byte swap intrinsic works fine on ARM as well :/

scottmcm · 2017-05-18T07:42:47Z

@gnzlbg Right, but older ARM (like v4 & v5) doesn't have fast unaligned access, so I doubted that read_unaligned+swap_bytes+write_unaligned would be better there than just directly reversing the bytes. Hopefully someone has more hardware for benchmarking than I and can make a better cfg! check that applies this on the ARM versions where it's worthwhile, if any. (As a guess, maybe ARMv6+ that only accesses Normal, not Device, memory?)

gnzlbg · 2017-05-18T07:54:30Z

@gnzlbg Right, but older ARM (like v4 & v5) doesn't have fast unaligned access,

Oh, did not knew that.

Hopefully someone has more hardware for benchmarking than I and can make a better cfg! check that applies this on the ARM versions where it's worthwhile, if any. (As a guess, maybe ARMv6+ that only accesses Normal, not Device, memory?)

This make sense, maybe we should fill an issue and mark it with the easy tag + ARM or something. One would just need to try this on ARMv6/v7/v8 and choose an appropriate set of features.

rust-highfive assigned brson May 5, 2017

scottmcm mentioned this pull request May 5, 2017

Add an in-place rotate method for slices to libcore #41670

Merged

brson added the relnotes Marks issues that should be documented in the release notes of the next release. label May 5, 2017

rust-highfive assigned bluss and unassigned brson May 5, 2017

shepmaster added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label May 5, 2017

Improve implementation approach comments in [T]::reverse()

1f891d1

frewsxcv reviewed May 6, 2017

View reviewed changes

Add reverse benchmarks for u128, [u8;3], and Simd<[f64;4]>

da91361

None of these are affected by e8fad32.

alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label May 9, 2017

frewsxcv mentioned this pull request May 10, 2017

Rollup of 11 pull requests #41869

Closed

frewsxcv mentioned this pull request May 10, 2017

Rollup of 9 pull requests #41872

Closed

bors merged commit da91361 into rust-lang:master May 10, 2017

scottmcm deleted the faster-reverse branch May 10, 2017 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make [u8]::reverse() 5x faster #41764

Make [u8]::reverse() 5x faster #41764

scottmcm commented May 5, 2017

rust-highfive commented May 5, 2017

brson commented May 5, 2017 •

edited

Loading

frewsxcv May 6, 2017 •

edited

Loading

scottmcm May 6, 2017 •

edited

Loading

arielb1 commented May 9, 2017

brson commented May 9, 2017

bors commented May 9, 2017

bors commented May 10, 2017

bors commented May 10, 2017

gnzlbg commented May 18, 2017

scottmcm commented May 18, 2017

gnzlbg commented May 18, 2017

Make [u8]::reverse() 5x faster #41764

Make [u8]::reverse() 5x faster #41764

Conversation

scottmcm commented May 5, 2017

rust-highfive commented May 5, 2017

brson commented May 5, 2017 • edited Loading

frewsxcv May 6, 2017 • edited Loading

Choose a reason for hiding this comment

scottmcm May 6, 2017 • edited Loading

Choose a reason for hiding this comment

arielb1 commented May 9, 2017

brson commented May 9, 2017

bors commented May 9, 2017

bors commented May 10, 2017

bors commented May 10, 2017

gnzlbg commented May 18, 2017

scottmcm commented May 18, 2017

gnzlbg commented May 18, 2017

brson commented May 5, 2017 •

edited

Loading

frewsxcv May 6, 2017 •

edited

Loading

scottmcm May 6, 2017 •

edited

Loading