Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make [u8]::reverse() 5x faster #41764

Merged
merged 3 commits into from
May 10, 2017
Merged

Make [u8]::reverse() 5x faster #41764

merged 3 commits into from
May 10, 2017

Conversation

scottmcm
Copy link
Member

@scottmcm scottmcm commented May 5, 2017

Since LLVM doesn't vectorize the loop for us, do unaligned reads of a larger type and use LLVM's bswap intrinsic to do the reversing of the actual bytes. cfg!-restricted to x86 and x86_64, as I assume it wouldn't help on things like ARMv5.

Also makes [u16]::reverse() a more modest 1.5x faster by loading/storing u32 and swapping the u16s with ROT16.

Thank you ptr::*_unaligned for making this easy :)

Benchmark results (from my i5-2500K):

# Before
test slice::reverse_u8      ... bench:  273,836 ns/iter (+/- 15,592) =  3829 MB/s
test slice::reverse_u16     ... bench:  139,793 ns/iter (+/- 17,748) =  7500 MB/s
test slice::reverse_u32     ... bench:   74,997 ns/iter  (+/- 5,130) = 13981 MB/s
test slice::reverse_u64     ... bench:   47,452 ns/iter  (+/- 2,213) = 22097 MB/s

# After
test slice::reverse_u8      ... bench:   52,170 ns/iter (+/- 3,962) = 20099 MB/s
test slice::reverse_u16     ... bench:   93,330 ns/iter (+/- 4,412) = 11235 MB/s
test slice::reverse_u32     ... bench:   74,731 ns/iter (+/- 1,425) = 14031 MB/s
test slice::reverse_u64     ... bench:   47,556 ns/iter (+/- 3,025) = 22049 MB/s

If you're curious about the assembly, instead of doing this

movzx	eax, byte ptr [rdi]
movzx	ecx, byte ptr [rsi]
mov	byte ptr [rdi], cl
mov	byte ptr [rsi], al

it does this

mov	rax, qword ptr [rdx]
mov	rbx, qword ptr [r11 + rcx - 8]
bswap	rbx
mov	qword ptr [rdx], rbx
bswap	rax
mov	qword ptr [r11 + rcx - 8], rax

Since LLVM doesn't vectorize the loop for us, do unaligned reads
of a larger type and use LLVM's bswap intrinsic to do the
reversing of the actual bytes.  cfg!-restricted to x86 and
x86_64, as I assume it wouldn't help on things like ARMv5.

Also makes [u16]::reverse() a more modest 1.5x faster by
loading/storing u32 and swapping the u16s with ROT16.

Thank you ptr::*_unaligned for making this easy :)
@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @brson (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@brson brson added the relnotes Marks issues that should be documented in the release notes of the next release. label May 5, 2017
@brson
Copy link
Contributor

brson commented May 5, 2017

This is a neat patch!

It combines several low level abstractions to do something very precise and unusual, and it's written clearly and all the abstractions disappear into a blip of assembly. Good Rust.

If you think the compiler should be doing this optimization (or better) itself maybe add a comment saying so, in case someday it should be removed.

It looks right to me but somebody else should review. Maybe r? @bluss?

@rust-highfive rust-highfive assigned bluss and unassigned brson May 5, 2017
@shepmaster shepmaster added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label May 5, 2017
reverse!(reverse_u8, u8);
reverse!(reverse_u16, u16);
reverse!(reverse_u32, u32);
reverse!(reverse_u64, u64);
Copy link
Member

@frewsxcv frewsxcv May 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should u128 also be here?

Copy link
Member Author

@scottmcm scottmcm May 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to have all the primitives. Also added [u8;3] and Simd<[f64;4]> while I was at it, to show more of the perf range.

Results, from fastest to slowest:

test slice::reverse_simd_f64x4  ... bench:   36,818 ns/iter (+/-   924) = 28479 MB/s
test slice::reverse_u128        ... bench:   41,797 ns/iter (+/- 3,127) = 25087 MB/s
test slice::reverse_u64         ... bench:   47,062 ns/iter (+/-   898) = 22280 MB/s
test slice::reverse_u8          ... bench:   51,678 ns/iter (+/- 3,819) = 20290 MB/s
test slice::reverse_u32         ... bench:   74,404 ns/iter (+/-   387) = 14092 MB/s
test slice::reverse_u16         ... bench:   92,952 ns/iter (+/- 2,385) = 11280 MB/s
test slice::reverse_u8x3        ... bench:  181,223 ns/iter (+/- 6,541) =  5786 MB/s

@arielb1
Copy link
Contributor

arielb1 commented May 9, 2017

@bluss are you planning on reviewing this PR?

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label May 9, 2017
@brson
Copy link
Contributor

brson commented May 9, 2017

@bors r+

@bors
Copy link
Contributor

bors commented May 9, 2017

📌 Commit da91361 has been approved by brson

frewsxcv added a commit to frewsxcv/rust that referenced this pull request May 10, 2017
Make [u8]::reverse() 5x faster

Since LLVM doesn't vectorize the loop for us, do unaligned reads of a larger type and use LLVM's bswap intrinsic to do the reversing of the actual bytes.  cfg!-restricted to x86 and x86_64, as I assume it wouldn't help on things like ARMv5.

Also makes [u16]::reverse() a more modest 1.5x faster by loading/storing u32 and swapping the u16s with ROT16.

Thank you ptr::*_unaligned for making this easy :)

Benchmark results (from my i5-2500K):
```text
test slice::reverse_u8      ... bench:  273,836 ns/iter (+/- 15,592) =  3829 MB/s
test slice::reverse_u16     ... bench:  139,793 ns/iter (+/- 17,748) =  7500 MB/s
test slice::reverse_u32     ... bench:   74,997 ns/iter  (+/- 5,130) = 13981 MB/s
test slice::reverse_u64     ... bench:   47,452 ns/iter  (+/- 2,213) = 22097 MB/s

test slice::reverse_u8      ... bench:   52,170 ns/iter (+/- 3,962) = 20099 MB/s
test slice::reverse_u16     ... bench:   93,330 ns/iter (+/- 4,412) = 11235 MB/s
test slice::reverse_u32     ... bench:   74,731 ns/iter (+/- 1,425) = 14031 MB/s
test slice::reverse_u64     ... bench:   47,556 ns/iter (+/- 3,025) = 22049 MB/s
```

If you're curious about the assembly, instead of doing this
```
movzx	eax, byte ptr [rdi]
movzx	ecx, byte ptr [rsi]
mov	byte ptr [rdi], cl
mov	byte ptr [rsi], al
```
it does this
```
mov	rax, qword ptr [rdx]
mov	rbx, qword ptr [r11 + rcx - 8]
bswap	rbx
mov	qword ptr [rdx], rbx
bswap	rax
mov	qword ptr [r11 + rcx - 8], rax
```
@bors
Copy link
Contributor

bors commented May 10, 2017

⌛ Testing commit da91361 with merge 2b97174...

bors added a commit that referenced this pull request May 10, 2017
Make [u8]::reverse() 5x faster

Since LLVM doesn't vectorize the loop for us, do unaligned reads of a larger type and use LLVM's bswap intrinsic to do the reversing of the actual bytes.  cfg!-restricted to x86 and x86_64, as I assume it wouldn't help on things like ARMv5.

Also makes [u16]::reverse() a more modest 1.5x faster by loading/storing u32 and swapping the u16s with ROT16.

Thank you ptr::*_unaligned for making this easy :)

Benchmark results (from my i5-2500K):
```text
# Before
test slice::reverse_u8      ... bench:  273,836 ns/iter (+/- 15,592) =  3829 MB/s
test slice::reverse_u16     ... bench:  139,793 ns/iter (+/- 17,748) =  7500 MB/s
test slice::reverse_u32     ... bench:   74,997 ns/iter  (+/- 5,130) = 13981 MB/s
test slice::reverse_u64     ... bench:   47,452 ns/iter  (+/- 2,213) = 22097 MB/s

# After
test slice::reverse_u8      ... bench:   52,170 ns/iter (+/- 3,962) = 20099 MB/s
test slice::reverse_u16     ... bench:   93,330 ns/iter (+/- 4,412) = 11235 MB/s
test slice::reverse_u32     ... bench:   74,731 ns/iter (+/- 1,425) = 14031 MB/s
test slice::reverse_u64     ... bench:   47,556 ns/iter (+/- 3,025) = 22049 MB/s
```

If you're curious about the assembly, instead of doing this
```
movzx	eax, byte ptr [rdi]
movzx	ecx, byte ptr [rsi]
mov	byte ptr [rdi], cl
mov	byte ptr [rsi], al
```
it does this
```
mov	rax, qword ptr [rdx]
mov	rbx, qword ptr [r11 + rcx - 8]
bswap	rbx
mov	qword ptr [rdx], rbx
bswap	rax
mov	qword ptr [r11 + rcx - 8], rax
```
@bors
Copy link
Contributor

bors commented May 10, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: brson
Pushing 2b97174 to master...

@bors bors merged commit da91361 into rust-lang:master May 10, 2017
@scottmcm scottmcm deleted the faster-reverse branch May 10, 2017 18:00
@gnzlbg
Copy link
Contributor

gnzlbg commented May 18, 2017

llvm byte swap intrinsic works fine on ARM as well :/

@scottmcm
Copy link
Member Author

@gnzlbg Right, but older ARM (like v4 & v5) doesn't have fast unaligned access, so I doubted that read_unaligned+swap_bytes+write_unaligned would be better there than just directly reversing the bytes. Hopefully someone has more hardware for benchmarking than I and can make a better cfg! check that applies this on the ARM versions where it's worthwhile, if any. (As a guess, maybe ARMv6+ that only accesses Normal, not Device, memory?)

@gnzlbg
Copy link
Contributor

gnzlbg commented May 18, 2017

@gnzlbg Right, but older ARM (like v4 & v5) doesn't have fast unaligned access,

Oh, did not knew that.

Hopefully someone has more hardware for benchmarking than I and can make a better cfg! check that applies this on the ARM versions where it's worthwhile, if any. (As a guess, maybe ARMv6+ that only accesses Normal, not Device, memory?)

This make sense, maybe we should fill an issue and mark it with the easy tag + ARM or something. One would just need to try this on ARMv6/v7/v8 and choose an appropriate set of features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants