Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM NEON SIMD instruction set support #39

Closed
furuame opened this issue Mar 28, 2018 · 33 comments · Fixed by #193
Closed

ARM NEON SIMD instruction set support #39

furuame opened this issue Mar 28, 2018 · 33 comments · Fixed by #193
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@furuame
Copy link
Member

furuame commented Mar 28, 2018

Reference:

@furuame furuame self-assigned this Mar 28, 2018
@furuame furuame added the enhancement New feature or request label Mar 28, 2018
@furuame
Copy link
Member Author

furuame commented Apr 10, 2018

There is NEON implementation in iotaledger/giota

@furuame furuame added the help wanted Extra attention is needed label May 14, 2018
@furuame furuame removed the help wanted Extra attention is needed label May 24, 2018
@jserv
Copy link
Member

jserv commented May 27, 2018

pow_cARM_ARM64.go didn't actually make use of NEON Intrinsics. Thus, it is expected to slow down, comparing to C version.

@jserv
Copy link
Member

jserv commented May 28, 2018

Since we have sse2neon to translate SSE intrinsics into corresponding NEON, we can concentrate on SSE implementation of PoW first.

@jserv
Copy link
Member

jserv commented Jul 15, 2018

branch wip/arm-neon illustrated the slowdown of non-optimized NEON implementation on Aarch64 target, which was derived from SSE port. It is worth validating SSE PoW in advance.

@furuame
Copy link
Member Author

furuame commented Jul 23, 2018

I will modify the SSE implementation by explicitly using SSE intrinsics and observe the performance with sse2neon.

@DLTcollab DLTcollab deleted a comment from furuame Jul 23, 2018
@jserv
Copy link
Member

jserv commented Jul 23, 2018

@chenwei-tw, Look forward to your progress on SSE. In the meanwhile, @ajblane is preparing utilities to measure hashrate.

@furuame
Copy link
Member Author

furuame commented Jul 24, 2018

The commit 8843008 which explicitly uses SSE intrinsics is added.
More performance measurement is needed.

ubuntu@node2-puyuma:~/dcurl$ time build/test-pow_c

real    0m1.265s
user    0m8.668s
sys     0m0.004s
ubuntu@node2-puyuma:~/dcurl$ time build/test-pow_neon

real    0m2.563s
user    0m17.664s
sys     0m0.084s

@furuame
Copy link
Member Author

furuame commented Jul 26, 2018

After running a thousand trytes ...... NEON implementation seems still 1x slower than C implementation.

ubuntu@node2-puyuma:~/dcurl$ time build/test-pow_c

real    72m56.953s     
user    505m46.668s    
sys     0m24.384s      
ubuntu@node2-puyuma:~/dcurl$ time build/test-pow_neon        
               
real    166m22.774s    
user    1158m42.368s   
sys     0m57.844s      

@jserv
Copy link
Member

jserv commented Jul 26, 2018

What is the performance gain by commit 8843008 on bare Intel processors?

@furuame
Copy link
Member Author

furuame commented Jul 26, 2018

It works well on my PC.

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               60
Model name:          Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
Stepping:            3
CPU MHz:             3765.850
CPU max MHz:         3900.0000
CPU min MHz:         800.0000
BogoMIPS:            6784.56
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K

Result (running 1000 trytes sequentially, which is the same as the above case on ARM server)

cwei@workstation ~/W/dcurl>
time --format="real %E sec\nuser %U sec\nsys %S sec\n" python3 tests/arg-pow.py arg-pow_sse
real 22:59.92 sec
user 8089.31 sec
sys 2.76 sec

cwei@workstation ~/W/dcurl>
time --format="real %E sec\nuser %U sec\nsys %S sec\n" python3 tests/arg-pow.py arg-pow_c
real 35:23.62 sec
user 12515.96 sec
sys 3.13 sec

@marktwtn
Copy link
Collaborator

@jserv , should we keep investigating the performance change before and after applying sse2neon?

@jserv
Copy link
Member

jserv commented Jun 16, 2019

We don't have to rely on NEON-optimized PoW since we can redirect to FPGA-based computing cluster. Instead, I just wonder if we can benefit from existing trit/tryes conversion routines by means of sse2neon.

@marktwtn
Copy link
Collaborator

We don't have to rely on NEON-optimized PoW since we can redirect to FPGA-based computing cluster. Instead, I just wonder if we can benefit from existing trit/tryes conversion routines by means of sse2neon.

I'll give it a try.

@marktwtn
Copy link
Collaborator

The current sse2neon lacks some intrinsic function conversion of:

  • SSSE3
    • _mm_shuffle_epi8
  • SSE4.1
    • _mm_test_all_zeros
  • SSE4.2
    • _mm_cmpistrm

_mm_shuffle_epi8 and _mm_cmpistrm are especially important intrinsic function in trits/trytes conversion.

@jserv
Copy link
Member

jserv commented Jun 17, 2019

_mm_shuffle_epi8 and _mm_cmpistrm are especially important intrinsic function in trits/trytes conversion.

For _mm_shuffle_epi8, you can check the implementation from cm256cc. However, I don't there is an equivalent instruction available on Arm regarding _mm_cmpistrm compatibility.

@jserv jserv assigned marktwtn and unassigned furuame Jun 17, 2019
@marktwtn
Copy link
Collaborator

_mm_shuffle_epi8 and _mm_cmpistrm are especially important intrinsic function in trits/trytes conversion.

For _mm_shuffle_epi8, you can check the implementation from cm256cc. However, I don't there is an equivalent instruction available on Arm regarding _mm_cmpistrm compatibility.

In the NEON intrinsic and Helium MVN intrinsic pages, I do not find any intrinsic function which is suitable to replace _mm_cmpistrm. It is too complicated.
I am not even sure about the possibility of using several NEON intrinsics to replace it.

@marktwtn
Copy link
Collaborator

However, only validateTrytes_sse42() and trits_from_trytes_sse42() uses _mm_cmpistrm intrinsic function.
The other functions like validateTrits_sse42() and trytes_from_trits_sse42() might work with adding the conversion of _mm_shuffle_epi8 in sse2neon.

@jserv
Copy link
Member

jserv commented Jun 24, 2019

Once sse2neon is usable, I will transfer the ownership to DLTcollab. Feel free to send pull requests.

@marktwtn marktwtn added this to the Watercress milestone Jun 25, 2019
@marktwtn
Copy link
Collaborator

validateTrits_sse42() and trytes_from_trits_sse42() use the intrinsic functions like _mm_shuffle_epi8 and _mm_test_all_zeros.
Their conversion are not supported in sse2neon.

I have written a conversion of _mm_test_all_zeros and copied the source code of _mm_shuffle_epi8 conversion in the implementation from cm256cc.

The dcurl can be compiled successfully if we remove the validateTrytes_sse42() and trits_from_trytes_sse42() related source code.

I will do a brief performance measurement and if they behave well, I'll send the pull requests to sse2neon project.

@marktwtn
Copy link
Collaborator

validateTrits()

graph

The performance measurement result has around 1150 data points with the different input trit size.
The overall performance behaves well.
However, it may take more time in a rare condition shown in the chart.
The ratio is around 3/1150.


trytes_from_trits()

graph

Unfortunately, it behaves worse.
Here are the possible reasons:

  1. The input trit size is small.
    Input trit size: 81 and 243.
  2. The corresponding sse2neon instruction is software implemented.
    For example, in trytes_from_trits_sse42() it uses _mm_shuffle_epi8 intrinsic function.
    However, the sse2neon version of _mm_shuffle_epi8 is mostly software implemented.
    We do not benefit from the arm hardware.

@jserv
Copy link
Member

jserv commented Jun 29, 2019

I noticed the performance slowdown of _mm_shuffle_epi8 in sse2neon. I thought it might be the reason why NVIDIA (original sse2neon author) did not provide this instruction.

@jserv
Copy link
Member

jserv commented Jun 29, 2019

Another reference regarding NEON transition: https://lemire.me/blog/2017/10/27/fast-integer-compression-with-stream-vbyte-on-arm-neon/

@marktwtn
Copy link
Collaborator

Another reference regarding NEON transition: https://lemire.me/blog/2017/10/27/fast-integer-compression-with-stream-vbyte-on-arm-neon/

It looks workable.
I'll give it a try and do the performance measurement again.

@marktwtn
Copy link
Collaborator

marktwtn commented Jul 2, 2019

I change the implementation of _mm_shuffle_epi8 from software to hardware.

trytes_from_trits() performance chart with input size 81 and 243:
graph

The hardware implementation is much better than the software implementation.
However, it is not good enough.
The trytes_from_trits() performance of sse2neon conversion enhancement is still worse than the original implementation.
Since there are no other intrinsic function conversion of trytes_from_trits() implemented with software anymore, I guess the possible reason is the small input size.

@marktwtn
Copy link
Collaborator

marktwtn commented Jul 2, 2019

Here is another interesting statistic:

The average time of trytes_from_trits()

Input size (trit) Average time(with sse2neon) Average time(without sse2neon)
81 2940.1 nsec 2070.4 nsec
243 1592.6 nsec 1937.1 nsec

The result looks weird, but it is reasonable.

The data can be accelerated with the SIMD only if the data is as large as 128-bit(16-byte).
If the data is not large enough, it will be handled with the original implementation, which does not accelerate at all.

81 / 3 / 16 = 1 ... 11 (1 * 16  = 16 trytes can be accelerated and 11 trytes does not)
243 / 3 / 16 = 5 ... 1 (5 * 16 = 80 trytes can be accelerated and 1 trytes does not)

Although I do not do extra experiments, but the acceleration looks promising with large input size.

@jserv
Copy link
Member

jserv commented Jul 2, 2019

The data can be accelerated with the SIMD only if the data is as large as 128-bit(16-byte).
If the data is not large enough, it will be handled with the original implementation, which does not accelerate at all.

It means appropriate threshold is necessary to perform runtime dispatching between C routine and Arm NEON implementation.

@marktwtn
Copy link
Collaborator

marktwtn commented Jul 8, 2019

The newest trytes_from_trits() performance chart:

graph

@jserv
Copy link
Member

jserv commented Jul 8, 2019

Have you found the regular patterns when NEON-based operations are superior?

@marktwtn
Copy link
Collaborator

marktwtn commented Jul 9, 2019

Have you found the regular patterns when NEON-based operations are superior?

Not yet. It needs more experiments.
Currently I will focus on applying sse2neon on trinary related functions on de10-nano board.

@marktwtn
Copy link
Collaborator

Unfortunately, the trytes_from_trits() does not work on de10-nano board, since the implementation of NEON version uses the intrinsic function vqtbl1q_s8().

vqtbl1q_s8() is only supported on the architecture A64 based on the description of the official website.
Our de10-nano board is Cortex A9, which is a 32-bit architecture.

@jserv
Copy link
Member

jserv commented Jul 10, 2019

vqtbl1q_s8() is only supported on the architecture A64 based on the description of the official website.
Our de10-nano board is Cortex A9, which is a 32-bit architecture.

So, we need to add software fallback for sse2neon. I would expect the enablement of validateTrits with NEON fastpath. Other operations remands C code for both ARMv7-A and ARMv8-A.

@marktwtn
Copy link
Collaborator

So, we need to add software fallback for sse2neon. I would expect the enablement of validateTrits with NEON fastpath. Other operations remands C code for both ARMv7-A and ARMv8-A.

Since we need to include the header file sse2neon.h, should we make sse2neon as a submodule or just ask the user to download it before building dcurl on ARM hardware?

@jserv
Copy link
Member

jserv commented Sep 13, 2019

Since we need to include the header file sse2neon.h, should we make sse2neon as a submodule or just ask the user to download it before building dcurl on ARM hardware?

Let's do it by updating submodule list.

marktwtn added a commit to marktwtn/dcurl that referenced this issue Sep 14, 2019
A new project sse2neon is added as git submodule to allow dcurl running
on ARM architecture to use SIMD acceleration without writing NEON
intrinsic functions.

Close DLTcollab#39.
marktwtn added a commit to marktwtn/dcurl that referenced this issue Sep 14, 2019
A new project sse2neon is added as git submodule to allow dcurl running
on ARM architecture to use SIMD acceleration without writing NEON
intrinsic functions.

Close DLTcollab#39.
marktwtn added a commit to marktwtn/dcurl that referenced this issue Sep 15, 2019
A new project sse2neon is added as git submodule to allow dcurl running
on ARM architecture to use SIMD acceleration without writing NEON
intrinsic functions.

Close DLTcollab#39.
marktwtn added a commit to marktwtn/dcurl that referenced this issue Sep 17, 2019
A new project sse2neon is added as git submodule to allow dcurl running
on ARM architecture to use SIMD acceleration without writing NEON
intrinsic functions.

Close DLTcollab#39.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants