ARM NEON SIMD instruction set support #39

furuame · 2018-03-28T10:59:46Z

Reference:

neon-guide

furuame · 2018-04-10T09:06:52Z

There is NEON implementation in iotaledger/giota

jserv · 2018-05-27T11:16:29Z

pow_cARM_ARM64.go didn't actually make use of NEON Intrinsics. Thus, it is expected to slow down, comparing to C version.

jserv · 2018-05-28T00:08:07Z

Since we have sse2neon to translate SSE intrinsics into corresponding NEON, we can concentrate on SSE implementation of PoW first.

jserv · 2018-07-15T05:54:52Z

branch wip/arm-neon illustrated the slowdown of non-optimized NEON implementation on Aarch64 target, which was derived from SSE port. It is worth validating SSE PoW in advance.

furuame · 2018-07-23T14:44:36Z

I will modify the SSE implementation by explicitly using SSE intrinsics and observe the performance with sse2neon.

jserv · 2018-07-23T14:52:19Z

@chenwei-tw, Look forward to your progress on SSE. In the meanwhile, @ajblane is preparing utilities to measure hashrate.

furuame · 2018-07-24T16:24:08Z

The commit 8843008 which explicitly uses SSE intrinsics is added.
More performance measurement is needed.

ubuntu@node2-puyuma:~/dcurl$ time build/test-pow_c

real    0m1.265s
user    0m8.668s
sys     0m0.004s
ubuntu@node2-puyuma:~/dcurl$ time build/test-pow_neon

real    0m2.563s
user    0m17.664s
sys     0m0.084s

furuame · 2018-07-26T00:52:19Z

After running a thousand trytes ...... NEON implementation seems still 1x slower than C implementation.

ubuntu@node2-puyuma:~/dcurl$ time build/test-pow_c

real    72m56.953s     
user    505m46.668s    
sys     0m24.384s      
ubuntu@node2-puyuma:~/dcurl$ time build/test-pow_neon        
               
real    166m22.774s    
user    1158m42.368s   
sys     0m57.844s

jserv · 2018-07-26T03:47:48Z

What is the performance gain by commit 8843008 on bare Intel processors?

furuame · 2018-07-26T15:37:43Z

It works well on my PC.

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               60
Model name:          Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
Stepping:            3
CPU MHz:             3765.850
CPU max MHz:         3900.0000
CPU min MHz:         800.0000
BogoMIPS:            6784.56
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K

Result (running 1000 trytes sequentially, which is the same as the above case on ARM server)

cwei@workstation ~/W/dcurl>
time --format="real %E sec\nuser %U sec\nsys %S sec\n" python3 tests/arg-pow.py arg-pow_sse
real 22:59.92 sec
user 8089.31 sec
sys 2.76 sec

cwei@workstation ~/W/dcurl>
time --format="real %E sec\nuser %U sec\nsys %S sec\n" python3 tests/arg-pow.py arg-pow_c
real 35:23.62 sec
user 12515.96 sec
sys 3.13 sec

marktwtn · 2019-06-16T01:53:12Z

@jserv , should we keep investigating the performance change before and after applying sse2neon?

jserv · 2019-06-16T05:54:00Z

We don't have to rely on NEON-optimized PoW since we can redirect to FPGA-based computing cluster. Instead, I just wonder if we can benefit from existing trit/tryes conversion routines by means of sse2neon.

marktwtn · 2019-06-16T12:28:00Z

We don't have to rely on NEON-optimized PoW since we can redirect to FPGA-based computing cluster. Instead, I just wonder if we can benefit from existing trit/tryes conversion routines by means of sse2neon.

I'll give it a try.

marktwtn · 2019-06-17T01:06:29Z

The current sse2neon lacks some intrinsic function conversion of:

SSSE3
- _mm_shuffle_epi8
SSE4.1
- _mm_test_all_zeros
SSE4.2
- _mm_cmpistrm

_mm_shuffle_epi8 and _mm_cmpistrm are especially important intrinsic function in trits/trytes conversion.

jserv · 2019-06-17T02:46:02Z

_mm_shuffle_epi8 and _mm_cmpistrm are especially important intrinsic function in trits/trytes conversion.

For _mm_shuffle_epi8, you can check the implementation from cm256cc. However, I don't there is an equivalent instruction available on Arm regarding _mm_cmpistrm compatibility.

marktwtn · 2019-06-21T01:42:00Z

_mm_shuffle_epi8 and _mm_cmpistrm are especially important intrinsic function in trits/trytes conversion.

For _mm_shuffle_epi8, you can check the implementation from cm256cc. However, I don't there is an equivalent instruction available on Arm regarding _mm_cmpistrm compatibility.

In the NEON intrinsic and Helium MVN intrinsic pages, I do not find any intrinsic function which is suitable to replace _mm_cmpistrm. It is too complicated.
I am not even sure about the possibility of using several NEON intrinsics to replace it.

marktwtn · 2019-06-24T02:50:37Z

However, only validateTrytes_sse42() and trits_from_trytes_sse42() uses _mm_cmpistrm intrinsic function.
The other functions like validateTrits_sse42() and trytes_from_trits_sse42() might work with adding the conversion of _mm_shuffle_epi8 in sse2neon.

jserv · 2019-06-24T03:57:47Z

Once sse2neon is usable, I will transfer the ownership to DLTcollab. Feel free to send pull requests.

marktwtn · 2019-06-26T11:45:32Z

validateTrits_sse42() and trytes_from_trits_sse42() use the intrinsic functions like _mm_shuffle_epi8 and _mm_test_all_zeros.
Their conversion are not supported in sse2neon.

I have written a conversion of _mm_test_all_zeros and copied the source code of _mm_shuffle_epi8 conversion in the implementation from cm256cc.

The dcurl can be compiled successfully if we remove the validateTrytes_sse42() and trits_from_trytes_sse42() related source code.

I will do a brief performance measurement and if they behave well, I'll send the pull requests to sse2neon project.

marktwtn · 2019-06-29T10:49:48Z

validateTrits()

The performance measurement result has around 1150 data points with the different input trit size.
The overall performance behaves well.
However, it may take more time in a rare condition shown in the chart.
The ratio is around 3/1150.

trytes_from_trits()

Unfortunately, it behaves worse.
Here are the possible reasons:

The input trit size is small.
Input trit size: 81 and 243.
The corresponding sse2neon instruction is software implemented.
For example, in trytes_from_trits_sse42() it uses _mm_shuffle_epi8 intrinsic function.
However, the sse2neon version of _mm_shuffle_epi8 is mostly software implemented.
We do not benefit from the arm hardware.

jserv · 2019-06-29T12:57:20Z

I noticed the performance slowdown of _mm_shuffle_epi8 in sse2neon. I thought it might be the reason why NVIDIA (original sse2neon author) did not provide this instruction.

jserv · 2019-06-29T13:24:56Z

Another reference regarding NEON transition: https://lemire.me/blog/2017/10/27/fast-integer-compression-with-stream-vbyte-on-arm-neon/

marktwtn · 2019-06-29T19:29:02Z

Another reference regarding NEON transition: https://lemire.me/blog/2017/10/27/fast-integer-compression-with-stream-vbyte-on-arm-neon/

It looks workable.
I'll give it a try and do the performance measurement again.

marktwtn · 2019-07-02T12:21:00Z

I change the implementation of _mm_shuffle_epi8 from software to hardware.

trytes_from_trits() performance chart with input size 81 and 243:

The hardware implementation is much better than the software implementation.
However, it is not good enough.
The trytes_from_trits() performance of sse2neon conversion enhancement is still worse than the original implementation.
Since there are no other intrinsic function conversion of trytes_from_trits() implemented with software anymore, I guess the possible reason is the small input size.

marktwtn · 2019-07-02T14:03:35Z

Here is another interesting statistic:

The average time of trytes_from_trits()

Input size (trit)	Average time(with sse2neon)	Average time(without sse2neon)
81	2940.1 nsec	2070.4 nsec
243	1592.6 nsec	1937.1 nsec

The result looks weird, but it is reasonable.

The data can be accelerated with the SIMD only if the data is as large as 128-bit(16-byte).
If the data is not large enough, it will be handled with the original implementation, which does not accelerate at all.

81 / 3 / 16 = 1 ... 11 (1 * 16  = 16 trytes can be accelerated and 11 trytes does not)
243 / 3 / 16 = 5 ... 1 (5 * 16 = 80 trytes can be accelerated and 1 trytes does not)

Although I do not do extra experiments, but the acceleration looks promising with large input size.

jserv · 2019-07-02T15:20:33Z

The data can be accelerated with the SIMD only if the data is as large as 128-bit(16-byte).
If the data is not large enough, it will be handled with the original implementation, which does not accelerate at all.

It means appropriate threshold is necessary to perform runtime dispatching between C routine and Arm NEON implementation.

marktwtn · 2019-07-08T16:28:20Z

The newest trytes_from_trits() performance chart:

jserv · 2019-07-08T19:40:43Z

Have you found the regular patterns when NEON-based operations are superior?

marktwtn · 2019-07-09T05:19:42Z

Have you found the regular patterns when NEON-based operations are superior?

Not yet. It needs more experiments.
Currently I will focus on applying sse2neon on trinary related functions on de10-nano board.

marktwtn · 2019-07-10T12:46:36Z

Unfortunately, the trytes_from_trits() does not work on de10-nano board, since the implementation of NEON version uses the intrinsic function vqtbl1q_s8().

vqtbl1q_s8() is only supported on the architecture A64 based on the description of the official website.
Our de10-nano board is Cortex A9, which is a 32-bit architecture.

jserv · 2019-07-10T15:02:51Z

vqtbl1q_s8() is only supported on the architecture A64 based on the description of the official website.
Our de10-nano board is Cortex A9, which is a 32-bit architecture.

So, we need to add software fallback for sse2neon. I would expect the enablement of validateTrits with NEON fastpath. Other operations remands C code for both ARMv7-A and ARMv8-A.

marktwtn · 2019-09-12T09:06:58Z

So, we need to add software fallback for sse2neon. I would expect the enablement of validateTrits with NEON fastpath. Other operations remands C code for both ARMv7-A and ARMv8-A.

Since we need to include the header file sse2neon.h, should we make sse2neon as a submodule or just ask the user to download it before building dcurl on ARM hardware?

jserv · 2019-09-13T13:10:35Z

Since we need to include the header file sse2neon.h, should we make sse2neon as a submodule or just ask the user to download it before building dcurl on ARM hardware?

Let's do it by updating submodule list.

A new project sse2neon is added as git submodule to allow dcurl running on ARM architecture to use SIMD acceleration without writing NEON intrinsic functions. Close DLTcollab#39.

furuame self-assigned this Mar 28, 2018

furuame added the enhancement New feature or request label Mar 28, 2018

jserv mentioned this issue Apr 11, 2018

Unrecognized command line option error on RaspberryPi #40

Closed

furuame added the help wanted Extra attention is needed label May 14, 2018

furuame removed the help wanted Extra attention is needed label May 24, 2018

DLTcollab deleted a comment from furuame Jul 23, 2018

jserv assigned marktwtn and unassigned furuame Jun 17, 2019

marktwtn added this to the Watercress milestone Jun 25, 2019

marktwtn mentioned this issue Sep 14, 2019

feat: Allow ARM architecture to use SIMD trit validation #193

Merged

jserv closed this as completed in #193 Sep 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM NEON SIMD instruction set support #39

ARM NEON SIMD instruction set support #39

furuame commented Mar 28, 2018

furuame commented Apr 10, 2018

jserv commented May 27, 2018 •

edited

Loading

jserv commented May 28, 2018

jserv commented Jul 15, 2018

furuame commented Jul 23, 2018

jserv commented Jul 23, 2018

furuame commented Jul 24, 2018

furuame commented Jul 26, 2018

jserv commented Jul 26, 2018

furuame commented Jul 26, 2018

marktwtn commented Jun 16, 2019

jserv commented Jun 16, 2019

marktwtn commented Jun 16, 2019

marktwtn commented Jun 17, 2019

jserv commented Jun 17, 2019

marktwtn commented Jun 21, 2019

marktwtn commented Jun 24, 2019

jserv commented Jun 24, 2019

marktwtn commented Jun 26, 2019

marktwtn commented Jun 29, 2019

jserv commented Jun 29, 2019

jserv commented Jun 29, 2019

marktwtn commented Jun 29, 2019

marktwtn commented Jul 2, 2019

marktwtn commented Jul 2, 2019

jserv commented Jul 2, 2019

marktwtn commented Jul 8, 2019

jserv commented Jul 8, 2019

marktwtn commented Jul 9, 2019

marktwtn commented Jul 10, 2019

jserv commented Jul 10, 2019

marktwtn commented Sep 12, 2019

jserv commented Sep 13, 2019

ARM NEON SIMD instruction set support #39

ARM NEON SIMD instruction set support #39

Comments

furuame commented Mar 28, 2018

furuame commented Apr 10, 2018

jserv commented May 27, 2018 • edited Loading

jserv commented May 28, 2018

jserv commented Jul 15, 2018

furuame commented Jul 23, 2018

jserv commented Jul 23, 2018

furuame commented Jul 24, 2018

furuame commented Jul 26, 2018

jserv commented Jul 26, 2018

furuame commented Jul 26, 2018

marktwtn commented Jun 16, 2019

jserv commented Jun 16, 2019

marktwtn commented Jun 16, 2019

marktwtn commented Jun 17, 2019

jserv commented Jun 17, 2019

marktwtn commented Jun 21, 2019

marktwtn commented Jun 24, 2019

jserv commented Jun 24, 2019

marktwtn commented Jun 26, 2019

marktwtn commented Jun 29, 2019

jserv commented Jun 29, 2019

jserv commented Jun 29, 2019

marktwtn commented Jun 29, 2019

marktwtn commented Jul 2, 2019

marktwtn commented Jul 2, 2019

jserv commented Jul 2, 2019

marktwtn commented Jul 8, 2019

jserv commented Jul 8, 2019

marktwtn commented Jul 9, 2019

marktwtn commented Jul 10, 2019

jserv commented Jul 10, 2019

marktwtn commented Sep 12, 2019

jserv commented Sep 13, 2019

jserv commented May 27, 2018 •

edited

Loading