-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM NEON SIMD instruction set support #39
Comments
There is NEON implementation in iotaledger/giota |
pow_cARM_ARM64.go didn't actually make use of NEON Intrinsics. Thus, it is expected to slow down, comparing to C version. |
Since we have sse2neon to translate SSE intrinsics into corresponding NEON, we can concentrate on SSE implementation of PoW first. |
branch wip/arm-neon illustrated the slowdown of non-optimized NEON implementation on Aarch64 target, which was derived from SSE port. It is worth validating SSE PoW in advance. |
I will modify the SSE implementation by explicitly using SSE intrinsics and observe the performance with sse2neon. |
@chenwei-tw, Look forward to your progress on SSE. In the meanwhile, @ajblane is preparing utilities to measure hashrate. |
The commit 8843008 which explicitly uses SSE intrinsics is added.
|
After running a thousand trytes ...... NEON implementation seems still 1x slower than C implementation.
|
What is the performance gain by commit 8843008 on bare Intel processors? |
It works well on my PC.
Result (running 1000 trytes sequentially, which is the same as the above case on ARM server)
|
We don't have to rely on NEON-optimized PoW since we can redirect to FPGA-based computing cluster. Instead, I just wonder if we can benefit from existing trit/tryes conversion routines by means of sse2neon. |
I'll give it a try. |
The current sse2neon lacks some intrinsic function conversion of:
_mm_shuffle_epi8 and _mm_cmpistrm are especially important intrinsic function in trits/trytes conversion. |
For |
In the NEON intrinsic and Helium MVN intrinsic pages, I do not find any intrinsic function which is suitable to replace |
However, only validateTrytes_sse42() and trits_from_trytes_sse42() uses |
Once |
validateTrits_sse42() and trytes_from_trits_sse42() use the intrinsic functions like I have written a conversion of The dcurl can be compiled successfully if we remove the validateTrytes_sse42() and trits_from_trytes_sse42() related source code. I will do a brief performance measurement and if they behave well, I'll send the pull requests to sse2neon project. |
validateTrits() The performance measurement result has around 1150 data points with the different input trit size. trytes_from_trits() Unfortunately, it behaves worse.
|
I noticed the performance slowdown of |
Another reference regarding NEON transition: https://lemire.me/blog/2017/10/27/fast-integer-compression-with-stream-vbyte-on-arm-neon/ |
It looks workable. |
Here is another interesting statistic: The average time of trytes_from_trits()
The result looks weird, but it is reasonable. The data can be accelerated with the SIMD only if the data is as large as 128-bit(16-byte).
Although I do not do extra experiments, but the acceleration looks promising with large input size. |
It means appropriate threshold is necessary to perform runtime dispatching between C routine and Arm NEON implementation. |
Have you found the regular patterns when NEON-based operations are superior? |
Not yet. It needs more experiments. |
Unfortunately, the trytes_from_trits() does not work on
|
So, we need to add software fallback for |
Since we need to include the header file |
Let's do it by updating submodule list. |
A new project sse2neon is added as git submodule to allow dcurl running on ARM architecture to use SIMD acceleration without writing NEON intrinsic functions. Close DLTcollab#39.
A new project sse2neon is added as git submodule to allow dcurl running on ARM architecture to use SIMD acceleration without writing NEON intrinsic functions. Close DLTcollab#39.
A new project sse2neon is added as git submodule to allow dcurl running on ARM architecture to use SIMD acceleration without writing NEON intrinsic functions. Close DLTcollab#39.
A new project sse2neon is added as git submodule to allow dcurl running on ARM architecture to use SIMD acceleration without writing NEON intrinsic functions. Close DLTcollab#39.
Reference:
The text was updated successfully, but these errors were encountered: