WIP: SVE intrinsics implementation of CSR SpMV and Merge-SpMV algorithms #1501

stanisic · 2023-12-07T10:49:47Z

This PR provides two implementations of CSR SpMV ("traditional" and Merge-SpMV from https://github.com/dumerrill/merge-spmv/raw/master/merge-based-spmv-sc16-preprint.pdf ) using SVE intrinsics for double precision. PR is far from being integration ready, and it should be considered more of an example of how the implementation could look like. One should eventually also apply the suggestions from PR #1497 about RHS, integration (a->get_strategy()), and OpenMP scheduling. To ease the testing, I put the current implementation in place of the OpenMP CSR SpMV, although it should probably be in a completely separate (completely new?) part of Ginkgo.

The motivation for having code with SVE intrinsics is performance. SVE intrinsics implementations can bring significantly better vectorization for Arm machines supporting SVE (Fujitsu A64FX, Amazon Graviton, Nvidia Grace...), since GCC auto-vectorization for CSR kernel seems to be poor. We have measured up to 80% performance improvements for bone010.mtx on Fujitsu A64FX and up to 36% improvements for thermal2.mtx on Amazon Graviton3 machine when using this implementation with SVE intrinsics.

Unlike AVX intrinsics, SVE allows vector length agnostic implementations which leads to a cleaner code. The code in the proposed PR works on both A64FX (512b vector length) and Graviton 3 (256b vector length).

On the other hand, AFAIK there is no easy way to deal with different datatypes (double, float, complex...), and one needs separate intrinsics implementations. The code for the proposed PR works only for double precision.

Finally, note that the OpenMP parallelization is commented out in the code. The reason behind this is the known internal bug of the GCC compiler ( https://gcc.gnu.org/bugzilla//show_bug.cgi?id=101018 ) which sometimes occurs when OpenMP pragmas are combined with SVE intrinsics. I hope that other compilers do not have this issue, and already committed fix to GCC is upstreamed soon. When this problem is fixed, one should simple uncomment OpenMP pragmas in this PR, and the code should work in parallel.

…g SVE intrinsics (OpenMP for now disabled)

Implementation of traditional CSR SpMV and Merge-SpMV algorithms usin…

e309bdd

…g SVE intrinsics (OpenMP for now disabled)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: SVE intrinsics implementation of CSR SpMV and Merge-SpMV algorithms #1501

WIP: SVE intrinsics implementation of CSR SpMV and Merge-SpMV algorithms #1501

stanisic commented Dec 7, 2023

WIP: SVE intrinsics implementation of CSR SpMV and Merge-SpMV algorithms #1501

Are you sure you want to change the base?

WIP: SVE intrinsics implementation of CSR SpMV and Merge-SpMV algorithms #1501

Conversation

stanisic commented Dec 7, 2023