Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: SVE intrinsics implementation of CSR SpMV and Merge-SpMV algorithms #1501

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

stanisic
Copy link

@stanisic stanisic commented Dec 7, 2023

This PR provides two implementations of CSR SpMV ("traditional" and Merge-SpMV from https://github.com/dumerrill/merge-spmv/raw/master/merge-based-spmv-sc16-preprint.pdf ) using SVE intrinsics for double precision. PR is far from being integration ready, and it should be considered more of an example of how the implementation could look like. One should eventually also apply the suggestions from PR #1497 about RHS, integration (a->get_strategy()), and OpenMP scheduling. To ease the testing, I put the current implementation in place of the OpenMP CSR SpMV, although it should probably be in a completely separate (completely new?) part of Ginkgo.

The motivation for having code with SVE intrinsics is performance. SVE intrinsics implementations can bring significantly better vectorization for Arm machines supporting SVE (Fujitsu A64FX, Amazon Graviton, Nvidia Grace...), since GCC auto-vectorization for CSR kernel seems to be poor. We have measured up to 80% performance improvements for bone010.mtx on Fujitsu A64FX and up to 36% improvements for thermal2.mtx on Amazon Graviton3 machine when using this implementation with SVE intrinsics.

Unlike AVX intrinsics, SVE allows vector length agnostic implementations which leads to a cleaner code. The code in the proposed PR works on both A64FX (512b vector length) and Graviton 3 (256b vector length).

On the other hand, AFAIK there is no easy way to deal with different datatypes (double, float, complex...), and one needs separate intrinsics implementations. The code for the proposed PR works only for double precision.

Finally, note that the OpenMP parallelization is commented out in the code. The reason behind this is the known internal bug of the GCC compiler ( https://gcc.gnu.org/bugzilla//show_bug.cgi?id=101018 ) which sometimes occurs when OpenMP pragmas are combined with SVE intrinsics. I hope that other compilers do not have this issue, and already committed fix to GCC is upstreamed soon. When this problem is fixed, one should simple uncomment OpenMP pragmas in this PR, and the code should work in parallel.

…g SVE intrinsics (OpenMP for now disabled)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant