OpenMP Merge-SpMV algorithm from Merril & Garland for CSR SpMV #1497

stanisic · 2023-12-06T11:51:18Z

This PR brings Merge-SpMV algorithm from Merril & Garland for CSR SpMV. The implementation is based on the original article https://github.com/dumerrill/merge-spmv/raw/master/merge-based-spmv-sc16-preprint.pdf and source code https://github.com/dumerrill/merge-spmv/blob/master/cpu_spmv.cpp from Merril & Garland work back in 2016. It is using a "2D" approach to do a better load-balancing between the threads. In my extensive investigation on 6 Arm and x86 machines with 8 SuiteSparse matrices using Ginkgo 1.6, this algorithm was performing between -4% to 80% better than the current CSR SpMV implementation.

Implementation in this PR was tested and optimized only for RHS=1, but in principle it should work for RHS>1 as well. Note that I have also tested the second variant of Merge-SpMV algorithm (Algorithm 4 in https://github.com/dumerrill/merge-spmv/raw/master/merge-based-spmv-sc16-preprint.pdf), but it was always performing worse than the original Merge-SpMV algorithm (except for a few matrices on A64FX machine, but I believe this can be ignored).

At the moment the code in this PR still lacks 3 things that I leave to the experts to propose what is the best way to do it:

Integration with the rest of Ginkgo.
If necessary, addition of any tests or comparisons with the default CSR SpMV.
I guess my name and affiliation ("Luka Stanisic luka.stanisic@huawei.com, Huawei Technologies Duesseldorf GmbH") as well as Merge-SpMV project (3-Clause BSD license https://github.com/dumerrill/merge-spmv/blob/master/LICENSE.TXT ) should be mentioned somewhere.

upsj

First pass

upsj · 2023-12-06T13:06:04Z

omp/matrix/csr_kernels.cpp

+    auto value_carry_out_ptr = value_carry_out.get_data();
+
+    for (size_type j = 0; j < c->get_size()[1]; ++j) {
+#pragma omp parallel for schedule(static)


this should still be dynamic, MergePath is only balancing the nonzeros, but doesn't take caching effects into account.

In theory yes, but in practice based on my measurements on 5 different machines and 9 SuiteSparse matrices for several important Ginkgo kernels (including CSR SpMV), using dynamic scheduling strategy explicitly decreases performance. So maybe we find a middle ground and stick to the default #pragma omp parallel for without specifying strategy?

Without modifying the body of the loop, that makes sense, as you need to do more binary searches, and have the scheduling overhead on top of that. I have an idea how to remove the search overhead though, I'll add a prototype soon.

I have done some performance measurements on Huawei Kunpeng 920 with 8 SuiteSparse matrices, and the results show that using schedule (static) is actually giving the best performance. Not specifying schedule or using schedule (auto) is only slightly worse. This applies for both "classical" CSR SpMV currently implemented in Ginkgo and for the proposed MergeSpMV.

(at least in my measurements on Kunpeng 920) Dynamic scheduling performs terribly for "classical" CSR SpMV. For MergeSpMV, the gap between dynamic and other scheduling policy is significantly lower, but dynamic is still consistently the worst.

Therefore, I propose for this PR to use either schedule (static) or not specify scheduling at all. Any optimized scheduling approach (especially when code is improved for multiple RHS) can be added later.

upsj · 2023-12-06T13:06:38Z

omp/matrix/csr_kernels.cpp

+ * Computes the begin offsets into A and B for the specific diagonal
+ */
+template <typename IndexType>
+inline void MergePathSearch(


nit: snake_case

Suggested change

inline void MergePathSearch(

inline void merge_path_search(

upsj · 2023-12-06T13:07:47Z

omp/matrix/csr_kernels.cpp

+    auto x_max = std::min(diagonal, A_len);
+
+    while (x_min < x_max) {
+        auto x_pivot = (x_min + x_max) >> 1;


this can overflow and uses a shift that the compiler would also emit for a division, safer and clearer would be

Suggested change

auto x_pivot = (x_min + x_max) >> 1;

auto x_pivot = x_min + ((x_max - x_min) / 2);

upsj · 2023-12-06T13:09:16Z

omp/matrix/csr_kernels.cpp

+    auto row_carry_out_ptr = row_carry_out.get_data();
+    auto value_carry_out_ptr = value_carry_out.get_data();
+
+    for (size_type j = 0; j < c->get_size()[1]; ++j) {


Ideally we should handle multiple RHS by blocking/inside the parallel loop

Do you suggest something similar to the approach in omp/matrix/coo_kernels.cpp? If yes, then it would require a bit more effort for development and testing. Do you see it as a part of this PR or is this a comment for some future optimizations?

upsj · 2023-12-06T13:10:30Z

core/matrix/csr_kernels.hpp

+    void merge_spmv(std::shared_ptr<const DefaultExecutor> exec,           \
+              const matrix::Csr<MatrixValueType, IndexType>* a,            \
+              const matrix::Dense<InputValueType>* b,                      \
+              matrix::Dense<OutputValueType>* c)


The choice which strategy to use is already present in a->get_strategy(), so we don't need a new kernel for it

I see now, so you suggest removing this and adding branches in csr_kernels.cpp based a->get_strategy(), similar to cuda/matrix/csr_kernels.template.cu, right?

…ergespmv

…se for which it was decided to implement it later. Co-authored-by: Tobias Ribizel <ribizel@kit.edu>

yhmtsai

Thank you for implementing merge spmv for openmp csr
this is 1/2 part of my review. mainly the format issue and partially implementation comments.
If you have build ginkgo with -DGINKGO_DEVEL_TOOLS=ON with pre-commit, you can run pre-commit run --from-ref "origin/develop" --to-ref HEAD to format this file. or installing clang-format-14 to format this file should be good enough

yhmtsai · 2024-01-11T09:10:00Z

omp/matrix/csr_kernels.cpp

+    const int         diagonal,           ///< [in]The diagonal to search
+    const IndexType*        A,                  ///< [in]List A
+    const int         A_len,              ///< [in]Length of A
+    const int         B_len,              ///< [in]Length of B
+    int&              path_coordinate_x,  ///< [out] (x) coordinate where diagonal intersects the merge path
+    int&              path_coordinate_y)  ///< [out] (y) coordinate where diagonal intersects the merge path


int -> gko::size_type or IndexType

yhmtsai · 2024-01-11T09:11:24Z

omp/matrix/csr_kernels.cpp

+    const IndexType*        A,                  ///< [in]List A
+    const int         A_len,              ///< [in]Length of A
+    const int         B_len,              ///< [in]Length of B


Suggested change

const IndexType* A, ///< [in]List A

const int A_len, ///< [in]Length of A

const int B_len, ///< [in]Length of B

const IndexType* a, ///< [in]List A

const int a_len, ///< [in]Length of A

const int b_len, ///< [in]Length of B

we use snake case for variable (https://github.com/ginkgo-project/ginkgo/wiki/Contributing-guidelines#variables)

yhmtsai · 2024-01-11T09:12:05Z

omp/matrix/csr_kernels.cpp

+    int&              path_coordinate_x,  ///< [out] (x) coordinate where diagonal intersects the merge path
+    int&              path_coordinate_y)  ///< [out] (y) coordinate where diagonal intersects the merge path
+{
+    auto x_min = std::max(diagonal - B_len, 0);


Suggested change

auto x_min = std::max(diagonal - B_len, 0);

auto x_min = std::max(diagonal - B_len, zero<IndexType>());

nit

yhmtsai · 2024-01-11T09:16:13Z

omp/matrix/csr_kernels.cpp

+            const auto start_diagonal = std::min(items_per_thread * tid, num_merge_items);
+            const auto end_diagonal = std::min(start_diagonal + items_per_thread, num_merge_items);


cast to IndexType to ensure the latter usage type

yhmtsai · 2024-01-11T09:17:17Z

omp/matrix/csr_kernels.cpp

+        for (size_type tid = 0; tid < num_threads; tid++) {
+            const auto start_diagonal = std::min(items_per_thread * tid, num_merge_items);
+            const auto end_diagonal = std::min(start_diagonal + items_per_thread, num_merge_items);
+            int thread_coord_x, thread_coord_y, thread_coord_end_x, thread_coord_end_y;


As https://github.com/ginkgo-project/ginkgo/wiki/Contributing-guidelines#variable-declarations mentioned, we need to declare only one variable in one line.

yhmtsai

LGTM in other part. also need to add the note to the merge_path option in csr.hpp about the omp merge_path is only supported in apply not advanced_apply

yhmtsai · 2024-01-15T07:31:54Z

omp/matrix/csr_kernels.cpp

+}
+


Suggested change

}

}

yhmtsai · 2024-01-15T07:35:24Z

omp/matrix/csr_kernels.cpp

+    const size_type num_threads = omp_get_max_threads();
+    const IndexType* row_end_offsets = row_ptrs + 1; // Merge list A: row end offsets
+    const auto num_merge_items = num_rows + nnz; // Merge path total length
+    const auto items_per_thread = (num_merge_items + num_threads - 1) / num_threads; // Merge items per thread


can use ceildiv

yhmtsai · 2024-01-15T07:36:59Z

omp/matrix/csr_kernels.cpp

+        // Carry-out fix-up (rows spanning multiple threads)
+        for (int tid = 0; tid < num_threads - 1; tid++) {


maybe add a note something like: the last thread does not carry out any partial result because it must compute the result till the last row end.

stanisic · 2024-01-31T10:56:38Z

Thank you for the review @yhmtsai, I agree with every single comment that you have made. Unfortunately, I am losing the access to the machine from which I can make public contributions. Therefore, I was hoping that Ginkgo community could take over with merging this PR.

Adding OpenMP Merge-SpMV algorithm from Merril & Garland for CSR SpMV

3410551

upsj reviewed Dec 6, 2023

View reviewed changes

stanisic mentioned this pull request Dec 7, 2023

WIP: SVE intrinsics implementation of CSR SpMV and Merge-SpMV algorithms #1501

Open

stanisic and others added 4 commits December 18, 2023 10:55

Merge branch 'ginkgo-project:develop' into omp_mergespmv

ccf8ae5

Adding OpenMP Merge-SpMV algorithm from Merril & Garland for CSR SpMV

09652cd

Merge branch 'omp_mergespmv' of github.com:acmunich/ginkgo into omp_m…

5d089f4

…ergespmv

Applied all modifications discussed in PR, except for multiple RHS ca…

598b89d

…se for which it was decided to implement it later. Co-authored-by: Tobias Ribizel <ribizel@kit.edu>

yhmtsai requested changes Jan 11, 2024

View reviewed changes

yhmtsai requested changes Jan 15, 2024

View reviewed changes

tcojean self-assigned this Feb 7, 2024

MarcelKoch added this to the Ginkgo 1.8.0 milestone Apr 5, 2024

MarcelKoch removed this from the Ginkgo 1.8.0 milestone May 3, 2024

tcojean added this to the Ginkgo 1.9.0 milestone May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenMP Merge-SpMV algorithm from Merril & Garland for CSR SpMV #1497

OpenMP Merge-SpMV algorithm from Merril & Garland for CSR SpMV #1497

stanisic commented Dec 6, 2023

upsj left a comment

upsj Dec 6, 2023

stanisic Dec 6, 2023

upsj Dec 6, 2023

stanisic Dec 18, 2023

upsj Dec 6, 2023

upsj Dec 6, 2023

upsj Dec 6, 2023

stanisic Dec 6, 2023

upsj Dec 6, 2023

stanisic Dec 6, 2023

upsj Dec 6, 2023

yhmtsai left a comment

yhmtsai Jan 11, 2024

yhmtsai Jan 11, 2024

yhmtsai Jan 11, 2024

yhmtsai Jan 11, 2024

yhmtsai Jan 11, 2024

yhmtsai left a comment

yhmtsai Jan 15, 2024

yhmtsai Jan 15, 2024

yhmtsai Jan 15, 2024

stanisic commented Jan 31, 2024

	auto x_pivot = (x_min + x_max) >> 1;
	auto x_pivot = x_min + ((x_max - x_min) / 2);

	auto x_min = std::max(diagonal - B_len, 0);
	auto x_min = std::max(diagonal - B_len, zero<IndexType>());

		const auto start_diagonal = std::min(items_per_thread * tid, num_merge_items);
		const auto end_diagonal = std::min(start_diagonal + items_per_thread, num_merge_items);

		// Carry-out fix-up (rows spanning multiple threads)
		for (int tid = 0; tid < num_threads - 1; tid++) {

OpenMP Merge-SpMV algorithm from Merril & Garland for CSR SpMV #1497

Are you sure you want to change the base?

OpenMP Merge-SpMV algorithm from Merril & Garland for CSR SpMV #1497

Conversation

stanisic commented Dec 6, 2023

upsj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhmtsai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhmtsai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stanisic commented Jan 31, 2024