Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMP Merge-SpMV algorithm from Merril & Garland for CSR SpMV #1497

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

stanisic
Copy link

@stanisic stanisic commented Dec 6, 2023

This PR brings Merge-SpMV algorithm from Merril & Garland for CSR SpMV. The implementation is based on the original article https://github.com/dumerrill/merge-spmv/raw/master/merge-based-spmv-sc16-preprint.pdf and source code https://github.com/dumerrill/merge-spmv/blob/master/cpu_spmv.cpp from Merril & Garland work back in 2016. It is using a "2D" approach to do a better load-balancing between the threads. In my extensive investigation on 6 Arm and x86 machines with 8 SuiteSparse matrices using Ginkgo 1.6, this algorithm was performing between -4% to 80% better than the current CSR SpMV implementation.

Implementation in this PR was tested and optimized only for RHS=1, but in principle it should work for RHS>1 as well. Note that I have also tested the second variant of Merge-SpMV algorithm (Algorithm 4 in https://github.com/dumerrill/merge-spmv/raw/master/merge-based-spmv-sc16-preprint.pdf), but it was always performing worse than the original Merge-SpMV algorithm (except for a few matrices on A64FX machine, but I believe this can be ignored).

At the moment the code in this PR still lacks 3 things that I leave to the experts to propose what is the best way to do it:

  1. Integration with the rest of Ginkgo.
  2. If necessary, addition of any tests or comparisons with the default CSR SpMV.
  3. I guess my name and affiliation ("Luka Stanisic luka.stanisic@huawei.com, Huawei Technologies Duesseldorf GmbH") as well as Merge-SpMV project (3-Clause BSD license https://github.com/dumerrill/merge-spmv/blob/master/LICENSE.TXT ) should be mentioned somewhere.

Copy link
Member

@upsj upsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass

auto value_carry_out_ptr = value_carry_out.get_data();

for (size_type j = 0; j < c->get_size()[1]; ++j) {
#pragma omp parallel for schedule(static)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should still be dynamic, MergePath is only balancing the nonzeros, but doesn't take caching effects into account.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory yes, but in practice based on my measurements on 5 different machines and 9 SuiteSparse matrices for several important Ginkgo kernels (including CSR SpMV), using dynamic scheduling strategy explicitly decreases performance. So maybe we find a middle ground and stick to the default #pragma omp parallel for without specifying strategy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without modifying the body of the loop, that makes sense, as you need to do more binary searches, and have the scheduling overhead on top of that. I have an idea how to remove the search overhead though, I'll add a prototype soon.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done some performance measurements on Huawei Kunpeng 920 with 8 SuiteSparse matrices, and the results show that using schedule (static) is actually giving the best performance. Not specifying schedule or using schedule (auto) is only slightly worse. This applies for both "classical" CSR SpMV currently implemented in Ginkgo and for the proposed MergeSpMV.

(at least in my measurements on Kunpeng 920) Dynamic scheduling performs terribly for "classical" CSR SpMV. For MergeSpMV, the gap between dynamic and other scheduling policy is significantly lower, but dynamic is still consistently the worst.

Therefore, I propose for this PR to use either schedule (static) or not specify scheduling at all. Any optimized scheduling approach (especially when code is improved for multiple RHS) can be added later.

* Computes the begin offsets into A and B for the specific diagonal
*/
template <typename IndexType>
inline void MergePathSearch(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: snake_case

Suggested change
inline void MergePathSearch(
inline void merge_path_search(

auto x_max = std::min(diagonal, A_len);

while (x_min < x_max) {
auto x_pivot = (x_min + x_max) >> 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can overflow and uses a shift that the compiler would also emit for a division, safer and clearer would be

Suggested change
auto x_pivot = (x_min + x_max) >> 1;
auto x_pivot = x_min + ((x_max - x_min) / 2);

auto row_carry_out_ptr = row_carry_out.get_data();
auto value_carry_out_ptr = value_carry_out.get_data();

for (size_type j = 0; j < c->get_size()[1]; ++j) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should handle multiple RHS by blocking/inside the parallel loop

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest something similar to the approach in omp/matrix/coo_kernels.cpp? If yes, then it would require a bit more effort for development and testing. Do you see it as a part of this PR or is this a comment for some future optimizations?

Comment on lines 50 to 53
void merge_spmv(std::shared_ptr<const DefaultExecutor> exec, \
const matrix::Csr<MatrixValueType, IndexType>* a, \
const matrix::Dense<InputValueType>* b, \
matrix::Dense<OutputValueType>* c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice which strategy to use is already present in a->get_strategy(), so we don't need a new kernel for it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now, so you suggest removing this and adding branches in csr_kernels.cpp based a->get_strategy(), similar to cuda/matrix/csr_kernels.template.cu, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly

Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing merge spmv for openmp csr
this is 1/2 part of my review. mainly the format issue and partially implementation comments.
If you have build ginkgo with -DGINKGO_DEVEL_TOOLS=ON with pre-commit, you can run pre-commit run --from-ref "origin/develop" --to-ref HEAD to format this file. or installing clang-format-14 to format this file should be good enough

Comment on lines +131 to +136
const int diagonal, ///< [in]The diagonal to search
const IndexType* A, ///< [in]List A
const int A_len, ///< [in]Length of A
const int B_len, ///< [in]Length of B
int& path_coordinate_x, ///< [out] (x) coordinate where diagonal intersects the merge path
int& path_coordinate_y) ///< [out] (y) coordinate where diagonal intersects the merge path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int -> gko::size_type or IndexType

Comment on lines +132 to +134
const IndexType* A, ///< [in]List A
const int A_len, ///< [in]Length of A
const int B_len, ///< [in]Length of B
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const IndexType* A, ///< [in]List A
const int A_len, ///< [in]Length of A
const int B_len, ///< [in]Length of B
const IndexType* a, ///< [in]List A
const int a_len, ///< [in]Length of A
const int b_len, ///< [in]Length of B

we use snake case for variable (https://github.com/ginkgo-project/ginkgo/wiki/Contributing-guidelines#variables)

int& path_coordinate_x, ///< [out] (x) coordinate where diagonal intersects the merge path
int& path_coordinate_y) ///< [out] (y) coordinate where diagonal intersects the merge path
{
auto x_min = std::max(diagonal - B_len, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto x_min = std::max(diagonal - B_len, 0);
auto x_min = std::max(diagonal - B_len, zero<IndexType>());

nit

Comment on lines +188 to +189
const auto start_diagonal = std::min(items_per_thread * tid, num_merge_items);
const auto end_diagonal = std::min(start_diagonal + items_per_thread, num_merge_items);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cast to IndexType to ensure the latter usage type

for (size_type tid = 0; tid < num_threads; tid++) {
const auto start_diagonal = std::min(items_per_thread * tid, num_merge_items);
const auto end_diagonal = std::min(start_diagonal + items_per_thread, num_merge_items);
int thread_coord_x, thread_coord_y, thread_coord_end_x, thread_coord_end_y;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As https://github.com/ginkgo-project/ginkgo/wiki/Contributing-guidelines#variable-declarations mentioned, we need to declare only one variable in one line.

Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in other part. also need to add the note to the merge_path option in csr.hpp about the omp merge_path is only supported in apply not advanced_apply

Comment on lines +152 to +153
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}

const size_type num_threads = omp_get_max_threads();
const IndexType* row_end_offsets = row_ptrs + 1; // Merge list A: row end offsets
const auto num_merge_items = num_rows + nnz; // Merge path total length
const auto items_per_thread = (num_merge_items + num_threads - 1) / num_threads; // Merge items per thread
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use ceildiv

Comment on lines +219 to +220
// Carry-out fix-up (rows spanning multiple threads)
for (int tid = 0; tid < num_threads - 1; tid++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a note something like: the last thread does not carry out any partial result because it must compute the result till the last row end.

@stanisic
Copy link
Author

Thank you for the review @yhmtsai, I agree with every single comment that you have made. Unfortunately, I am losing the access to the machine from which I can make public contributions. Therefore, I was hoping that Ginkgo community could take over with merging this PR.

@tcojean tcojean self-assigned this Feb 7, 2024
@MarcelKoch MarcelKoch added this to the Ginkgo 1.8.0 milestone Apr 5, 2024
@MarcelKoch MarcelKoch removed this from the Ginkgo 1.8.0 milestone May 3, 2024
@tcojean tcojean added this to the Ginkgo 1.9.0 milestone May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants