[SPARSE] Update oneMKL backends to match new sparse API #500

Rbiessy · 2024-05-27T15:13:00Z

Description

Update sparse API to follow the specification change: uxlfoundation/oneAPI-spec#522
Also see related small changes needed in the specification: uxlfoundation/oneAPI-spec#542

This also add some backend specific documentation in the docs folder. You can see how it is rendered here: https://rbiessy.github.io/oneMKL/domains/sparse_linear_algebra.html

All Submissions

Do all unit tests pass locally? Attach a log.
Have you formatted the code using clang-format?

Rbiessy · 2024-05-27T15:14:48Z

Test logs:
mklcpu_mklgpu_logs.txt

Tests are marked as skipped when some configurations are skipped but all the tests are run regardless.

src/sparse_blas/backends/mkl_common/mkl_handles.cxx

examples/sparse_blas/compile_time_dispatching/sparse_blas_spmv_usm_mklcpu.cpp

examples/sparse_blas/run_time_dispatching/sparse_blas_spmv_usm.cpp

tests/unit_tests/sparse_blas/include/common_sparse_reference.hpp

gajanan-choudhary · 2024-06-05T16:59:47Z

tests/unit_tests/sparse_blas/include/common_sparse_reference.hpp

+        std::size_t dense_a_idx =
+            (!is_symmetric_or_hermitian_view && transpose_val != oneapi::mkl::transpose::nontrans)
+                ? col * a_nrows + row
+                : row * a_ncols + col;
+        fpType val = opVal(a[a_idx], apply_conjugate);
+        dense_a[dense_a_idx] = val;


Just typing out cases to be verified. I'm mostly concerned about the hermitian case. I'll revisit this comment later.

Sr. No. transpose matrix_descr uplo diag Dense output Logic verified?

1 nontrans triangular lower nonunit $L + D + \mathbf{0}$ Yes

2 trans triangular lower nonunit $\mathbf{0} + D + L^T$ Yes

3 conjtrans triangular lower nonunit $\mathbf{0} + \overline{D} + \overline{L}^T$ Yes

4 nontrans triangular upper nonunit $\mathbf{0} + D + U$ Yes

5 trans triangular upper nonunit $U^T + D + \mathbf{0}$ Yes

6 conjtrans triangular upper nonunit $\overline{U}^T + \overline{D} + \mathbf{0}$ Yes

7 nontrans symmetric lower nonunit $L + D + L^T$ TBD

8 trans symmetric lower nonunit $L + D + L^T$ TBD

9 conjtrans symmetric lower nonunit $\overline{L} + \overline{D} + \overline{L}^T$ TBD

10 nontrans symmetric upper nonunit $U^T + D + U$ TBD

11 trans symmetric upper nonunit $U^T + D + U$ TBD

12 conjtrans symmetric upper nonunit $\overline{U}^T + \overline{D} + \overline{U}$ TBD

13 nontrans hermitian lower nonunit $L + Re(D) + \overline{L}^T$ TBD

14 trans hermitian lower nonunit $\overline{L} + Re(D) + L^T$ TBD

15 conjtrans hermitian lower nonunit $L + Re(D) + \overline{L}^T$ TBD

16 nontrans hermitian upper nonunit $\overline{U}^T + Re(D) + U$ TBD

17 trans hermitian upper nonunit $U^T + Re(D) + \overline{U}$ TBD

18 conjtrans hermitian upper nonunit $\overline{U}^T + Re(D) + U$ TBD

I thought some more about this and figured that it makes more sense to prevent the user from using conjtrans with symmetric or Hermitian matrices. No backend would support that and it is probably not useful. spmv is the only operation affected by this so I have disabled that in ec72831. I've also clarified this in the spec.

That's actually partially incorrect.

conjtrans + symmetric (rows 9 and 12) are unique operations as opposed to nontrans/trans cases that are pairwise same (rows {7 and 8} and rows {10 and 11}).

trans + hermitian (rows 14 and 17) are unique operations as opposed to nontrans/conjtrans cases that are pairwise same (rows {13 and 15} and rows {16 and 18}).

The set of duplications are:

8 duplicates 7

11 duplicates 10

15 duplicates 13

18 duplicates 16

So other than 8, 11, 15, 18, all the rest of the rows are valid operations.

I don't understand whether this means you are suggesting a change in this PR?

Yes, you mentioned you added restrictions on symmetric/hermitian in ec72831. All the combinations will be eventually supported although no backends currently support hermitian views, so you can throw a not_implemented exception for that. symmetric views are supported.

You also mentioned in your first comment:

I've also clarified this in the spec.

Could you please point me to where this was?

The related change in the spec was: uxlfoundation/oneAPI-spec@471a11d
I'm ok to revert this and throw unsupported "for now". Note that this means all backends with throw unsupported for some time.

I have changed it to throwing an unimplemented exception in c7a4420
The symmetric + conjtrans issue that I described in my email "MKLCPU and MKLGPU mismatch for spmm using conjtrans and symmetric property" is still present with oneAPI 2024.2 so I am throwing unsupported for both symmetric and hermitian + conjtrans. I have reverted the spec change that I linked above.

Rbiessy · 2024-06-06T16:04:00Z

src/sparse_blas/backends/mkl_common/mkl_spmv.cxx

+    if ((A_view.type_view == oneapi::mkl::sparse::matrix_descr::symmetric ||
+         A_view.type_view == oneapi::mkl::sparse::matrix_descr::hermitian) &&
+        opA == oneapi::mkl::transpose::conjtrans) {


Related to this comment #500 (comment), I was wondering if it makes sense to allow matrix_descr::symmetric for complex types or matrix_descr::hermitian and transpose::conjtrans for real types. My understanding is that they make sense mathematically speaking but are probably not useful and often assumed to be restricted to real (resp. complex) types.
The Intel oneMKL product spec seems to restrict conjtrans to complex type but the oneMKL SYCL interface one is a bit more vague. I think it's mostly a question of what would be more useful to the user: assume that there was a mistake or trust that the user understands conjugating a real type is a no-op. What do you think @gajanan-choudhary ?

if it makes sense to allow matrix_descr::symmetric for complex types

Symmetric + complex is definitely needed as the operation/output is mathematically different from Hermitian + complex.

Intel oneMKL product spec seems to restrict conjtrans to complex type

I was honestly not aware we had documented conjtrans that way. I'll get back to you after confirming what they are doing. For real types + hermitian/conjtrans, On the Intel oneMKL sparse BLAS side, we either already dispatch to or will eventually dispatch to symmetric/trans code-paths as applicable if the user calls hermitian/conjtrans on real types. This is particularly true for the C/Fortran CPU side.

Although a typical user would use a specific FP type in their application, we do have to care about HPC libraries such as PETSc or Kokkos that support multiple data types and operations. Our approach has been to generally support all combinations regardless of whether one is mathematically different from another or not, and deal with correct dispatching on our side.

IF BLAS/LAPACK support real + conjtrans/symmetric cases (to be confirmed), then my suggestion would be to directly forward the args to the backend unless the backend. If the backend does not handle the case correctly, a top level check on oneMKL Interfaces side for real data type would be needed to replace hermitian with symmetric and conjtrans replaced with trans.

Thanks for checking.
Currently I am not testing symmetric with complex types nor hermitian or conjtrans with real types. From what you're saying it sounds like it should be tested. I was mostly wondering whether I should add some error checking. I agree it makes sense to forward the values to the backend and add a top level check if needed. I'll wait for your confirmation.

I went with the solution of adding more tests for symmetric with complex types and hermitian and conjtrans with real types in 82566e5

Both LAPACK and BLAS teams said they support real + conjtrans cases to just do real + trans as it is the least surprising behavior for the users. Thanks for the change.

Unresolving this comment with respect to #500 (comment). Current implementation has symmetric + conjtrans throwing an exception, that should be lifted as that should be supported or will be supported. For hermitian + conjtrans case, we could just replace conjtrans with nontrans as listed in #500 (comment), but that is anyway the backend's job to do that, not the Spec's/oneMKL Interfaces'.

These configurations are throwing unimplemented now, see #500 (comment)

Rbiessy · 2024-06-07T07:31:54Z

Had to force push on my branch to fix my author name and email, there was no code changes since my last comment.

include/oneapi/mkl/sparse_blas/detail/onemkl_sparse_blas_backends.hxx

Rbiessy · 2024-07-04T14:52:39Z

I have merged with the develop branch and tested with oneAPI 2024.2.
FYI @gajanan-choudhary I found a new issue with spsv complex float and conjtrans which is failing with incorrect results on GPU only. I have disabled this configuration in 9b9548a.
Recent test logs: intel_gpu_cpu_log.txt
Recent doc build: docs.zip

…al types

src/sparse_blas/backends/mklgpu/mklgpu_spmm.cpp

src/sparse_blas/enum_data_types.hpp

Rbiessy · 2024-07-15T17:44:11Z

The diff in e8eac87 simply removes the unused macro TEST_RUN_CT_SELECT as there is one macro per domain now.

src/sparse_blas/backends/mkl_common/mkl_spmv.cxx

tests/unit_tests/include/test_helper.hpp

Rbiessy · 2024-07-19T11:42:40Z

I've pushed 2f59edc to reduce the number of times get_pointer_type is called. In the other backends I had concerns this may introduce overheads for no reasons. I figured I would align the MKL backends with this.

gajanan-choudhary

I've finally reviewed the entire PR. This is fantastic work and top notch code. Thanks a lot @Rbiessy for being patient on the review and feedback. I have some review comments, I'm ready to approve once those are resolved.

examples/sparse_blas/run_time_dispatching/sparse_blas_spmv_usm.cpp

src/sparse_blas/backends/mkl_common/mkl_spmm.cxx

src/sparse_blas/backends/mkl_common/mkl_spmv.cxx

src/sparse_blas/backends/mkl_common/mkl_spsv.cxx

tests/unit_tests/sparse_blas/include/test_common.hpp

tests/unit_tests/sparse_blas/source/sparse_spmm_buffer.cpp

tests/unit_tests/sparse_blas/source/sparse_spmm_usm.cpp

tests/unit_tests/sparse_blas/source/sparse_spmv_buffer.cpp

gajanan-choudhary · 2024-08-22T21:19:04Z

src/sparse_blas/backends/mkl_common/mkl_spmv.cxx

+                          oneapi::mkl::sparse::spmv_descr_t /*spmv_descr*/,
+                          const std::vector<sycl::event> &dependencies,
+                          bool is_alpha_host_accessible, bool is_beta_host_accessible) {
+    T host_alpha =


Thanks to @spencerpatty who brought up an important point here. The "Notes" section of the oneAPI Spec says that the spmv_buffer_size()/spmv_optimize() stages are compulsory. We need a way to check if they have been called, and if not, throw an exception. The fact that it works with the oneMKL backend is just luck, but we shouldn't rely on it.

Also, it seems we need to clarify in the oneAPI Spec explicitly that if either optimize/buffer_size stages are not called, then it should throw a specific exception (invalid_argument?)

I thought about it but the issue is that we would need to check that the *_buffer_size and *_optimize functions are called with the same arguments for the check to be really useful. I think it would add too much overhead to check this. I would suggest to only clarify in the spec that not doing so is undefined behavior.
I was also thinking of storing in the descriptor whether buffer_size or optimize has been called once regardless of the arguments used but this is only a partial check. We would need to justify in the spec that we can throw an exception for some cases and keep it undefined behavior if different arguments are used. It's simpler to say not calling them is undefined behavior.

Undefined behavior is not good enough. We can't have users getting used to expecting that not having calls to _buffer_size or _optimize APIs is fine and things will work just because oneMKL backend works right now. It is okay for us to enforce restrictions first, and later lift some of them as needed, but it's not possible for us to do it the other way round because that would break users' codes. The overhead for a few if checks on the host is going to be far less than the actual operation itself, even for tiny matrices. Even in the unlikely event that overhead is large, that's the cost of doing business and having the APIs that we do in the oneAPI Spec.

Not calling the _buffer_size()/_optimize() stages is not allowed. This must be caught and an exception thrown. To do that, store flags for these in the <api>_descr and internally initialize them to false, setting them to true as and when the APIs are called.

Repeating calls to any of these functions with a different transpose operation(s) or alg enum with the same <api>_descr is not allowed. This must be caught and an exception must be thrown. That breaks all optimizations and state of the workspace. If users need a different op/alg, they need to create a new <api>_descr for that with a different temporary workspace. To catch this exception, internally store the transpose and alg enums in <api>_descr and initialize them to invalid values not among any of the existing enums, e.g. static_cast<oneapi::mkl::transpose>(0xFF). That way when any API is called the first time, you set the transpose/alg enums and they are no longer allowed to change that.

Repeating calls with a different sparsity pattern (including mathematically same but just differently ordered sparsity pattern) compared to the first call is also not allowed, but we will have to make it undefined behavior because there is no efficient way to check if the users provided us a new matrix with the same sparsity pattern in the same order or not.

Repeating calls with different alpha, beta is always allowed, of course.

We need to clarify these things in the oneAPI specification.

or we just lift the requirement and go back to it being optional :)

or we just lift the requirement and go back to it being optional :)

As I mentioned in my previous comment:

It is okay for us to enforce restrictions first, and later lift some of them as needed, but it's not possible for us to do it the other way round because that would break users' codes.

The other reason to prefer a compulsory call to _buffer_size, _optimize APIs is because even though some backends (like the current Intel oneMKL) may not need a workspace associated with a particular op/alg combo, there can be another backend that requires for a temporary workspace some or all their algorithms thereby requiring these APIs be called. When a user tries to then run the same code on two different hardware/backends relying on the APIs being optional on one hardware, they are guaranteed to run into problems on the other hardware. The only way to prevent that in that case is to have non-intersecting algorithm enums or backend-specific enums in the oneAPI spec itself, which is likely not what we want.

Another issue that will need to be undefined behavior is to check whether the _buffer_size function has already been called before _optimize with the same arguments. Nothing prevents the user from calling _buffer_size multiple times in a row with different arguments (but the same descriptor) or from re-using the output of _buffer_size from a previous call. We would need to store a map of all the set of arguments that _buffer_size has been called with which will introduce too much overhead. We could check that buffer_size has been called at least once but specifying this in the specification will look weird.
I will see if we can add some checks as you suggested but I suspect this will lead to more discussions and delay. Remember we can always improve the spec later if we find this is an issue for users.

I have added some checks in 5f8e183 and the associated spec change is in uxlfoundation/oneAPI-spec@fda9f80.
I don't think I can add more checks that wouldn't require storing a map of the arguments used.

…when possible

Rbiessy requested review from spencerpatty and gajanan-choudhary May 27, 2024 15:13

abhilash1910 reviewed May 31, 2024

View reviewed changes

src/sparse_blas/backends/mkl_common/mkl_handles.cxx Show resolved Hide resolved

spencerpatty reviewed Jun 3, 2024

View reviewed changes

examples/sparse_blas/compile_time_dispatching/sparse_blas_spmv_usm_mklcpu.cpp Show resolved Hide resolved

spencerpatty reviewed Jun 3, 2024

View reviewed changes

examples/sparse_blas/compile_time_dispatching/sparse_blas_spmv_usm_mklcpu.cpp Show resolved Hide resolved