Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARSE] Update oneMKL backends to match new sparse API #500

Open
wants to merge 39 commits into
base: develop
Choose a base branch
from

Conversation

Rbiessy
Copy link
Contributor

@Rbiessy Rbiessy commented May 27, 2024

Description

Update sparse API to follow the specification change: uxlfoundation/oneAPI-spec#522
Also see related small changes needed in the specification: uxlfoundation/oneAPI-spec#542

This also add some backend specific documentation in the docs folder. You can see how it is rendered here: https://rbiessy.github.io/oneMKL/domains/sparse_linear_algebra.html

All Submissions

  • Do all unit tests pass locally? Attach a log.
  • Have you formatted the code using clang-format?

@Rbiessy
Copy link
Contributor Author

Rbiessy commented May 27, 2024

Test logs:
mklcpu_mklgpu_logs.txt

Tests are marked as skipped when some configurations are skipped but all the tests are run regardless.

Comment on lines +182 to +185
std::size_t dense_a_idx =
(!is_symmetric_or_hermitian_view && transpose_val != oneapi::mkl::transpose::nontrans)
? col * a_nrows + row
: row * a_ncols + col;
fpType val = opVal(a[a_idx], apply_conjugate);
dense_a[dense_a_idx] = val;
Copy link
Contributor

@gajanan-choudhary gajanan-choudhary Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just typing out cases to be verified. I'm mostly concerned about the hermitian case. I'll revisit this comment later.

Sr. No. transpose matrix_descr uplo diag Dense output Logic verified?
1 nontrans triangular lower nonunit $L + D + \mathbf{0}$ Yes
2 trans triangular lower nonunit $\mathbf{0} + D + L^T$ Yes
3 conjtrans triangular lower nonunit $\mathbf{0} + \overline{D} + \overline{L}^T$ Yes
4 nontrans triangular upper nonunit $\mathbf{0} + D + U$ Yes
5 trans triangular upper nonunit $U^T + D + \mathbf{0}$ Yes
6 conjtrans triangular upper nonunit $\overline{U}^T + \overline{D} + \mathbf{0}$ Yes
7 nontrans symmetric lower nonunit $L + D + L^T$ TBD
8 trans symmetric lower nonunit $L + D + L^T$ TBD
9 conjtrans symmetric lower nonunit $\overline{L} + \overline{D} + \overline{L}^T$ TBD
10 nontrans symmetric upper nonunit $U^T + D + U$ TBD
11 trans symmetric upper nonunit $U^T + D + U$ TBD
12 conjtrans symmetric upper nonunit $\overline{U}^T + \overline{D} + \overline{U}$ TBD
13 nontrans hermitian lower nonunit $L + Re(D) + \overline{L}^T$ TBD
14 trans hermitian lower nonunit $\overline{L} + Re(D) + L^T$ TBD
15 conjtrans hermitian lower nonunit $L + Re(D) + \overline{L}^T$ TBD
16 nontrans hermitian upper nonunit $\overline{U}^T + Re(D) + U$ TBD
17 trans hermitian upper nonunit $U^T + Re(D) + \overline{U}$ TBD
18 conjtrans hermitian upper nonunit $\overline{U}^T + Re(D) + U$ TBD

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought some more about this and figured that it makes more sense to prevent the user from using conjtrans with symmetric or Hermitian matrices. No backend would support that and it is probably not useful. spmv is the only operation affected by this so I have disabled that in ec72831. I've also clarified this in the spec.

Copy link
Contributor

@gajanan-choudhary gajanan-choudhary Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually partially incorrect.

  • conjtrans + symmetric (rows 9 and 12) are unique operations as opposed to nontrans/trans cases that are pairwise same (rows {7 and 8} and rows {10 and 11}).
  • trans + hermitian (rows 14 and 17) are unique operations as opposed to nontrans/conjtrans cases that are pairwise same (rows {13 and 15} and rows {16 and 18}).

Copy link
Contributor

@gajanan-choudhary gajanan-choudhary Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set of duplications are:

  • 8 duplicates 7
  • 11 duplicates 10
  • 15 duplicates 13
  • 18 duplicates 16

So other than 8, 11, 15, 18, all the rest of the rows are valid operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand whether this means you are suggesting a change in this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you mentioned you added restrictions on symmetric/hermitian in ec72831. All the combinations will be eventually supported although no backends currently support hermitian views, so you can throw a not_implemented exception for that. symmetric views are supported.

You also mentioned in your first comment:

I've also clarified this in the spec.

Could you please point me to where this was?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The related change in the spec was: uxlfoundation/oneAPI-spec@471a11d
I'm ok to revert this and throw unsupported "for now". Note that this means all backends with throw unsupported for some time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed it to throwing an unimplemented exception in c7a4420
The symmetric + conjtrans issue that I described in my email "MKLCPU and MKLGPU mismatch for spmm using conjtrans and symmetric property" is still present with oneAPI 2024.2 so I am throwing unsupported for both symmetric and hermitian + conjtrans. I have reverted the spec change that I linked above.

Comment on lines +60 to +62
if ((A_view.type_view == oneapi::mkl::sparse::matrix_descr::symmetric ||
A_view.type_view == oneapi::mkl::sparse::matrix_descr::hermitian) &&
opA == oneapi::mkl::transpose::conjtrans) {
Copy link
Contributor Author

@Rbiessy Rbiessy Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to this comment #500 (comment), I was wondering if it makes sense to allow matrix_descr::symmetric for complex types or matrix_descr::hermitian and transpose::conjtrans for real types. My understanding is that they make sense mathematically speaking but are probably not useful and often assumed to be restricted to real (resp. complex) types.
The Intel oneMKL product spec seems to restrict conjtrans to complex type but the oneMKL SYCL interface one is a bit more vague. I think it's mostly a question of what would be more useful to the user: assume that there was a mistake or trust that the user understands conjugating a real type is a no-op. What do you think @gajanan-choudhary ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it makes sense to allow matrix_descr::symmetric for complex types

Symmetric + complex is definitely needed as the operation/output is mathematically different from Hermitian + complex.

Intel oneMKL product spec seems to restrict conjtrans to complex type

I was honestly not aware we had documented conjtrans that way. I'll get back to you after confirming what they are doing. For real types + hermitian/conjtrans, On the Intel oneMKL sparse BLAS side, we either already dispatch to or will eventually dispatch to symmetric/trans code-paths as applicable if the user calls hermitian/conjtrans on real types. This is particularly true for the C/Fortran CPU side.

Although a typical user would use a specific FP type in their application, we do have to care about HPC libraries such as PETSc or Kokkos that support multiple data types and operations. Our approach has been to generally support all combinations regardless of whether one is mathematically different from another or not, and deal with correct dispatching on our side.

IF BLAS/LAPACK support real + conjtrans/symmetric cases (to be confirmed), then my suggestion would be to directly forward the args to the backend unless the backend. If the backend does not handle the case correctly, a top level check on oneMKL Interfaces side for real data type would be needed to replace hermitian with symmetric and conjtrans replaced with trans.

Copy link
Contributor Author

@Rbiessy Rbiessy Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking.
Currently I am not testing symmetric with complex types nor hermitian or conjtrans with real types. From what you're saying it sounds like it should be tested. I was mostly wondering whether I should add some error checking. I agree it makes sense to forward the values to the backend and add a top level check if needed. I'll wait for your confirmation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the solution of adding more tests for symmetric with complex types and hermitian and conjtrans with real types in 82566e5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both LAPACK and BLAS teams said they support real + conjtrans cases to just do real + trans as it is the least surprising behavior for the users. Thanks for the change.

Copy link
Contributor

@gajanan-choudhary gajanan-choudhary Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving this comment with respect to #500 (comment). Current implementation has symmetric + conjtrans throwing an exception, that should be lifted as that should be supported or will be supported. For hermitian + conjtrans case, we could just replace conjtrans with nontrans as listed in #500 (comment), but that is anyway the backend's job to do that, not the Spec's/oneMKL Interfaces'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These configurations are throwing unimplemented now, see #500 (comment)

@Rbiessy
Copy link
Contributor Author

Rbiessy commented Jun 7, 2024

Had to force push on my branch to fix my author name and email, there was no code changes since my last comment.

@Rbiessy
Copy link
Contributor Author

Rbiessy commented Jul 4, 2024

I have merged with the develop branch and tested with oneAPI 2024.2.
FYI @gajanan-choudhary I found a new issue with spsv complex float and conjtrans which is failing with incorrect results on GPU only. I have disabled this configuration in 9b9548a.
Recent test logs: intel_gpu_cpu_log.txt
Recent doc build: docs.zip

@Rbiessy
Copy link
Contributor Author

Rbiessy commented Jul 15, 2024

The diff in e8eac87 simply removes the unused macro TEST_RUN_CT_SELECT as there is one macro per domain now.

@Rbiessy
Copy link
Contributor Author

Rbiessy commented Jul 19, 2024

I've pushed 2f59edc to reduce the number of times get_pointer_type is called. In the other backends I had concerns this may introduce overheads for no reasons. I figured I would align the MKL backends with this.

Copy link
Contributor

@gajanan-choudhary gajanan-choudhary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've finally reviewed the entire PR. This is fantastic work and top notch code. Thanks a lot @Rbiessy for being patient on the review and feedback. I have some review comments, I'm ready to approve once those are resolved.

src/sparse_blas/backends/mkl_common/mkl_spmv.cxx Outdated Show resolved Hide resolved
src/sparse_blas/backends/mkl_common/mkl_spmv.cxx Outdated Show resolved Hide resolved
src/sparse_blas/backends/mkl_common/mkl_spsv.cxx Outdated Show resolved Hide resolved
tests/unit_tests/sparse_blas/include/test_common.hpp Outdated Show resolved Hide resolved
tests/unit_tests/sparse_blas/source/sparse_spmv_buffer.cpp Outdated Show resolved Hide resolved
oneapi::mkl::sparse::spmv_descr_t /*spmv_descr*/,
const std::vector<sycl::event> &dependencies,
bool is_alpha_host_accessible, bool is_beta_host_accessible) {
T host_alpha =
Copy link
Contributor

@gajanan-choudhary gajanan-choudhary Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to @spencerpatty who brought up an important point here. The "Notes" section of the oneAPI Spec says that the spmv_buffer_size()/spmv_optimize() stages are compulsory. We need a way to check if they have been called, and if not, throw an exception. The fact that it works with the oneMKL backend is just luck, but we shouldn't rely on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems we need to clarify in the oneAPI Spec explicitly that if either optimize/buffer_size stages are not called, then it should throw a specific exception (invalid_argument?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it but the issue is that we would need to check that the *_buffer_size and *_optimize functions are called with the same arguments for the check to be really useful. I think it would add too much overhead to check this. I would suggest to only clarify in the spec that not doing so is undefined behavior.
I was also thinking of storing in the descriptor whether buffer_size or optimize has been called once regardless of the arguments used but this is only a partial check. We would need to justify in the spec that we can throw an exception for some cases and keep it undefined behavior if different arguments are used. It's simpler to say not calling them is undefined behavior.

Copy link
Contributor

@gajanan-choudhary gajanan-choudhary Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undefined behavior is not good enough. We can't have users getting used to expecting that not having calls to _buffer_size or _optimize APIs is fine and things will work just because oneMKL backend works right now. It is okay for us to enforce restrictions first, and later lift some of them as needed, but it's not possible for us to do it the other way round because that would break users' codes. The overhead for a few if checks on the host is going to be far less than the actual operation itself, even for tiny matrices. Even in the unlikely event that overhead is large, that's the cost of doing business and having the APIs that we do in the oneAPI Spec.

  • Not calling the _buffer_size()/_optimize() stages is not allowed. This must be caught and an exception thrown. To do that, store flags for these in the <api>_descr and internally initialize them to false, setting them to true as and when the APIs are called.
  • Repeating calls to any of these functions with a different transpose operation(s) or alg enum with the same <api>_descr is not allowed. This must be caught and an exception must be thrown. That breaks all optimizations and state of the workspace. If users need a different op/alg, they need to create a new <api>_descr for that with a different temporary workspace. To catch this exception, internally store the transpose and alg enums in <api>_descr and initialize them to invalid values not among any of the existing enums, e.g. static_cast<oneapi::mkl::transpose>(0xFF). That way when any API is called the first time, you set the transpose/alg enums and they are no longer allowed to change that.
  • Repeating calls with a different sparsity pattern (including mathematically same but just differently ordered sparsity pattern) compared to the first call is also not allowed, but we will have to make it undefined behavior because there is no efficient way to check if the users provided us a new matrix with the same sparsity pattern in the same order or not.
  • Repeating calls with different alpha, beta is always allowed, of course.

We need to clarify these things in the oneAPI specification.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we just lift the requirement and go back to it being optional :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we just lift the requirement and go back to it being optional :)

As I mentioned in my previous comment:

It is okay for us to enforce restrictions first, and later lift some of them as needed, but it's not possible for us to do it the other way round because that would break users' codes.

Copy link
Contributor

@gajanan-choudhary gajanan-choudhary Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other reason to prefer a compulsory call to _buffer_size, _optimize APIs is because even though some backends (like the current Intel oneMKL) may not need a workspace associated with a particular op/alg combo, there can be another backend that requires for a temporary workspace some or all their algorithms thereby requiring these APIs be called. When a user tries to then run the same code on two different hardware/backends relying on the APIs being optional on one hardware, they are guaranteed to run into problems on the other hardware. The only way to prevent that in that case is to have non-intersecting algorithm enums or backend-specific enums in the oneAPI spec itself, which is likely not what we want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue that will need to be undefined behavior is to check whether the _buffer_size function has already been called before _optimize with the same arguments. Nothing prevents the user from calling _buffer_size multiple times in a row with different arguments (but the same descriptor) or from re-using the output of _buffer_size from a previous call. We would need to store a map of all the set of arguments that _buffer_size has been called with which will introduce too much overhead. We could check that buffer_size has been called at least once but specifying this in the specification will look weird.
I will see if we can add some checks as you suggested but I suspect this will lead to more discussions and delay. Remember we can always improve the spec later if we find this is an issue for users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some checks in 5f8e183 and the associated spec change is in uxlfoundation/oneAPI-spec@fda9f80.
I don't think I can add more checks that wouldn't require storing a map of the arguments used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants