Block Jacobi not parallelized for matrix-matrix system #1391

blegouix · 2023-08-16T14:47:40Z

Hello,

This issue follows #1381, in which the problem was not correctly identified.

I am using Bicgstab to solve a test problem with A a 1000x1000 matrix (this is a band matrice with 19 band width stored in Csr format whose non-zeros are like 1 2 3 4 5 6 7 8 9 10 9 8 7 6 5 4 3 2 1). B is a 1000x1000 dense matrix filled with ones.

The number of colums in B (or X) is the size n_batch=1000 of batch (but this is a batch where all individuals systems share the same A, that's why I dont use the new BatchDense or BatchCsr classes).

I compare the performance on GPU with and without Jacobi preconditionner (size 32):

Without Jacobi : 4700 iterations, 5s total execution time.
With Jacobi : 490 iterations, 20s total execution time.

So, the execution time of one iteration is much longer with preconditionner (~19ms vs ~1ms). This effect does not appears with n_batch=1 (matrix-vector system, total execution time 0.8s).

The reason is n_batch gko::kernels::cuda::jacobi::kernel::apply are called sequentially. Could it be improved ?

Regards

The text was updated successfully, but these errors were encountered:

MarcelKoch · 2023-08-16T14:50:46Z

Perhaps to side step the issue with the block Jacobi, you could create the Jacobi preconditioner with .with_max_block_size(1u). Maybe that helps you already getting a shorter runtime.

But of course, we need to fix our implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block Jacobi not parallelized for matrix-matrix system #1391

Block Jacobi not parallelized for matrix-matrix system #1391

blegouix commented Aug 16, 2023

MarcelKoch commented Aug 16, 2023

Block Jacobi not parallelized for matrix-matrix system #1391

Block Jacobi not parallelized for matrix-matrix system #1391

Comments

blegouix commented Aug 16, 2023

MarcelKoch commented Aug 16, 2023