Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize alignment algorithm for x86-64 #768

Open
rhpvorderman opened this issue Mar 15, 2024 · 1 comment
Open

Vectorize alignment algorithm for x86-64 #768

rhpvorderman opened this issue Mar 15, 2024 · 1 comment

Comments

@rhpvorderman
Copy link
Collaborator

rhpvorderman commented Mar 15, 2024

Ideally you would want to use in this case __m128i _mm_blendv_epi8(__m128i a, __m128i b, __m128i mask)
where the mask could be created with anything in the _mm_cmp**_epi8 range.

But blendv is a SSE4.1 instruction. Leading to compile headaches. However, this can be done using SSE2 instructions only:

<include "emmintrin.h">

static inline __m128i vector_blend_128(__m128i a, __m128i b, __m128i mask) {
     return _mm_or_si128(
         _mm_and_si128(mask, a);
         _mm_andnot_si128(mask, b);
     );
}

So this might open up opportunities for vectorization, using only #ifdef __SSE2__ compile guards.

EDIT: This would work for other than epi8 data types as well of course.

@rhpvorderman
Copy link
Collaborator Author

I did some further research, as sequali now calculates sequence identity using Smith-Waterman, and unfortunately it was bottlenecking report creation. The paper below highlights ways of parallelizing the Smith-Waterman algorithm:

https://link.springer.com/article/10.1007/s12539-021-00473-0

I went for the reverse diagonal approach, since it was immediately clear to me that I could use that in a way that I would only keep 3 diagonals in memory at any time. Striped looked to me more like the whole matrix needs to be in memory. (Although I haven't properly checked this, diagonals were much more obvious, so I simply started hastily implementing).

Since in Sequali queries are limited to 31 bp, I could take massive shortcuts using avx2 vectors. So that is what I did. The result is here: rhpvorderman/sequali#164.

In theory the same could be done for cutadapt, but it would be more work. No short cuts can be taken and probably 16-bit integer vectors need to be used instead of 8-bit integer vectors. Unfortunately it is not possible to write it in a way that the compiler auto vectorizes properly, so separate instructions need to be written for each architecture. In practice this means a fallback and a sse4.1 or avx2 implementation (ARM64 with Neon instructions in production is still quite a rarity I guess?) .

So a lot of work, much more code, with a lot of potential speed benefits. That it can be done does not mean it has to be done of course. But I figured the least I could do is dump some helpful resources in the case you like to fiddle with these things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant