Vectorize alignment algorithm for x86-64 #768

rhpvorderman · 2024-03-15T08:03:45Z

Ideally you would want to use in this case __m128i _mm_blendv_epi8(__m128i a, __m128i b, __m128i mask)
where the mask could be created with anything in the _mm_cmp**_epi8 range.

But blendv is a SSE4.1 instruction. Leading to compile headaches. However, this can be done using SSE2 instructions only:

<include "emmintrin.h">

static inline __m128i vector_blend_128(__m128i a, __m128i b, __m128i mask) {
     return _mm_or_si128(
         _mm_and_si128(mask, a);
         _mm_andnot_si128(mask, b);
     );
}

So this might open up opportunities for vectorization, using only #ifdef __SSE2__ compile guards.

EDIT: This would work for other than epi8 data types as well of course.

The text was updated successfully, but these errors were encountered:

rhpvorderman · 2024-05-27T07:33:08Z

I did some further research, as sequali now calculates sequence identity using Smith-Waterman, and unfortunately it was bottlenecking report creation. The paper below highlights ways of parallelizing the Smith-Waterman algorithm:

https://link.springer.com/article/10.1007/s12539-021-00473-0

I went for the reverse diagonal approach, since it was immediately clear to me that I could use that in a way that I would only keep 3 diagonals in memory at any time. Striped looked to me more like the whole matrix needs to be in memory. (Although I haven't properly checked this, diagonals were much more obvious, so I simply started hastily implementing).

Since in Sequali queries are limited to 31 bp, I could take massive shortcuts using avx2 vectors. So that is what I did. The result is here: rhpvorderman/sequali#164.

In theory the same could be done for cutadapt, but it would be more work. No short cuts can be taken and probably 16-bit integer vectors need to be used instead of 8-bit integer vectors. Unfortunately it is not possible to write it in a way that the compiler auto vectorizes properly, so separate instructions need to be written for each architecture. In practice this means a fallback and a sse4.1 or avx2 implementation (ARM64 with Neon instructions in production is still quite a rarity I guess?) .

So a lot of work, much more code, with a lot of potential speed benefits. That it can be done does not mean it has to be done of course. But I figured the least I could do is dump some helpful resources in the case you like to fiddle with these things.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize alignment algorithm for x86-64 #768

Vectorize alignment algorithm for x86-64 #768

rhpvorderman commented Mar 15, 2024 •

edited

Loading

rhpvorderman commented May 27, 2024

Vectorize alignment algorithm for x86-64 #768

Vectorize alignment algorithm for x86-64 #768

Comments

rhpvorderman commented Mar 15, 2024 • edited Loading

rhpvorderman commented May 27, 2024

rhpvorderman commented Mar 15, 2024 •

edited

Loading