Allow using SIMD intrinsics for SSE2, SSE4, AVX2 and NEON code #8209

homm · 2024-07-06T11:31:44Z

Current SIMD state

There is a separate Pillow-SIMD package on PyPI with modified Pillow code, which includes some rewritten routines using SSE4 and AVX2 compiler intrinsics. Any version of Pillow-SIMD has strictly the same functionality, behavior, and bugs as the corresponding Pillow version. There are many problems with this approach, including:

Pillow-SIMD versions are often outdated. This is because any Pillow-SIMD release requires some work to merge SIMD-accelerated code into every Pillow version, which involves resolving potential conflicts and testing all implementations.
SIMD-accelerated code is unconditionally platform-specific. Pillow-SIMD can be compiled and run only on x86. This means that it is very hard to add Pillow-SIMD as a dependency in cross-platform applications.
Pillow-SIMD is a different package, although it shares the same namespace (PIL), which means only one package (Pillow or Pillow-SIMD) should be installed at a time in one environment. It can be tricky to avoid Pillow installation if it is a dependency for some third-party libraries.
Pillow-SIMD itself can't be a dependency for third-party libraries, since it's not cross-platform.
Pillow-SIMD doesn't provide precompiled binaries since there are two types of SIMD-acceleration implemented: SSE4 and AVX2. Pillow-SIMD compiled with AVX2 is a bit faster, but can't be run on non-AVX2 CPUs (including Rosetta 2 on Apple silicon).

Proposal

I'd like to move SIMD-accelerated code to the upstream with some enhancements for cross-platform compilation. How it affects described problems:

Every released Pillow version will contain SIMD-accelerated code, so no additional actions will be required.
Pillow can be compiled on any platform, with or without SIMD-acceleration, if available. No separate package will be required.
Like the previous point, no separate package will be required.
Like the previous point, no separate package will be required.
Default precompiled binaries for Pillow will be provided with minimal available SIMD-acceleration for every platform: SSE2 for x86_64 and NEON for ARM aarch64 (please note, currently there is no SSE2 or NEON accelerated code). For advanced users who want acceleration, compilation documentation will be provided.

Challenges

Testing

The main challenge is testing. If some code is contributed to the upstream, it should be run during tests. There could be the following versions of the code:

Native code: Will be tested on platforms without acceleration support: 32-bit x86, s390x, and ppc64le. Are there any tests for any of these platforms?
SSE2: The most common, since it is the default for x86_64.
SSE4: Not currently tested. Should be added explicitly.
AVX2: Not currently tested. Should be added explicitly.
NEON: Mandatory extension for ARM aarch64, so it will be tested in many ARM configurations.

Detecting Current Acceleration

There should be a way to detect the current acceleration implementation. I think some property for the core object should be enough.

Criticism

The proposed solution doesn't provide first-class SIMD-acceleration for Pillow. When you are using any application or installing other libraries, like NumPy, there is only one compiled version available for a platform, and it contains all types of available acceleration. The right accelerated functions are dynamically called at runtime instead of recompiling for particular CPU capabilities during installation. However, such an approach requires much more effort, and this is not the aim of the current proposal.

homm · 2024-07-08T07:22:04Z

@python-pillow/pillow-team What do you think?

nulano · 2024-07-15T21:16:56Z

I don't see why testing x86_64 would be an issue - we already have special cases for PYTHONOPTIMIZE and could do the same for SSE4 / AVX2:

Pillow/.github/workflows/test.yml

Lines 53 to 54 in 2152a17

    
           - { python-version: "3.11", PYTHONOPTIMIZE: 1, REVERSE: "--reverse" } 
        
           - { python-version: "3.10", PYTHONOPTIMIZE: 2 }

I think the only x86 (32-bit) job we have is on AppVeyor (using SysWOW64):

Pillow/.appveyor.yml

Lines 21 to 23 in 2152a17

    
           - PYTHON: C:/Python312 
        
             ARCHITECTURE: x86 
        
             APPVEYOR_BUILD_WORKER_IMAGE: Visual Studio 2022

However, it seems to have selected sse2: https://ci.appveyor.com/project/Python-pillow/pillow/builds/50165788/job/sv60thgf3ygibbv4?fullLog=true#L3667

s390x and ppc64le are tested using QEMU:

Pillow/.github/workflows/test-docker.yml

Lines 39 to 40 in 2152a17

    
           ubuntu-24.04-noble-ppc64le, 
        
           ubuntu-24.04-noble-s390x,

Detecting Current Acceleration

There should be a way to detect the current acceleration implementation. I think some property for the core object should be enough.

~~I anticipate that we will get questions about this, so I would propose we add this to PIL.features and the report summary PIL.features.pilinfo.~~

I see you've already done this 👍

The right accelerated functions are dynamically called at runtime instead of recompiling for particular CPU capabilities during installation.

I think we should ideally aim for implementing this at some point in the future, but starting with compile-time selection for now would be a good first step.

nulano · 2024-07-15T21:18:30Z

src/PIL/features.py

@@ -128,6 +128,7 @@ def get_supported_codecs() -> list[str]:
    "libjpeg_turbo": ("PIL._imaging", "HAVE_LIBJPEGTURBO", "libjpeg_turbo_version"),
    "libimagequant": ("PIL._imaging", "HAVE_LIBIMAGEQUANT", "imagequant_version"),
    "xcb": ("PIL._imaging", "HAVE_XCB", None),
+    "acceleration": ("PIL._imaging", "acceleration", "acceleration"),


I think this violates the type hint for check_feature by returning a string instead of a bool.

.github/workflows/test.yml

homm marked this pull request as draft July 6, 2024 11:34

homm force-pushed the simd-init branch from 0b4b3b0 to 5376120 Compare July 6, 2024 11:42

radarhere mentioned this pull request Jul 8, 2024

Updated test uploadcare/pillow-simd#137

Merged

nulano reviewed Jul 15, 2024

View reviewed changes

homm and others added 7 commits July 16, 2024 22:19

Add SIMD headers

86dc0ab

Add SIMD example

0472d3d

Add __SSE4_2__ for MSCC

c1f137a

Add core.acceleration attribute

3e19e2e

clang-format

ada8478

Define better name __NEON__

f6df27b

Updated test

43bf980

homm force-pushed the simd-init branch 3 times, most recently from bd65e1c to 7fa8fb8 Compare July 16, 2024 19:11

hugovk reviewed Jul 17, 2024

View reviewed changes

.github/workflows/test.yml Outdated Show resolved Hide resolved

.github/workflows/test.yml Outdated Show resolved Hide resolved

Add accelerated test builds

b979bb8

homm force-pushed the simd-init branch from 7fa8fb8 to b979bb8 Compare July 17, 2024 06:35

homm requested a review from hugovk July 17, 2024 11:36

homm marked this pull request as ready for review July 28, 2024 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow using SIMD intrinsics for SSE2, SSE4, AVX2 and NEON code #8209

Allow using SIMD intrinsics for SSE2, SSE4, AVX2 and NEON code #8209

homm commented Jul 6, 2024 •

edited

Loading

homm commented Jul 8, 2024

nulano commented Jul 15, 2024

nulano Jul 15, 2024

Allow using SIMD intrinsics for SSE2, SSE4, AVX2 and NEON code #8209

Are you sure you want to change the base?

Allow using SIMD intrinsics for SSE2, SSE4, AVX2 and NEON code #8209

Conversation

homm commented Jul 6, 2024 • edited Loading

Current SIMD state

Proposal

Challenges

Testing

Detecting Current Acceleration

Criticism

homm commented Jul 8, 2024

nulano commented Jul 15, 2024

nulano Jul 15, 2024

Choose a reason for hiding this comment

homm commented Jul 6, 2024 •

edited

Loading