Add support for b4_SSE2 batched mode. #1825

johnfea · 2024-06-07T14:52:21Z

Description

Add support for b4_SSE2 batched mode, enabling batched execution for all x86-64 CPUs that don't support AVX.

Is there interest in adding this into OSL? I still support CPUs with vector width of 4 in my project that uses OSL and it would be nice to not to have to resort to scalar processing.

Tests

Quick tests were run to see that output with procedural and texture based materials looked ok and proper SSE2 batched code was being generated for wide/ functions. EDIT: Additionally, all tests pass now.

Checklist:

I have read the contribution guidelines.
I have updated the documentation, if applicable.
I have ensured that the change is tested somewhere in the testsuite (adding new test cases if necessary).
My code follows the prevailing code style of this project. If I haven't
already run clang-format v17 before submitting, I definitely will look at
the CI test that runs clang-format and fix anything that it highlights as
being nonconforming.

linux-foundation-easycla · 2024-06-07T14:52:26Z

The committers listed above are authorized under a signed CLA.

✅ login: johnfea (9e5c674, ae76938, c6d796a, b8480ab, e0197db, d13c467)
✅ login: lgritz / name: Larry Gritz (a6d89f2)

AlexMWells · 2024-06-07T16:31:17Z

@johnfea thanks, I think the question (possibly unanswered) was: With many of the noise and other functions doing internal SIMD using x,y,z or r,g,b,a to take advantage of SSE would running a Batched<4> be an improvement? The Batched<4> would execute the entire shader logic and other features in SIMD, so could be a faster. If you have any feedback on that we would love to hear it.

Also please attempt to add a testing configuration that exercises batched<4> to
.github/workflows/ci.yml
You should see some batch-b8avx2 config you can copy.

johnfea · 2024-06-07T20:10:13Z

Based on a quick benchmark, batched SSE2 was 20% to 40% faster in total render speed with low complexity material when a single material filled the screen.

20% was with a single 3D fBm and 40% was with mix of fBm and two different procedural shaders.

It seems definitely an improvement.

There was at least two maybe related issues, however. A string from a shader didn't end up with correct ustringhash in output closure and its string representation was empty string. Also, backfacing() in a shader didn't work correctly anymore.

AlexMWells · 2024-06-08T00:24:59Z

@johnfea great to hear it is speeding things up. Curious how Batched<8> does with AVX on the same workloads vs Batched<4> with SSE.

As far as backfacing() not working, it should just pull a varying value out of the BatchShaderGlobals<4>::varing.backfacing. So make sure you are populating it with value values before executing the shadernetwork.
Not sure this would affect this but take a look at:
#1827

As far as the closures, once you have added a b4_SSE2 workflow to the .github/workflows/ci.yml you can craft a failing unit test that reproduces the issue, you can update this pull request and we could take a look at the failing CI.

johnfea · 2024-06-08T13:36:16Z

b4_SSE2 ran at 60% the speed of b8_AVX2 mode with single 3D fBm when measured in total render time, but for this measurement all parts of rendering were switched and not just OSL part.
b4_SSE2 ran at 55% the speed of b8_AVX2 mode with mix of fBm and two different procedural shaders.

I updated the pull request with failing closure test unit, but I've no idea how to make the tests run so it probably won't run without some changes. Since testshade didn't seem to read output closures and testrender doesn't support batched, I added new testminimal. This issue doesn't happen with printf from a shader like done in some testshade tests.

It seems the backfacing() issue occurs with batched mode at any width but it doesn't affect output. When using backfacing() as input to a mix shader, non-batched mode doesn't output the cancelled shader in closures, but batched mode does with zero weights, which is missed optimization.

Also, the issue reproduced by this unit test is also generic to batched mode like #1801 and #1800 and not just with SSE2.

johnfea · 2024-06-17T11:05:09Z

I've updated the pull request, now "closure-string" test fails in batched mode only and for some reason also for non-batched icc/icx. This is the only test that fails for b8_AVX2 but for b4_SSE2 there's also "Invalid bitcast" <4x i1> to i8.

johnfea · 2024-06-17T13:49:28Z

I removed closure-string and testminimal from this PR and added those into a separate PR #1831.

Now for b4_SSE2 only "'Invalid bitcast' <4x i1> to i8" should fail in tests and b8_AVX2 should pass.

johnfea · 2024-06-25T10:44:36Z

All tests for this PR pass now.

I'm not sure why AVX512 needs cases for width 8 in src/liboslexec/llvm_util.cpp and why AVX needs cases for width 16. I tried removing them and EDIT: it made no difference for either b16 or b8 in tests. Anyway, I added cases in those places too for width 4. There's an alternative branch here with all of those cases removed. EDIT2: Apparently, b8 AVX512 is a path in liboslexec CMakeList.txt but b16 AVX is not. My take would be to not add any b4 paths/cases for AVX512 or AVX, but retain the existing non-typical width cases for AVX512 and AVX.

lgritz · 2024-08-15T18:02:50Z

@johnfea Reviving this, sorry for the delay, it's been a busy last several weeks. Would you mind doing a rebase on top of the current main so we get a clean merge? And other than that, is this patch ready to merge now?

@AlexMWells Do you have any reservations about it?

AlexMWells · 2024-08-15T21:49:24Z

For the Files Changed it all looks good, just adds a 4 wide path utilizing all the same approaches. Want to see the CI run it though and make sure it passes the massive pile of regressions.

Great Work!

Signed-off-by: Tuomas Tonteri <e90@tuomastonteri.fi>

johnfea · 2024-08-16T11:18:35Z

@johnfea Reviving this, sorry for the delay, it's been a busy last several weeks. Would you mind doing a rebase on top of the current main so we get a clean merge? And other than that, is this patch ready to merge now?

Yes, it can be merged now. One might want to change b4sse2 ci to be more different from b8avx2. Both old and new non-typical width batched code in llvm_util.c isn't covered by testsuite though.

AlexMWells · 2024-08-16T17:39:30Z

@johnfea , can you elaborate or provide example of "old and new non-typical width batched code in llvm_util.c isn't covered by testsuite though."

AlexMWells · 2024-08-16T17:49:19Z

Looking at CI action, I see
VFX2021 gcc9/C++17 llvm11 py3.7 exr2.5 oiio2.3 sse2 batch-b4sse2

which successfully executed in (SSE2 batch width 4) 60 different *.regress.batched.opt tests comparing results against scalar results, and another 161 *.batched.opt tests comparing against reference txt or images.

@johnfea what isn't being covered such that "Both old and new non-typical width batched code in llvm_util.c isn't covered by testsuite though"?
Do we need more test suite entries to cover?

johnfea · 2024-08-16T20:06:52Z

I mean what I discussed above. Here's the code structure:

if (m_supports_avx512f) {
switch (m_vector_width) {
case 16:
break;
case 8;
// Removing this old case doesn't change test results
break;
case 4:
// Removing this new case doesn't change test results
break;
}
} else if (m_supports_avx) {
switch (m_vector_width) {
case 16:
// Removing this old case doesn't change test results
break;
case 8:
break;
case 4:
// Removing this new case doesn't change test results
break;
}
} else {
switch (m_vector_width) {
case 16:
// Removing this old case doesn't change test results
break;
case 8:
// Removing this old case doesn't change test results
break;
case 4:
break;
}
}

Here's maybe all affected functions:

llvm::Value* LLVM_Util::mask_as_int(llvm::Value* mask)
llvm::Value* LLVM_Util::op_gather(llvm::Type* src_type, llvm::Value* src_ptr,llvm::Value* wide_index)
void LLVM_Util::op_scatter(llvm::Value* wide_val, llvm::Type* src_type, llvm::Value* src_ptr, llvm::Value* wide_index)

I haven't looked more into it. For me these are not important since I don't currently use e.g. 8-wide AVX512. I added the completely untested cases for 4 there to not change existing code layout. I wouldn't worry about it but just as a note.

AlexMWells · 2024-08-16T21:29:04Z

src/liboslexec/llvm_util.cpp

+        llvm::Value* mask_as_int = mask4_as_int8(mask);
+
+        // Count trailing zeros, least significant
+        llvm::Type* types[] = { intMaskType };


This looks to repeat the code outside the switch statement, I think you can just reduce this down to

llvm::Value* mask = mask4_as_int8(mask); break;

and let the existing code after the switch handle the rest vs. repeating it.

AlexMWells · 2024-08-16T21:39:07Z

@johnfea , I got it, so CI doesn't execute all combinations of 4,8,16 and SSE, AVX, AVX2, AVX512 ISA's so portions of llvm_util.cpp maybe untested.

I do think llvm_util.cpp should support all combinations.
Due to turbo frequency differences when executing different ISA's at different widths, AVX512 8 wide is a useful/expected scenario. I could also imaging AVX/2 4 wide as well in some scenarios where top bin turbo frequencies are desired.
It would be easy to add more CI plans, but probably need a balance of CI plan time vs. coverage.
Users of batched should build OSL for their desired configurations and make sure the test suite passes (as the CI may not cover them all)
As noted 4 wide batched scenarios for other than SSE are not in the CI plan and may contain bugs until fully tested.

Signed-off-by: Tuomas Tonteri <e90@tuomastonteri.fi>

lgritz · 2024-08-17T16:11:32Z

I think the test failures are because of a merge that just happened in OIIO master. I merged a fix into OSL main. If you rebase on top of the current main, I think that will all get cleaned up.

lgritz · 2024-08-18T11:58:49Z

Fixed now, sorry about that. One more merge/rebase on top of the (new) main ought to clear up the CI for you.

lgritz · 2024-08-18T17:04:14Z

Looks like all the CI tests are coming in passing now.

I'm inclined to merge this in its current state, and any further enhancements or tests can be a follow-up.

Express objections now; in their absense, I will merge this tomorrow (Monday).

johnfea · 2024-09-03T07:04:13Z

This one needs to be accepted before I can work on the others. Due to changes there's again trivial merge conflict in ci.yml.

Signed-off-by: Larry Gritz <lg@larrygritz.com>

lgritz · 2024-09-04T00:06:01Z

I'm going to merge this, and then I have a fix for the CI that I will push immediately following that.

lgritz · 2024-09-04T00:14:54Z

Next time, please submit PRs from a branch other than your own main, that can make things very difficult.

Signed-off-by: Larry Gritz <lg@larrygritz.com>

johnfea force-pushed the main branch from 6ffdff8 to 52fb623 Compare June 8, 2024 12:23

johnfea force-pushed the main branch 14 times, most recently from 59297e6 to 450c96a Compare June 17, 2024 11:01

johnfea force-pushed the main branch 3 times, most recently from f79e03f to 6b1276a Compare June 17, 2024 13:46

johnfea force-pushed the main branch from 6b1276a to 9d83430 Compare June 17, 2024 15:56

johnfea force-pushed the main branch 3 times, most recently from c947bf3 to 6551e79 Compare June 25, 2024 10:07

johnfea force-pushed the main branch from 9997747 to ae76938 Compare July 12, 2024 16:19

johnfea closed this Aug 16, 2024

johnfea force-pushed the main branch from ae76938 to 28b71e7 Compare August 16, 2024 08:56

Merge branch 'main' of https://github.com/johnfea/OpenShadingLanguage

e0197db

Signed-off-by: Tuomas Tonteri <e90@tuomastonteri.fi>

johnfea reopened this Aug 16, 2024

Fix regression in batched SSE2 patch.

9e5c674

Signed-off-by: Tuomas Tonteri <e90@tuomastonteri.fi>

johnfea force-pushed the main branch from ae88298 to 9e5c674 Compare August 16, 2024 10:55

AlexMWells reviewed Aug 16, 2024

View reviewed changes

johnfea force-pushed the main branch 2 times, most recently from 8ed8976 to 7a08a8f Compare August 17, 2024 10:09

Fix unnecessary code repetition in SSE2 patch.

c6d796a

Signed-off-by: Tuomas Tonteri <e90@tuomastonteri.fi>

johnfea force-pushed the main branch from 7a08a8f to c6d796a Compare August 17, 2024 10:13

Merge branch 'AcademySoftwareFoundation:main' into main

b8480ab

Merge branch 'AcademySoftwareFoundation:main' into main

d13c467

Merge branch 'main' into main

a6d89f2

Signed-off-by: Larry Gritz <lg@larrygritz.com>

lgritz merged commit bd71665 into AcademySoftwareFoundation:main Sep 4, 2024
20 of 22 checks passed

lgritz added a commit to lgritz/OpenShadingLanguage that referenced this pull request Sep 4, 2024

Fix minor CI breaks after AcademySoftwareFoundation#1825

7738e23

Signed-off-by: Larry Gritz <lg@larrygritz.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for b4_SSE2 batched mode. #1825

Add support for b4_SSE2 batched mode. #1825

johnfea commented Jun 7, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jun 7, 2024 •

edited

Loading

AlexMWells commented Jun 7, 2024

johnfea commented Jun 7, 2024

AlexMWells commented Jun 8, 2024

johnfea commented Jun 8, 2024 •

edited

Loading

johnfea commented Jun 17, 2024

johnfea commented Jun 17, 2024 •

edited

Loading

johnfea commented Jun 25, 2024 •

edited

Loading

lgritz commented Aug 15, 2024

AlexMWells commented Aug 15, 2024

johnfea commented Aug 16, 2024

AlexMWells commented Aug 16, 2024

AlexMWells commented Aug 16, 2024 •

edited

Loading

johnfea commented Aug 16, 2024

AlexMWells Aug 16, 2024

johnfea Aug 17, 2024

AlexMWells commented Aug 16, 2024

lgritz commented Aug 17, 2024

lgritz commented Aug 18, 2024

lgritz commented Aug 18, 2024

johnfea commented Sep 3, 2024

lgritz commented Sep 4, 2024

lgritz commented Sep 4, 2024

Add support for b4_SSE2 batched mode. #1825

Add support for b4_SSE2 batched mode. #1825

Conversation

johnfea commented Jun 7, 2024 • edited Loading

Description

Tests

Checklist:

linux-foundation-easycla bot commented Jun 7, 2024 • edited Loading

AlexMWells commented Jun 7, 2024

johnfea commented Jun 7, 2024

AlexMWells commented Jun 8, 2024

johnfea commented Jun 8, 2024 • edited Loading

johnfea commented Jun 17, 2024

johnfea commented Jun 17, 2024 • edited Loading

johnfea commented Jun 25, 2024 • edited Loading

lgritz commented Aug 15, 2024

AlexMWells commented Aug 15, 2024

johnfea commented Aug 16, 2024

AlexMWells commented Aug 16, 2024

AlexMWells commented Aug 16, 2024 • edited Loading

johnfea commented Aug 16, 2024

AlexMWells Aug 16, 2024

Choose a reason for hiding this comment

johnfea Aug 17, 2024

Choose a reason for hiding this comment

AlexMWells commented Aug 16, 2024

lgritz commented Aug 17, 2024

lgritz commented Aug 18, 2024

lgritz commented Aug 18, 2024

johnfea commented Sep 3, 2024

lgritz commented Sep 4, 2024

lgritz commented Sep 4, 2024

johnfea commented Jun 7, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jun 7, 2024 •

edited

Loading

johnfea commented Jun 8, 2024 •

edited

Loading

johnfea commented Jun 17, 2024 •

edited

Loading

johnfea commented Jun 25, 2024 •

edited

Loading

AlexMWells commented Aug 16, 2024 •

edited

Loading