Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for b4_SSE2 batched mode. #1825

Merged
merged 7 commits into from
Sep 4, 2024

Conversation

johnfea
Copy link
Contributor

@johnfea johnfea commented Jun 7, 2024

Description

Add support for b4_SSE2 batched mode, enabling batched execution for all x86-64 CPUs that don't support AVX.

Is there interest in adding this into OSL? I still support CPUs with vector width of 4 in my project that uses OSL and it would be nice to not to have to resort to scalar processing.

Tests

Quick tests were run to see that output with procedural and texture based materials looked ok and proper SSE2 batched code was being generated for wide/ functions. EDIT: Additionally, all tests pass now.

Checklist:

  • I have read the contribution guidelines.
  • I have updated the documentation, if applicable.
  • I have ensured that the change is tested somewhere in the testsuite (adding new test cases if necessary).
  • My code follows the prevailing code style of this project. If I haven't
    already run clang-format v17 before submitting, I definitely will look at
    the CI test that runs clang-format and fix anything that it highlights as
    being nonconforming.

Copy link

linux-foundation-easycla bot commented Jun 7, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@AlexMWells
Copy link
Contributor

@johnfea thanks, I think the question (possibly unanswered) was: With many of the noise and other functions doing internal SIMD using x,y,z or r,g,b,a to take advantage of SSE would running a Batched<4> be an improvement? The Batched<4> would execute the entire shader logic and other features in SIMD, so could be a faster. If you have any feedback on that we would love to hear it.

Also please attempt to add a testing configuration that exercises batched<4> to
.github/workflows/ci.yml
You should see some batch-b8avx2 config you can copy.

@johnfea
Copy link
Contributor Author

johnfea commented Jun 7, 2024

Based on a quick benchmark, batched SSE2 was 20% to 40% faster in total render speed with low complexity material when a single material filled the screen.

20% was with a single 3D fBm and 40% was with mix of fBm and two different procedural shaders.

It seems definitely an improvement.

There was at least two maybe related issues, however. A string from a shader didn't end up with correct ustringhash in output closure and its string representation was empty string. Also, backfacing() in a shader didn't work correctly anymore.

@AlexMWells
Copy link
Contributor

@johnfea great to hear it is speeding things up. Curious how Batched<8> does with AVX on the same workloads vs Batched<4> with SSE.

As far as backfacing() not working, it should just pull a varying value out of the BatchShaderGlobals<4>::varing.backfacing. So make sure you are populating it with value values before executing the shadernetwork.
Not sure this would affect this but take a look at:
#1827

As far as the closures, once you have added a b4_SSE2 workflow to the .github/workflows/ci.yml you can craft a failing unit test that reproduces the issue, you can update this pull request and we could take a look at the failing CI.

@johnfea
Copy link
Contributor Author

johnfea commented Jun 8, 2024

b4_SSE2 ran at 60% the speed of b8_AVX2 mode with single 3D fBm when measured in total render time, but for this measurement all parts of rendering were switched and not just OSL part.
b4_SSE2 ran at 55% the speed of b8_AVX2 mode with mix of fBm and two different procedural shaders.

I updated the pull request with failing closure test unit, but I've no idea how to make the tests run so it probably won't run without some changes. Since testshade didn't seem to read output closures and testrender doesn't support batched, I added new testminimal. This issue doesn't happen with printf from a shader like done in some testshade tests.

It seems the backfacing() issue occurs with batched mode at any width but it doesn't affect output. When using backfacing() as input to a mix shader, non-batched mode doesn't output the cancelled shader in closures, but batched mode does with zero weights, which is missed optimization.

Also, the issue reproduced by this unit test is also generic to batched mode like #1801 and #1800 and not just with SSE2.

@johnfea johnfea force-pushed the main branch 14 times, most recently from 59297e6 to 450c96a Compare June 17, 2024 11:01
@johnfea
Copy link
Contributor Author

johnfea commented Jun 17, 2024

I've updated the pull request, now "closure-string" test fails in batched mode only and for some reason also for non-batched icc/icx. This is the only test that fails for b8_AVX2 but for b4_SSE2 there's also "Invalid bitcast" <4x i1> to i8.

@johnfea johnfea force-pushed the main branch 3 times, most recently from f79e03f to 6b1276a Compare June 17, 2024 13:46
@johnfea
Copy link
Contributor Author

johnfea commented Jun 17, 2024

I removed closure-string and testminimal from this PR and added those into a separate PR #1831.

Now for b4_SSE2 only "'Invalid bitcast' <4x i1> to i8" should fail in tests and b8_AVX2 should pass.

@johnfea
Copy link
Contributor Author

johnfea commented Jun 25, 2024

All tests for this PR pass now.

I'm not sure why AVX512 needs cases for width 8 in src/liboslexec/llvm_util.cpp and why AVX needs cases for width 16. I tried removing them and EDIT: it made no difference for either b16 or b8 in tests. Anyway, I added cases in those places too for width 4. There's an alternative branch here with all of those cases removed. EDIT2: Apparently, b8 AVX512 is a path in liboslexec CMakeList.txt but b16 AVX is not. My take would be to not add any b4 paths/cases for AVX512 or AVX, but retain the existing non-typical width cases for AVX512 and AVX.

@lgritz
Copy link
Collaborator

lgritz commented Aug 15, 2024

@johnfea Reviving this, sorry for the delay, it's been a busy last several weeks. Would you mind doing a rebase on top of the current main so we get a clean merge? And other than that, is this patch ready to merge now?

@AlexMWells Do you have any reservations about it?

@AlexMWells
Copy link
Contributor

For the Files Changed it all looks good, just adds a 4 wide path utilizing all the same approaches. Want to see the CI run it though and make sure it passes the massive pile of regressions.

Great Work!

Signed-off-by: Tuomas Tonteri <e90@tuomastonteri.fi>
@johnfea johnfea reopened this Aug 16, 2024
Signed-off-by: Tuomas Tonteri <e90@tuomastonteri.fi>
@johnfea
Copy link
Contributor Author

johnfea commented Aug 16, 2024

@johnfea Reviving this, sorry for the delay, it's been a busy last several weeks. Would you mind doing a rebase on top of the current main so we get a clean merge? And other than that, is this patch ready to merge now?

Yes, it can be merged now. One might want to change b4sse2 ci to be more different from b8avx2. Both old and new non-typical width batched code in llvm_util.c isn't covered by testsuite though.

@AlexMWells
Copy link
Contributor

@johnfea , can you elaborate or provide example of "old and new non-typical width batched code in llvm_util.c isn't covered by testsuite though."

@AlexMWells
Copy link
Contributor

AlexMWells commented Aug 16, 2024

Looking at CI action, I see
VFX2021 gcc9/C++17 llvm11 py3.7 exr2.5 oiio2.3 sse2 batch-b4sse2

which successfully executed in (SSE2 batch width 4) 60 different *.regress.batched.opt tests comparing results against scalar results, and another 161 *.batched.opt tests comparing against reference txt or images.

@johnfea what isn't being covered such that "Both old and new non-typical width batched code in llvm_util.c isn't covered by testsuite though"?
Do we need more test suite entries to cover?

@johnfea
Copy link
Contributor Author

johnfea commented Aug 16, 2024

I mean what I discussed above. Here's the code structure:

if (m_supports_avx512f) {
switch (m_vector_width) {
case 16:
break;
case 8;
// Removing this old case doesn't change test results
break;
case 4:
// Removing this new case doesn't change test results
break;
}
} else if (m_supports_avx) {
switch (m_vector_width) {
case 16:
// Removing this old case doesn't change test results
break;
case 8:
break;
case 4:
// Removing this new case doesn't change test results
break;
}
} else {
switch (m_vector_width) {
case 16:
// Removing this old case doesn't change test results
break;
case 8:
// Removing this old case doesn't change test results
break;
case 4:
break;
}
}

Here's maybe all affected functions:

llvm::Value* LLVM_Util::mask_as_int(llvm::Value* mask)
llvm::Value* LLVM_Util::op_gather(llvm::Type* src_type, llvm::Value* src_ptr,llvm::Value* wide_index)
void LLVM_Util::op_scatter(llvm::Value* wide_val, llvm::Type* src_type, llvm::Value* src_ptr, llvm::Value* wide_index)

I haven't looked more into it. For me these are not important since I don't currently use e.g. 8-wide AVX512. I added the completely untested cases for 4 there to not change existing code layout. I wouldn't worry about it but just as a note.

llvm::Value* mask_as_int = mask4_as_int8(mask);

// Count trailing zeros, least significant
llvm::Type* types[] = { intMaskType };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to repeat the code outside the switch statement, I think you can just reduce this down to

llvm::Value* mask = mask4_as_int8(mask);
break;

and let the existing code after the switch handle the rest vs. repeating it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@AlexMWells
Copy link
Contributor

@johnfea , I got it, so CI doesn't execute all combinations of 4,8,16 and SSE, AVX, AVX2, AVX512 ISA's so portions of llvm_util.cpp maybe untested.

  1. I do think llvm_util.cpp should support all combinations.
  2. Due to turbo frequency differences when executing different ISA's at different widths, AVX512 8 wide is a useful/expected scenario. I could also imaging AVX/2 4 wide as well in some scenarios where top bin turbo frequencies are desired.
  3. It would be easy to add more CI plans, but probably need a balance of CI plan time vs. coverage.
  4. Users of batched should build OSL for their desired configurations and make sure the test suite passes (as the CI may not cover them all)
  5. As noted 4 wide batched scenarios for other than SSE are not in the CI plan and may contain bugs until fully tested.

@johnfea johnfea force-pushed the main branch 2 times, most recently from 8ed8976 to 7a08a8f Compare August 17, 2024 10:09
Signed-off-by: Tuomas Tonteri <e90@tuomastonteri.fi>
@lgritz
Copy link
Collaborator

lgritz commented Aug 17, 2024

I think the test failures are because of a merge that just happened in OIIO master. I merged a fix into OSL main. If you rebase on top of the current main, I think that will all get cleaned up.

@lgritz
Copy link
Collaborator

lgritz commented Aug 18, 2024

Fixed now, sorry about that. One more merge/rebase on top of the (new) main ought to clear up the CI for you.

@lgritz
Copy link
Collaborator

lgritz commented Aug 18, 2024

Looks like all the CI tests are coming in passing now.

I'm inclined to merge this in its current state, and any further enhancements or tests can be a follow-up.

Express objections now; in their absense, I will merge this tomorrow (Monday).

@johnfea
Copy link
Contributor Author

johnfea commented Sep 3, 2024

This one needs to be accepted before I can work on the others. Due to changes there's again trivial merge conflict in ci.yml.

Signed-off-by: Larry Gritz <lg@larrygritz.com>
@lgritz
Copy link
Collaborator

lgritz commented Sep 4, 2024

I'm going to merge this, and then I have a fix for the CI that I will push immediately following that.

@lgritz lgritz merged commit bd71665 into AcademySoftwareFoundation:main Sep 4, 2024
20 of 22 checks passed
@lgritz
Copy link
Collaborator

lgritz commented Sep 4, 2024

Next time, please submit PRs from a branch other than your own main, that can make things very difficult.

lgritz added a commit to lgritz/OpenShadingLanguage that referenced this pull request Sep 4, 2024
Signed-off-by: Larry Gritz <lg@larrygritz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants