Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix userdata binding corner case #1673

Merged
merged 1 commit into from
Apr 24, 2023

Conversation

aconty
Copy link
Contributor

@aconty aconty commented Apr 24, 2023

Description

This includes two fixes:

  1. When registering symbols that need user data, sort the entries in
    the set so the layer number is ignored. A needed udata iteam shouldn't
    depend on the layer and separating them makes find_userdata_index()
    sometimes find an index with different derivs status.

  2. osl_bind_interpolated_param() is memcpy'ing derivs that might not
    be there, yielding corrupted derivs and possibly a crash.

Checklist:

  • I have read the contribution guidelines.
  • I have previously submitted a Contributor License Agreement.
  • I have updated the documentation, if applicable.
  • I have ensured that the change is tested somewhere in the testsuite (adding new test cases if necessary).
  • My code follows the prevailing code style of this project.

This includes two fixes:

 1) When registering symbols that need user data, sort the entries in
 the set so the layer number is ignored. A needed udata iteam shouldn't
 depend on the layer and separating them makes find_userdata_index()
 sometimes find an index with different derivs status.

 2) osl_bind_interpolated_param() is memcpy'ing derivs that might not
 be there, yielding corrupted derivs and possibly a crash.

Signed-off-by: Alejandro Conty <aconty@imageworks.com>
@lgritz
Copy link
Collaborator

lgritz commented Apr 24, 2023

So this is the fix for the weird problem we were looking at last week?

Background: For those outside SPI, we had stumbled across a shader network that produced incorrect results when we DISABLED instance merging (when two shader instance nodes in a group have identical inputs and do identical computations, they are merged and the newly single node's outputs are sent to wherever the original ones' outputs went). We're accustomed to occasionally seeing some kind of logic error on our/my part where two nodes were merged when it wasn't actually safe to do so, but in this case, the merged result was correct but when merging was disabled, it was wrong. Real head scratcher.

Looks like Alex has tracked it down, which is awesome and an impressive feat of detective work.

Alex, I wonder if you could spell out for us exactly what was going wrong and how this fixes things? I think I get the gist -- there were two nodes that needed the same bit of named userdata, but one needed the derivs of the userdata and the other did not. Once merged, the derivatives would be retrieved by the merged node, since its output would be connected to the downstream sink that needed derivs. But when unmerged, if the one that didn't need derivs retrieved the userdata first, no derivs would be present (and by the time the one that needed derivs runs, the userdata is already computed and cached, deriv-free).

I think I get that part, but I am still a little hazy about how exactly this is fixed by removing the layer index from the sorting criteria of the list of userdata we will need. Won't you still have two entries in the list, one needing derivs and the other not, and so there is some kind of order dependence you are inadvertently relying on? Oh, is it a set of some kind, so there is not a second entry at all if the sort doesn't make them look different? But even in that case, if the first one added to the set doesn't need derivs but the second does, what is the mechanism by which the record already in the set becomes aware that it needs that userdata to compute derivs?

@aconty
Copy link
Contributor Author

aconty commented Apr 24, 2023

Exactly, it is a set<> built on that index that was considering the layer num. Now that it ignores it both layer will get the same entry in the set, and since one needs derivs, it will promote the entry. Without this change two entries are put in the set, and then find_userdata_index() will get the first one, which may not have derivs and fail to retrieve them.

We had to identical layers asking for Pref [ lockgeom off ]. One needed derivs, the other one did. When not merging we were unlucky and find_userdata_index() was going always for the one without derivs.

Copy link
Collaborator

@lgritz lgritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! LGTM

@lgritz lgritz merged commit 5600d02 into AcademySoftwareFoundation:main Apr 24, 2023
std::min(symbol_data_size, udata_size));
if (symbol_data_size > udata_size)
memset((char*)symbol_data + udata_size, 0,
symbol_data_size - udata_size);
return 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wide_shadingsys.cpp __OSL_OP(bind_interpolated_param) could use this same type of fix, although because of the SOA data layout it might be more complicated. In particular can't just copy less bytes. Are there any unit tests to exercise this exact issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the fixed version of the set<> index, the safeguard in bind_interpolated_param shouldn't be necessary. I added it first to confirm we were getting garbage, then I left it in case the policy changed. No unit test, but with the other fix it won't even fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you making the statement
OSL_ASSERT(symbol_data_size == udata_size);
?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside:

I would love to eventually work toward as much as possible merging the separate scalar and batch versions of the ShadingSystem and BackendLLVM classes, or making them both inherit from a common base class with most of the functionality that is shared, or making batch a derived class of scalar, or something. Making "batch-ness" be more of a mode or a subclass that has only a few critical methods overloaded/replaced, will help to reduce these recurring issues where the scalar and batch systems have large numbers of identical or very similar methods and patchers (myself included) often only remember to change one of them.

In retrospect, I think it's less problematic how we did the OptiX back end, in which there are a variety of #if USE_OPTIX (for compile time) and if (optixmode) (for runtime) clauses inside the methods that really need them, rather than completely replicating the class hierarchy. That seems to suffer from fewer cases where we make a change to scalar or optix but forget to make the corresponding fix to the other... because for most of it, there are not separate code paths.

Granted, the changes necessary for batch shading are more intrusive and extensive -- the whole Cuda shtick is that you can largely write the code as if it's running on one point at a time, so the Cuda/OptiX code is closer in spirt to the CPU scalar path than the CPU simd path is to the CPU scalar path.

lgritz pushed a commit to lgritz/OpenShadingLanguage that referenced this pull request Apr 25, 2023
This includes two fixes:

 1) When registering symbols that need user data, sort the entries in
 the set so the layer number is ignored. A needed udata iteam shouldn't
 depend on the layer and separating them makes find_userdata_index()
 sometimes find an index with different derivs status.

 2) osl_bind_interpolated_param() is memcpy'ing derivs that might not
 be there, yielding corrupted derivs and possibly a crash.

Signed-off-by: Alejandro Conty <aconty@imageworks.com>
chellmuth pushed a commit to chellmuth/OpenShadingLanguage that referenced this pull request Sep 6, 2024
Highlights:

  * GPU/OptiX support of ReParameter (AcademySoftwareFoundation#1686)
  * Fix OptiX userdata derivatives for interpolated params on GPU (AcademySoftwareFoundation#1685)
  * Fix userdata binding corner case (AcademySoftwareFoundation#1673)
  * Fix constant float values being converted to ints (AcademySoftwareFoundation#1674)

Restock the abi dump file as we push to 1.13.4.

See merge request spi/dev/3rd-party/osl-feedstock!48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants