Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check Jolt with TSAN in CI #1278

Merged
merged 7 commits into from
Sep 26, 2024
Merged

Check Jolt with TSAN in CI #1278

merged 7 commits into from
Sep 26, 2024

Conversation

romasandu-gaijin
Copy link
Contributor

@romasandu-gaijin romasandu-gaijin commented Sep 24, 2024

Added a new CI workflow for checking Jolt with TSAN. I think it works.
Also patched Reference.h and JobSystem.h to not use atomic_thread_fences under TSAN.
Tackles #1275

This commit adds a new GitHub Actions workflow that checks
Jolt with ThreadSanitizer under Ubuntu & Clang. The workflow
is triggered on every push to the repository.

This commit also replaces usage of atomic_thread_fences in
Reference.h and JobSystem.h with regular atomic ops when
building under TSAN, as it does not support fences and unlikely
ever will do so.
@CLAassistant
Copy link

CLAassistant commented Sep 24, 2024

CLA assistant check
All committers have signed the CLA.

@jrouwe
Copy link
Owner

jrouwe commented Sep 24, 2024

Thanks! I'll make some cosmetic changes to it before merging. I'm surprised that it needed this few changes, when I tried it last it triggered so many false positives that there was no way I was going to fix them all.

* Moving tsan_check.yml into the main build.yml
* Added a separate ReleaseTSAN target (similar to e.g. ReleaseUBSAN)
* Including Core.h is not needed (if it hasn't been included JPH_NAMESPACE_BEGIN will not have been defined)
* Style changes
@jrouwe
Copy link
Owner

jrouwe commented Sep 25, 2024

I'm a bit at a loss why the TSAN test passed at first. If I use your original change and locally run what was in tsan_check.yml:

./cmake_linux_clang_gcc.sh RelWithDebInfo clang++-15 -DCMAKE_CXX_FLAGS="-fsanitize=thread" -DCROSS_PLATFORM_DETERMINISTIC=OFF -DTARGET_VIEWER=OFF -DTARGET_SAMPLES=OFF -DTARGET_HELLO_WORLD=OFF -DTARGET_UNIT_TESTS=ON -DTARGET_PERFORMANCE_TEST=ON

compile and run:

ctest --output-on-failure --verbose

then I get the same errors as we're seeing now.

@jrouwe
Copy link
Owner

jrouwe commented Sep 25, 2024

Ah, I think I understand, you can see in the build log:

[ 69%] Built target HelloWorld

while -DTARGET_HELLO_WORLD=OFF was specified. It looks like passing parameters using multiple lines:

run: ./cmake_linux_clang_gcc.sh RelWithDebInfo ${{env.UBUNTU_CLANG_VERSION}} \
        -DCMAKE_CXX_FLAGS="-fsanitize=thread" \
        -DCROSS_PLATFORM_DETERMINISTIC=OFF \
        -DTARGET_VIEWER=OFF \
        -DTARGET_SAMPLES=OFF \
        -DTARGET_HELLO_WORLD=OFF \
        -DTARGET_UNIT_TESTS=ON \
        -DTARGET_PERFORMANCE_TEST=ON

means they're not actually passed to cmake.

Edit: Put them all on the same line and now HelloWorld is no longer built. So it appears the first test run wasn't using TSAN at all.

inline int GetValue() const
{
#if defined(JPH_TSAN_ENABLED) && !defined(JPH_PLATFORM_WINDOWS)
// TSAN complains that we're reading mCount without locking. We don't care if we read a stale value. Inserting a lock to keep TSAN happy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be better to use an atomic_int in relaxed mode?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to change it to __attribute__((no_sanitize("thread"))), but yes that's a better idea.

…rformanceTest later again)

Turn on compiler optimizations
Implemented @SirLynix's suggestion
@jrouwe jrouwe merged commit 8f69986 into jrouwe:master Sep 26, 2024
69 checks passed
@jrouwe
Copy link
Owner

jrouwe commented Sep 26, 2024

@romasandu-gaijin Thanks for getting the ball rolling on this! I'll re-add the PerformanceTest later (if I can get it to work).

@jrouwe
Copy link
Owner

jrouwe commented Sep 29, 2024

The PerformanceTests have been added. They actually found one valid race condition and I had to disable a bunch of false positives. See #1285.

@romasandu-gaijin
Copy link
Contributor Author

@jrouwe oh snap. How come newlines don't actually work? O_o
I swear I checked it manually with the same command line as CI should've used and it passed without any warnings... Oh well. Thank you for finishing this!

@romasandu-gaijin
Copy link
Contributor Author

Also @jrouwe, a general observation of mine is that TSAN never actually emits false positives. Every single warning is an actual violation of the C++ standard memory model in my experience, it's just that sometimes it is REALLY hard to understand why exactly something is a data race. I might try and take a look at all the places you've suppressed and try to convince you that they are actual races and not false positives =)

@jrouwe
Copy link
Owner

jrouwe commented Sep 30, 2024

I might try and take a look at all the places you've suppressed and try to convince you that they are actual races and not false positives =)

Yes, please go ahead! I looked at them very carefully and I think that they are not.

  • Profiler::NextFrame - This is definitively a race, but I don't care as it is debug code and in practice it isn't a race since all other threads are idle (through the use of a semaphore) by the time this function is called.
  • Body::GetIndexInActiveBodiesInternal/IsActive - I added my reasoning in the comment. I was able to fix this detected race by changing mIndexInActiveBodies to an atomic and read/write everything in relaxed mode. This however, results in the exact same assembly code on both ARM and x86 (just a regular load / store operation). You could argue that the compiler could rearrange code in such a way that it would break, but that is not possible since the operations are in different functions and there is another atomic variable that uses acquire/release semantics to prevent code from being moved too far. I would have preferred to keep using the atomic, but this triggered some code gen issue on MSVC that I really didn't want to debug.
  • PhysicsSystem::JobFindCollisions - This is the one I'm least sure of, it involves 2 atomic variables mReadIdx and mWriteIdx that are both used with memory order sequentially consistent (I think it can be much more relaxed than that, so it could run faster on ARM). Again, the reasoning is in the comments. Everything is localized to that function, so it should also be the easiest one to read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants