FATAL: ThreadSanitizer CHECK failed #950

WallStProg · 2018-05-03T15:41:59Z

I'm getting the following error consistently w/several of my test pgms when built w/TSAN.

FATAL: ThreadSanitizer CHECK failed: /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:69 "((n_all_locks_)) < (((sizeof(all_locks_with_contexts_)/sizeof((all_locks_with_contexts_)[0]))))" (0x40, 0x40)
    #0 __tsan::TsanCheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_rtl_report.cc:48 (ITimerTest+0x492ad3)
    #1 __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_termination.cc:79 (ITimerTest+0x4ae005)
    #2 __sanitizer::DeadlockDetectorTLS<__sanitizer::TwoLevelBitVector<1ul, __sanitizer::BasicBitVector<unsigned long> > >::addLock(unsigned long, unsigned long, unsigned int) /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:69 (ITimerTest+0x4a246d)
    #3 __sanitizer::DeadlockDetector<__sanitizer::TwoLevelBitVector<1ul, __sanitizer::BasicBitVector<unsigned long> > >::onLockAfter(__sanitizer::DeadlockDetectorTLS<__sanitizer::TwoLevelBitVector<1ul, __sanitizer::BasicBitVector<unsigned long> > >*, unsigned long, unsigned int) /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:220 (ITimerTest+0x4a246d)
    #4 __sanitizer::DD::MutexAfterLock(__sanitizer::DDCallback*, __sanitizer::DDMutex*, bool, bool) /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector1.cc:170 (ITimerTest+0x4a246d)
    #5 __tsan::MutexPostLock(__tsan::ThreadState*, unsigned long, unsigned long, unsigned int, int) /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_rtl_mutex.cc:200 (ITimerTest+0x4911cf)
    #6 pthread_mutex_lock /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:4033 (ITimerTest+0x440cf0)
    #7 lockTimerHeap /shared/work/OpenMAMA/6.2.1/common/c_cpp/src/c/timers.c:411:5 (libmamazmqimpl.so+0x2e047)
    #8 zmqBridgeMamaTimer_destroy /home/btorpey/work/OpenMAMA-zmq/2.0/src/timer.c:168:4 (libmamazmqimpl.so+0x13d21)
    #9 mamaTimer_destroy /shared/work/OpenMAMA/6.2.1/mama/c_cpp/src/c/timer.c:229:37 (libmama.so+0x92968)
    #10 mamaEnvTimer_destroy /home/btorpey/work/transact/4.0.0/src/common/Middleware/MamaAdapter/mme/mamaEnvTimer.c:91:19 (libmme.so+0x5a94)
    #11 mamaEnvTimer_onTimerDestroy /home/btorpey/work/transact/4.0.0/src/common/Middleware/MamaAdapter/mme/mamaEnvTimer.c:131 (libmme.so+0x5a94)
    #12 wombatQueue_dispatchInt /shared/work/OpenMAMA/6.2.1/common/c_cpp/src/c/queue.c:326:9 (libmama.so+0xe0930)
    #13 wombatQueue_timedDispatch /shared/work/OpenMAMA/6.2.1/common/c_cpp/src/c/queue.c:342:12 (libmama.so+0xe09bd)
    #14 zmqBridgeMamaQueue_dispatch /home/btorpey/work/OpenMAMA-zmq/2.0/src/queue.c:253:16 (libmamazmqimpl.so+0x103a3)
    #15 mamaQueue_dispatch /shared/work/OpenMAMA/6.2.1/mama/c_cpp/src/c/queue.c:825:12 (libmama.so+0x8cb86)
    #16 dispatchThreadProc /shared/work/OpenMAMA/6.2.1/mama/c_cpp/src/c/queue.c:1303:30 (libmama.so+0x8e63a)
    #17 __tsan_thread_start_func /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:955 (ITimerTest+0x4247fd)
    #18 start_thread <null> (libpthread.so.0+0x33c0607aa0)
    #19 clone <null> (libc.so.6+0x33bfee8bcc)

Any ideas on how to work around this would be much appreciated -- thanks in advance!

The text was updated successfully, but these errors were encountered:

kcc · 2018-05-03T18:13:54Z

At the very least you may disable the deadlock detector (TSAN_OPTIONS=detect_deadlocks=0)

If you have a reasonably small reproducer we may try to get this fixed.

WallStProg · 2018-05-03T18:53:39Z

Thanks Kostya.

I've already disabled the deadlock detector, but wanted to enable it if possible.

It looks like the problem may be too many mutexes? That's quite possible, as the code that triggers this is an internal stress test pgm that creates many objects. If so, I could try building locally with a different size for all_locks_with_contexts_ -- do you know where that is set?

Also, on a related note, it also looks like TSAN reports what appear to be false positives w/recursive mutexes -- is that true?

Thanks again!

kcc · 2018-05-04T19:39:45Z

The code in question is in lib/sanitizer_common/sanitizer_deadlock_detector.h
It limits the number of simultaneously held locks in a given thread to an arbitrary large number 64.
If you hold 65 locks in one thread at once, this will fail.

I have to admit that I don't remember fine details of this code any more (haven't touched since 2014).
recursive mutexes should work, but OTOH we don't have too many of them, which means their support is not well tested.
If you can provide a minimal repro, please open a separate bug.

dvyukov · 2018-05-07T09:14:20Z

@WallStProg you showed some "destroy of a locked mutex" reports. I suspect they cause the unbounded number of mutexes locked by a thread.

POSIX is very clear on this:

Attempting to destroy a locked mutex results in undefined behavior.

http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_destroy.html

WallStProg · 2018-05-09T20:47:14Z

It appears that the CHECK failure is in fact caused by creating > 64 mutexes in a single thread -- unusual, but this code is a stress test that does that deliberately.

I hear you about the "destroy locked" reports, and I'm in the process of fixing that. (I inherited this code, which has been running for quite some time w/no apparent issues, but I agree that UB is not OK).

For now, I've disabled the problematic tests when running w/"detect_deadlocks=1".

Thanks!

TSAN limits the number of simultaneous lock acquisitions in a single thread to 64 when using the deadlock detector[1]. However, compaction can select up to 128 (128MB budget / 1MB min rowset size) rowsets in a single op. kudu-tool-test's TestNonRandomWorkloadLoadgen almost always hits TSAN's limit when the KUDU-1400 changes following this patch are applied. This patch prevents this by limiting the number of rowsets selected for a compaction to 32 when running under TSAN. I ran the test with the KUDU-1400 changes on top and saw 97/100 failures. With the change, I saw 100 successes. [1]: google/sanitizers#950 Change-Id: I01ad4ba3a13995c194c3308d72c1eb9b611ef766 Reviewed-on: http://gerrit.cloudera.org:8080/11885 Tested-by: Kudu Jenkins Reviewed-by: Adar Dembo <adar@cloudera.com> Reviewed-by: Andrew Wong <awong@cloudera.com>

As we hold a mutex for our custom C++ Node, when calling reentrant backward from custom C++ function, we will cocurrently holding many mutexes up to MAX_DEPTH. TSAN only allow 65 mutexes at once, otherwise it will complain. This PR lower the limit according to TSAN. TSAN Reference: google/sanitizers#950 [ghstack-poisoned]

As we hold a mutex for our custom C++ Node, when calling reentrant backward from custom C++ function, we will cocurrently holding many mutexes up to MAX_DEPTH. TSAN only allow 65 mutexes at once, otherwise it will complain. This PR lower the limit according to TSAN. TSAN Reference: google/sanitizers#950 ghstack-source-id: de61c260ea671025b486c0118af782efdf07aab3 Pull Request resolved: #36745

Summary: Pull Request resolved: #36745 As we hold a mutex for our custom C++ Node, when calling reentrant backward from custom C++ function, we will cocurrently holding many mutexes up to MAX_DEPTH. TSAN only allow 65 mutexes at once, otherwise it will complain. This PR lower the limit according to TSAN. TSAN Reference: google/sanitizers#950 Test Plan: Imported from OSS Differential Revision: D21072604 Pulled By: wanchaol fbshipit-source-id: 99cd1acab41a203d834fa4947f4e6f0ffd2e70f2

…utexes Under TSan you can lock only not more then 64 mutexes from one thread at once [1] [2], while RESTART REPLICAS can acquire more (it depends on the number of replicated tables). [1]: google/sanitizers#950 (comment) [2]: https://github.com/llvm/llvm-project/blob/b02eab9058e58782fca32dd8b1e53c27ed93f866/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h#L67 And since stress tests run tests in parallel, you can have more then 64 ReplicatedMergeTree tables at once (even though it is unlikely). Fix this by using RESTART REPLICA table over RESTART REPLICAS.

This commit adds a new GitHub Actions workflow that checks Jolt with ThreadSanitizer under Ubuntu & Clang. * Replaces usage of atomic_thread_fences in Reference.h and JobSystem.h with regular atomic ops when building under TSAN, as it does not support fences and unlikely ever will do so. * Limits the max number of mutexes to use under TSAN to work around a TSAN limitation, see: google/sanitizers#950. * Replaces Semaphore::mCount with an atomic int that's used in relaxed mode. Co-authored-by: Jorrit Rouwe <jrouwe@gmail.com>

WallStProg closed this as completed May 9, 2018

alexey-milovidov mentioned this issue Aug 20, 2019

Added a test for race conditions. ClickHouse/ClickHouse#6514

Merged

alexey-milovidov mentioned this issue Aug 29, 2019

Fix for data race in StorageMerge ClickHouse/ClickHouse#6717

Merged

wanchaol mentioned this issue Apr 16, 2020

[autograd] lower MAX_DEPTH limit according to TSAN limit pytorch/pytorch#36745

Closed

Oipo mentioned this issue May 7, 2020

test_framework thread sanitizer apache/celix#219

Closed

azat mentioned this issue Jan 9, 2021

Add fsync support for Distributed engine. ClickHouse/ClickHouse#18864

Merged

azat mentioned this issue Jan 10, 2021

Drop RESTART REPLICAS from stateless tests to avoid locking lots of mutexes ClickHouse/ClickHouse#18897

Merged

louiswilliams mentioned this issue Jan 25, 2023

TSAN deadlock detector checks fail even when suppressed #1611

Open

rcorre mentioned this issue Apr 30, 2023

CHECK failed: tsan_interceptors_posix.cpp:1987 "((thr->slot)) != (0)" / segfault running Godot 4 with TSAN #1647

Closed

thatsafunnyname mentioned this issue Sep 20, 2023

Use *next_sequence -1 here facebook/rocksdb#11861

Closed

JaySon-Huang mentioned this issue Jul 26, 2024

Failed test StoreIngestTest.ConcurrentIngestAndWrite under tsan pingcap/tiflash#9257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FATAL: ThreadSanitizer CHECK failed #950

FATAL: ThreadSanitizer CHECK failed #950

WallStProg commented May 3, 2018

kcc commented May 3, 2018

WallStProg commented May 3, 2018

kcc commented May 4, 2018

dvyukov commented May 7, 2018

WallStProg commented May 9, 2018

FATAL: ThreadSanitizer CHECK failed #950

FATAL: ThreadSanitizer CHECK failed #950

Comments

WallStProg commented May 3, 2018

kcc commented May 3, 2018

WallStProg commented May 3, 2018

kcc commented May 4, 2018

dvyukov commented May 7, 2018

WallStProg commented May 9, 2018