-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FATAL: ThreadSanitizer CHECK failed #950
Comments
At the very least you may disable the deadlock detector ( If you have a reasonably small reproducer we may try to get this fixed. |
Thanks Kostya. I've already disabled the deadlock detector, but wanted to enable it if possible. It looks like the problem may be too many mutexes? That's quite possible, as the code that triggers this is an internal stress test pgm that creates many objects. If so, I could try building locally with a different size for all_locks_with_contexts_ -- do you know where that is set? Also, on a related note, it also looks like TSAN reports what appear to be false positives w/recursive mutexes -- is that true? Thanks again! |
The code in question is in lib/sanitizer_common/sanitizer_deadlock_detector.h I have to admit that I don't remember fine details of this code any more (haven't touched since 2014). |
@WallStProg you showed some "destroy of a locked mutex" reports. I suspect they cause the unbounded number of mutexes locked by a thread. POSIX is very clear on this:
http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_destroy.html |
It appears that the CHECK failure is in fact caused by creating > 64 mutexes in a single thread -- unusual, but this code is a stress test that does that deliberately. I hear you about the "destroy locked" reports, and I'm in the process of fixing that. (I inherited this code, which has been running for quite some time w/no apparent issues, but I agree that UB is not OK). For now, I've disabled the problematic tests when running w/"detect_deadlocks=1". Thanks! |
TSAN limits the number of simultaneous lock acquisitions in a single thread to 64 when using the deadlock detector[1]. However, compaction can select up to 128 (128MB budget / 1MB min rowset size) rowsets in a single op. kudu-tool-test's TestNonRandomWorkloadLoadgen almost always hits TSAN's limit when the KUDU-1400 changes following this patch are applied. This patch prevents this by limiting the number of rowsets selected for a compaction to 32 when running under TSAN. I ran the test with the KUDU-1400 changes on top and saw 97/100 failures. With the change, I saw 100 successes. [1]: google/sanitizers#950 Change-Id: I01ad4ba3a13995c194c3308d72c1eb9b611ef766 Reviewed-on: http://gerrit.cloudera.org:8080/11885 Tested-by: Kudu Jenkins Reviewed-by: Adar Dembo <adar@cloudera.com> Reviewed-by: Andrew Wong <awong@cloudera.com>
TSAN limits the number of simultaneous lock acquisitions in a single thread to 64 when using the deadlock detector[1]. However, compaction can select up to 128 (128MB budget / 1MB min rowset size) rowsets in a single op. kudu-tool-test's TestNonRandomWorkloadLoadgen almost always hits TSAN's limit when the KUDU-1400 changes following this patch are applied. This patch prevents this by limiting the number of rowsets selected for a compaction to 32 when running under TSAN. I ran the test with the KUDU-1400 changes on top and saw 97/100 failures. With the change, I saw 100 successes. [1]: google/sanitizers#950 Change-Id: I01ad4ba3a13995c194c3308d72c1eb9b611ef766 Reviewed-on: http://gerrit.cloudera.org:8080/11885 Tested-by: Kudu Jenkins Reviewed-by: Adar Dembo <adar@cloudera.com> Reviewed-by: Andrew Wong <awong@cloudera.com>
As we hold a mutex for our custom C++ Node, when calling reentrant backward from custom C++ function, we will cocurrently holding many mutexes up to MAX_DEPTH. TSAN only allow 65 mutexes at once, otherwise it will complain. This PR lower the limit according to TSAN. TSAN Reference: google/sanitizers#950 [ghstack-poisoned]
As we hold a mutex for our custom C++ Node, when calling reentrant backward from custom C++ function, we will cocurrently holding many mutexes up to MAX_DEPTH. TSAN only allow 65 mutexes at once, otherwise it will complain. This PR lower the limit according to TSAN. TSAN Reference: google/sanitizers#950 ghstack-source-id: de61c260ea671025b486c0118af782efdf07aab3 Pull Request resolved: #36745
Summary: Pull Request resolved: #36745 As we hold a mutex for our custom C++ Node, when calling reentrant backward from custom C++ function, we will cocurrently holding many mutexes up to MAX_DEPTH. TSAN only allow 65 mutexes at once, otherwise it will complain. This PR lower the limit according to TSAN. TSAN Reference: google/sanitizers#950 Test Plan: Imported from OSS Differential Revision: D21072604 Pulled By: wanchaol fbshipit-source-id: 99cd1acab41a203d834fa4947f4e6f0ffd2e70f2
…utexes Under TSan you can lock only not more then 64 mutexes from one thread at once [1] [2], while RESTART REPLICAS can acquire more (it depends on the number of replicated tables). [1]: google/sanitizers#950 (comment) [2]: https://github.com/llvm/llvm-project/blob/b02eab9058e58782fca32dd8b1e53c27ed93f866/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h#L67 And since stress tests run tests in parallel, you can have more then 64 ReplicatedMergeTree tables at once (even though it is unlikely). Fix this by using RESTART REPLICA table over RESTART REPLICAS.
This commit adds a new GitHub Actions workflow that checks Jolt with ThreadSanitizer under Ubuntu & Clang. * Replaces usage of atomic_thread_fences in Reference.h and JobSystem.h with regular atomic ops when building under TSAN, as it does not support fences and unlikely ever will do so. * Limits the max number of mutexes to use under TSAN to work around a TSAN limitation, see: google/sanitizers#950. * Replaces Semaphore::mCount with an atomic int that's used in relaxed mode. Co-authored-by: Jorrit Rouwe <jrouwe@gmail.com>
I'm getting the following error consistently w/several of my test pgms when built w/TSAN.
Any ideas on how to work around this would be much appreciated -- thanks in advance!
The text was updated successfully, but these errors were encountered: