SQLite tests having intermittent failures in the threading test #1011

designatednerd · 2020-02-12T01:14:08Z

Background: While investigating #993 I determined that the intermittent failures of testThreadedCache appear to only be happening in the SQLite tests when the DispatchGroup gets leave called more times than it should. I added counters to the three closures and was seeing wayyyy more than the expected 1001 leave calls, which would cause a crash since enter and leave calls to DispatchGroup need to be balanced.

In the discussion for #1010, there was a general agreement that this test uncovered existing incorrect behavior rather than having the incorrect behavior introduced in the corresponding PR, and this test was updated to not run on SQLite tests in order to keep further forward movement unblocked.

I have no idea why a) This is happening b) It only seems to be happening on the SQLite tests, even though the client in the testThreadedCache is hard-coded to be an in-memory cache. The failures are intermittent, but they do still happen even when the test is run in isolation, so it doesn't seem like it's leakage from another test.

Any help we can get on this would be really great.

The text was updated successfully, but these errors were encountered:

cysp · 2020-02-12T07:32:04Z

I enabled TSan and ran the test suite and it pointed to data races on properties of GraphQLQueryWatcher:

Enable-TSan.diff

diff --git i/Apollo.xcodeproj/xcshareddata/xcschemes/Apollo.xcscheme w/Apollo.xcodeproj/xcshareddata/xcschemes/Apollo.xcscheme
index 1e839927..a375da33 100644
--- i/Apollo.xcodeproj/xcshareddata/xcschemes/Apollo.xcscheme
+++ w/Apollo.xcodeproj/xcshareddata/xcschemes/Apollo.xcscheme
@@ -27,6 +27,7 @@
       selectedDebuggerIdentifier = "Xcode.DebuggerFoundation.Debugger.LLDB"
       selectedLauncherIdentifier = "Xcode.DebuggerFoundation.Launcher.LLDB"
       shouldUseLaunchSchemeArgsEnv = "NO"
+      enableThreadSanitizer = "YES"
       codeCoverageEnabled = "YES"
       onlyGenerateCoverageForSpecifiedTargets = "YES">
       <MacroExpansion>
@@ -81,6 +82,7 @@
       buildConfiguration = "Debug"
       selectedDebuggerIdentifier = "Xcode.DebuggerFoundation.Debugger.LLDB"
       selectedLauncherIdentifier = "Xcode.DebuggerFoundation.Launcher.LLDB"
+      enableThreadSanitizer = "YES"
       launchStyle = "0"
       useCustomWorkingDirectory = "NO"
       ignoresPersistentStateOnLaunch = "NO"

TSan's report

This naïve introduction of a lock around GraphQLQueryWatcher silences TSan and lets me see passing test runs, even when running code from prior to #1010. What do you think?

GraphQLQueryWatcher-locking.diff

diff --git i/Sources/Apollo/GraphQLQueryWatcher.swift w/Sources/Apollo/GraphQLQueryWatcher.swift
index 52e1b6a3..7a1cb221 100644
--- i/Sources/Apollo/GraphQLQueryWatcher.swift
+++ w/Sources/Apollo/GraphQLQueryWatcher.swift
@@ -2,6 +2,8 @@ import Dispatch
 
 /// A `GraphQLQueryWatcher` is responsible for watching the store, and calling the result handler with a new result whenever any of the data the previous result depends on changes.
 public final class GraphQLQueryWatcher<Query: GraphQLQuery>: Cancellable, ApolloStoreSubscriber {
+  private let lock: NSRecursiveLock
+
   weak var client: ApolloClientProtocol?
   let query: Query
   let resultHandler: GraphQLResultHandler<Query.Data>
@@ -21,6 +23,9 @@ public final class GraphQLQueryWatcher<Query: GraphQLQuery>: Cancellable, Apollo
   public init(client: ApolloClientProtocol,
               query: Query,
               resultHandler: @escaping GraphQLResultHandler<Query.Data>) {
+    lock = NSRecursiveLock()
+    lock.name = "GraphQLQueryWatcher.lock"
+
     self.client = client
     self.query = query
     self.resultHandler = resultHandler
@@ -34,9 +39,15 @@ public final class GraphQLQueryWatcher<Query: GraphQLQuery>: Cancellable, Apollo
   }
 
   func fetch(cachePolicy: CachePolicy) {
+    lock.lock()
+    defer { lock.unlock() }
+
     fetching = client?.fetch(query: query, cachePolicy: cachePolicy, context: &context, queue: .main) { [weak self] result in
       guard let `self` = self else { return }
 
+      self.lock.lock()
+      defer { self.lock.unlock() }
+
       switch result {
       case .success(let graphQLResult):
         self.dependentKeys = graphQLResult.dependentKeys
@@ -50,6 +61,9 @@ public final class GraphQLQueryWatcher<Query: GraphQLQuery>: Cancellable, Apollo
 
   /// Cancel any in progress fetching operations and unsubscribe from the store.
   public func cancel() {
+    lock.lock()
+    defer { lock.unlock() }
+
     fetching?.cancel()
     client?.store.unsubscribe(self)
   }
@@ -57,6 +71,9 @@ public final class GraphQLQueryWatcher<Query: GraphQLQuery>: Cancellable, Apollo
   func store(_ store: ApolloStore,
              didChangeKeys changedKeys: Set<CacheKey>,
              context: UnsafeMutableRawPointer?) {
+    lock.lock()
+    defer { lock.unlock() }
+
     if context == &self.context { return }
 
     guard let dependentKeys = dependentKeys else { return }

RolandasRazma · 2020-02-12T09:22:01Z

Looking at fetch(cachePolicy: CachePolicy), @designatednerd , don't you need to cancel previous fetching?

  func fetch(cachePolicy: CachePolicy) {
    fetching?.cancel() // <- missing?
    fetching = client?.fetch(query: query, cachePolicy: cachePolicy, context: &context, queue: .main) { ... }
  }

designatednerd · 2020-02-12T18:32:58Z

@RolandasRazma - ooh, that could be it - and that's probably a better way to do it than locking the cache up. Will give it a try!

cysp · 2020-02-12T20:55:36Z

Performing that cancellation exposes a data race on AsynchronousOperation.state, which looks tricky because there are two mutations (one setting to .finished because the operation has been cancelled and one setting to .ready because of an internal call to an isReady getter triggered by the cancel function itself) and I'm not sure that just ensuring exclusive access to the state property is a correct solution. It seems like it would be bad if a solution was implemented that prevented the data race but resulted in the logic in this isReady getter "winning" and state being set to .ready.

Enable-TSan-and-Cancel-Fetch.diff

diff --git i/Apollo.xcodeproj/xcshareddata/xcschemes/Apollo.xcscheme w/Apollo.xcodeproj/xcshareddata/xcschemes/Apollo.xcscheme
index 1e839927..a375da33 100644
--- i/Apollo.xcodeproj/xcshareddata/xcschemes/Apollo.xcscheme
+++ w/Apollo.xcodeproj/xcshareddata/xcschemes/Apollo.xcscheme
@@ -27,6 +27,7 @@
       selectedDebuggerIdentifier = "Xcode.DebuggerFoundation.Debugger.LLDB"
       selectedLauncherIdentifier = "Xcode.DebuggerFoundation.Launcher.LLDB"
       shouldUseLaunchSchemeArgsEnv = "NO"
+      enableThreadSanitizer = "YES"
       codeCoverageEnabled = "YES"
       onlyGenerateCoverageForSpecifiedTargets = "YES">
       <MacroExpansion>
@@ -81,6 +82,7 @@
       buildConfiguration = "Debug"
       selectedDebuggerIdentifier = "Xcode.DebuggerFoundation.Debugger.LLDB"
       selectedLauncherIdentifier = "Xcode.DebuggerFoundation.Launcher.LLDB"
+      enableThreadSanitizer = "YES"
       launchStyle = "0"
       useCustomWorkingDirectory = "NO"
       ignoresPersistentStateOnLaunch = "NO"
diff --git i/Sources/Apollo/GraphQLQueryWatcher.swift w/Sources/Apollo/GraphQLQueryWatcher.swift
index 52e1b6a3..4612c748 100644
--- i/Sources/Apollo/GraphQLQueryWatcher.swift
+++ w/Sources/Apollo/GraphQLQueryWatcher.swift
@@ -34,6 +34,7 @@ public final class GraphQLQueryWatcher<Query: GraphQLQuery>: Cancellable, Apollo
   }
 
   func fetch(cachePolicy: CachePolicy) {
+    fetching?.cancel()
     fetching = client?.fetch(query: query, cachePolicy: cachePolicy, context: &context, queue: .main) { [weak self] result in
       guard let `self` = self else { return }

TSan's report

designatednerd · 2020-02-12T21:33:57Z

Eesh, setting that in a getter seems like a bad idea in general, that's one hell of a side effect.

It does seem like that cancellation helps the issue at hand (although I still have no idea why it was only happening on SQLite test runs). I'll see what I can figure out about why that getter is doing what it is. I might wind up doing this as separate PRs though.

designatednerd · 2020-02-12T21:48:47Z

Interestingly even if I comment out that whole isReady override, there's still a data race happening.

There's obviously something else going on there - my hunch is something in our Promise implementation, since that's doing some threading stuff it maaaybe shouldn't be.

designatednerd · 2020-12-01T22:44:41Z

Thanks to @martijnwalraven's hard work, all our tests are now passing with TSAN as of version 0.38.0! If you run into TSAN issues in that version or later, please open a new issue. Thank you!

designatednerd added the help wanted Issues the Apollo team would love help from the community about label Feb 12, 2020

designatednerd mentioned this issue Feb 12, 2020

Don't run in-memory cache tests during SQLite cache testing #1010

Merged

designatednerd added the caching label Feb 12, 2020

This was referenced Feb 12, 2020

Cancel existing operation in a watcher before starting a new one #1012

Merged

Flaky cache tests #993

Closed

designatednerd added the threading-sadness label Apr 12, 2020

hwillson self-assigned this Jul 15, 2020

designatednerd added this to the Next Release milestone Nov 30, 2020

designatednerd closed this as completed Dec 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQLite tests having intermittent failures in the threading test #1011

SQLite tests having intermittent failures in the threading test #1011

designatednerd commented Feb 12, 2020

cysp commented Feb 12, 2020 •

edited

Loading

RolandasRazma commented Feb 12, 2020

designatednerd commented Feb 12, 2020

cysp commented Feb 12, 2020

designatednerd commented Feb 12, 2020

designatednerd commented Feb 12, 2020

designatednerd commented Dec 1, 2020

SQLite tests having intermittent failures in the threading test #1011

SQLite tests having intermittent failures in the threading test #1011

Comments

designatednerd commented Feb 12, 2020

cysp commented Feb 12, 2020 • edited Loading

RolandasRazma commented Feb 12, 2020

designatednerd commented Feb 12, 2020

cysp commented Feb 12, 2020

designatednerd commented Feb 12, 2020

designatednerd commented Feb 12, 2020

designatednerd commented Dec 1, 2020

cysp commented Feb 12, 2020 •

edited

Loading