Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLite tests having intermittent failures in the threading test #1011

Closed
designatednerd opened this issue Feb 12, 2020 · 7 comments
Closed
Assignees
Labels
caching help wanted Issues the Apollo team would love help from the community about threading-sadness
Milestone

Comments

@designatednerd
Copy link
Contributor

Background: While investigating #993 I determined that the intermittent failures of testThreadedCache appear to only be happening in the SQLite tests when the DispatchGroup gets leave called more times than it should. I added counters to the three closures and was seeing wayyyy more than the expected 1001 leave calls, which would cause a crash since enter and leave calls to DispatchGroup need to be balanced.

In the discussion for #1010, there was a general agreement that this test uncovered existing incorrect behavior rather than having the incorrect behavior introduced in the corresponding PR, and this test was updated to not run on SQLite tests in order to keep further forward movement unblocked.

I have no idea why a) This is happening b) It only seems to be happening on the SQLite tests, even though the client in the testThreadedCache is hard-coded to be an in-memory cache. The failures are intermittent, but they do still happen even when the test is run in isolation, so it doesn't seem like it's leakage from another test.

Any help we can get on this would be really great.

@designatednerd designatednerd added the help wanted Issues the Apollo team would love help from the community about label Feb 12, 2020
@cysp
Copy link
Contributor

cysp commented Feb 12, 2020

I enabled TSan and ran the test suite and it pointed to data races on properties of GraphQLQueryWatcher:

Enable-TSan.diff
diff --git i/Apollo.xcodeproj/xcshareddata/xcschemes/Apollo.xcscheme w/Apollo.xcodeproj/xcshareddata/xcschemes/Apollo.xcscheme
index 1e839927..a375da33 100644
--- i/Apollo.xcodeproj/xcshareddata/xcschemes/Apollo.xcscheme
+++ w/Apollo.xcodeproj/xcshareddata/xcschemes/Apollo.xcscheme
@@ -27,6 +27,7 @@
       selectedDebuggerIdentifier = "Xcode.DebuggerFoundation.Debugger.LLDB"
       selectedLauncherIdentifier = "Xcode.DebuggerFoundation.Launcher.LLDB"
       shouldUseLaunchSchemeArgsEnv = "NO"
+      enableThreadSanitizer = "YES"
       codeCoverageEnabled = "YES"
       onlyGenerateCoverageForSpecifiedTargets = "YES">
       <MacroExpansion>
@@ -81,6 +82,7 @@
       buildConfiguration = "Debug"
       selectedDebuggerIdentifier = "Xcode.DebuggerFoundation.Debugger.LLDB"
       selectedLauncherIdentifier = "Xcode.DebuggerFoundation.Launcher.LLDB"
+      enableThreadSanitizer = "YES"
       launchStyle = "0"
       useCustomWorkingDirectory = "NO"
       ignoresPersistentStateOnLaunch = "NO"
TSan's report image image

This naïve introduction of a lock around GraphQLQueryWatcher silences TSan and lets me see passing test runs, even when running code from prior to #1010. What do you think?

GraphQLQueryWatcher-locking.diff
diff --git i/Sources/Apollo/GraphQLQueryWatcher.swift w/Sources/Apollo/GraphQLQueryWatcher.swift
index 52e1b6a3..7a1cb221 100644
--- i/Sources/Apollo/GraphQLQueryWatcher.swift
+++ w/Sources/Apollo/GraphQLQueryWatcher.swift
@@ -2,6 +2,8 @@ import Dispatch
 
 /// A `GraphQLQueryWatcher` is responsible for watching the store, and calling the result handler with a new result whenever any of the data the previous result depends on changes.
 public final class GraphQLQueryWatcher<Query: GraphQLQuery>: Cancellable, ApolloStoreSubscriber {
+  private let lock: NSRecursiveLock
+
   weak var client: ApolloClientProtocol?
   let query: Query
   let resultHandler: GraphQLResultHandler<Query.Data>
@@ -21,6 +23,9 @@ public final class GraphQLQueryWatcher<Query: GraphQLQuery>: Cancellable, Apollo
   public init(client: ApolloClientProtocol,
               query: Query,
               resultHandler: @escaping GraphQLResultHandler<Query.Data>) {
+    lock = NSRecursiveLock()
+    lock.name = "GraphQLQueryWatcher.lock"
+
     self.client = client
     self.query = query
     self.resultHandler = resultHandler
@@ -34,9 +39,15 @@ public final class GraphQLQueryWatcher<Query: GraphQLQuery>: Cancellable, Apollo
   }
 
   func fetch(cachePolicy: CachePolicy) {
+    lock.lock()
+    defer { lock.unlock() }
+
     fetching = client?.fetch(query: query, cachePolicy: cachePolicy, context: &context, queue: .main) { [weak self] result in
       guard let `self` = self else { return }
 
+      self.lock.lock()
+      defer { self.lock.unlock() }
+
       switch result {
       case .success(let graphQLResult):
         self.dependentKeys = graphQLResult.dependentKeys
@@ -50,6 +61,9 @@ public final class GraphQLQueryWatcher<Query: GraphQLQuery>: Cancellable, Apollo
 
   /// Cancel any in progress fetching operations and unsubscribe from the store.
   public func cancel() {
+    lock.lock()
+    defer { lock.unlock() }
+
     fetching?.cancel()
     client?.store.unsubscribe(self)
   }
@@ -57,6 +71,9 @@ public final class GraphQLQueryWatcher<Query: GraphQLQuery>: Cancellable, Apollo
   func store(_ store: ApolloStore,
              didChangeKeys changedKeys: Set<CacheKey>,
              context: UnsafeMutableRawPointer?) {
+    lock.lock()
+    defer { lock.unlock() }
+
     if context == &self.context { return }
 
     guard let dependentKeys = dependentKeys else { return }

@RolandasRazma
Copy link
Contributor

Looking at fetch(cachePolicy: CachePolicy), @designatednerd , don't you need to cancel previous fetching?

  func fetch(cachePolicy: CachePolicy) {
    fetching?.cancel() // <- missing?
    fetching = client?.fetch(query: query, cachePolicy: cachePolicy, context: &context, queue: .main) { ... }
  }

@designatednerd
Copy link
Contributor Author

@RolandasRazma - ooh, that could be it - and that's probably a better way to do it than locking the cache up. Will give it a try!

@cysp
Copy link
Contributor

cysp commented Feb 12, 2020

Performing that cancellation exposes a data race on AsynchronousOperation.state, which looks tricky because there are two mutations (one setting to .finished because the operation has been cancelled and one setting to .ready because of an internal call to an isReady getter triggered by the cancel function itself) and I'm not sure that just ensuring exclusive access to the state property is a correct solution. It seems like it would be bad if a solution was implemented that prevented the data race but resulted in the logic in this isReady getter "winning" and state being set to .ready.

Enable-TSan-and-Cancel-Fetch.diff
diff --git i/Apollo.xcodeproj/xcshareddata/xcschemes/Apollo.xcscheme w/Apollo.xcodeproj/xcshareddata/xcschemes/Apollo.xcscheme
index 1e839927..a375da33 100644
--- i/Apollo.xcodeproj/xcshareddata/xcschemes/Apollo.xcscheme
+++ w/Apollo.xcodeproj/xcshareddata/xcschemes/Apollo.xcscheme
@@ -27,6 +27,7 @@
       selectedDebuggerIdentifier = "Xcode.DebuggerFoundation.Debugger.LLDB"
       selectedLauncherIdentifier = "Xcode.DebuggerFoundation.Launcher.LLDB"
       shouldUseLaunchSchemeArgsEnv = "NO"
+      enableThreadSanitizer = "YES"
       codeCoverageEnabled = "YES"
       onlyGenerateCoverageForSpecifiedTargets = "YES">
       <MacroExpansion>
@@ -81,6 +82,7 @@
       buildConfiguration = "Debug"
       selectedDebuggerIdentifier = "Xcode.DebuggerFoundation.Debugger.LLDB"
       selectedLauncherIdentifier = "Xcode.DebuggerFoundation.Launcher.LLDB"
+      enableThreadSanitizer = "YES"
       launchStyle = "0"
       useCustomWorkingDirectory = "NO"
       ignoresPersistentStateOnLaunch = "NO"
diff --git i/Sources/Apollo/GraphQLQueryWatcher.swift w/Sources/Apollo/GraphQLQueryWatcher.swift
index 52e1b6a3..4612c748 100644
--- i/Sources/Apollo/GraphQLQueryWatcher.swift
+++ w/Sources/Apollo/GraphQLQueryWatcher.swift
@@ -34,6 +34,7 @@ public final class GraphQLQueryWatcher<Query: GraphQLQuery>: Cancellable, Apollo
   }
 
   func fetch(cachePolicy: CachePolicy) {
+    fetching?.cancel()
     fetching = client?.fetch(query: query, cachePolicy: cachePolicy, context: &context, queue: .main) { [weak self] result in
       guard let `self` = self else { return }
 
TSan's report image image

@designatednerd
Copy link
Contributor Author

Eesh, setting that in a getter seems like a bad idea in general, that's one hell of a side effect.

It does seem like that cancellation helps the issue at hand (although I still have no idea why it was only happening on SQLite test runs). I'll see what I can figure out about why that getter is doing what it is. I might wind up doing this as separate PRs though.

@designatednerd
Copy link
Contributor Author

Interestingly even if I comment out that whole isReady override, there's still a data race happening.

There's obviously something else going on there - my hunch is something in our Promise implementation, since that's doing some threading stuff it maaaybe shouldn't be.

@designatednerd
Copy link
Contributor Author

Thanks to @martijnwalraven's hard work, all our tests are now passing with TSAN as of version 0.38.0! If you run into TSAN issues in that version or later, please open a new issue. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
caching help wanted Issues the Apollo team would love help from the community about threading-sadness
Projects
None yet
Development

No branches or pull requests

4 participants