Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup & Fix data races in ApolloWebSocket #307

Closed
wants to merge 8 commits into from

Conversation

js
Copy link
Contributor

@js js commented Jul 3, 2018

This PR commits the ultimate sin of mixing several changes in the same PR, sorry not sorry.

The ApolloWebSocket had a vastly different coding style than the rest of the project, and diverged a bit from standard swift style as well, so since I'm a total snob I had to fix that before I could fix the actual problem:

Due to the fact that Starscream delivers its callbacks on the main queue by default (it can be configured to another DispatchQueue but we'd still have the same problem) and the NetworkTransport's send(....) is operated from whatever queue the calling AsynchronousOperation gets assigned to, this can lead to race conditions on some of the state variables, as detected by the Thread Sanitizer. So this PR adds a lock around some of the state variables (mainly the acking of messages) as well as processing of starscream callbacks on a serial queue.
I'm not too happy about the latter there as I feel there should be another design around coordinating thread access with the NetworkTransport implementations, what do you think?

@martijnwalraven
Copy link
Contributor

Thanks! I'm not too familiar with the code, so I think @knutaa is the best person to give detailed feedback on the changes.

As for the concurrency issues, would it be possible to have all processing happen on a single serial queue? If we configure Starscream to use that as its callbackQueue, and async dispatch to that queue from the WebSocketNetworkTransport's send(), wouldn't that mean we could get rid of explicit locks?

@js
Copy link
Contributor Author

js commented Jul 3, 2018

Maybe, there's a few other critical paths outside the callbacks though, such as (un)subscribing and write() (which mutates the messageQueue)

I gave it a quick shot locally, but I get test failures by just setting the callbackQueue on the starscream websocket object (and no other changes), haven't dug into why yet…

@martijnwalraven
Copy link
Contributor

Regardless if we can get rid of the lock, I think avoiding hitting the main queue when WebSocket messages are received would make me feel a lot better. Thanks for looking into this, I didn't realize that was what Starscream was doing!

@knutaa knutaa mentioned this pull request Jul 4, 2018
2 tasks
@psi-gh
Copy link
Contributor

psi-gh commented Jul 4, 2018

I'm really waiting for this because it's a fix #308

@knutaa
Copy link
Contributor

knutaa commented Jul 4, 2018

The websocket in WebSocketTransport declared as

internal let websocket: WebSocketClient

is the protocol and not the full Starscream WebSocket. This is done in support of the mocked testing.

It should be possible to set the callback processingQueue with something like

// websocket represent the WebSocketClient protocol, not the Starscream WebSocket
// this is done in support of the test mocking of the protocol
if let websocket = self.websocket as? WebSocket {
  websocket.callbackQueue = processingQueue
}
self.websocket.connect()

Would it then be possible to wrap the content of write() and replace the locks in sendHelper and unsubscribe? Other locks should be redundant.

The current use of processingQueue for the delegate methods should (of course?) be redundant.

@js
Copy link
Contributor Author

js commented Jul 5, 2018

Yeah I didn't discover the callbackQueue Starscream API until after I made this PR. Also, if callbackQueue is made part of the WebSocketClient then we don't have the awkwardness of asking "hey so are you really a WebSocket?".

I quickly tried setting the callbackQueue and wrapping write(..) in the same queue, but there's a test failure in testSubscribeMultipleReview() iirc where the last change isn't always pushed to the subscription. Didn't have time to dig more into it though.

Honestly I feel a bit bad about dumping this PR in your lap. It contains some opinionated stylistic and future proofing changes mixed in with some actual fixes.

@js
Copy link
Contributor Author

js commented Jul 5, 2018

in case it isn't obvious you run the tests with TSAN by going to Edit Scheme (opt click the run icon) for the ApolloWebSockets scheme, then select the Test configuration in the sidebar thing, then the Diagnostics tab > ThreadSanitizer in case any one else want to have a look the data race it finds.

@js js mentioned this pull request Jul 5, 2018
@js
Copy link
Contributor Author

js commented Jul 5, 2018

Based on #310 here is a compare with just the concurrency fixes

It still uses a NSLock, because, well, dispatch queues are a pretty heavy hammer for a small locking problem, and it tends to complicate things by turning a synchronous problem into an asynchronous one.

TSAN still triggers on the state mutation FetchQueryOperation.fetchFromNetwork() in the main Apollo project though.

@designatednerd
Copy link
Contributor

Tying this to #600 for when I look into that issue - thanks for doing this @js!

@designatednerd
Copy link
Contributor

@js ping me when you get back from vacation, let's try to get this good to merge 😃

@designatednerd
Copy link
Contributor

@js You still on vacation or do you think you might have a chance to take a look at this?

@js
Copy link
Contributor Author

js commented Aug 30, 2019

@designatednerd I'll take a look over the weekend

@aivcec
Copy link
Contributor

aivcec commented Sep 24, 2019

Any updates on this? I am experiencing race condition issues on disconnecting from multiple subscriptions and am interested in seeing this merged.

@js
Copy link
Contributor Author

js commented Sep 26, 2019

Sorry, I simply haven't found the time to look at this. Took a brief look last night but I've completely lost all context I had a year ago :) and my project still doesn't use graphql subscriptions so haven't found the excuse to weave it in there

When running the tests (the main apollo test bundle), TSAN seems triggered by something with the ReadWriteLock inside the barrier blocks but it doesn't give much of the usual backtrace/hints as to what it thinks the problem is… 🤷‍♂

I'm closing this, but if anyone else wants the interesting bits seems to be in ec42e1d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants