Fix MGLMapSnapshotter concurrency bugs (issue #11827). #11831

ChrisLoer · 2018-05-04T00:51:06Z

I took a stab at making the changes @julianrex and I have been discussing for fixing issue #11827.

Biggest change: when we apply the watermark on a background thread, don't capture self (turn most of the related instance methods into class methods)
Don't call mbglMapSnapshotter->snapshot from a user-provided queue, since it's an asynchronous call anyway and starting it on the user's queue requires capturing self.

The design goal here is that we should not be passing shared pointers/references across thread boundaries. Because startWithQueue allows users to provide their own queue for handling the completion, we have to assume that anything on that queue can run on another thread (even though the more common startWithCompletion will run on the main/UI thread). When we capture a block for execution on a (potentially) different thread, we need to copy everything it needs by value. Any further changes to MGLMapSnapshotter should only affect the result of future start calls.

It's now possible to remove the "can't restart after cancel" constraint we had (just need to use the last-set "options" to re-initialize the mbglMapSnapshotter), but I haven't done that yet. I'm still getting up to speed on all the context for how this interface is/should be used (see issue #11825).

I'm a bit out of my element in Objective-C so careful review from someone on the iOS team is definitely a must. I'm also not sure of the best way to test beyond what CI gives us -- I've done just the basic loading of snapshots in the macos and ios test apps.

/cc @julianrex @friedbunny @1ec5

ChrisLoer · 2018-05-04T01:01:32Z

@julianrex One thing you mentioned in #11801 is that we may need to make sure the completion callback is called even if the background snapshot operation is abandoned (either by an explicit "cancel" or by destruction of the MGLMapSnapshotter). If that's a requirement, we'll have to figure out how to add that logic explicitly (basically whenever mbglMapSnapshotter or the snapshotCallback can get destroyed/reset).

ChrisLoer · 2018-05-04T22:25:05Z

~~Not sure if we'll try to touch the docs with this PR, but I was just looking at the camera/coordinateBounds interplay, and the docs say:~~

"If this property is non-nil and the coordinateBounds property is set to a non-empty coordinate bounds, the camera’s center coordinate and altitude are ignored in favor of the coordinateBounds property."

~~As far as I can tell, bearing and pitch are ignored as well. But I'm not sure I understand how cameraForLatLngs is supposed to work.~~

EDIT: I was wrong, ignore the above.

ChrisLoer · 2018-05-17T18:10:36Z

Current state here is that we're able to frequently produce a deadlock somewhere in GL drawing code when we're running six snapshotters at once in the ios test app on the iOS simulator. The deadlock can happen either in native CoreGraphics rendering (applying attribution to a rendered map), or in the background gl-native calls into the EAGLBackend. The trigger seems to be running the attribution-stamping code in a GCD worker queue -- although we haven't identified a specific unsafe operation and it's possible we're just shifting timing around when we move it to the foreground. We haven't seen the deadlock running on a physical iPhone or on macOS.

I would really like to get to the bottom of the cause of the deadlock so we can be confident we're not causing a (potentially rare) deadlock bug in the wild. However, our immediate fix may be to just stop doing attribution on a GCD worker queue. It should be a cheap operation, and fine to do on the foreground. I talked to @ivovandongen today about why we originally did it on the background, and his memory was that we were mainly just copying an Android design pattern of doing as little non-UI work on the foreground as possible. It's worth keeping in mind, though, that our API allows the completion callback to run on arbitrary threads, so if there's a gotcha on sharing NSImage/UIImage across threads we need to figure it out even if we abandon the worker queue.

cc @frederoni (Ivo tells me you helped him with the original implementation)

ChrisLoer · 2018-05-21T17:32:04Z

🤔 Seems like attribution is much more expensive than I would have guessed, here's timing from six runs (in microseconds):

Attribution time: 834962
Attribution time: 515704
Attribution time: 447460
Attribution time: 448551
Attribution time: 453873
Attribution time: 461213

The first run loads logo images from disk, and I thought that might be the expensive part so I cached the logos in memory for subsequent runs, but it still seems to take around half a second to run. That is... way longer than I expected... and probably a good reason to keep doing attribution in the background?

julianrex · 2018-05-21T17:57:26Z

probably a good reason to keep doing attribution in the background

Sounds like it.

If we temporarily remove the attribution code - is "everything OK"?

ChrisLoer · 2018-05-29T18:31:11Z

@julianrex pending your 🍏 , I think we should merge these changes and include them in the next patch release. I'm unhappy that we've been unable to satisfactorily explain the deadlock we see in the simulator, but I don't think it's a good idea to include the various "fixes" we've tried (e.g. wrapping EAGLBackend usage with a global lock), since we don't have a theoretical justification for them and we're probably just altering the timing of the reproduction case.

Current state:

None of us have seen any deadlock on physical devices. At this point @julianrex and I have run these tests hundreds of times.
On this branch, the "six snapshotter" case reproduces a deadlock frequently (but not 100% of the time) on the iOS simulator.
Many of our experimental fixes appear to make the problem go away, but as a caveat, one of my "global lock" experiments seemed to fix the problem but after maybe 30 runs of the repro case it showed up again.
I have reproduced the deadlock on the current version of release-boba (although it seemed to take more runs until I saw it), so if this is an issue in production, it's at least not something being introduced by this PR.

julianrex · 2018-05-29T18:41:54Z

@ChrisLoer will review! Do you think this could go into the next minor release instead (4.1.0 on iOS)?

I'm unhappy that we've been unable to satisfactorily explain the deadlock we see in the simulator, but I don't think it's a good idea to include the various "fixes" we've tried (e.g. wrapping EAGLBackend usage with a global lock), since we don't have a theoretical justification for them and we're probably just altering the timing of the reproduction case.

Agreed. We could wrap any locks/unlocks with a simulator #ifdef, but I'd rather the simulator crashed in this instance and that we reference this pull request in the tests.

/cc @lilykaiser

julianrex

A few minor tweaks - also needs an entry in the ios & macos change logs.

julianrex · 2018-05-29T19:01:14Z

platform/darwin/src/MGLMapSnapshotter.mm

    _snapshotCallback = std::make_unique<mbgl::Actor<mbgl::MapSnapshotter::Callback>>(*mbgl::Scheduler::GetCurrent(), [=](std::exception_ptr mbglError, mbgl::PremultipliedImage image, mbgl::MapSnapshotter::Attributions attributions, mbgl::MapSnapshotter::PointForFn pointForFn) {
        __typeof__(self) strongSelf = weakSelf;
+        // If self had died, _snapshotCallback would have been destroyed and this block would not be executed
+        assert(strongSelf);


We tend to prefer to use the NSAssert family of assert macros. I think in this case we might need to use NSCAssert since the NSAssert macro uses self.

I think it's important that we call the completion block in the case where _snapshotCallback is destroyed. Is that straightforward?

I pushed a change with NSCAssert.

I thinking calling the iOS completion block on cancellation is outside the scope of the core code -- that is, I don't think you'd want the callback destructor to block waiting for a forced run of the completion callback (which might not be on the same thread the destructor is running from). To handle this in MGLMapSnapshotter, I guess you'd want to explicitly call the completion callback with an error value from cancel and also from...what is it, dealloc? I guess the short answer is that it's not terribly difficult but also not totally straightforward. I think we can call it outside the scope of this fix, although I agree that it's a better interface if the callback always either succeeds or errors. FWIW, the Android interface also allows you to cancel without receiving a callback.

julianrex · 2018-05-29T19:04:58Z

platform/darwin/src/MGLMapSnapshotter.mm

+
+ (void)drawAttributedSnapshotWorker:(mbgl::MapSnapshotter::Attributions)attributions snapshotImage:(MGLImage *)mglImage pointForFn:(mbgl::MapSnapshotter::PointForFn)pointForFn queue:(dispatch_queue_t)queue scale:(CGFloat)scale size:(CGSize)size completionHandler:(MGLMapSnapshotCompletionHandler)completion {
+
+    NS_ARRAY_OF(MGLAttributionInfo *)* attributionInfo = [MGLMapSnapshotter generateAttributionInfos:attributions];


Just an FYI for when it comes to merge to master: NS_ARRAY_OF and friends have been removed (but are still in release-boba)

julianrex · 2018-05-29T19:08:46Z

platform/ios/Integration Tests/Snapshotter Tests/MGLMapSnapshotterTest.m

+    [self waitForExpectations:@[expectation] timeout:10.0];
+}
+
+- (void)testMultipleSnapshotters {


I think it's worth temporarily disabling the test so that we don't get false positives during CI. Let's just rename the failing methods something like - (void)disabled_test....

When we merge to master, we can mark this as a pending test (which uses a environment variable to decide whether to run or not)

I don't know the details of how this runs on CI, but I agree we should disable tests that may fail intermittently. Could you disable the tests that you think need disabling?

ChrisLoer · 2018-05-29T21:16:21Z

I think the changelog entry should be:

* Fixed race conditions that could cause crashes when re-using `MGLMapSnapshotter` or using multiple snapshotters at the same time. ([#11831](https://github.com/mapbox/mapbox-gl-native/pull/11831))

I'm not sure where it's supposed to go in the changelog since I'm not sure which release this will go in.

- Biggest change: when we apply the watermark on a background thread, don't capture self (turn most of the related instance methods into class methods) - Don't call mbglMapSnapshotter->snapshot from a user-provided queue, since it's an asynchronous call anyway and starting it on the user's queue requires capturing self.

… without active run loop.

8 simultaneous mapsnapshotter test periodically deadlocks in simulator. Also, increase timeouts to decrease chance of spurious test failure.

ChrisLoer · 2018-06-01T22:50:55Z

Rebased on master, added changelog entries, and tried to make integration test changes @julianrex suggested although I have a really shaky understanding of how these integration tests are supposed to work, so please review/test/edit accordingly.

julianrex

Looks good

ChrisLoer added the ⚠️ DO NOT MERGE Work in progress, proof of concept, or on hold label May 4, 2018

ChrisLoer requested a review from julianrex May 4, 2018 00:51

ChrisLoer mentioned this pull request May 4, 2018

Examine lifetime of _snapshotCallback in MGLMapSnapshotter startWithQueue:completionHandler: #11801

Closed

ChrisLoer force-pushed the snapshotter-concurrency branch from 4f2970b to 3ed3c30 Compare May 29, 2018 17:37

ChrisLoer added iOS Mapbox Maps SDK for iOS macOS Mapbox Maps SDK for macOS crash and removed ⚠️ DO NOT MERGE Work in progress, proof of concept, or on hold labels May 29, 2018

ChrisLoer force-pushed the snapshotter-concurrency branch from 3ed3c30 to 98727ca Compare May 29, 2018 17:41

julianrex suggested changes May 29, 2018

View reviewed changes

ChrisLoer and others added 6 commits June 1, 2018 15:35

[ios, test] Add MGLMapSnapshotter integration tests.

eb39d71

[ios, macos] Raise exception if MGLMapSnapshotter is used from thread…

adcb935

… without active run loop.

[ios, macos] Use NSCAssert instead of assert.

5b22c6b

[docs,ios,macos] Changelog entries for MGLMapSnapshotter fix.

501a819

[test,ios] Disable multiple mapsnapshotter test.

4f1360c

8 simultaneous mapsnapshotter test periodically deadlocks in simulator. Also, increase timeouts to decrease chance of spurious test failure.

ChrisLoer force-pushed the snapshotter-concurrency branch from 3dfc244 to 4f1360c Compare June 1, 2018 22:49

ChrisLoer changed the base branch from release-boba to master June 1, 2018 22:50

julianrex approved these changes Jun 4, 2018

View reviewed changes

ChrisLoer merged commit 6eee872 into master Jun 4, 2018

ChrisLoer mentioned this pull request Jun 4, 2018

MGLMapSnapshotter not thread-safe, can cause crashes #11827

Closed

fabian-guerra mentioned this pull request Jun 7, 2018

[ios, macos] CP 11831 into release-chai #12087

Merged

friedbunny added the needs changelog Indicates PR needs a changelog entry prior to merging. label Jun 7, 2018

friedbunny deleted the snapshotter-concurrency branch June 11, 2018 18:37

friedbunny removed the needs changelog Indicates PR needs a changelog entry prior to merging. label Jun 11, 2018

ChrisLoer mentioned this pull request Jul 6, 2018

MGLMapSnapshotter iOS issue specific to Mapbox version 4.1+ #12336

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MGLMapSnapshotter concurrency bugs (issue #11827). #11831

Fix MGLMapSnapshotter concurrency bugs (issue #11827). #11831

ChrisLoer commented May 4, 2018

ChrisLoer commented May 4, 2018

ChrisLoer commented May 4, 2018 •

edited

Loading

ChrisLoer commented May 17, 2018

ChrisLoer commented May 21, 2018

julianrex commented May 21, 2018

ChrisLoer commented May 29, 2018

julianrex commented May 29, 2018

julianrex left a comment

julianrex May 29, 2018

julianrex May 29, 2018

ChrisLoer May 29, 2018

julianrex May 29, 2018

julianrex May 29, 2018

ChrisLoer May 29, 2018

ChrisLoer commented May 29, 2018 •

edited

Loading

ChrisLoer commented Jun 1, 2018

julianrex left a comment


		+ (void)drawAttributedSnapshotWorker:(mbgl::MapSnapshotter::Attributions)attributions snapshotImage:(MGLImage *)mglImage pointForFn:(mbgl::MapSnapshotter::PointForFn)pointForFn queue:(dispatch_queue_t)queue scale:(CGFloat)scale size:(CGSize)size completionHandler:(MGLMapSnapshotCompletionHandler)completion {

		NS_ARRAY_OF(MGLAttributionInfo ) attributionInfo = [MGLMapSnapshotter generateAttributionInfos:attributions];

Fix MGLMapSnapshotter concurrency bugs (issue #11827). #11831

Fix MGLMapSnapshotter concurrency bugs (issue #11827). #11831

Conversation

ChrisLoer commented May 4, 2018

ChrisLoer commented May 4, 2018

ChrisLoer commented May 4, 2018 • edited Loading

ChrisLoer commented May 17, 2018

ChrisLoer commented May 21, 2018

julianrex commented May 21, 2018

ChrisLoer commented May 29, 2018

julianrex commented May 29, 2018

julianrex left a comment

Choose a reason for hiding this comment

julianrex May 29, 2018

Choose a reason for hiding this comment

julianrex May 29, 2018

Choose a reason for hiding this comment

ChrisLoer May 29, 2018

Choose a reason for hiding this comment

julianrex May 29, 2018

Choose a reason for hiding this comment

julianrex May 29, 2018

Choose a reason for hiding this comment

ChrisLoer May 29, 2018

Choose a reason for hiding this comment

ChrisLoer commented May 29, 2018 • edited Loading

ChrisLoer commented Jun 1, 2018

julianrex left a comment

Choose a reason for hiding this comment

ChrisLoer commented May 4, 2018 •

edited

Loading

ChrisLoer commented May 29, 2018 •

edited

Loading