Skip to content
This repository has been archived by the owner on Aug 8, 2023. It is now read-only.

Fix MGLMapSnapshotter concurrency bugs (issue #11827). #11831

Merged
merged 6 commits into from
Jun 4, 2018

Conversation

ChrisLoer
Copy link
Contributor

I took a stab at making the changes @julianrex and I have been discussing for fixing issue #11827.

  • Biggest change: when we apply the watermark on a background thread, don't capture self (turn most of the related instance methods into class methods)
  • Don't call mbglMapSnapshotter->snapshot from a user-provided queue, since it's an asynchronous call anyway and starting it on the user's queue requires capturing self.

The design goal here is that we should not be passing shared pointers/references across thread boundaries. Because startWithQueue allows users to provide their own queue for handling the completion, we have to assume that anything on that queue can run on another thread (even though the more common startWithCompletion will run on the main/UI thread). When we capture a block for execution on a (potentially) different thread, we need to copy everything it needs by value. Any further changes to MGLMapSnapshotter should only affect the result of future start calls.

It's now possible to remove the "can't restart after cancel" constraint we had (just need to use the last-set "options" to re-initialize the mbglMapSnapshotter), but I haven't done that yet. I'm still getting up to speed on all the context for how this interface is/should be used (see issue #11825).

I'm a bit out of my element in Objective-C so careful review from someone on the iOS team is definitely a must. I'm also not sure of the best way to test beyond what CI gives us -- I've done just the basic loading of snapshots in the macos and ios test apps.

/cc @julianrex @friedbunny @1ec5

@ChrisLoer
Copy link
Contributor Author

@julianrex One thing you mentioned in #11801 is that we may need to make sure the completion callback is called even if the background snapshot operation is abandoned (either by an explicit "cancel" or by destruction of the MGLMapSnapshotter). If that's a requirement, we'll have to figure out how to add that logic explicitly (basically whenever mbglMapSnapshotter or the snapshotCallback can get destroyed/reset).

@ChrisLoer
Copy link
Contributor Author

ChrisLoer commented May 4, 2018

Not sure if we'll try to touch the docs with this PR, but I was just looking at the camera/coordinateBounds interplay, and the docs say:

"If this property is non-nil and the coordinateBounds property is set to a non-empty coordinate bounds, the camera’s center coordinate and altitude are ignored in favor of the coordinateBounds property."

As far as I can tell, bearing and pitch are ignored as well. But I'm not sure I understand how cameraForLatLngs is supposed to work.

EDIT: I was wrong, ignore the above.

@ChrisLoer
Copy link
Contributor Author

Current state here is that we're able to frequently produce a deadlock somewhere in GL drawing code when we're running six snapshotters at once in the ios test app on the iOS simulator. The deadlock can happen either in native CoreGraphics rendering (applying attribution to a rendered map), or in the background gl-native calls into the EAGLBackend. The trigger seems to be running the attribution-stamping code in a GCD worker queue -- although we haven't identified a specific unsafe operation and it's possible we're just shifting timing around when we move it to the foreground. We haven't seen the deadlock running on a physical iPhone or on macOS.

I would really like to get to the bottom of the cause of the deadlock so we can be confident we're not causing a (potentially rare) deadlock bug in the wild. However, our immediate fix may be to just stop doing attribution on a GCD worker queue. It should be a cheap operation, and fine to do on the foreground. I talked to @ivovandongen today about why we originally did it on the background, and his memory was that we were mainly just copying an Android design pattern of doing as little non-UI work on the foreground as possible. It's worth keeping in mind, though, that our API allows the completion callback to run on arbitrary threads, so if there's a gotcha on sharing NSImage/UIImage across threads we need to figure it out even if we abandon the worker queue.

cc @frederoni (Ivo tells me you helped him with the original implementation)

@ChrisLoer
Copy link
Contributor Author

🤔 Seems like attribution is much more expensive than I would have guessed, here's timing from six runs (in microseconds):

Attribution time: 834962
Attribution time: 515704
Attribution time: 447460
Attribution time: 448551
Attribution time: 453873
Attribution time: 461213

The first run loads logo images from disk, and I thought that might be the expensive part so I cached the logos in memory for subsequent runs, but it still seems to take around half a second to run. That is... way longer than I expected... and probably a good reason to keep doing attribution in the background?

@julianrex
Copy link
Contributor

probably a good reason to keep doing attribution in the background

Sounds like it.

If we temporarily remove the attribution code - is "everything OK"?

@ChrisLoer ChrisLoer added iOS Mapbox Maps SDK for iOS macOS Mapbox Maps SDK for macOS crash and removed ⚠️ DO NOT MERGE Work in progress, proof of concept, or on hold labels May 29, 2018
@ChrisLoer
Copy link
Contributor Author

@julianrex pending your 🍏 , I think we should merge these changes and include them in the next patch release. I'm unhappy that we've been unable to satisfactorily explain the deadlock we see in the simulator, but I don't think it's a good idea to include the various "fixes" we've tried (e.g. wrapping EAGLBackend usage with a global lock), since we don't have a theoretical justification for them and we're probably just altering the timing of the reproduction case.

Current state:

  • None of us have seen any deadlock on physical devices. At this point @julianrex and I have run these tests hundreds of times.
  • On this branch, the "six snapshotter" case reproduces a deadlock frequently (but not 100% of the time) on the iOS simulator.
  • Many of our experimental fixes appear to make the problem go away, but as a caveat, one of my "global lock" experiments seemed to fix the problem but after maybe 30 runs of the repro case it showed up again.
  • I have reproduced the deadlock on the current version of release-boba (although it seemed to take more runs until I saw it), so if this is an issue in production, it's at least not something being introduced by this PR.

@julianrex
Copy link
Contributor

@ChrisLoer will review! Do you think this could go into the next minor release instead (4.1.0 on iOS)?

I'm unhappy that we've been unable to satisfactorily explain the deadlock we see in the simulator, but I don't think it's a good idea to include the various "fixes" we've tried (e.g. wrapping EAGLBackend usage with a global lock), since we don't have a theoretical justification for them and we're probably just altering the timing of the reproduction case.

Agreed. We could wrap any locks/unlocks with a simulator #ifdef, but I'd rather the simulator crashed in this instance and that we reference this pull request in the tests.

/cc @lilykaiser

Copy link
Contributor

@julianrex julianrex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor tweaks - also needs an entry in the ios & macos change logs.

_snapshotCallback = std::make_unique<mbgl::Actor<mbgl::MapSnapshotter::Callback>>(*mbgl::Scheduler::GetCurrent(), [=](std::exception_ptr mbglError, mbgl::PremultipliedImage image, mbgl::MapSnapshotter::Attributions attributions, mbgl::MapSnapshotter::PointForFn pointForFn) {
__typeof__(self) strongSelf = weakSelf;
// If self had died, _snapshotCallback would have been destroyed and this block would not be executed
assert(strongSelf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tend to prefer to use the NSAssert family of assert macros. I think in this case we might need to use NSCAssert since the NSAssert macro uses self.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important that we call the completion block in the case where _snapshotCallback is destroyed. Is that straightforward?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a change with NSCAssert.

I thinking calling the iOS completion block on cancellation is outside the scope of the core code -- that is, I don't think you'd want the callback destructor to block waiting for a forced run of the completion callback (which might not be on the same thread the destructor is running from). To handle this in MGLMapSnapshotter, I guess you'd want to explicitly call the completion callback with an error value from cancel and also from...what is it, dealloc? I guess the short answer is that it's not terribly difficult but also not totally straightforward. I think we can call it outside the scope of this fix, although I agree that it's a better interface if the callback always either succeeds or errors. FWIW, the Android interface also allows you to cancel without receiving a callback.


+ (void)drawAttributedSnapshotWorker:(mbgl::MapSnapshotter::Attributions)attributions snapshotImage:(MGLImage *)mglImage pointForFn:(mbgl::MapSnapshotter::PointForFn)pointForFn queue:(dispatch_queue_t)queue scale:(CGFloat)scale size:(CGSize)size completionHandler:(MGLMapSnapshotCompletionHandler)completion {

NS_ARRAY_OF(MGLAttributionInfo *)* attributionInfo = [MGLMapSnapshotter generateAttributionInfos:attributions];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an FYI for when it comes to merge to master: NS_ARRAY_OF and friends have been removed (but are still in release-boba)

[self waitForExpectations:@[expectation] timeout:10.0];
}

- (void)testMultipleSnapshotters {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth temporarily disabling the test so that we don't get false positives during CI. Let's just rename the failing methods something like - (void)disabled_test....

When we merge to master, we can mark this as a pending test (which uses a environment variable to decide whether to run or not)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the details of how this runs on CI, but I agree we should disable tests that may fail intermittently. Could you disable the tests that you think need disabling?

@ChrisLoer
Copy link
Contributor Author

ChrisLoer commented May 29, 2018

I think the changelog entry should be:

* Fixed race conditions that could cause crashes when re-using `MGLMapSnapshotter` or using multiple snapshotters at the same time. ([#11831](https://github.com/mapbox/mapbox-gl-native/pull/11831))

I'm not sure where it's supposed to go in the changelog since I'm not sure which release this will go in.

ChrisLoer and others added 6 commits June 1, 2018 15:35
- Biggest change: when we apply the watermark on a background thread, don't capture self (turn most of the related instance methods into class methods)
- Don't call mbglMapSnapshotter->snapshot from a user-provided queue, since it's an asynchronous call anyway and starting it on the user's queue requires capturing self.
8 simultaneous mapsnapshotter test periodically deadlocks in simulator.
Also, increase timeouts to decrease chance of spurious test failure.
@ChrisLoer ChrisLoer changed the base branch from release-boba to master June 1, 2018 22:50
@ChrisLoer
Copy link
Contributor Author

Rebased on master, added changelog entries, and tried to make integration test changes @julianrex suggested although I have a really shaky understanding of how these integration tests are supposed to work, so please review/test/edit accordingly.

Copy link
Contributor

@julianrex julianrex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@friedbunny friedbunny added the needs changelog Indicates PR needs a changelog entry prior to merging. label Jun 7, 2018
@friedbunny friedbunny deleted the snapshotter-concurrency branch June 11, 2018 18:37
@friedbunny friedbunny removed the needs changelog Indicates PR needs a changelog entry prior to merging. label Jun 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
crash iOS Mapbox Maps SDK for iOS macOS Mapbox Maps SDK for macOS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants