-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronous SGD via layer-wise parallelism #2219
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This means that Caffe::Get has to be moved to common.cpp, and loses its "inline" (but there are no real performance implications).
Instead of just keeping track of input and output blobs, also keep track of layer dependencies. (Also adjust AppendBottom's argument types to avoid passing an input as a pointer.)
This simplifies the OS X build, and will allow use of the per-thread default stream for running existing layer code asynchronously.
Note that this may cause issues with code that assumes either explicit or device-level synchronization, which we'll fix in the next commit.
This ensures that layers are synchronous with respect to each other, even when layer code doesn't use explicit streams.
There are no cases where Forward is called without Reshape, so we can simplify the call structure.
This will allow us to cleanly kill compute threads that are waiting for wark.
This gives us a way to specify layer-level execution placement for layerwise parallelism, implemented in future commits.
Split layer gains a param, top_device, which allows tops to exist on different (explicitly specified) devices. Params are automatically copied and diffs are automatically accumulated. Because the implementation is now device-agnostic, it's done in (only) the *_cpu functions.
This fills in the top_device param of split layer according to the device params of the connecting layers.
This is necessary to ensure that buffers are allocated on the correct devices.
Compute threads hold (blocking) queues of forward or backward commands, which are synchronized according to the layer graph through Net member variables.
This fully exercises the multi-GPU case, and saves time.
This is necessary to ensure that operations are performed on the correct device.
Closed
Why would a PR include and be blocked by so many other PRs? |
the peer copy is failed when use cudaDeviceEnablePeerAccess() function in net.cpp line 349. |
Closed
9 tasks
Closed
This was referenced Apr 13, 2017
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR upgrades Caffe with integrated support for layer-wise multi-device parallelism. This allows Caffe to perform parallel synchronous SGD, using model-parallel, data-parallel, or hybrid approaches. This includes parallelization across GPUs, CPU/GPU parallelism, and multi-threaded CPU parallelism. This is a work-in-progress in proof-of-concept state.
The basic approach is as follows:
device
andthread_id
. Each provides a (logical) description of which device the layer will be run on (and where its memory will be allocated). Thread IDs work like CUDA streams; layers are launched in their specified order, and automatically synchronized according to the network DAG. Devices and thread IDs are specified manually; poor choices will result in poor performance.top_device
) to split layer, but this is done automatically for automatically inserted split layers.Limitations:
thread_id
-1
, and therefore all copies are done in the same thread. This can be worked around by manually specifying split layers.device_id
s as given to layers are purely logical and mapped arbitrarily to physical IDs starting from 0.) UseCUDA_VISIBLE_DEVICES
instead.DummyDataLayer
. Using this with real data layers will cause problems, due to some subtle and unnecessary singleton interaction. Ultimately prefetching should probably just be integrated into this framework, but that requires additional functionality for pipelining computation. There are very ugly ways to hack around this for now if you really want to.Caffe
's ownBrew
enum somehow.A simple example is included for CaffeNet data parallelism (not using real data layers). I've found that it performs near 2x on 2 GPUs, ~2.5x on 3 GPUs, and poorly on 4 GPUs. I'm sure that with additional tuning/better execution planning, much nearer-linear scaling can be achieved. The example is generated using #2086 (but the output is included). Note that actually solving will have some (small) additional overhead from H2D transfer of data (which should be made asynchronous) and from the solver parameter-update code.
NVTX instrumentation of forward/backward calls is included, and together with
nvprof
/nvvp
it makes performance issues pretty easy to find. It's a build option,WITH_NVTX
.A hacky option,
-notime_layers
, is added tocaffe time
to do rough-and-ready parallel timing.This PR is largely orthogonal to the net-level parallelism of the parallel branch, even though both can be used to implement some of the same things, like pure data parallelism for synchronous SGD (see #2114). Some common functionality has been factored out. I have no plans to support multi-node parallelism or asynchronous SGD using this code; you can combine it with the parallel branch for that.
There are many existing PRs included in this one:
net.hpp
includes so thatNet
can get intimate withboost::thread
param_bottoms
option, so that split layer can be used for sharing parametersNote that omitting included PRs and (very long) examples, the total diff here is on the order of a few hundred lines.
This is still work-in-progress, but hopefully usable/hackable by eager beavers.
Major TODOs:
Net
option or coordinate this PR with mode switching some other way.