-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix crash when pairing an odd number of devices without P2P (BVLC/github issue #3531) #3586
Conversation
/cc @thatguymike |
Hi SvenTwo, This patch did not fix it for me. When I explored more related to nvidia p2p access, I found some more info related to my system.... please check below: root@fs3:/usr/local/cuda-7.5/samples/0_Simple/simpleP2P# ./simpleP2P
Checking GPU(s) for support of peer to peer memory access...
It appears from above output that P2P between GPU's is not supported although each GPU is capable for it. Further, I ran a command to check topology as given below: root@fs3:/usr/local/cuda-7.5/samples/0_Simple/simpleP2P# nvidia-smi topo -m
GPU0 X PIX SOC SOC 0-7,72-127 Legend: X = Self From above topo, it appears that GPU0 and GPU1 as well as GPU2 and GPU3 are connected to each other via PIX and other combinations via SOC. With above results, I have the following conclusion:
Query: |
I think caffe pairs everything with p2p first and then does a non-p2p-fallback for the rest. The non-p2p-fallback had a bug that caused list corruption, which I'm fixing (at least for my machine) in this pull request. Do you still have the exact same stack trace in your crash? What's the output of the pairing? I believe caffe tells you exactly which devices it paired and whether it was using P2P (run the failing test manually and specify logtostderr to get the output logs of the test) |
This is a gdb backtrace for your reference: [----------] 12 tests from SGDSolverTest/2, where TypeParam = caffe::GPUDevice Program received signal SIGSEGV, Segmentation fault. |
You could compile a debug build (set DEBUG in Makefile.config). It may give you an assertion earlier if e.g. vectors are accessed out of bounds. Otherwise, sorry, I have no idea. Maybe it's a different bug after all. |
I enabled DEBUG in Makefile and then took a gdb trace... its in detail as compared to the previous one.. [----------] 12 tests from SGDSolverTest/2, where TypeParam = caffe::GPUDevice Program received signal SIGSEGV, Segmentation fault. |
@anuphalarnkar Sorry, I have no idea. But you may find more hints in the log file of the test. E.g. if you run: ./build/test/test_all.testbin --gtest_filter="SGDSolverTest/2.TestLeastSquaresUpdate" --logtostderr You should get all the standard caffe blabber, including the log messages about GPU pairing. It will be something like "parallel.cpp:390] GPUs pairs 0:1". You might be able to deduce which pairings cause the problem. (Use 2>filename.log to reroute it to a file) I don't think it's related to this pull request though. |
#3539 fixed with this pull request |
The example talks about LevelDB as the db backend but has lmdb as the param in the execution.
This can happen if, e.g., testing never occurs in the log
This would've saved me an overnight download (slow connection here) I tested it, and it worked for me.
refactor duplicate code into separate update function for smoothed loss fix naming convention
…ype: boost::shared_ptr<caffe::Blob<float> >
and ScaleLayer. The behavior of ChannelwiseAffineLayer can be reproduced by a ScaleLayer with `scale_param { bias_term: true }`. BiasLayer and ScaleLayer each take 1 or 2 bottoms, with the output having the same shape as the first. The second input -- either another bottom or a learned parameter -- will have its axes (virtually) broadcast and tiled to have the same shape as the first, after which elementwise addition (Bias) or multiplication (Scale) is performed.
Fix some typos. Correct imports. Refactor data layers. Apply PEP8 formatting.
output info, warnings, and errors for fuller description of the upgrade
check all conditions all the time; V0 -> V1 and V1 -> V2 do not suffice.
convert inputs in legacy definitions (prototxt), but simply strip inputs from legacy weights (caffemodel). fix BVLC#3750
die loudly if a net definition (prototxt) mixes proto formats by defining both `layer` and `layers` fields instead of complaining but discarding and continuing. fix BVLC#3381
- cosmetic change to mkl related doc
This is temporary measure to avoid an apparent upstream issue with protobuf 3.0.0b2.post1.
This provides a framework for automatically aligning different layers of a net despite up/downsampling, padding, and output size rounding.
- document by docstring and comment - pep8 - add latest layers and alphabetize - respect default crop params - handle graphs with compositions of crops by walking only the first, cropped bottom of Crop layers - make python3 happy by replacing arg tuple unpacking
- crop -> offset - adjust crop axis by 1
- test known mappings: conv-pool-deconv stack, ReLU and 1x1 conv - test effects of padding - test rectangular/anisotropic coordinate mapping, test N-D - catch error cases: negative crop, scale mismatch, tops that are not spatially connected
configure offset(s) through proto definition.
Changes are: reduce test blob dims for speed use standard Gaussian filler, polish formatting and rename tests, test HW crop and 5D crop, standard gradient checks
there seems to be a caching issue at the moment; this is a temporary fix for BVLC#3786
MKLROOT variable is set by MKL scripts, so it also should be used in Makefile.
…ime on repeated calls.
…spec interface. The deconvolution layer uses the convolution_param subgroup for its parameters, so hardcode that connection into param_name_dict().
Resolved by the switch to new parallelism in #4563. Thanks for submitting a fix all the same! |
Also simplify the code by not relying on the log2 computation to pre-estimate pairing counts. Just pair until one device remains.