Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with Non-Deterministic Behavior (Ensure Determinism?) #3168

Open
ronghanghu opened this issue Oct 8, 2015 · 8 comments
Open

Deal with Non-Deterministic Behavior (Ensure Determinism?) #3168

ronghanghu opened this issue Oct 8, 2015 · 8 comments

Comments

@ronghanghu
Copy link
Member

Although there have been a lot of efforts in Caffe (such as unified RNG) to ensure reproducible and deterministic results, Caffe is currently still non-deterministic in several ways as described below:

  1. GPU mode: cuDNN can be numerically non-deterministic with CUDNN_CONVOLUTION_BWD_DATA_ALGO_0 and CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3.
  2. CPU mode: Intel MKL can be numerically non-deterministic. Details: test_gradient_based_solver fails #3109 (comment)
  3. Multi-GPU traininig: the order of data fetching and coordination between multiple data layers in a net are non-deterministic (based on race conditions). This is my fault in Multi-GPU Data Parallelism (with Parallel Data Layers) #2903.

1 & 2 (numerically non-determinism) can cause tests that relies on deterministic behavior (such as TestSnapshot in test_gradient_based_solver.cpp) to fail, while 3 can result in bugs like #2977.

This thread is opened to discuss how to cope with them (and possibly ensure determinism in Caffe?)

@RolanChen
Copy link

I have encountered similar issues while using a self-coded LSTM layer to train translation model. I could be sure that I hadn't introduce any random factors in my own coding part, either in the config .prototxt -- the random seeds were remained same each time; yet eachtime I run training procedure they were yielding different loss values except for the initial one.

Moreover I have turned off the cuDNN flag while compling the enviroment, so I guess there might still be some random factors in the training related part of caffe. FYI I run the experiments on single GPU.

@FicusRong
Copy link

I think the cuDNN non-deterministic behavior is cuased by resetting the diffs every time in CuDNNConvolutionLayer::Backward_gpu(). Actually, the diffs are already set in Net::ClearParamDiffs(). It seems that it's not a bug in the situation of multi-gpus but in "iter_size > 1" .

template 
void CuDNNConvolutionLayer::Backward_gpu(const vector*>& top,
    const vector& propagate_down, const vector*>& bottom) {
  const Dtype* weight = NULL;
  Dtype* weight_diff = NULL;
  if (this->param_propagate_down_[0]) {
    weight = this->blobs_[0]->gpu_data();
    weight_diff = this->blobs_[0]->mutable_gpu_diff();
    caffe_gpu_set(this->blobs_[0]->count(), Dtype(0), weight_diff);
  }
  Dtype* bias_diff = NULL;
  if (this->bias_term_ && this->param_propagate_down_[1]) {
    bias_diff = this->blobs_[1]->mutable_gpu_diff();
    caffe_gpu_set(this->blobs_[1]->count(), Dtype(0), bias_diff);
  }

@ronghanghu
Copy link
Member Author

@FicusRong These two line seems to be introduced in #3160. I'll take a look. Thanks for reporting!

@ylongqi
Copy link

ylongqi commented Jan 21, 2016

I encounter the problem 3 in multiple GPU training. - I use two data input layers (One for image and the other for labels (multi-dimensional)) and the program crashed with the following error (Training is all fine with single GPU). Is there any workaround currently to solve this problem?:

*** Aborted at 1453395416 (unix time) try "date -d @1453395416" if you are using GNU date ***
PC: @ 0x7f17e7c7da5f (unknown)
*** SIGSEGV (@0xaa1c000) received by PID 22896 (TID 0x7f17e9587780) from PID 178372608; stack trace: ***
@ 0x7f17e7b62d40 (unknown)
@ 0x7f17e7c7da5f (unknown)
@ 0x7f17e8c43a9c std::vector<>::erase()
@ 0x7f17e8c42807 caffe::DevicePair::compute()
@ 0x7f17e8c47c4c caffe::P2PSync<>::run()
@ 0x407dc1 train()
@ 0x405bc1 main
@ 0x7f17e7b4dec5 (unknown)
@ 0x4062d1 (unknown)
@ 0x0 (unknown)
Segmentation fault (core dumped)

@rodrigoberriel
Copy link

rodrigoberriel commented Feb 15, 2017

@ronghanghu to fix 1. would it be okay if the default algorithms (bwd_filter_algo_ and bwd_data_algo_) were changed to 1 (determinisc according to cuDNN docs) when a random_seed is given by the user? I mean, if the user sets the random_seed, he must be expecting determinic behavior.

I couldn't find any information on the impact it would cause in terms of performance. Should we expect any side-effect besides performance issues if we manually set those algorithms to 1 and rebuild Caffe as a temporary fix (instead of disabling cuDNN)? NVIDIA's fork already has this change.

@shelhamer
Copy link
Member

  1. Multi-GPU traininig: the order of data fetching and coordination between multiple data layers in a net are non-deterministic (based on race conditions). This is my fault in Multi-GPU Data Parallelism (with Parallel Data Layers) #2903.

This is fixed with the switch to new parallelism in #4563. The non-determinism of cuDNN can be addressed by setting engine: CAFFE instead and for CPU one can pick BLAS other than MKL. I think these are acceptable workarounds. A FORCE_DETERMINISM mode that trades performance for determinism could be incorporated however for a more fatalistic Caffe.

@gaobb
Copy link

gaobb commented Sep 24, 2018

I have meet the same question.

@Himeshi
Copy link

Himeshi commented Mar 17, 2019

I am faced with the same issue.

Tried rodrigoberriel's solution, but still got non-deterministic results when training. Is there a way to get consistent results for each run without disabling cuDNN since doing so will slow down the training?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants