Deal with Non-Deterministic Behavior (Ensure Determinism?) #3168

ronghanghu · 2015-10-08T02:03:24Z

Although there have been a lot of efforts in Caffe (such as unified RNG) to ensure reproducible and deterministic results, Caffe is currently still non-deterministic in several ways as described below:

GPU mode: cuDNN can be numerically non-deterministic with CUDNN_CONVOLUTION_BWD_DATA_ALGO_0 and CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3.
CPU mode: Intel MKL can be numerically non-deterministic. Details: test_gradient_based_solver fails #3109 (comment)
Multi-GPU traininig: the order of data fetching and coordination between multiple data layers in a net are non-deterministic (based on race conditions). This is my fault in Multi-GPU Data Parallelism (with Parallel Data Layers) #2903.

1 & 2 (numerically non-determinism) can cause tests that relies on deterministic behavior (such as TestSnapshot in test_gradient_based_solver.cpp) to fail, while 3 can result in bugs like #2977.

This thread is opened to discuss how to cope with them (and possibly ensure determinism in Caffe?)

The text was updated successfully, but these errors were encountered:

RolanChen · 2015-10-19T03:43:08Z

I have encountered similar issues while using a self-coded LSTM layer to train translation model. I could be sure that I hadn't introduce any random factors in my own coding part, either in the config .prototxt -- the random seeds were remained same each time; yet eachtime I run training procedure they were yielding different loss values except for the initial one.

Moreover I have turned off the cuDNN flag while compling the enviroment, so I guess there might still be some random factors in the training related part of caffe. FYI I run the experiments on single GPU.

FicusRong · 2015-10-28T03:18:21Z

I think the cuDNN non-deterministic behavior is cuased by resetting the diffs every time in CuDNNConvolutionLayer::Backward_gpu(). Actually, the diffs are already set in Net::ClearParamDiffs(). It seems that it's not a bug in the situation of multi-gpus but in "iter_size > 1" .

template 
void CuDNNConvolutionLayer::Backward_gpu(const vector*>& top,
    const vector& propagate_down, const vector*>& bottom) {
  const Dtype* weight = NULL;
  Dtype* weight_diff = NULL;
  if (this->param_propagate_down_[0]) {
    weight = this->blobs_[0]->gpu_data();
    weight_diff = this->blobs_[0]->mutable_gpu_diff();
    caffe_gpu_set(this->blobs_[0]->count(), Dtype(0), weight_diff);
  }
  Dtype* bias_diff = NULL;
  if (this->bias_term_ && this->param_propagate_down_[1]) {
    bias_diff = this->blobs_[1]->mutable_gpu_diff();
    caffe_gpu_set(this->blobs_[1]->count(), Dtype(0), bias_diff);
  }

ronghanghu · 2015-10-28T05:43:09Z

@FicusRong These two line seems to be introduced in #3160. I'll take a look. Thanks for reporting!

ylongqi · 2016-01-21T16:59:03Z

I encounter the problem 3 in multiple GPU training. - I use two data input layers (One for image and the other for labels (multi-dimensional)) and the program crashed with the following error (Training is all fine with single GPU). Is there any workaround currently to solve this problem?:

*** Aborted at 1453395416 (unix time) try "date -d @1453395416" if you are using GNU date ***
PC: @ 0x7f17e7c7da5f (unknown)
*** SIGSEGV (@0xaa1c000) received by PID 22896 (TID 0x7f17e9587780) from PID 178372608; stack trace: ***
@ 0x7f17e7b62d40 (unknown)
@ 0x7f17e7c7da5f (unknown)
@ 0x7f17e8c43a9c std::vector<>::erase()
@ 0x7f17e8c42807 caffe::DevicePair::compute()
@ 0x7f17e8c47c4c caffe::P2PSync<>::run()
@ 0x407dc1 train()
@ 0x405bc1 main
@ 0x7f17e7b4dec5 (unknown)
@ 0x4062d1 (unknown)
@ 0x0 (unknown)
Segmentation fault (core dumped)

rodrigoberriel · 2017-02-15T12:12:04Z

@ronghanghu to fix 1. would it be okay if the default algorithms (bwd_filter_algo_ and bwd_data_algo_) were changed to 1 (determinisc according to cuDNN docs) when a random_seed is given by the user? I mean, if the user sets the random_seed, he must be expecting determinic behavior.

I couldn't find any information on the impact it would cause in terms of performance. Should we expect any side-effect besides performance issues if we manually set those algorithms to 1 and rebuild Caffe as a temporary fix (instead of disabling cuDNN)? NVIDIA's fork already has this change.

shelhamer · 2017-04-14T17:13:01Z

Multi-GPU traininig: the order of data fetching and coordination between multiple data layers in a net are non-deterministic (based on race conditions). This is my fault in Multi-GPU Data Parallelism (with Parallel Data Layers) #2903.

This is fixed with the switch to new parallelism in #4563. The non-determinism of cuDNN can be addressed by setting engine: CAFFE instead and for CPU one can pick BLAS other than MKL. I think these are acceptable workarounds. A FORCE_DETERMINISM mode that trades performance for determinism could be incorporated however for a more fatalistic Caffe.

gaobb · 2018-09-24T17:55:42Z

I have meet the same question.

Himeshi · 2019-03-17T00:38:20Z

I am faced with the same issue.

Tried rodrigoberriel's solution, but still got non-deterministic results when training. Is there a way to get consistent results for each run without disabling cuDNN since doing so will slow down the training?

longjon mentioned this issue Nov 5, 2015

Caffe returns different values for same input repeated #3155

Closed

bhack mentioned this issue Oct 6, 2016

integration to tinny-cnn naibaf7/libdnn#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deal with Non-Deterministic Behavior (Ensure Determinism?) #3168

Deal with Non-Deterministic Behavior (Ensure Determinism?) #3168

ronghanghu commented Oct 8, 2015

RolanChen commented Oct 19, 2015

FicusRong commented Oct 28, 2015

ronghanghu commented Oct 28, 2015

ylongqi commented Jan 21, 2016

rodrigoberriel commented Feb 15, 2017 •

edited

Loading

shelhamer commented Apr 14, 2017

gaobb commented Sep 24, 2018 •

edited

Loading

Himeshi commented Mar 17, 2019 •

edited

Loading

Deal with Non-Deterministic Behavior (Ensure Determinism?) #3168

Deal with Non-Deterministic Behavior (Ensure Determinism?) #3168

Comments

ronghanghu commented Oct 8, 2015

RolanChen commented Oct 19, 2015

FicusRong commented Oct 28, 2015

ronghanghu commented Oct 28, 2015

ylongqi commented Jan 21, 2016

rodrigoberriel commented Feb 15, 2017 • edited Loading

shelhamer commented Apr 14, 2017

gaobb commented Sep 24, 2018 • edited Loading

Himeshi commented Mar 17, 2019 • edited Loading

rodrigoberriel commented Feb 15, 2017 •

edited

Loading

gaobb commented Sep 24, 2018 •

edited

Loading

Himeshi commented Mar 17, 2019 •

edited

Loading