Multi GPU training #125

alessandro-montanari · 2018-01-13T20:53:23Z

I tried this code on CPU and on a single GPU and it works fine. I tried a previous version on 4 GPUs and it worked fine too. I will be able to try this version on multiple GPUs on Monday.
Let me know what you think.

experiencor · 2018-01-14T13:33:19Z

This is great! Will let you know if I am able to run the code.

experiencor · 2018-01-15T15:47:14Z

@alessandro-montanari I find that the multiple GPU version produces worse result compared to the single GPU version. It makes a lot of wrong detections. Do I have to take any note when running this multiple GPU version?

alessandro-montanari · 2018-01-15T16:12:00Z

That's weird.
What's your batch size? Maybe you need to train for longer because with a bigger batch size there are less updates?
I am trying it on the raccoon dataset.

alessandro-montanari · 2018-01-15T17:55:36Z

Unfortunately I am having some weird issues with the images where the code fails in preprocessing.py line 238 (h, w, c = image.shape) with ValueError: not enough values to unpack (expected 3, got 2). This is not due to the code but it's because I am running it on a cluster where I can test multiple GPUs but I always had some strange problems with jpeg files. I also tried with the master branch and it's the same.

Anyway, with the code we are using for our application (it's basically this one plus some other changes to your implementation) we didn't see any loss in accuracy going from 1 GPU (batch size = 40) to 4 GPUs (batch size = 160).
Is your validation loss very different from the single GPU version of the code?
Do you evaluate the model immediately after training or you load again the weights?

I'll try to come back on this but please let me know if you have any news.

msis · 2018-02-14T19:00:13Z

Have anyone been able to train with more than one GPU?
Over here, at the end of the first epoch, keras crashes when trying to save the model.

alessandro-montanari · 2018-02-15T10:59:39Z

@msis what error do you get?

msis · 2018-02-15T13:25:51Z

@alessandro-montanari Here's the trace with python2 :

Traceback (most recent call last):
  File "train.py", line 144, in <module>
    _main_(args)
  File "train.py", line 140, in _main_
    debug              = config['train']['debug'])
  File "/home/ubuntu/dl/basic-yolo-keras/frontend.py", line 478, in train
    max_queue_size   = 8)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/engine/training.py", line 2213, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/callbacks.py", line 76, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/callbacks.py", line 418, in on_epoch_end
    self.model.save(filepath, overwrite=True)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/engine/topology.py", line 2573, in save
    save_model(self, filepath, overwrite, include_optimizer)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/models.py", line 111, in save_model
    'config': model.get_config()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/engine/topology.py", line 2414, in get_config
    return copy.deepcopy(config)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 230, in _deepcopy_list
    y.append(deepcopy(a, memo))
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 237, in _deepcopy_tuple
    y.append(deepcopy(a, memo))
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 237, in _deepcopy_tuple
    y.append(deepcopy(a, memo))
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 190, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 334, in _reconstruct
    state = deepcopy(state, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 190, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 334, in _reconstruct
    state = deepcopy(state, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 182, in deepcopy
    rv = reductor(2)
TypeError: can't pickle NotImplementedType objects

and in python3 :

Traceback (most recent call last):
  File "train.py", line 144, in <module>
    _main_(args)
  File "train.py", line 140, in _main_
    debug              = config['train']['debug'])
  File "/home/smr/tmp/basic-yolo-keras/frontend.py", line 478, in train
    max_queue_size   = 8)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/engine/training.py", line 2117, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/callbacks.py", line 73, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/callbacks.py", line 414, in on_epoch_end
    self.model.save(filepath, overwrite=True)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/engine/topology.py", line 2556, in save
    save_model(self, filepath, overwrite, include_optimizer)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/models.py", line 107, in save_model
    'config': model.get_config()
  File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/engine/topology.py", line 2397, in get_config
    return copy.deepcopy(config)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 218, in _deepcopy_list
    y.append(deepcopy(a, memo))
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 223, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 223, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 223, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 223, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 182, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 297, in _reconstruct
    state = deepcopy(state, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 182, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 306, in _reconstruct
    y.__dict__.update(state)
AttributeError: 'NoneType' object has no attribute 'update'
Exception ignored in: <bound method BaseSession.__del__ of <tensorflow.python.client.session.Session object at 0x7f40ec
384e48>>
Traceback (most recent call last):
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 696, in __del__
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/c_api_util.py", line 30, in __init_
_
TypeError: 'NoneType' object is not callable

N.B. I used 2to3 to use the project with python3

wooeagle · 2019-06-11T02:06:47Z

have a issue for multi-gpu training.
i did trained own dataset(6,000 images) using multi-gpu code of you.
but i got a 0.00 mAP result using multi-gpu while evaluate. (single-gpu = 0.3 / multi-gpu = 0.00)

different configuration is only batch size. (multi-gpu: 64, single-gpu: 16)
what's problem?

Multi GPU training

2256cc1

alessandro-montanari mentioned this pull request Jan 14, 2018

How I can train on multiGPU #94

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi GPU training #125

Multi GPU training #125

alessandro-montanari commented Jan 13, 2018

experiencor commented Jan 14, 2018

experiencor commented Jan 15, 2018 •

edited

Loading

alessandro-montanari commented Jan 15, 2018 •

edited

Loading

alessandro-montanari commented Jan 15, 2018

msis commented Feb 14, 2018

alessandro-montanari commented Feb 15, 2018

msis commented Feb 15, 2018 •

edited

Loading

wooeagle commented Jun 11, 2019

Multi GPU training #125

Are you sure you want to change the base?

Multi GPU training #125

Conversation

alessandro-montanari commented Jan 13, 2018

experiencor commented Jan 14, 2018

experiencor commented Jan 15, 2018 • edited Loading

alessandro-montanari commented Jan 15, 2018 • edited Loading

alessandro-montanari commented Jan 15, 2018

msis commented Feb 14, 2018

alessandro-montanari commented Feb 15, 2018

msis commented Feb 15, 2018 • edited Loading

wooeagle commented Jun 11, 2019

experiencor commented Jan 15, 2018 •

edited

Loading

alessandro-montanari commented Jan 15, 2018 •

edited

Loading

msis commented Feb 15, 2018 •

edited

Loading