Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU training #125

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alessandro-montanari
Copy link

I tried this code on CPU and on a single GPU and it works fine. I tried a previous version on 4 GPUs and it worked fine too. I will be able to try this version on multiple GPUs on Monday.
Let me know what you think.

@experiencor
Copy link
Owner

This is great! Will let you know if I am able to run the code.

@experiencor
Copy link
Owner

experiencor commented Jan 15, 2018

@alessandro-montanari I find that the multiple GPU version produces worse result compared to the single GPU version. It makes a lot of wrong detections. Do I have to take any note when running this multiple GPU version?

@alessandro-montanari
Copy link
Author

alessandro-montanari commented Jan 15, 2018

That's weird.
What's your batch size? Maybe you need to train for longer because with a bigger batch size there are less updates?
I am trying it on the raccoon dataset.

@alessandro-montanari
Copy link
Author

Unfortunately I am having some weird issues with the images where the code fails in preprocessing.py line 238 (h, w, c = image.shape) with ValueError: not enough values to unpack (expected 3, got 2). This is not due to the code but it's because I am running it on a cluster where I can test multiple GPUs but I always had some strange problems with jpeg files. I also tried with the master branch and it's the same.

Anyway, with the code we are using for our application (it's basically this one plus some other changes to your implementation) we didn't see any loss in accuracy going from 1 GPU (batch size = 40) to 4 GPUs (batch size = 160).
Is your validation loss very different from the single GPU version of the code?
Do you evaluate the model immediately after training or you load again the weights?

I'll try to come back on this but please let me know if you have any news.

@msis
Copy link

msis commented Feb 14, 2018

Have anyone been able to train with more than one GPU?
Over here, at the end of the first epoch, keras crashes when trying to save the model.

@alessandro-montanari
Copy link
Author

@msis what error do you get?

@msis
Copy link

msis commented Feb 15, 2018

@alessandro-montanari Here's the trace with python2 :

Traceback (most recent call last):
  File "train.py", line 144, in <module>
    _main_(args)
  File "train.py", line 140, in _main_
    debug              = config['train']['debug'])
  File "/home/ubuntu/dl/basic-yolo-keras/frontend.py", line 478, in train
    max_queue_size   = 8)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/engine/training.py", line 2213, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/callbacks.py", line 76, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/callbacks.py", line 418, in on_epoch_end
    self.model.save(filepath, overwrite=True)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/engine/topology.py", line 2573, in save
    save_model(self, filepath, overwrite, include_optimizer)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/models.py", line 111, in save_model
    'config': model.get_config()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/engine/topology.py", line 2414, in get_config
    return copy.deepcopy(config)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 230, in _deepcopy_list
    y.append(deepcopy(a, memo))
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 237, in _deepcopy_tuple
    y.append(deepcopy(a, memo))
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 237, in _deepcopy_tuple
    y.append(deepcopy(a, memo))
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 190, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 334, in _reconstruct
    state = deepcopy(state, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 190, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 334, in _reconstruct
    state = deepcopy(state, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 182, in deepcopy
    rv = reductor(2)
TypeError: can't pickle NotImplementedType objects

and in python3 :

Traceback (most recent call last):
  File "train.py", line 144, in <module>
    _main_(args)
  File "train.py", line 140, in _main_
    debug              = config['train']['debug'])
  File "/home/smr/tmp/basic-yolo-keras/frontend.py", line 478, in train
    max_queue_size   = 8)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/engine/training.py", line 2117, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/callbacks.py", line 73, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/callbacks.py", line 414, in on_epoch_end
    self.model.save(filepath, overwrite=True)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/engine/topology.py", line 2556, in save
    save_model(self, filepath, overwrite, include_optimizer)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/models.py", line 107, in save_model
    'config': model.get_config()
  File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/engine/topology.py", line 2397, in get_config
    return copy.deepcopy(config)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 218, in _deepcopy_list
    y.append(deepcopy(a, memo))
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 223, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 223, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 223, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 223, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 182, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 297, in _reconstruct
    state = deepcopy(state, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 182, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/anaconda/envs/py35/lib/python3.5/copy.py", line 306, in _reconstruct
    y.__dict__.update(state)
AttributeError: 'NoneType' object has no attribute 'update'
Exception ignored in: <bound method BaseSession.__del__ of <tensorflow.python.client.session.Session object at 0x7f40ec
384e48>>
Traceback (most recent call last):
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 696, in __del__
  File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/c_api_util.py", line 30, in __init_
_
TypeError: 'NoneType' object is not callable

N.B. I used 2to3 to use the project with python3

@wooeagle
Copy link

have a issue for multi-gpu training.
i did trained own dataset(6,000 images) using multi-gpu code of you.
but i got a 0.00 mAP result using multi-gpu while evaluate. (single-gpu = 0.3 / multi-gpu = 0.00)

different configuration is only batch size. (multi-gpu: 64, single-gpu: 16)
what's problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants