Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding mixed precision training for RTX graphic cards #210

Closed
drapado opened this issue Apr 12, 2019 · 20 comments
Closed

Adding mixed precision training for RTX graphic cards #210

drapado opened this issue Apr 12, 2019 · 20 comments
Labels
enhancement New feature or request

Comments

@drapado
Copy link

drapado commented Apr 12, 2019

Hi,

Thanks for your work, it's a very nice implementation of yolov3.

I have an RTX 2060 and by using mixed precision training I got a speed increment in the training process, since you can almost double batch size)

I added to your code nvidia apex.amp which is a very easy way to add mixed precision training to a pytorch model. The code you need to add is:
from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

And change loss.backward() with

with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

And with that you are using mixed precision training and you can almost double batch size while training in a gpu with tensor cores.

@drapado drapado added the enhancement New feature or request label Apr 12, 2019
@glenn-jocher
Copy link
Member

@drapado wow, I didn't realize it was that easy. We do our onsite training using a 1080 Ti, we'd held off buying an RTX card because the FP32 bump wasn't too impressive, about 1/3 faster I believe.

  • Have you benchmarked your training speed before and after this change?
  • I read that the best way to implement FP16 is to use FP16 for the forward passes, but then compute the gradient and apply the optimizer as FP32. Does your update handle this correctly?

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 12, 2019

Could you submit a PR with the changes? The scaled_loss.backward() replacement seems easy enough, but where exactly would model, optimizer = amp.initialize(model, optimizer, opt_level="O1") go, before or after we pass model to model = torch.nn.parallel.DistributedDataParallel(model)?

A useful test for the PR would be to plot one of the training tutorials with and without the code. For example data_100img.txt runs in about 5-10 minutes:
https://docs.ultralytics.com/yolov5/tutorials/train_custom_data

git pull  # download latest updates
rm results.txt  # remove existing results
python3 train.py --nosave --data data\coco_10img.data && mv results.txt results_10img.txt
python3 train_new.py --nosave --data data\coco_10img.data && mv results.txt results_10img_fp16.txt
python3 -c "from utils import utils; utils.plot_results()"

@drapado
Copy link
Author

drapado commented Apr 12, 2019

Have you benchmarked your training speed before and after this change?

Yes, I did a small test in my RTX 2060 with 6 gb of ram. These are the results:

type img-size batch-size time per batch
fp32 416 5 0.209
mixed 416 5 0.16
mixed 416 13 0.305
fp32 256 20 0.26
mixed 256 20 0.202
mixed 256 36 0.314

The time per batch is a bit approximated since I took a couple of values in the terminal output while training. The batch size is set to the maximum number that doesn't give out of memory error. You can increase the batch size considerably with mixed precision.

I read that the best way to implement FP16 is to use FP16 for the forward passes, but then compute the gradient and apply the optimizer as FP32. Does your update handle this correctly?

Yes, it does. You have also different levels to configure it (see opt levels). I use level O1, since I didn't notices any difference in speed with O2.

Could you submit a PR with the changes?

Unfortunately I don't have the time to submit a pr right now, sorry. But I attach the train.py that I've been used with the changes. The changes are in lines: 11, 14, 116-117, 162-166. It's quite straightforward
train.py.txt

@glenn-jocher
Copy link
Member

I've added mixed precision support to train.py in f299d83. I'll try and recreate the tutorial training curves to overlay and compare.

@drapado I assume during training that test.py also exploits the mixed precision since it uses the model passed to it by train.py, but if test.py is called by itself it does not currently support mixed precision. This might be useful to establish testing accuracies of the mixed precision models, but it looks like the amp language needs an optimizer passed to it. Do you know what happens if you pass None or [] instead of supplying an optimizer in model, optimizer = amp.initialize(model, optimizer, opt_level="O1")?

@drapado
Copy link
Author

drapado commented Apr 13, 2019

I've added mixed precision support to train.py in f299d83. I'll try and recreate the tutorial training curves to overlay and compare.

Nice, thanks!

I also tried to make the forward pass in test.py as fp16 using nvidia amp but I didn't manage neither. From my experience amp.initialize needs an optimizer, so I created one just to call the function. However, when I tried to run test.py I got errors saying that some steps in models.py require float32 tensor instead of half tensor (fp16). Did you get the same error?

I'll let you know if I manage to run test.py with a fp16 model

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 14, 2019

@drapado actually I wasn't able to test it on our GCP VM (https://docs.ultralytics.com/yolov5/environments/google_cloud_quickstart_tutorial/) because I had some install problems with apex. It seems not to be preinstalled on the DL VM that google offers, so I tried installing it the two seperate ways from https://github.com/NVIDIA/apex/tree/master/examples/imagenet, but after install from apex import amp always fails.

Its a shame, it looks like it could speed up training on V100s significantly. Any recommendations on the install?

@drapado
Copy link
Author

drapado commented Apr 14, 2019

Yeah, installing it with the c++ extensions it's complicated, I didn't manage because you sort of have to build pytorch on your own with some lib=1 (I don't remember which one). So in the end I just use the python only build pip install -v --no-cache-dir . https://github.com/NVIDIA/apex#quick-start.

I tested your code, it seems there is an error here

yolov3/train.py

Line 104 in 52464f5

model, optimizer = amp.initialize(model, optimizer, opt_level='01')

It should be O1 with the letter O instead of a zero. Once that is changed, the software works for me.

@glenn-jocher
Copy link
Member

Ah! Got it, just committed the fix. We are planning for some hyperparameter search soon to try and improve the training, it would be awesome if we could run these all at mixed precision, the time (and money) savings would be enormous. I'll keep trying the installation.

BTW, we recently inadvertently discovered a change that can improve the training (reducing the wh loss multiple from 4 to 1), you can implement this if you git pull. We tried this similarly with xy but that produced worse results. These are the sorts of hyperparameter searches that need more time and effort. See #211

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 17, 2019

I can't get the pip install -v --no-cache-dir . install to work on our GCP Deep Learning VMs, I've raised an issue over at the apex repo.
NVIDIA/apex#259

@mcarilli
Copy link

mcarilli commented Apr 17, 2019

In top of tree Apex, the optimizers argument to amp.initialize is now optional (a number of people have asked for this). You can pass a model or list of models through it without supplying an optimizer.

Yeah, installing it with the c++ extensions it's complicated, I didn't manage because you sort of have to build pytorch on your own with some lib=1 (I don't remember which one). So in the end I just use the python only build pip install -v --no-cache-dir . https://github.com/NVIDIA/apex#quick-start.

Are you referring to the -D_GLIBCXX_USE_CXX11_ABI=0 or 1 issue? Under the hood, Pytorch's extension builder detects/anticipates two possible cases (see NVIDIA/apex#212 (comment)):

  1. If you're compiling Apex against a pip or conda installed torch binary (which are compiled by upstream with -D_GLIBCXX_USE_CXX11_ABI=0), Pytorch's extension builder should detect this, and set -D_GLIBCXX_USE_CXX11_ABI=0 for the Apex extension build as well.
  2. If you're compiling Apex against a version of Pytorch that you installed from source, Pytorch's extension builder does NOT attempt to set -D_GLIBCXX_USE_CXX11_ABI=anything. Rather, it assumes that the environment (including the current value of -D_GLIBCXX_USE_CXX11_ABI) in which you're currently compiling Apex extensions is the same as it was when you compiled Pytorch on your system, in which case the value of that variable used while compiling Apex will match the value that was used while compiling Pytorch.

In both case 1 and case 2, the upshot is that compiling extensions should "just work" without having to worry about -D_GLIBCXX_USE_CXX11_ABI=0 or 1. If this is not the case, I suspect you are somehow violating the assumptions/cases that Pytorch's extension builder can handle (in your issue NVIDIA/apex#220) you said you were doing an arch linux install which I have never tried).

@drapado
Copy link
Author

drapado commented Apr 18, 2019

Hi @mcarilli, thanks for your answers

Are you referring to the -D_GLIBCXX_USE_CXX11_ABI=0 or 1 issue? Under the hood, Pytorch's extension builder detects/anticipates two possible cases (see NVIDIA/apex#212 (comment)):

Yes, I'm referring to this issue. It seems the solution to make apex work with cpp extension in Arch Linux is by building pytorch with the proper -D_GLIBCXX_USE_CXX11_ABI flag yourself (see https://aur.archlinux.org/packages/python-apex-git/). I also tried with a conda installation of pytorch and it doesn't work neither.

@glenn-jocher
Copy link
Member

@drapado mixed precision is now integrated into the repo, and will be used by default if nvidia apex is installed. Closing issue.

@H-YunHui
Copy link

H-YunHui commented Oct 1, 2019

@glenn-jocher
when I use the mixed_precision to train, some wrong like this:

File "/home/cumt_506/anaconda/envs/zyh/lib/python3.7/tkinter/__init__.py", line 3507, in __del__
    self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Tcl_AsyncDelete: async handler deleted by the wrong thread
Aborted (core dumped)

I have tried many times but failed to solve it. Have you ever encountered this problem?

@glenn-jocher
Copy link
Member

@H-YunHui mixed precision is used automatically if you have nvidia apex installed. See https://github.com/NVIDIA/apex

@H-YunHui
Copy link

H-YunHui commented Oct 3, 2019

@glenn-jocher
have you installed the nvidia apex and used the mixed_precision to train in this repo?
I install the nvidia apex and use the mixed_precision to train many times, but every time something goes wrong as shown above , I see the https://github.com/NVIDIA/apex but I couldn't find the answer to my question, I'm confused.

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 3, 2019

Mixed precision works correctly if you have it installed. Install it correctly, or run from docker or colab to use it.

@piyp791
Copy link

piyp791 commented Oct 23, 2019

Hi,

I am trying to train the model on a single GPU.
Here's the start configuration:

Namespace(accumulate=2, adam=False, arc='default', batch_size=32, bucket='', cache_images=False, cfg='cfg/yolov3-custom.cfg', data='data/coco.data', device='', epochs=5, evolve=False, img_size=416, img_weights=False, multi_scale=False, name='', nosave=False, notest=False, prebias=False, rect=False, resume=False, transfer=True, var=None, weights=' weights/darknet53.conv.74')
Using CUDA device0 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11172MB)

I don't have apex installed and I have mixed_precision set to False (in train.py), but I am still getting the following error:

Exception ignored in: <bound method Image.__del__ of <tkinter.PhotoImage object at 0x7fc61fd624e0>>
Traceback (most recent call last):
File "/home/local/ASUAD/ppapreja/anaconda3/envs/pygpu36/lib/python3.6/tkinter/__init__.py", line 3507, in __del__
self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop 
Exception ignored in: <bound method Image.__del__ of <tkinter.PhotoImage object at 0x7fc6b2a78048>>
Traceback (most recent call last):
File "/home/local/ASUAD/ppapreja/anaconda3/envs/pygpu36/lib/python3.6/tkinter/__init__.py", line 3507, in __del__
self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Tcl_AsyncDelete: async handler deleted by the wrong thread
Aborted (core dumped)

Am I missing something here?
My environment:
Python: 3.6
Torch: 1.1
Cuda : 9.0
Is there a way to run without mixed-precision?

@glenn-jocher
Copy link
Member

@peps0791 sure, you set this in train.py:

yolov3/train.py

Lines 12 to 17 in aae39ca

mixed_precision = True
try: # Mixed precision training https://github.com/NVIDIA/apex
from apex import amp
except:
mixed_precision = False # not installed

@zhangyilalala
Copy link

@peps0791 Hi~have you solved this problem,I meet the same problem

@pderrenger
Copy link
Member

@zhangyilalala the error you're encountering seems to be related to Tkinter and not directly to mixed precision training or the YOLOv3 code. It might be an issue with your environment or a conflict with another library. Make sure you're running your training script from the command line and not within an interactive environment like Jupyter notebooks or IDLE, which can sometimes cause issues with threading and Tkinter.

If you're not using Tkinter in your code, try to ensure there are no background processes or other parts of your code that might be inadvertently invoking Tkinter. If the problem persists, you might want to isolate the training script in a clean environment to rule out any conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants