Adding mixed precision training for RTX graphic cards #210

drapado · 2019-04-12T13:00:12Z

Hi,

Thanks for your work, it's a very nice implementation of yolov3.

I have an RTX 2060 and by using mixed precision training I got a speed increment in the training process, since you can almost double batch size)

I added to your code nvidia apex.amp which is a very easy way to add mixed precision training to a pytorch model. The code you need to add is:
from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

And change loss.backward() with

with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

And with that you are using mixed precision training and you can almost double batch size while training in a gpu with tensor cores.

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2019-04-12T14:21:14Z

@drapado wow, I didn't realize it was that easy. We do our onsite training using a 1080 Ti, we'd held off buying an RTX card because the FP32 bump wasn't too impressive, about 1/3 faster I believe.

Have you benchmarked your training speed before and after this change?
I read that the best way to implement FP16 is to use FP16 for the forward passes, but then compute the gradient and apply the optimizer as FP32. Does your update handle this correctly?

glenn-jocher · 2019-04-12T14:26:47Z

Could you submit a PR with the changes? The scaled_loss.backward() replacement seems easy enough, but where exactly would model, optimizer = amp.initialize(model, optimizer, opt_level="O1") go, before or after we pass model to model = torch.nn.parallel.DistributedDataParallel(model)?

A useful test for the PR would be to plot one of the training tutorials with and without the code. For example data_100img.txt runs in about 5-10 minutes:
https://docs.ultralytics.com/yolov5/tutorials/train_custom_data

git pull  # download latest updates
rm results.txt  # remove existing results
python3 train.py --nosave --data data\coco_10img.data && mv results.txt results_10img.txt
python3 train_new.py --nosave --data data\coco_10img.data && mv results.txt results_10img_fp16.txt
python3 -c "from utils import utils; utils.plot_results()"

drapado · 2019-04-12T14:47:35Z

Have you benchmarked your training speed before and after this change?

Yes, I did a small test in my RTX 2060 with 6 gb of ram. These are the results:

type	img-size	batch-size	time per batch
fp32	416	5	0.209
mixed	416	5	0.16
mixed	416	13	0.305
fp32	256	20	0.26
mixed	256	20	0.202
mixed	256	36	0.314

The time per batch is a bit approximated since I took a couple of values in the terminal output while training. The batch size is set to the maximum number that doesn't give out of memory error. You can increase the batch size considerably with mixed precision.

I read that the best way to implement FP16 is to use FP16 for the forward passes, but then compute the gradient and apply the optimizer as FP32. Does your update handle this correctly?

Yes, it does. You have also different levels to configure it (see opt levels). I use level O1, since I didn't notices any difference in speed with O2.

Could you submit a PR with the changes?

Unfortunately I don't have the time to submit a pr right now, sorry. But I attach the train.py that I've been used with the changes. The changes are in lines: 11, 14, 116-117, 162-166. It's quite straightforward
train.py.txt

glenn-jocher · 2019-04-13T14:06:40Z

I've added mixed precision support to train.py in f299d83. I'll try and recreate the tutorial training curves to overlay and compare.

@drapado I assume during training that test.py also exploits the mixed precision since it uses the model passed to it by train.py, but if test.py is called by itself it does not currently support mixed precision. This might be useful to establish testing accuracies of the mixed precision models, but it looks like the amp language needs an optimizer passed to it. Do you know what happens if you pass None or [] instead of supplying an optimizer in model, optimizer = amp.initialize(model, optimizer, opt_level="O1")?

drapado · 2019-04-13T14:17:56Z

I've added mixed precision support to train.py in f299d83. I'll try and recreate the tutorial training curves to overlay and compare.

Nice, thanks!

I also tried to make the forward pass in test.py as fp16 using nvidia amp but I didn't manage neither. From my experience amp.initialize needs an optimizer, so I created one just to call the function. However, when I tried to run test.py I got errors saying that some steps in models.py require float32 tensor instead of half tensor (fp16). Did you get the same error?

I'll let you know if I manage to run test.py with a fp16 model

glenn-jocher · 2019-04-14T13:23:21Z

@drapado actually I wasn't able to test it on our GCP VM (https://docs.ultralytics.com/yolov5/environments/google_cloud_quickstart_tutorial/) because I had some install problems with apex. It seems not to be preinstalled on the DL VM that google offers, so I tried installing it the two seperate ways from https://github.com/NVIDIA/apex/tree/master/examples/imagenet, but after install from apex import amp always fails.

Its a shame, it looks like it could speed up training on V100s significantly. Any recommendations on the install?

drapado · 2019-04-14T13:30:05Z

Yeah, installing it with the c++ extensions it's complicated, I didn't manage because you sort of have to build pytorch on your own with some lib=1 (I don't remember which one). So in the end I just use the python only build pip install -v --no-cache-dir . https://github.com/NVIDIA/apex#quick-start.

I tested your code, it seems there is an error here

yolov3/train.py

Line 104 in 52464f5

model, optimizer = amp.initialize(model, optimizer, opt_level='01')

It should be O1 with the letter O instead of a zero. Once that is changed, the software works for me.

glenn-jocher · 2019-04-14T14:04:04Z

Ah! Got it, just committed the fix. We are planning for some hyperparameter search soon to try and improve the training, it would be awesome if we could run these all at mixed precision, the time (and money) savings would be enormous. I'll keep trying the installation.

BTW, we recently inadvertently discovered a change that can improve the training (reducing the wh loss multiple from 4 to 1), you can implement this if you git pull. We tried this similarly with xy but that produced worse results. These are the sorts of hyperparameter searches that need more time and effort. See #211

glenn-jocher · 2019-04-17T10:38:41Z

I can't get the pip install -v --no-cache-dir . install to work on our GCP Deep Learning VMs, I've raised an issue over at the apex repo.
NVIDIA/apex#259

mcarilli · 2019-04-17T22:13:54Z

In top of tree Apex, the optimizers argument to amp.initialize is now optional (a number of people have asked for this). You can pass a model or list of models through it without supplying an optimizer.

Yeah, installing it with the c++ extensions it's complicated, I didn't manage because you sort of have to build pytorch on your own with some lib=1 (I don't remember which one). So in the end I just use the python only build pip install -v --no-cache-dir . https://github.com/NVIDIA/apex#quick-start.

Are you referring to the -D_GLIBCXX_USE_CXX11_ABI=0 or 1 issue? Under the hood, Pytorch's extension builder detects/anticipates two possible cases (see NVIDIA/apex#212 (comment)):

If you're compiling Apex against a pip or conda installed torch binary (which are compiled by upstream with -D_GLIBCXX_USE_CXX11_ABI=0), Pytorch's extension builder should detect this, and set -D_GLIBCXX_USE_CXX11_ABI=0 for the Apex extension build as well.
If you're compiling Apex against a version of Pytorch that you installed from source, Pytorch's extension builder does NOT attempt to set -D_GLIBCXX_USE_CXX11_ABI=anything. Rather, it assumes that the environment (including the current value of -D_GLIBCXX_USE_CXX11_ABI) in which you're currently compiling Apex extensions is the same as it was when you compiled Pytorch on your system, in which case the value of that variable used while compiling Apex will match the value that was used while compiling Pytorch.

In both case 1 and case 2, the upshot is that compiling extensions should "just work" without having to worry about -D_GLIBCXX_USE_CXX11_ABI=0 or 1. If this is not the case, I suspect you are somehow violating the assumptions/cases that Pytorch's extension builder can handle (in your issue NVIDIA/apex#220) you said you were doing an arch linux install which I have never tried).

drapado · 2019-04-18T07:04:06Z

Hi @mcarilli, thanks for your answers

Are you referring to the -D_GLIBCXX_USE_CXX11_ABI=0 or 1 issue? Under the hood, Pytorch's extension builder detects/anticipates two possible cases (see NVIDIA/apex#212 (comment)):

Yes, I'm referring to this issue. It seems the solution to make apex work with cpp extension in Arch Linux is by building pytorch with the proper -D_GLIBCXX_USE_CXX11_ABI flag yourself (see https://aur.archlinux.org/packages/python-apex-git/). I also tried with a conda installation of pytorch and it doesn't work neither.

glenn-jocher · 2019-08-06T16:56:09Z

@drapado mixed precision is now integrated into the repo, and will be used by default if nvidia apex is installed. Closing issue.

H-YunHui · 2019-10-01T13:21:12Z

@glenn-jocher
when I use the mixed_precision to train, some wrong like this:

File "/home/cumt_506/anaconda/envs/zyh/lib/python3.7/tkinter/__init__.py", line 3507, in __del__
    self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Tcl_AsyncDelete: async handler deleted by the wrong thread
Aborted (core dumped)

I have tried many times but failed to solve it. Have you ever encountered this problem?

glenn-jocher · 2019-10-01T13:24:42Z

@H-YunHui mixed precision is used automatically if you have nvidia apex installed. See https://github.com/NVIDIA/apex

H-YunHui · 2019-10-03T06:41:37Z

@glenn-jocher
have you installed the nvidia apex and used the mixed_precision to train in this repo?
I install the nvidia apex and use the mixed_precision to train many times, but every time something goes wrong as shown above , I see the https://github.com/NVIDIA/apex but I couldn't find the answer to my question, I'm confused.

glenn-jocher · 2019-10-03T10:08:08Z

Mixed precision works correctly if you have it installed. Install it correctly, or run from docker or colab to use it.

piyp791 · 2019-10-23T02:43:24Z

Hi,

I am trying to train the model on a single GPU.
Here's the start configuration:

Namespace(accumulate=2, adam=False, arc='default', batch_size=32, bucket='', cache_images=False, cfg='cfg/yolov3-custom.cfg', data='data/coco.data', device='', epochs=5, evolve=False, img_size=416, img_weights=False, multi_scale=False, name='', nosave=False, notest=False, prebias=False, rect=False, resume=False, transfer=True, var=None, weights=' weights/darknet53.conv.74')
Using CUDA device0 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11172MB)

I don't have apex installed and I have mixed_precision set to False (in train.py), but I am still getting the following error:

Exception ignored in: <bound method Image.__del__ of <tkinter.PhotoImage object at 0x7fc61fd624e0>>
Traceback (most recent call last):
File "/home/local/ASUAD/ppapreja/anaconda3/envs/pygpu36/lib/python3.6/tkinter/__init__.py", line 3507, in __del__
self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop 
Exception ignored in: <bound method Image.__del__ of <tkinter.PhotoImage object at 0x7fc6b2a78048>>
Traceback (most recent call last):
File "/home/local/ASUAD/ppapreja/anaconda3/envs/pygpu36/lib/python3.6/tkinter/__init__.py", line 3507, in __del__
self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Tcl_AsyncDelete: async handler deleted by the wrong thread
Aborted (core dumped)

Am I missing something here?
My environment:
Python: 3.6
Torch: 1.1
Cuda : 9.0
Is there a way to run without mixed-precision?

glenn-jocher · 2019-11-07T23:00:14Z

@peps0791 sure, you set this in train.py:

yolov3/train.py

Lines 12 to 17 in aae39ca

    
           mixed_precision = True 
        
           try:  # Mixed precision training https://github.com/NVIDIA/apex 
        
               from apex import amp 
        
           except: 
        
               mixed_precision = False  # not installed

zhangyilalala · 2019-12-17T07:04:34Z

@peps0791 Hi~have you solved this problem,I meet the same problem

pderrenger · 2024-01-28T15:18:12Z

@zhangyilalala the error you're encountering seems to be related to Tkinter and not directly to mixed precision training or the YOLOv3 code. It might be an issue with your environment or a conflict with another library. Make sure you're running your training script from the command line and not within an interactive environment like Jupyter notebooks or IDLE, which can sometimes cause issues with threading and Tkinter.

If you're not using Tkinter in your code, try to ensure there are no background processes or other parts of your code that might be inadvertently invoking Tkinter. If the problem persists, you might want to isolate the training script in a clean environment to rule out any conflicts.

drapado added the enhancement New feature or request label Apr 12, 2019

glenn-jocher mentioned this issue May 10, 2019

Multi GPU training - only slightly faster #269

Closed

glenn-jocher closed this as completed Aug 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding mixed precision training for RTX graphic cards #210

Adding mixed precision training for RTX graphic cards #210

drapado commented Apr 12, 2019

glenn-jocher commented Apr 12, 2019

glenn-jocher commented Apr 12, 2019 •

edited

Loading

drapado commented Apr 12, 2019 •

edited

Loading

glenn-jocher commented Apr 13, 2019

drapado commented Apr 13, 2019 •

edited

Loading

glenn-jocher commented Apr 14, 2019 •

edited

Loading

drapado commented Apr 14, 2019

glenn-jocher commented Apr 14, 2019

glenn-jocher commented Apr 17, 2019 •

edited

Loading

mcarilli commented Apr 17, 2019 •

edited

Loading

drapado commented Apr 18, 2019

glenn-jocher commented Aug 6, 2019

H-YunHui commented Oct 1, 2019 •

edited

Loading

glenn-jocher commented Oct 1, 2019

H-YunHui commented Oct 3, 2019 •

edited

Loading

glenn-jocher commented Oct 3, 2019 •

edited

Loading

piyp791 commented Oct 23, 2019 •

edited

Loading

glenn-jocher commented Nov 7, 2019

zhangyilalala commented Dec 17, 2019

pderrenger commented Jan 28, 2024

Adding mixed precision training for RTX graphic cards #210

Adding mixed precision training for RTX graphic cards #210

Comments

drapado commented Apr 12, 2019

glenn-jocher commented Apr 12, 2019

glenn-jocher commented Apr 12, 2019 • edited Loading

drapado commented Apr 12, 2019 • edited Loading

glenn-jocher commented Apr 13, 2019

drapado commented Apr 13, 2019 • edited Loading

glenn-jocher commented Apr 14, 2019 • edited Loading

drapado commented Apr 14, 2019

glenn-jocher commented Apr 14, 2019

glenn-jocher commented Apr 17, 2019 • edited Loading

mcarilli commented Apr 17, 2019 • edited Loading

drapado commented Apr 18, 2019

glenn-jocher commented Aug 6, 2019

H-YunHui commented Oct 1, 2019 • edited Loading

glenn-jocher commented Oct 1, 2019

H-YunHui commented Oct 3, 2019 • edited Loading

glenn-jocher commented Oct 3, 2019 • edited Loading

piyp791 commented Oct 23, 2019 • edited Loading

glenn-jocher commented Nov 7, 2019

zhangyilalala commented Dec 17, 2019

pderrenger commented Jan 28, 2024

glenn-jocher commented Apr 12, 2019 •

edited

Loading

drapado commented Apr 12, 2019 •

edited

Loading

drapado commented Apr 13, 2019 •

edited

Loading

glenn-jocher commented Apr 14, 2019 •

edited

Loading

glenn-jocher commented Apr 17, 2019 •

edited

Loading

mcarilli commented Apr 17, 2019 •

edited

Loading

H-YunHui commented Oct 1, 2019 •

edited

Loading

H-YunHui commented Oct 3, 2019 •

edited

Loading

glenn-jocher commented Oct 3, 2019 •

edited

Loading

piyp791 commented Oct 23, 2019 •

edited

Loading