-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding mixed precision training for RTX graphic cards #210
Comments
@drapado wow, I didn't realize it was that easy. We do our onsite training using a 1080 Ti, we'd held off buying an RTX card because the FP32 bump wasn't too impressive, about 1/3 faster I believe.
|
Could you submit a PR with the changes? The A useful test for the PR would be to plot one of the training tutorials with and without the code. For example
|
Yes, I did a small test in my RTX 2060 with 6 gb of ram. These are the results:
The time per batch is a bit approximated since I took a couple of values in the terminal output while training. The batch size is set to the maximum number that doesn't give out of memory error. You can increase the batch size considerably with mixed precision.
Yes, it does. You have also different levels to configure it (see opt levels). I use level O1, since I didn't notices any difference in speed with O2.
Unfortunately I don't have the time to submit a pr right now, sorry. But I attach the train.py that I've been used with the changes. The changes are in lines: 11, 14, 116-117, 162-166. It's quite straightforward |
I've added mixed precision support to train.py in f299d83. I'll try and recreate the tutorial training curves to overlay and compare. @drapado I assume during training that test.py also exploits the mixed precision since it uses the model passed to it by train.py, but if test.py is called by itself it does not currently support mixed precision. This might be useful to establish testing accuracies of the mixed precision models, but it looks like the amp language needs an optimizer passed to it. Do you know what happens if you pass |
Nice, thanks! I also tried to make the forward pass in test.py as fp16 using nvidia amp but I didn't manage neither. From my experience I'll let you know if I manage to run test.py with a fp16 model |
@drapado actually I wasn't able to test it on our GCP VM (https://docs.ultralytics.com/yolov5/environments/google_cloud_quickstart_tutorial/) because I had some install problems with apex. It seems not to be preinstalled on the DL VM that google offers, so I tried installing it the two seperate ways from https://github.com/NVIDIA/apex/tree/master/examples/imagenet, but after install Its a shame, it looks like it could speed up training on V100s significantly. Any recommendations on the install? |
Yeah, installing it with the c++ extensions it's complicated, I didn't manage because you sort of have to build pytorch on your own with some lib=1 (I don't remember which one). So in the end I just use the python only build I tested your code, it seems there is an error here Line 104 in 52464f5
It should be O1 with the letter O instead of a zero. Once that is changed, the software works for me. |
Ah! Got it, just committed the fix. We are planning for some hyperparameter search soon to try and improve the training, it would be awesome if we could run these all at mixed precision, the time (and money) savings would be enormous. I'll keep trying the installation. BTW, we recently inadvertently discovered a change that can improve the training (reducing the |
I can't get the |
In top of tree Apex, the
Are you referring to the -D_GLIBCXX_USE_CXX11_ABI=0 or 1 issue? Under the hood, Pytorch's extension builder detects/anticipates two possible cases (see NVIDIA/apex#212 (comment)):
In both case 1 and case 2, the upshot is that compiling extensions should "just work" without having to worry about -D_GLIBCXX_USE_CXX11_ABI=0 or 1. If this is not the case, I suspect you are somehow violating the assumptions/cases that Pytorch's extension builder can handle (in your issue NVIDIA/apex#220) you said you were doing an arch linux install which I have never tried). |
Hi @mcarilli, thanks for your answers
Yes, I'm referring to this issue. It seems the solution to make apex work with cpp extension in Arch Linux is by building pytorch with the proper -D_GLIBCXX_USE_CXX11_ABI flag yourself (see https://aur.archlinux.org/packages/python-apex-git/). I also tried with a conda installation of pytorch and it doesn't work neither. |
@drapado mixed precision is now integrated into the repo, and will be used by default if nvidia apex is installed. Closing issue. |
@glenn-jocher
I have tried many times but failed to solve it. Have you ever encountered this problem? |
@H-YunHui mixed precision is used automatically if you have nvidia apex installed. See https://github.com/NVIDIA/apex |
@glenn-jocher |
Mixed precision works correctly if you have it installed. Install it correctly, or run from docker or colab to use it. |
Hi, I am trying to train the model on a single GPU.
I don't have apex installed and I have mixed_precision set to False (in train.py), but I am still getting the following error:
Am I missing something here? |
@peps0791 sure, you set this in train.py: Lines 12 to 17 in aae39ca
|
@peps0791 Hi~have you solved this problem,I meet the same problem |
@zhangyilalala the error you're encountering seems to be related to Tkinter and not directly to mixed precision training or the YOLOv3 code. It might be an issue with your environment or a conflict with another library. Make sure you're running your training script from the command line and not within an interactive environment like Jupyter notebooks or IDLE, which can sometimes cause issues with threading and Tkinter. If you're not using Tkinter in your code, try to ensure there are no background processes or other parts of your code that might be inadvertently invoking Tkinter. If the problem persists, you might want to isolate the training script in a clean environment to rule out any conflicts. |
Hi,
Thanks for your work, it's a very nice implementation of yolov3.
I have an RTX 2060 and by using mixed precision training I got a speed increment in the training process, since you can almost double batch size)
I added to your code nvidia apex.amp which is a very easy way to add mixed precision training to a pytorch model. The code you need to add is:
from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
And change
loss.backward()
withAnd with that you are using mixed precision training and you can almost double batch size while training in a gpu with tensor cores.
The text was updated successfully, but these errors were encountered: