-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixed precision training slow #325
Comments
A single V100, or multiple? Also, what level of device utilization are you achieving? For a quick-and-dirty (by no means definitive) check, try We've got some people right now working on optimizing BERT specifically. I'll let you know if we observe similar behavior, and detail whatever best practices we discover. |
A single V100.
|
Update: I try the code with a 2080Ti and docker, everything works fine. The memory usage is reduced and the training speed is also faster. I hope this helps to find the problem. |
use cuda 10, it is much faster in tensorcore computation(2080ti) |
Recently, I was helping optimize an internal version of BERT with @sharatht. We're using Amp with We noticed that the dictionary size was not a multiple of 8, which prevented Tensor Core use for FP16 GEMMs in a particular linear decoder layer, causing that layer to take an annoyingly long time, even with Amp. See #221 (comment). Bert is not rnn-based, but the same concepts apply (to enable Tensor Core use with Amp, you should make sure any dimensions that participate in GEMMs are multiples of 8). |
I have the same issue. My model is this one https://github.com/kenshohara/3D-ResNets-PyTorch Acitvating O1 on apex give degraded performance on 2080 ti compared to 1080 Ti. But using a dumb |
Hi @hyperfraise, no script in your repo seems to import |
Please see this link with reproductible code https://github.com/hyperfraise/Apex-bench |
I think this is related to this pytorch/pytorch#22961 |
After profiling via torch.autograd.profiler.profile, I observed the following issue, a significant amount of time is spent on the CPU side during CudnnConvolutionBackward, cudnn_convolution_backward,CudnnBatchNormBackward,cudnn_batch_norm_backward. Note that I am using half precision (via apex), and my network use 3D convolution operations. I use cuDNN 7.6.1, CUDA 10.0, and pytorch 1.1.0. The GPU is RTX 2080 ti. In contrast, a dumb approach which uses .half() only spends a tiny fraction of this time on the CPU side.
|
@OValery16 I get the same problem, have you solved the problem? It seems that GPU is not used during backward. |
Hi,
I'm trying to fine-tuning bert using Bert fine-tuning.
My problem is: after using apex, the GPU memory usage is reduced, but the training time is about 1.3 times before.
My GPU is V100(16G, CUDA9, CUDNN7), Pytorch version is 1.0.
Is it a problem with my hardware?
The text was updated successfully, but these errors were encountered: