Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not use tensor cores #221

Closed
vaibhav0195 opened this issue Mar 26, 2019 · 8 comments
Closed

Can not use tensor cores #221

vaibhav0195 opened this issue Mar 26, 2019 · 8 comments

Comments

@vaibhav0195
Copy link

Hi ,
I am on an ubuntu machine with a 2080Ti using cuda 10.0,cuddn 7.4, python3.7 ,pytorch1.0.1 and ubuntu 16.04
I converted the model to use the tensorcore using amp module as specified by this example:

https://nvidia.github.io/apex/amp.html

but when i run my python program using the profiler nvprof as specified here
https://devtalk.nvidia.com/default/topic/1047165/how-to-confirm-whether-tensor-core-is-working-or-not-/

i get :


No events/metrics were profiled.


which as stated by modertator should not occur if my tensorcores were being used.
Can anyone help me why this is happening ?
any help is appreciated
Thanks

@mcarilli
Copy link
Contributor

What was the command line you used to run your script under nvprof?

@vaibhav0195
Copy link
Author

/usr/local/cuda/bin/nvprof --kernels compute_gemm --metrics tensor_precision_fu_utilization,tensor_int_fu_utilization python myscript.py

@hellojialee
Copy link

Hi, @vaibhav0195, @mcarilli, must we change all the length (N, C, H, W) of a tensor so that they can be divided by 8 before we can make use of tensor cores?

@vaibhav0195
Copy link
Author

@mcarilli i think just the input and output channels of the conv and the batch sizes should do the trick.

@mcarilli
Copy link
Contributor

Convolutions:
For cudnn versions 7.2 and ealier, @vaibhav0195 is correct: input channels, output channels, and batch size should be multiples of 8 to use tensor cores. However, this requirement is lifted for cudnn versions 7.3 and later. For cudnn 7.3 and later, you don't need to worry about making your channels/batch size multiples of 8 to enable Tensor Core use.

GEMMs (fully connected layers):
For matrix A x matrix B, where A has size [I, J] and B has size [J, K], I, J, and K must be multiples of 8 to use Tensor Cores. This requirement exists for all cublas and cudnn versions. This means that for bare fully connected layers, the batch size, input features, and output features must be multiples of 8, and for RNNs, you usually (but not always, it can be architecture-dependent depending on what you use for encoder/decoder) need to have batch size, hidden size, embedding size, and dictionary size as multiples of 8.

@hellojialee
Copy link

@mcarilli Thank you for your clear explanation.

@mcarilli
Copy link
Contributor

mcarilli commented Mar 30, 2019

It may also help to set
torch.backends.cudnn.benchmark=True
at the top of your script, which enables pytorch‘s autotuner. Each time pytorch encounters a new set of convolution parameters, it will test all available cudnn algorithms to find the fastest one, then cache that choice to reuse whenever it encounters the same set of convolution parameters again. The first iteration of your network will be slower as pytorch tests all the cudnn algorithms for each convolution, but the second iteration and later iterations will likely be faster.

@zhenhuahu
Copy link

ers, it will test all available cudnn algorithms to find the fastest one, then cache that choice to reuse whenever it encounters the same set of convolution parameters again. The first iteration of your network will be slower as pyt

Hi, thanks for your detailed explanation. Is the command to set autotoner
torch.backends.cudnn.benchmark=True
specific for Apex? Can we use it in more general cases?
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants