Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with AMD benchmark #105

Open
computingdolas opened this issue Jul 26, 2018 · 18 comments
Open

Problem with AMD benchmark #105

computingdolas opened this issue Jul 26, 2018 · 18 comments

Comments

@computingdolas
Copy link

I am getting the 1/10 flops/s on the AMD Vega architecture as compared to one mentioned in the results folder. Anybody know why ???

@sunway513
Copy link

Hi @computingdolas , could you help provide the versions on your ROCm software environment?

@computingdolas
Copy link
Author

Thanks @sunway513 for your response. Here is the what rocminfo says 👍

=====================
HSA System Attributes

Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (number of timestamp)
Machine Model: LARGE
System Endianness: LITTLE

@sunway513
Copy link

Thanks @computingdolas , can you share more information? e.g. the log of:
apt --installed list | grep rocm
Or the following in centos:
rpm -qa | grep rocm

@computingdolas
Copy link
Author

Hey @sunway513

See this :)

rocm-clang-ocl-0.3.0_7997136-1.x86_64
rocm-device-libs-0.0.1-1.x86_64
rocm-dev-1.8.192-1.x86_64
rocminfo-1.0.0-1.x86_64
rocm-amdgpu-pro-icd-17.50-552542.el7.x86_64
rocm-opencl-1.2.0-2018071635.x86_64
rocm-libs-1.8.192-1.x86_64
rocm-smi-1.0.0_46_g81ef66f-1.x86_64
rocm-amdgpu-pro-17.50-552542.el7.x86_64
rocm-utils-1.8.192-1.x86_64
rocm-profiler-5.4.6878-g15f6673.x86_64
rocm-opencl-devel-1.2.0-2018071635.x86_64
rocm-dkms-1.8.192-1.x86_64
rocm-amdgpu-pro-opencl-17.50-552542.el7.x86_64

@sunway513
Copy link

Great, so you are on the latest ROCm, thanks :-)
I'll try to reproduce your result and update here.

@computingdolas
Copy link
Author

Here are my results for gemm benchmark in code/amd folder. matrix flops approximately is 2mn*k and TFLOPS = (flops/time)/10^12. For the first case in this I am getting somewhere around 0.15 TFLOPs but I should according to results folder get 1.5 TFLOPs. Please find the data below 👍
m n k a_t b_t time (usec)
1760 16 1760 n n 644
1760 32 1760 n n 657
1760 64 1760 n n 690
1760 128 1760 n n 692
1760 7000 1760 n n 8626
2048 16 2048 n n 752
2048 32 2048 n n 774
2048 64 2048 n n 881
2048 128 2048 n n 933
2048 7000 2048 n n 11562
2560 16 2560 n n 939
2560 32 2560 n n 963
2560 64 2560 n n 1043
2560 128 2560 n n 1110
2560 7000 2560 n n 17520
4096 16 4096 n n 1523
4096 32 4096 n n 1556
4096 64 4096 n n 1860
4096 128 4096 n n 2314
4096 7000 4096 n n 48839
1760 16 1760 t n 1104
1760 32 1760 t n 1114
1760 64 1760 t n 1173
1760 128 1760 t n 1307
1760 7000 1760 t n 10391
2048 16 2048 t n 1375
2048 32 2048 t n 1404
2048 64 2048 t n 1889
2048 128 2048 t n 2131
2048 7000 2048 t n 14931
2560 16 2560 t n 1853
2560 32 2560 t n 1889
2560 64 2560 t n 2146
2560 128 2560 t n 2324
2560 7000 2560 t n 21081
4096 16 4096 t n 3368
4096 32 4096 t n 3459
4096 64 4096 t n 3660
4096 128 4096 t n 12966
4096 7000 4096 t n 57209
1760 7133 1760 n t 7234
2048 7133 2048 n t 8275
2560 7133 2560 n t 13501
4096 7133 4096 n t 36544
5124 9124 1760 n n 32020
35 8457 1760 n n 985
5124 9124 2048 n n 41212
35 8457 2048 n n 2658
5124 9124 2560 n n 48522
35 8457 2560 n n 1729
5124 9124 4096 n n 82356
35 8457 4096 n n 4522
5124 9124 1760 t n 37142
35 8457 1760 t n 1438
5124 9124 2048 t n 46961
35 8457 2048 t n 2351
5124 9124 2560 t n 54639
35 8457 2560 t n 2441
5124 9124 4096 t n 92358
35 8457 4096 t n 4228
7680 16 2560 n n 989
7680 32 2560 n n 977
7680 64 2560 n n 1162
7680 128 2560 n n 1337
7680 16 2560 t n 2262
7680 32 2560 t n 2257
7680 64 2560 t n 3044
7680 128 2560 t n 3402
3072 16 1024 n n 389
3072 32 1024 n n 399
3072 64 1024 n n 496
3072 128 1024 n n 586
3072 16 1024 t n 882
3072 32 1024 t n 902
3072 64 1024 t n 1034
3072 128 1024 t n 1527
3072 7435 1024 n t 7455
7680 5481 2560 n t 34993
512 8 500000 n n 176088
1024 8 500000 n n 175514
512 16 500000 n n 178471
1024 16 500000 n n 177038
512 8 500000 t n 336153
1024 8 500000 t n 337628
512 16 500000 t n 337572
1024 16 500000 t n 336461
1024 700 512 n n 304
1024 700 512 t n 446
7680 24000 2560 n n 187187
6144 24000 2048 n n 124284
4608 24000 1536 n n 68394
8448 24000 2816 n n 225464
3072 24000 1024 n n 31407
7680 48000 2560 n n 380244
6144 48000 2048 n n 249364
4608 48000 1536 n n 137419
8448 48000 2816 n n 454826
3072 48000 1024 n n 65120

Thank you for your support @sunway513

@sunway513
Copy link

Hi @computingdolas, my numbers are very different (much faster) than yours, please find it here:
rocm-deepbench.log

To reproduce my number, please use the following command to run the docker image I've prepared:

alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video -v $HOME/dockerx:/dockerx'
drun rocm/deepbench:latest

@computingdolas
Copy link
Author

Hey @sunway513 Thank you for your response. Those are nice numbers for AMD GPUs. Why I am getting this issue any idea ?

@computingdolas
Copy link
Author

Just to confirm you are still using AMD Vega gfx900. I am using AMD Pro SSG ?

@computingdolas
Copy link
Author

Is it the driver problem because I am really confused now ? I am looking non-docker solution. I want to know what happened that these numbers are so bad ?

@sunway513
Copy link

Hi @computingdolas , my test GPU is MI25, it's GFX900 based.
Could you firstly try with the docker? If you still get suboptimal performance data, that means your driver stack was not properly configured. If that can boost your performance, then it's a userland software issue -- we can take a further look from there.

@dagamayank
Copy link
Contributor

@computingdolas

I am getting the 1/10 flops/s on the AMD Vega architecture as compared to one mentioned in the results folder.

Please also clarify which GPU are you really using. I saw some reference of AMD Pro SSG, and that is NOT one of the supported deep learning AMD GPUs.

@computingdolas
Copy link
Author

hi @sunway513 Ok let's try the docker solution and I will update you in that :)

@dagamayank Hey I am using AMD SSG-PRO which is Vega 10 XT architecture. Are you saying we have good ROCm support for that GPU ? I saw the white paper and the data sheet and I saw many references where they mentioned about this capabilities for deep learning stuff. Can you let me know more about this.

Thanks :)

@computingdolas
Copy link
Author

Correction @dagamayank Are you saying we don't have good ROCm support for this GPU ?

@computingdolas
Copy link
Author

@sunway513 Is it possible to provide me remote access to your AMD mi25 GPU ?

@sunway513
Copy link

@computingdolas , I'm not able to provide public access to the MI25 node. However, you can alternatively try with third-party cloud services using VegaFE:
https://www.gpueater.com/

@computingdolas
Copy link
Author

Hey @sunway513 but https://www.gpueater.com/ don't have MI25 GPUs although having Vega frontier edition

@sunway513
Copy link

Yes, VegaFE should run ROCm fine with the similar performance as what I've provided in my log for MI25.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants