Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add No Support Label for ROCm GPU in pytorch profiler tutorial #2674

Closed
wants to merge 3 commits into from

Conversation

guptaaryan16
Copy link

@guptaaryan16 guptaaryan16 commented Nov 12, 2023

Fixes #2014

Description

This PR is in relation to the issue of ROCm support in PyTorch Profiler, which I have seen can give some bugs on testing with PyTorch API. The official files in PyTorch/Profiler does not have an option for its support yet as given here: https://github.com/pytorch/pytorch/blob/7f1cbc8b5a7de8794c36833baa47bb1343833589/torch/profiler/profiler.py#L35

ROCm usage can give some unexpected results so right now it's best to mark the usage unsupported. I plan to raise this issue with the PyTorch team as it will be important to support ROCm GPUs as their usage grows.

Checklist

  • The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
  • Only one issue is addressed in this pull request
  • Labels from the issue that this PR is fixing are added to this pull request
  • No unnecessary issues are included into this pull request.

cc @aaronenyeshi @chaekit @jcarreiro @sekyondaMeta @svekars @carljparker @NicolasHug @kit1980 @subramen

Copy link

pytorch-bot bot commented Nov 12, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2674

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3aa5d06 with merge base dc448c2 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

pytorch-bot bot commented Nov 12, 2023

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@guptaaryan16
Copy link
Author

@pytorchbot label "docathon-h2-2023"

@guptaaryan16 guptaaryan16 marked this pull request as draft November 14, 2023 18:03
@guptaaryan16 guptaaryan16 marked this pull request as ready for review November 14, 2023 18:03
@@ -7,7 +7,7 @@
Introduction
------------
PyTorch 1.8 includes an updated profiler API capable of
recording the CPU side operations as well as the CUDA kernel launches on the GPU side.
recording the CPU side operations as well as the CUDA kernel launches on the GPU side (``AMD ROCm™`` GPUs are not supported).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffdaily is this indeed the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why are you adding a backtricks here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ROCm has full upstream support of both modern pytorch profiling via kineto + ROCm's libroctracer, as well as the older autograd profiler via ROCm's roctx.

Run any application, pytorch or otherwise, using ROCm's rocprof. The PyTorch GPU traces will be collected as part of your rocprof output.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malfet I quoted this function supported_activities in official PyTorch API which does not mention anything about ROCm GPUs. I suppose they should have something here for the ROCm profiling if that's the case.

Also the backtricks is due to an error in PySpelling test where ROCm is not in dictionary, so the test caught it as a misspelled word. I suppose I can either do this or add it in a custom dictionary list.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update that comment. ROCm profiler uses roctracer library to trace on-device HIP kernels. Passing ProfilerActivity.CUDA is the correct activity to use for both CUDA and ROCm due to ROCm PyTorch's strategy of "masquerading" as CUDA so that users do not have to make any changes to their pytorch models when running on ROCm. PyTorch chose to expose "CUDA" in public APIs and we chose to reuse them to make the transition to ROCm from CUDA easier on users.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To further substantiate the claim, can you please specify in PR description why your are making it (I.e. you've tried to run the tutorial on GPU such-n-such using PyTorch-X.Y.Z and instead of certain output you got something else. Because right now you just add a blank statement, which would be very hard to verify

@guptaaryan16
Copy link
Author

guptaaryan16 commented Nov 14, 2023

@malfet I was testing the issue #2014 but since I did not have specific hardware(i.e. AMD GPU) for the testing the trace I assumed that the person who raised the issue has provided a correct trace image which seems to give wrong output. Since the trace is fine for CPU and CUDA (on testing), and there as no mention of ROCm in profiler API if they are supported, I raised this PR. Can you verify this tutorial and https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html on ROCm and see if the output trace is correct? Seems like it is off in time for about 250ms which is not the case for CPU and CUDA devices. Thanks @jeffdaily

@jeffdaily
Copy link

I'm assigning this to @hongxiayang to verify.

@hongxiayang
Copy link
Contributor

hongxiayang commented Nov 15, 2023

I'm assigning this to @hongxiayang to verify.

#2684

@guptaaryan16
Copy link
Author

Ok seems like this issue is resolved, so I am closing this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

profiler ROCm tracing result was wrong
5 participants