Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MIGraphX EP] Fix CopyTensorAsync and add guards for stream sync CopyTensors #16787

Merged
merged 3 commits into from
Jul 22, 2023

Conversation

TedThemistokleous
Copy link
Contributor

Add compile guards to gate functionality based on MIGRAPHX_STREAM_SYNC for adding the following

  • remove excess hipStreamSyncronize to nullstream on CopyTensor calls
  • Add proper call for stream synchronized CopyTensorAsync for DeviceToHost case

Without this change subsequent CopyTensorAsync() calls will fail for cards that don't use pinned memory thus causing hipMemcpy() calls to occur before certain kernel operations occur.

Description

Remove excess syncronization when stream sync is enabled but also block the DeviceTohost CopyTensorAsync() to the desired GPU stream.

Motivation and Context

Without this change we fail to properly wait for a kernel to compute and synchronize correctly if memory isn't specifically pinned for the task. This was observed when doing another run with test_parity_gelu and test_parity_layernorm tests.

image

becomes

image

…Tensors()

Add compile guards to gate functionality based on MIGRAPHX_STREAM_SYNC for
adding the following

- remove excess hipStreamSyncronize to nullstream on CopyTensor calls
- Add proper call for stream synchronized CopyTensorAsync for DeviceToHost case

Without this change subsequent CopyTensorAsync() calls will fail for cards that
don't use pinned memory thus causing hipMemcpy() calls to occur before certain
kernel operations occur.
@TedThemistokleous
Copy link
Contributor Author

ping @cloudhan @PeixuanZuo . This is related to the issue I found earlier. #16774

@cloudhan
Copy link
Member

/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

@cloudhan
Copy link
Member

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,ONNX Runtime Web CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

This is already handled in the EP as end of run performs:

OnRunEnd()->hipStreamQuery()->hipStreamSyncronize()

as well as

Sync()->hipStreamSync()

Also after every hipMemCpy() we perform a hipStreamSyncronize(stream)
@TedThemistokleous
Copy link
Contributor Author

TedThemistokleous commented Jul 21, 2023

@cloudhan @PeixuanZuo @ytaous

let me know if you need anything further for this or there are any additional concerns.

@cloudhan
Copy link
Member

/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

@cloudhan
Copy link
Member

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,ONNX Runtime Web CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

@TedThemistokleous
Copy link
Contributor Author

@cloudhan, looks like this test isn't building for some reason

4: Test command: 
4: Working Directory: D:/a/onnxruntime/onnxruntime/build/Release/_deps/tvm-build
4/4 Test #4: cpptest_NOT_BUILT ................***Not Run   0.00 sec

75% tests passed, 1 tests failed out of 4

@cloudhan cloudhan merged commit 488544b into microsoft:main Jul 22, 2023
65 of 66 checks passed
@TedThemistokleous TedThemistokleous deleted the migx_gpu_async_copy_fix branch July 24, 2023 13:54
TedThemistokleous added a commit to TedThemistokleous/onnxruntime that referenced this pull request Jul 25, 2023
…Tensors (microsoft#16787)

Add compile guards to gate functionality based on MIGRAPHX_STREAM_SYNC
for adding the following

- remove excess hipStreamSyncronize to nullstream on CopyTensor calls
- Add proper call for stream synchronized CopyTensorAsync for
DeviceToHost case

Without this change subsequent CopyTensorAsync() calls will fail for
cards that don't use pinned memory thus causing hipMemcpy() calls to
occur before certain kernel operations occur.

![image](https://github.com/microsoft/onnxruntime/assets/107195283/4915c18a-fb2d-40c9-a50e-a7c6613c324b)

becomes

![image](https://github.com/microsoft/onnxruntime/assets/107195283/f661acf4-e2af-4c9a-b26a-30fca339cf1d)

---------

Co-authored-by: Ted Themistokleous <tthemist@amd.com>
TedThemistokleous added a commit to TedThemistokleous/onnxruntime that referenced this pull request Jul 25, 2023
…Tensors (microsoft#16787)

Add compile guards to gate functionality based on MIGRAPHX_STREAM_SYNC
for adding the following

- remove excess hipStreamSyncronize to nullstream on CopyTensor calls
- Add proper call for stream synchronized CopyTensorAsync for
DeviceToHost case

Without this change subsequent CopyTensorAsync() calls will fail for
cards that don't use pinned memory thus causing hipMemcpy() calls to
occur before certain kernel operations occur.

![image](https://github.com/microsoft/onnxruntime/assets/107195283/4915c18a-fb2d-40c9-a50e-a7c6613c324b)

becomes

![image](https://github.com/microsoft/onnxruntime/assets/107195283/f661acf4-e2af-4c9a-b26a-30fca339cf1d)

---------

Co-authored-by: Ted Themistokleous <tthemist@amd.com>
jeffdaily pushed a commit to ROCm/onnxruntime that referenced this pull request Jul 25, 2023
…Tensors (microsoft#16787) (#13)

Add compile guards to gate functionality based on MIGRAPHX_STREAM_SYNC
for adding the following

- remove excess hipStreamSyncronize to nullstream on CopyTensor calls
- Add proper call for stream synchronized CopyTensorAsync for
DeviceToHost case

Without this change subsequent CopyTensorAsync() calls will fail for
cards that don't use pinned memory thus causing hipMemcpy() calls to
occur before certain kernel operations occur.

![image](https://github.com/microsoft/onnxruntime/assets/107195283/4915c18a-fb2d-40c9-a50e-a7c6613c324b)

becomes

![image](https://github.com/microsoft/onnxruntime/assets/107195283/f661acf4-e2af-4c9a-b26a-30fca339cf1d)

---------

Co-authored-by: Ted Themistokleous <tthemist@amd.com>
TedThemistokleous added a commit to TedThemistokleous/onnxruntime that referenced this pull request Jul 26, 2023
…Tensors (microsoft#16787)

Add compile guards to gate functionality based on MIGRAPHX_STREAM_SYNC
for adding the following

- remove excess hipStreamSyncronize to nullstream on CopyTensor calls
- Add proper call for stream synchronized CopyTensorAsync for
DeviceToHost case

Without this change subsequent CopyTensorAsync() calls will fail for
cards that don't use pinned memory thus causing hipMemcpy() calls to
occur before certain kernel operations occur.

![image](https://github.com/microsoft/onnxruntime/assets/107195283/4915c18a-fb2d-40c9-a50e-a7c6613c324b)

becomes

![image](https://github.com/microsoft/onnxruntime/assets/107195283/f661acf4-e2af-4c9a-b26a-30fca339cf1d)

---------

Co-authored-by: Ted Themistokleous <tthemist@amd.com>
jchen351 pushed a commit that referenced this pull request Aug 12, 2023
…Tensors (#16787)

Add compile guards to gate functionality based on MIGRAPHX_STREAM_SYNC
for adding the following

- remove excess hipStreamSyncronize to nullstream on CopyTensor calls
- Add proper call for stream synchronized CopyTensorAsync for
DeviceToHost case

Without this change subsequent CopyTensorAsync() calls will fail for
cards that don't use pinned memory thus causing hipMemcpy() calls to
occur before certain kernel operations occur.

![image](https://github.com/microsoft/onnxruntime/assets/107195283/4915c18a-fb2d-40c9-a50e-a7c6613c324b)

becomes

![image](https://github.com/microsoft/onnxruntime/assets/107195283/f661acf4-e2af-4c9a-b26a-30fca339cf1d)

---------

Co-authored-by: Ted Themistokleous <tthemist@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants