Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Onnx GPU runtime fails to fallback to CPU when GPU is not available/busy #5304

Merged
merged 6 commits into from
Oct 3, 2020
Merged

Conversation

vladbph
Copy link
Contributor

@vladbph vladbph commented Sep 26, 2020

Description:
Handle 'not available/busy GPU exception in the ctor of InferenceSession to fallback to default(CPU,etc) providers. This functionality is present in run() method, but not in ctor where exception happens in the first place. The changes introduce consistent behaviour.

Motivation and Context

@vladbph vladbph requested a review from a team as a code owner September 26, 2020 19:04
@ghost
Copy link

ghost commented Sep 26, 2020

CLA assistant check
All CLA requirements met.

Copy link
Contributor

@pranavsharma pranavsharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution.

for i, provider in enumerate(providers):
if provider in self._fallback_providers:
fallback_providers.append(provider)
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need a try/catch here? Would if i < len(provider_options) not work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if all this logic is really needed. it would be simpler if we just fall back to the hardcoded set of providers (same as the Run() path)
the goal of the fallback path is to activate a fail safe execution mode.
thus it's best to fall back to CPU if the session creation or inference run fails
(because ORT guarantees CPU and CUDA provider should always works)
the other providers, not so much.
incorporating user specified providers list would not provide such guarantees.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jywu-msft Given this check if provider in self._fallback_providers: I thought this method will always fallback to only the hardcoded set of providers. The objective of this logic is to preserve the provider options which get lost when we fallback in the run method.

Copy link
Member

@jywu-msft jywu-msft Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jywu-msft Given this check if provider in self._fallback_providers: I thought this method will always fallback to only the hardcoded set of providers. The objective of this logic is to preserve the provider options which get lost when we fallback in the run method.

that's true upon further inspection. i guess it's also up to the user to provide fallbacks in the correct order.
i wonder if we should separate out these cases.
if the user provides the order and provider options, we'll go with the user provided information and treat it as an override.

If the user doesn't provide any explicit providers/options, then we fallback to the hardcoded set.

currently the logic can mix both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way is fine with me, just thought to bring this issue(user's order) up for the consideration. Just let me know what is the common ground...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jywu-msft and I had a discussion. We think it's best to keep the behavior simple and consistent between run and session ctor. Hence, let's just use the hardcoded fallback providers.

@pranavsharma
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

@pranavsharma
Copy link
Contributor

/azp run orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-mac-ci-pipeline,Linux OpenVINO CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@vladbph
Copy link
Contributor Author

vladbph commented Sep 29, 2020 via email

…rder, IF they are included into providers list.
pranavsharma
pranavsharma previously approved these changes Oct 1, 2020
@pranavsharma
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

@pranavsharma
Copy link
Contributor

/azp run orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-mac-ci-pipeline,Linux OpenVINO CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@pranavsharma
Copy link
Contributor

pranavsharma commented Oct 1, 2020

Thank you. I know nothing about coding. God bless

What's the reason for requesting changes for this PR?

@vladbph vladbph changed the title Onnx GPU runtime fails to fallback to CPU when GPU is not available OR busy Onnx GPU runtime fails to fallback to CPU when GPU is not available/busy Oct 1, 2020
@jywu-msft
Copy link
Member

looks like the test pipelines are failing due to PEP8 check failure.
see: https://github.com/microsoft/onnxruntime/blob/master/docs/Coding_Conventions_and_Standards.md#python-code-style

@vladbph
Copy link
Contributor Author

vladbph commented Oct 1, 2020

looks like the test pipelines are failing due to PEP8 check failure.
see: https://github.com/microsoft/onnxruntime/blob/master/docs/Coding_Conventions_and_Standards.md#python-code-style

CI pipeline is not very informative. What is the exact failure?

@pranavsharma
Copy link
Contributor

looks like the test pipelines are failing due to PEP8 check failure.
see: https://github.com/microsoft/onnxruntime/blob/master/docs/Coding_Conventions_and_Standards.md#python-code-style

CI pipeline is not very informative. What is the exact failure?

It's not meeting the coding guidelines. Please see the link above to rectify.

@vladbph
Copy link
Contributor Author

vladbph commented Oct 2, 2020 via email

@pranavsharma
Copy link
Contributor

pranavsharma commented Oct 2, 2020

I dont see anything wrong with the submitted 5 lines of code

     [flake8 PEP8 ERROR] C:/a/1/s/onnxruntime/python/session.py:197:9: E722 do not use bare 'except'

@vladbph
Copy link
Contributor Author

vladbph commented Oct 2, 2020

I dont see anything wrong with the submitted 5 lines of code

     [flake8 PEP8 ERROR] C:/a/1/s/onnxruntime/python/session.py:197:9: E722 do not use bare 'except'

thanks, update pushed

@pranavsharma
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

@pranavsharma
Copy link
Contributor

/azp run orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-mac-ci-pipeline,Linux OpenVINO CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@vladbph
Copy link
Contributor Author

vladbph commented Oct 2, 2020

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

It is failing again...and no reason is provided. Is there any way to run pre checks locally?

@jywu-msft
Copy link
Member

[flake8 PEP8 ERROR] /onnxruntime_src/onnxruntime/python/session.py:233:5: E303 too many blank lines (2)
CMakeFiles/pep8_check.dir/build.make:87: recipe for target 'CMakeFiles/pep8_check' failed
CMakeFiles/Makefile2:400: recipe for target 'CMakeFiles/pep8_check.dir/all' failed
make[2]: *** [CMakeFiles/pep8_check] Error 1

my suggestion is to follow the instructions in https://github.com/microsoft/onnxruntime/blob/master/docs/Coding_Conventions_and_Standards.md#python-code-style
to resolve pep8 issues.

@vladbph
Copy link
Contributor Author

vladbph commented Oct 2, 2020 via email

@pranavsharma
Copy link
Contributor

It would be way more productive if CI show an actual error instead of cmake exit code 1...dont you think?
You can always click on the CI run and see what error it is throwing.

@vladbph
Copy link
Contributor Author

vladbph commented Oct 2, 2020 via email

@pranavsharma
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

@pranavsharma
Copy link
Contributor

/azp run orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-mac-ci-pipeline,Linux OpenVINO CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@jywu-msft
Copy link
Member

/azp run centos7_cpu

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jywu-msft jywu-msft merged commit c20fcf2 into microsoft:master Oct 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants