-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Onnx GPU runtime fails to fallback to CPU when GPU is not available/busy #5304
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution.
onnxruntime/python/session.py
Outdated
for i, provider in enumerate(providers): | ||
if provider in self._fallback_providers: | ||
fallback_providers.append(provider) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need a try/catch here? Would if i < len(provider_options)
not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if all this logic is really needed. it would be simpler if we just fall back to the hardcoded set of providers (same as the Run() path)
the goal of the fallback path is to activate a fail safe execution mode.
thus it's best to fall back to CPU if the session creation or inference run fails
(because ORT guarantees CPU and CUDA provider should always works)
the other providers, not so much.
incorporating user specified providers list would not provide such guarantees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jywu-msft Given this check if provider in self._fallback_providers:
I thought this method will always fallback to only the hardcoded set of providers. The objective of this logic is to preserve the provider options which get lost when we fallback in the run method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jywu-msft Given this check
if provider in self._fallback_providers:
I thought this method will always fallback to only the hardcoded set of providers. The objective of this logic is to preserve the provider options which get lost when we fallback in the run method.
that's true upon further inspection. i guess it's also up to the user to provide fallbacks in the correct order.
i wonder if we should separate out these cases.
if the user provides the order and provider options, we'll go with the user provided information and treat it as an override.
If the user doesn't provide any explicit providers/options, then we fallback to the hardcoded set.
currently the logic can mix both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way is fine with me, just thought to bring this issue(user's order) up for the consideration. Just let me know what is the common ground...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jywu-msft and I had a discussion. We think it's best to keep the behavior simple and consistent between run and session ctor. Hence, let's just use the hardcoded fallback providers.
/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline |
/azp run orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-mac-ci-pipeline,Linux OpenVINO CI Pipeline |
Azure Pipelines successfully started running 4 pipeline(s). |
Azure Pipelines successfully started running 9 pipeline(s). |
Initially that were my exact thoughts as well, but then I decided to put
future proof solution where you might have few callbacks and user can
control priorities. Anyways, It is totally fine with me. I'll push
simplified version tomorrow.
…On Tue, Sep 29, 2020, 9:06 AM George Wu, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In onnxruntime/python/session.py
<#5304 (comment)>
:
> @@ -192,9 +192,40 @@ def __init__(self, path_or_bytes, sess_options=None, providers=None, provider_op
self._enable_fallback = True
self._read_config_from_model = os.environ.get('ORT_LOAD_CONFIG_FROM_MODEL') == '1'
- self._create_inference_session(providers, provider_options)
+ try:
+ self._create_inference_session(providers, provider_options)
+ except:
+ if self._enable_fallback:
+ # Collect fallback providers matching user's order
+ fallback_providers = []
+ fallback_providers_options = []
+ providers = providers or []
+
+ # Are there any user providers from the default fallback list?
+ for i, provider in enumerate(providers):
+ if provider in self._fallback_providers:
+ fallback_providers.append(provider)
+ try:
not sure if all this logic is really needed. it would be simpler if we
just fall back to the hardcoded set of providers (same as the Run() path)
the goal of the fallback path is to activate a fail safe execution mode.
thus it's best to fall back to CPU if the session creation or inference
run fails
(because ORT guarantees CPU and CUDA provider should always works)
the other providers, not so much.
incorporating user specified providers list would not provide such
guarantees.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#5304 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC6HAUACY6OF4KC6OXP5TV3SIIAZBANCNFSM4R3CAOOA>
.
|
…rder, IF they are included into providers list.
/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline |
/azp run orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-mac-ci-pipeline,Linux OpenVINO CI Pipeline |
Azure Pipelines successfully started running 4 pipeline(s). |
Azure Pipelines successfully started running 9 pipeline(s). |
What's the reason for requesting changes for this PR? |
looks like the test pipelines are failing due to PEP8 check failure. |
CI pipeline is not very informative. What is the exact failure? |
It's not meeting the coding guidelines. Please see the link above to rectify. |
I dont see anything wrong with the submitted 5 lines of code
…On Thu, Oct 1, 2020, 1:47 PM Pranav Sharma, ***@***.***> wrote:
looks like the test pipelines are failing due to PEP8 check failure.
see:
https://github.com/microsoft/onnxruntime/blob/master/docs/Coding_Conventions_and_Standards.md#python-code-style
CI pipeline is not very informative. What is the exact failure?
It's not meeting the coding guidelines. Please see the link above to
rectify.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#5304 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC6HAUDNK42WHNIJCH65BL3SITTHJANCNFSM4R3CAOOA>
.
|
|
thanks, update pushed |
/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline |
/azp run orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-mac-ci-pipeline,Linux OpenVINO CI Pipeline |
Azure Pipelines successfully started running 9 pipeline(s). |
Azure Pipelines successfully started running 4 pipeline(s). |
It is failing again...and no reason is provided. Is there any way to run pre checks locally? |
[flake8 PEP8 ERROR] /onnxruntime_src/onnxruntime/python/session.py:233:5: E303 too many blank lines (2) my suggestion is to follow the instructions in https://github.com/microsoft/onnxruntime/blob/master/docs/Coding_Conventions_and_Standards.md#python-code-style |
It would be way more productive if CI show an actual error instead of cmake
exit code 1...dont you think?
…On Fri, Oct 2, 2020, 11:28 AM George Wu, ***@***.***> wrote:
[flake8 PEP8 ERROR] /onnxruntime_src/onnxruntime/python/session.py:233:5:
E303 too many blank lines (2)
CMakeFiles/pep8_check.dir/build.make:87: recipe for target
'CMakeFiles/pep8_check' failed
CMakeFiles/Makefile2:400: recipe for target
'CMakeFiles/pep8_check.dir/all' failed
make[2]: *** [CMakeFiles/pep8_check] Error 1
my suggestion is to follow the instructions in
https://github.com/microsoft/onnxruntime/blob/master/docs/Coding_Conventions_and_Standards.md#python-code-style
to resolve pep8 issues.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#5304 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC6HAUGB4RED6C4FM2GKBE3SIYLVRANCNFSM4R3CAOOA>
.
|
|
Hmmm, I did and it shows only cmake exit code 1
…On Fri, Oct 2, 2020, 12:18 PM Pranav Sharma, ***@***.***> wrote:
It would be way more productive if CI show an actual error instead of
cmake exit code 1...dont you think?
You can always click on the CI run and see what error it is throwing.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#5304 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC6HAUFW4SEE3XRRCMUMBRDSIYRRTANCNFSM4R3CAOOA>
.
|
/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline |
/azp run orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-mac-ci-pipeline,Linux OpenVINO CI Pipeline |
Azure Pipelines successfully started running 9 pipeline(s). |
Azure Pipelines successfully started running 4 pipeline(s). |
/azp run centos7_cpu |
Azure Pipelines successfully started running 1 pipeline(s). |
Description:
Handle 'not available/busy GPU exception in the ctor of InferenceSession to fallback to default(CPU,etc) providers. This functionality is present in run() method, but not in ctor where exception happens in the first place. The changes introduce consistent behaviour.
Motivation and Context
Onnx GPU runtime fails to fallback to CPU when GPU is not available OR busy. #5299