Onnx GPU runtime fails to fallback to CPU when GPU is not available/busy #5304

vladbph · 2020-09-26T19:04:57Z

Description:
Handle 'not available/busy GPU exception in the ctor of InferenceSession to fallback to default(CPU,etc) providers. This functionality is present in run() method, but not in ctor where exception happens in the first place. The changes introduce consistent behaviour.

Motivation and Context

Less work on client side
Handle production requirements where more than one model can be run on the hosting machine, thus busy GPU case is quite normal.
Onnx GPU runtime fails to fallback to CPU when GPU is not available OR busy. #5299

…R busy #5299

ghost · 2020-09-26T19:05:12Z

All CLA requirements met.

pranavsharma

Thanks for your contribution.

pranavsharma · 2020-09-29T08:01:43Z

onnxruntime/python/session.py

+                for i, provider in enumerate(providers):
+                    if provider in self._fallback_providers:
+                        fallback_providers.append(provider)
+                        try:


Do you need a try/catch here? Would if i < len(provider_options) not work?

not sure if all this logic is really needed. it would be simpler if we just fall back to the hardcoded set of providers (same as the Run() path)
the goal of the fallback path is to activate a fail safe execution mode.
thus it's best to fall back to CPU if the session creation or inference run fails
(because ORT guarantees CPU and CUDA provider should always works)
the other providers, not so much.
incorporating user specified providers list would not provide such guarantees.

@jywu-msft Given this check if provider in self._fallback_providers: I thought this method will always fallback to only the hardcoded set of providers. The objective of this logic is to preserve the provider options which get lost when we fallback in the run method.

@jywu-msft Given this check if provider in self._fallback_providers: I thought this method will always fallback to only the hardcoded set of providers. The objective of this logic is to preserve the provider options which get lost when we fallback in the run method.

that's true upon further inspection. i guess it's also up to the user to provide fallbacks in the correct order.
i wonder if we should separate out these cases.
if the user provides the order and provider options, we'll go with the user provided information and treat it as an override.

If the user doesn't provide any explicit providers/options, then we fallback to the hardcoded set.

currently the logic can mix both?

Either way is fine with me, just thought to bring this issue(user's order) up for the consideration. Just let me know what is the common ground...

@jywu-msft and I had a discussion. We think it's best to keep the behavior simple and consistent between run and session ctor. Hence, let's just use the hardcoded fallback providers.

pranavsharma · 2020-09-29T08:15:30Z

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

pranavsharma · 2020-09-29T08:15:37Z

/azp run orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-mac-ci-pipeline,Linux OpenVINO CI Pipeline

azure-pipelines · 2020-09-29T08:15:58Z

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines · 2020-09-29T08:16:11Z

Azure Pipelines successfully started running 9 pipeline(s).

onnxruntime/python/session.py

vladbph · 2020-09-29T16:52:28Z

Initially that were my exact thoughts as well, but then I decided to put future proof solution where you might have few callbacks and user can control priorities. Anyways, It is totally fine with me. I'll push simplified version tomorrow.

…

On Tue, Sep 29, 2020, 9:06 AM George Wu, ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In onnxruntime/python/session.py <#5304 (comment)> : > @@ -192,9 +192,40 @@ def __init__(self, path_or_bytes, sess_options=None, providers=None, provider_op self._enable_fallback = True self._read_config_from_model = os.environ.get('ORT_LOAD_CONFIG_FROM_MODEL') == '1' - self._create_inference_session(providers, provider_options) + try: + self._create_inference_session(providers, provider_options) + except: + if self._enable_fallback: + # Collect fallback providers matching user's order + fallback_providers = [] + fallback_providers_options = [] + providers = providers or [] + + # Are there any user providers from the default fallback list? + for i, provider in enumerate(providers): + if provider in self._fallback_providers: + fallback_providers.append(provider) + try: not sure if all this logic is really needed. it would be simpler if we just fall back to the hardcoded set of providers (same as the Run() path) the goal of the fallback path is to activate a fail safe execution mode. thus it's best to fall back to CPU if the session creation or inference run fails (because ORT guarantees CPU and CUDA provider should always works) the other providers, not so much. incorporating user specified providers list would not provide such guarantees. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5304 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC6HAUACY6OF4KC6OXP5TV3SIIAZBANCNFSM4R3CAOOA> .

…rder, IF they are included into providers list.

pranavsharma · 2020-10-01T07:19:54Z

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

pranavsharma · 2020-10-01T07:20:07Z

/azp run orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-mac-ci-pipeline,Linux OpenVINO CI Pipeline

azure-pipelines · 2020-10-01T07:20:30Z

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines · 2020-10-01T07:20:36Z

Azure Pipelines successfully started running 9 pipeline(s).

pranavsharma · 2020-10-01T07:21:30Z

Thank you. I know nothing about coding. God bless

What's the reason for requesting changes for this PR?

jywu-msft · 2020-10-01T17:40:35Z

looks like the test pipelines are failing due to PEP8 check failure.
see: https://github.com/microsoft/onnxruntime/blob/master/docs/Coding_Conventions_and_Standards.md#python-code-style

vladbph · 2020-10-01T18:03:03Z

looks like the test pipelines are failing due to PEP8 check failure.
see: https://github.com/microsoft/onnxruntime/blob/master/docs/Coding_Conventions_and_Standards.md#python-code-style

CI pipeline is not very informative. What is the exact failure?

pranavsharma · 2020-10-01T20:47:30Z

looks like the test pipelines are failing due to PEP8 check failure.
see: https://github.com/microsoft/onnxruntime/blob/master/docs/Coding_Conventions_and_Standards.md#python-code-style

CI pipeline is not very informative. What is the exact failure?

It's not meeting the coding guidelines. Please see the link above to rectify.

vladbph · 2020-10-02T01:19:53Z

I dont see anything wrong with the submitted 5 lines of code

…

On Thu, Oct 1, 2020, 1:47 PM Pranav Sharma, ***@***.***> wrote: looks like the test pipelines are failing due to PEP8 check failure. see: https://github.com/microsoft/onnxruntime/blob/master/docs/Coding_Conventions_and_Standards.md#python-code-style CI pipeline is not very informative. What is the exact failure? It's not meeting the coding guidelines. Please see the link above to rectify. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5304 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC6HAUDNK42WHNIJCH65BL3SITTHJANCNFSM4R3CAOOA> .

pranavsharma · 2020-10-02T06:39:05Z

I dont see anything wrong with the submitted 5 lines of code

     [flake8 PEP8 ERROR] C:/a/1/s/onnxruntime/python/session.py:197:9: E722 do not use bare 'except'

vladbph · 2020-10-02T15:58:48Z

I dont see anything wrong with the submitted 5 lines of code
     [flake8 PEP8 ERROR] C:/a/1/s/onnxruntime/python/session.py:197:9: E722 do not use bare 'except'

thanks, update pushed

pranavsharma · 2020-10-02T17:04:08Z

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

pranavsharma · 2020-10-02T17:04:28Z

/azp run orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-mac-ci-pipeline,Linux OpenVINO CI Pipeline

azure-pipelines · 2020-10-02T17:04:50Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2020-10-02T17:04:50Z

Azure Pipelines successfully started running 4 pipeline(s).

vladbph · 2020-10-02T17:29:42Z

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

It is failing again...and no reason is provided. Is there any way to run pre checks locally?

jywu-msft · 2020-10-02T18:28:22Z

[flake8 PEP8 ERROR] /onnxruntime_src/onnxruntime/python/session.py:233:5: E303 too many blank lines (2)
CMakeFiles/pep8_check.dir/build.make:87: recipe for target 'CMakeFiles/pep8_check' failed
CMakeFiles/Makefile2:400: recipe for target 'CMakeFiles/pep8_check.dir/all' failed
make[2]: *** [CMakeFiles/pep8_check] Error 1

my suggestion is to follow the instructions in https://github.com/microsoft/onnxruntime/blob/master/docs/Coding_Conventions_and_Standards.md#python-code-style
to resolve pep8 issues.

vladbph · 2020-10-02T19:09:07Z

It would be way more productive if CI show an actual error instead of cmake exit code 1...dont you think?

…

On Fri, Oct 2, 2020, 11:28 AM George Wu, ***@***.***> wrote: [flake8 PEP8 ERROR] /onnxruntime_src/onnxruntime/python/session.py:233:5: E303 too many blank lines (2) CMakeFiles/pep8_check.dir/build.make:87: recipe for target 'CMakeFiles/pep8_check' failed CMakeFiles/Makefile2:400: recipe for target 'CMakeFiles/pep8_check.dir/all' failed make[2]: *** [CMakeFiles/pep8_check] Error 1 my suggestion is to follow the instructions in https://github.com/microsoft/onnxruntime/blob/master/docs/Coding_Conventions_and_Standards.md#python-code-style to resolve pep8 issues. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5304 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC6HAUGB4RED6C4FM2GKBE3SIYLVRANCNFSM4R3CAOOA> .

pranavsharma · 2020-10-02T19:18:32Z

It would be way more productive if CI show an actual error instead of cmake exit code 1...dont you think?
You can always click on the CI run and see what error it is throwing.

vladbph · 2020-10-02T19:36:55Z

Hmmm, I did and it shows only cmake exit code 1

…

On Fri, Oct 2, 2020, 12:18 PM Pranav Sharma, ***@***.***> wrote: It would be way more productive if CI show an actual error instead of cmake exit code 1...dont you think? You can always click on the CI run and see what error it is throwing. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5304 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC6HAUFW4SEE3XRRCMUMBRDSIYRRTANCNFSM4R3CAOOA> .

pranavsharma · 2020-10-03T02:01:09Z

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

pranavsharma · 2020-10-03T02:01:29Z

/azp run orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-mac-ci-pipeline,Linux OpenVINO CI Pipeline

azure-pipelines · 2020-10-03T02:01:46Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2020-10-03T02:01:49Z

Azure Pipelines successfully started running 4 pipeline(s).

jywu-msft · 2020-10-03T04:54:57Z

/azp run centos7_cpu

azure-pipelines · 2020-10-03T04:55:05Z

Azure Pipelines successfully started running 1 pipeline(s).

vladbph added 3 commits September 26, 2020 11:35

ONNX GPU runtime fails to fallback to CPU when GPU is not available O…

050a3aa

…R busy #5299

comments

3b1acbc

Init _fallback_providers before C.InferenceSession

319ed94

vladbph requested a review from a team as a code owner September 26, 2020 19:04

tianleiwu assigned pranavsharma Sep 28, 2020

pranavsharma reviewed Sep 29, 2020

View reviewed changes

jywu-msft reviewed Sep 29, 2020

View reviewed changes

onnxruntime/python/session.py Show resolved Hide resolved

As per review: Fallback providers order supersedes user's providers o…

a8ca5ae

…rder, IF they are included into providers list.

pranavsharma previously approved these changes Oct 1, 2020

View reviewed changes

vladbph changed the title ~~Onnx GPU runtime fails to fallback to CPU when GPU is not available OR busy~~ Onnx GPU runtime fails to fallback to CPU when GPU is not available/busy Oct 1, 2020

vladbph dismissed pranavsharma’s stale review via 94f2ea7 October 2, 2020 15:57

Code convention fix

94f2ea7

pep8

42462a7

pranavsharma approved these changes Oct 3, 2020

View reviewed changes

jywu-msft approved these changes Oct 3, 2020

View reviewed changes

jywu-msft merged commit c20fcf2 into microsoft:master Oct 3, 2020

Onnx GPU runtime fails to fallback to CPU when GPU is not available/busy #5304

Onnx GPU runtime fails to fallback to CPU when GPU is not available/busy #5304

Conversation

vladbph commented Sep 26, 2020

ghost commented Sep 26, 2020 • edited by ghost Loading

pranavsharma left a comment

Choose a reason for hiding this comment

pranavsharma Sep 29, 2020

Choose a reason for hiding this comment

jywu-msft Sep 29, 2020

Choose a reason for hiding this comment

pranavsharma Sep 29, 2020

Choose a reason for hiding this comment

jywu-msft Sep 29, 2020 • edited Loading

Choose a reason for hiding this comment

vladbph Sep 30, 2020

Choose a reason for hiding this comment

pranavsharma Sep 30, 2020

Choose a reason for hiding this comment

pranavsharma commented Sep 29, 2020

pranavsharma commented Sep 29, 2020

azure-pipelines bot commented Sep 29, 2020

azure-pipelines bot commented Sep 29, 2020

vladbph commented Sep 29, 2020 via email

pranavsharma commented Oct 1, 2020

pranavsharma commented Oct 1, 2020

azure-pipelines bot commented Oct 1, 2020

azure-pipelines bot commented Oct 1, 2020

pranavsharma commented Oct 1, 2020 • edited Loading

jywu-msft commented Oct 1, 2020

vladbph commented Oct 1, 2020

pranavsharma commented Oct 1, 2020

vladbph commented Oct 2, 2020 via email

pranavsharma commented Oct 2, 2020 • edited Loading

vladbph commented Oct 2, 2020

pranavsharma commented Oct 2, 2020

pranavsharma commented Oct 2, 2020

azure-pipelines bot commented Oct 2, 2020

azure-pipelines bot commented Oct 2, 2020

vladbph commented Oct 2, 2020

jywu-msft commented Oct 2, 2020

vladbph commented Oct 2, 2020 via email

pranavsharma commented Oct 2, 2020

vladbph commented Oct 2, 2020 via email • edited Loading

pranavsharma commented Oct 3, 2020

pranavsharma commented Oct 3, 2020

azure-pipelines bot commented Oct 3, 2020

azure-pipelines bot commented Oct 3, 2020

jywu-msft commented Oct 3, 2020

azure-pipelines bot commented Oct 3, 2020

ghost commented Sep 26, 2020 •

edited by ghost

Loading

jywu-msft Sep 29, 2020 •

edited

Loading

pranavsharma commented Oct 1, 2020 •

edited

Loading

pranavsharma commented Oct 2, 2020 •

edited

Loading

vladbph commented Oct 2, 2020 via email •

edited

Loading