Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONNX Inference Speed extremely slow compare to .pt Model #4808

Closed
shrijan00 opened this issue Sep 15, 2021 · 30 comments · Fixed by #5087
Closed

ONNX Inference Speed extremely slow compare to .pt Model #4808

shrijan00 opened this issue Sep 15, 2021 · 30 comments · Fixed by #5087
Labels
question Further information is requested

Comments

@shrijan00
Copy link

Hi,
I tried to inference an image of resolution 1024*1536 using onnx and .pt model
As you can see the huge time difference between the 2 cases in the image
image

Any reason for this?

@shrijan00 shrijan00 added the question Further information is requested label Sep 15, 2021
@glenn-jocher
Copy link
Member

@shrijan00 onnx models run on CPU

@shrijan00
Copy link
Author

onnx models run on CPU
I tried to add in the detect.py
session = onnxruntime.InferenceSession(w, None, providers='CUDAExecutionProvider')
but doesn't seem to work. Any other way to run onnx on GPU?

@glenn-jocher
Copy link
Member

@shrijan00 I don’t know, but if you find a good solution make sure to submit a PR to help others run Onnx on GPU!

@zhaojun060708
Copy link

@glenn-jocher Why the exported onnx model does not support GPU?

@happyday-lkj
Copy link

have you solve the problems?

@callbarian
Copy link

in detect.py, change this line
check_requirements(('onnx', 'onnxruntime'))
to
check_requirements(('onnx', 'onnxruntime-gpu'))

so that the code will not install onnxruntime, which is cpu version.

make sure you have installed cuda and cudnn to use onnxruntime-gpu

@glenn-jocher
Copy link
Member

@callbarian thanks for the pointer! I didn't know about the -gpu package. We should make the requirements check conditional on the hardware then, with GPU-enabled systems installing -gpu automatically. I'll submit a PR for this fix.

TODO: Install onnxruntime-gpu automatically if the user has a CUDA-enabled system.

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 7, 2021

@callbarian I opened a PR #5087 for this, but testing this PR does not show improved ONNX inference speeds even after installing onnxruntime-gpu. Are there additional steps required for ONNX to use your GPU? There's a cryptic warning message about CUDA/CPU ExecutionProvider. This is in Colab.

!python export.py --weights yolov5s.pt --include onnx --dynamic --simplify

!python detect.py --weights yolov5s.onnx

detect: weights=['yolov5s.onnx'], source=data/images, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False
YOLOv5 🚀 v5.0-498-g16f413b torch 1.9.0+cu111 CUDA:0 (Tesla P100-PCIE-16GB, 16280.875MB)

/usr/local/lib/python3.7/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py:353: UserWarning: Deprecation warning. This ORT build has ['CUDAExecutionProvider', 'CPUExecutionProvider'] enabled. The next release (ORT 1.10) will require explicitly setting the providers parameter (as opposed to the current behavior of providers getting set/registered by default based on the build flags) when instantiating InferenceSession.For example, onnxruntime.InferenceSession(..., providers=["CUDAExecutionProvider"], ...)
  "based on the build flags) when instantiating InferenceSession."

image 1/2 /content/yolov5/data/images/bus.jpg: 640x640 4 class0s, 1 class5, Done. (0.846s)
image 2/2 /content/yolov5/data/images/zidane.jpg: 640x640 2 class0s, 2 class27s, Done. (0.262s)
Speed: 2.0ms pre-process, 554.0ms inference, 1.8ms NMS per image at shape (1, 3, 640, 640)
Results saved to runs/detect/exp3

@glenn-jocher
Copy link
Member

@callbarian can you comment on PR #5087 that adds onnxruntime-gpu installation to detect.py but does not result faster inference? Thanks!

@MrRace
Copy link

MrRace commented Apr 12, 2022

@callbarian I opened a PR #5087 for this, but testing this PR does not show improved ONNX inference speeds even after installing onnxruntime-gpu. Are there additional steps required for ONNX to use your GPU? There's a cryptic warning message about CUDA/CPU ExecutionProvider. This is in Colab.

!python export.py --weights yolov5s.pt --include onnx --dynamic --simplify

!python detect.py --weights yolov5s.onnx

detect: weights=['yolov5s.onnx'], source=data/images, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False
YOLOv5 🚀 v5.0-498-g16f413b torch 1.9.0+cu111 CUDA:0 (Tesla P100-PCIE-16GB, 16280.875MB)

/usr/local/lib/python3.7/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py:353: UserWarning: Deprecation warning. This ORT build has ['CUDAExecutionProvider', 'CPUExecutionProvider'] enabled. The next release (ORT 1.10) will require explicitly setting the providers parameter (as opposed to the current behavior of providers getting set/registered by default based on the build flags) when instantiating InferenceSession.For example, onnxruntime.InferenceSession(..., providers=["CUDAExecutionProvider"], ...)
  "based on the build flags) when instantiating InferenceSession."

image 1/2 /content/yolov5/data/images/bus.jpg: 640x640 4 class0s, 1 class5, Done. (0.846s)
image 2/2 /content/yolov5/data/images/zidane.jpg: 640x640 2 class0s, 2 class27s, Done. (0.262s)
Speed: 2.0ms pre-process, 554.0ms inference, 1.8ms NMS per image at shape (1, 3, 640, 640)
Results saved to runs/detect/exp3

@glenn-jocher From your testing results, it seems extremely bad which cost 554.0ms inference! Therefore the problem has not been fixed~

@MrRace
Copy link

MrRace commented Apr 12, 2022

@callbarian have you solved the problem?

@glenn-jocher
Copy link
Member

@MrRace ONNX export inference is working correctly with comparable speeds to PyTorch, i.e. see #6963

Run YOLOv5 benchmarks (speed and accuracy) for all supported export formats. This PR adds GPU benchmarking capability following CPU benchmarking PR #6613.

Format export.py --include Model
PyTorch - yolov5s.pt
TorchScript torchscript yolov5s.torchscript
ONNX onnx yolov5s.onnx
OpenVINO openvino yolov5s_openvino_model/
TensorRT engine yolov5s.engine
CoreML coreml yolov5s.mlmodel
TensorFlow SavedModel saved_model yolov5s_saved_model/
TensorFlow GraphDef pb yolov5s.pb
TensorFlow Lite tflite yolov5s.tflite
TensorFlow Edge TPU edgetpu yolov5s_edgetpu.tflite
TensorFlow.js tfjs yolov5s_web_model/

Usage:

git clone https://github.com/ultralytics/yolov5 -b update/bench_gpu  # clone
cd yolov5
pip install -qr requirements.txt coremltools onnx onnxruntime-gpu openvino-dev  # install
pip install -U nvidia-tensorrt --index-url https://pypi.ngc.nvidia.com  # TensorRT

python utils/benchmarks.py --weights yolov5s.pt --img 640 --device 0

Colab++ V100 High-RAM Results

benchmarks: weights=/content/yolov5/yolov5s.pt, imgsz=640, batch_size=1, data=/content/yolov5/data/coco128.yaml, device=0, half=False
Checking setup...
YOLOv5 🚀 v6.1-48-g0c1025f torch 1.10.0+cu111 CUDA:0 (Tesla V100-SXM2-16GB, 16160MiB)
Setup complete ✅ (8 CPUs, 51.0 GB RAM, 46.1/166.8 GB disk)

Benchmarks complete (433.63s)
                   Format  mAP@0.5:0.95  Inference time (ms)
0                 PyTorch      0.462296             9.159939
1             TorchScript      0.462296             6.607546
2                    ONNX      0.462296            12.698026
3                OpenVINO           NaN                  NaN
4                TensorRT      0.462280             1.725197
5                  CoreML           NaN                  NaN
6   TensorFlow SavedModel      0.462296            20.273019
7     TensorFlow GraphDef      0.462296            20.212173
8         TensorFlow Lite           NaN                  NaN
9     TensorFlow Edge TPU           NaN                  NaN
10          TensorFlow.js           NaN                  NaN

Note: TensorRT exports are fixed at FP16.

@MrRace
Copy link

MrRace commented Apr 12, 2022

@glenn-jocher Thanks for your reply and I will try with your guide

@MrRace
Copy link

MrRace commented Apr 12, 2022

@MrRace ONNX export inference is working correctly with comparable speeds to PyTorch, i.e. see #6963

Run YOLOv5 benchmarks (speed and accuracy) for all supported export formats. This PR adds GPU benchmarking capability following CPU benchmarking PR #6613.

Format
export.py --include
Model

PyTorch

yolov5s.pt

TorchScript
torchscript
yolov5s.torchscript

ONNX
onnx
yolov5s.onnx

OpenVINO
openvino
yolov5s_openvino_model/

TensorRT
engine
yolov5s.engine

CoreML
coreml
yolov5s.mlmodel

TensorFlow SavedModel
saved_model
yolov5s_saved_model/

TensorFlow GraphDef
pb
yolov5s.pb

TensorFlow Lite
tflite
yolov5s.tflite

TensorFlow Edge TPU
edgetpu
yolov5s_edgetpu.tflite

TensorFlow.js
tfjs
yolov5s_web_model/

Usage:

git clone https://github.com/ultralytics/yolov5 -b update/bench_gpu  # clone
cd yolov5
pip install -qr requirements.txt coremltools onnx onnxruntime-gpu openvino-dev  # install
pip install -U nvidia-tensorrt --index-url https://pypi.ngc.nvidia.com  # TensorRT

python utils/benchmarks.py --weights yolov5s.pt --img 640 --device 0

Colab++ V100 High-RAM Results

benchmarks: weights=/content/yolov5/yolov5s.pt, imgsz=640, batch_size=1, data=/content/yolov5/data/coco128.yaml, device=0, half=False
Checking setup...
YOLOv5 🚀 v6.1-48-g0c1025f torch 1.10.0+cu111 CUDA:0 (Tesla V100-SXM2-16GB, 16160MiB)
Setup complete ✅ (8 CPUs, 51.0 GB RAM, 46.1/166.8 GB disk)

Benchmarks complete (433.63s)
                   Format  mAP@0.5:0.95  Inference time (ms)
0                 PyTorch      0.462296             9.159939
1             TorchScript      0.462296             6.607546
2                    ONNX      0.462296            12.698026
3                OpenVINO           NaN                  NaN
4                TensorRT      0.462280             1.725197
5                  CoreML           NaN                  NaN
6   TensorFlow SavedModel      0.462296            20.273019
7     TensorFlow GraphDef      0.462296            20.212173
8         TensorFlow Lite           NaN                  NaN
9     TensorFlow Edge TPU           NaN                  NaN
10          TensorFlow.js           NaN                  NaN

Note: TensorRT exports are fixed at FP16.

@glenn-jocher
When run git clone https://github.com/ultralytics/yolov5 -b update/bench_gpu:

Cloning into 'yolov5'...
fatal: Remote branch update/bench_gpu not found in upstream origin
Unexpected end of command stream

@glenn-jocher
Copy link
Member

@MrRace the aforementioned PR is already merged, all of this is in master. If you already have YOLOv5 you don't need to do anything except install dependencies and run the benchmarks:

pip install -qr requirements.txt coremltools onnx onnxruntime-gpu openvino-dev  # install
pip install -U nvidia-tensorrt --index-url https://pypi.ngc.nvidia.com  # TensorRT

python utils/benchmarks.py --weights yolov5s.pt --img 640 --device 0

@MrRace
Copy link

MrRace commented Apr 12, 2022

I use the master, the result:

Checking setup...
YOLOv5 🚀 v6.1-124-g8c420c4 torch 1.9.1+cu102 CUDA:0 (Tesla T4, 15110MiB)
Setup complete ✅ (40 CPUs, 156.6 GB RAM, 881.3/984.2 GB disk)

Benchmarks complete (445.65s)
                   Format  mAP@0.5:0.95  Inference time (ms)
0                 PyTorch        0.4623                 7.54
1             TorchScript        0.4623                 7.47
2                    ONNX        0.4623                14.99
3                OpenVINO           NaN                  NaN
4                TensorRT        0.4620                 2.96
5                  CoreML           NaN                  NaN
6   TensorFlow SavedModel           NaN                  NaN
7     TensorFlow GraphDef           NaN                  NaN
8         TensorFlow Lite           NaN                  NaN
9     TensorFlow Edge TPU           NaN                  NaN
10          TensorFlow.js           NaN                  NaN

@glenn-jocher Onnx also slower than Pytorch obviously. Therefore could you share the docker environment?

@glenn-jocher
Copy link
Member

@MrRace your results look fine. What makes you think that ONNX should be faster than PyTorch?

@glenn-jocher
Copy link
Member

@MrRace my previous results are labelled as 'Colab V100 High-RAM Results', but in any case Docker image is also readily available in the README in the Environments section: https://github.com/ultralytics/yolov5#environments

@MrRace
Copy link

MrRace commented Apr 12, 2022

@MrRace your results look fine. What makes you think that ONNX should be faster than PyTorch?

@glenn-jocher Thanks for your prompt reply ~

  1. OnnxRuntime provide some graph optimization.
  2. As my previous experience, convert Pytorch model to onnx can accelerate the inference normally

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 12, 2022

I've never seen ONNX speedup on GPU for YOLOv5. If you manage any speed improvements though feel free to submit a PR.

Please see our ✅ Contributing Guide to get started.

@glenn-jocher
Copy link
Member

@MrRace latest GPU results on our server below. I also exported ONNX at --half but saw no speedup compared to FP32.

benchmarks: weights=/usr/src/app/yolov5s.pt, imgsz=640, batch_size=1, data=/usr/src/app/data/coco128.yaml, device=, half=False, test=False
Checking setup...
YOLOv5 🚀 v6.1-129-g74aaab3 torch 1.11.0+cu113 CUDA:0 (A100-SXM-80GB, 81251MiB)
Setup complete ✅ (96 CPUs, 1007.7 GB RAM, 1925.3/3519.3 GB disk)

Benchmarks complete (536.24s)
                   Format  mAP@0.5:0.95  Inference time (ms)
0                 PyTorch        0.4624                 6.45
1             TorchScript        0.4624                 4.57
2                    ONNX        0.4623                 6.90
3                OpenVINO           NaN                  NaN
4                TensorRT        0.4618                 1.17
5                  CoreML           NaN                  NaN
6   TensorFlow SavedModel        0.4623                17.72
7     TensorFlow GraphDef        0.4623                18.26
8         TensorFlow Lite           NaN                  NaN
9     TensorFlow Edge TPU           NaN                  NaN
10          TensorFlow.js           NaN                  NaN

@MrRace
Copy link

MrRace commented Apr 13, 2022

@glenn-jocher I also try the half version with --half:python utils/benchmarks.py --weights models/yolov5s.pt --img 640 --device 0 --half, the result:

benchmarks: weights=models/yolov5s.pt, imgsz=640, batch_size=1, data=/usr/src/app/data/coco128.yaml, device=0, half=True, test=False
Checking setup...
YOLOv5 🚀 v6.1-124-g8c420c4 torch 1.9.1+cu102 CUDA:0 (Tesla T4, 15110MiB)
Setup complete ✅ (40 CPUs, 156.6 GB RAM, 788.4/984.2 GB disk)

Benchmarks complete (427.96s)
                   Format  mAP@0.5:0.95  Inference time (ms)
0                 PyTorch        0.4622                 6.32
1             TorchScript        0.4622                 5.57
2                    ONNX        0.4596                11.38
3                OpenVINO           NaN                  NaN
4                TensorRT        0.4599                 2.71
5                  CoreML           NaN                  NaN
6   TensorFlow SavedModel           NaN                  NaN
7     TensorFlow GraphDef           NaN                  NaN
8         TensorFlow Lite           NaN                  NaN
9     TensorFlow Edge TPU           NaN                  NaN
10          TensorFlow.js           NaN                  NaN

The fp16 TensorRT version just same with float32 TensorRT version above. I check the code and find in export.py

       if builder.platform_has_fast_fp16:
            config.set_flag(trt.BuilderFlag.FP16)

Even I do not set --half the builder.platform_has_fast_fp16 is always True, which means the engine is always fp16. In the other word, for tensorRT,python utils/benchmarks.py --weights yolov5s.pt --img 640 --device 0 and python utils/benchmarks.py --weights yolov5s.pt --img 640 --device 0 --half will both get half version. Your fp32 version and fp16 version of tensorRT seems also have the same problem?

@glenn-jocher
Copy link
Member

@MrRace yes that's correct! TRT is pinned to FP16 as we saw no observable benefit to FP32 TRT exports.

@MrRace
Copy link

MrRace commented Apr 18, 2022

@glenn-jocher I use same model and test data for detect.py and val.py, and the inference time of detect.py and val.py are significantly different. For example,

detect.py inference message:

Speed: 0.4ms pre-process, 12.4ms inference, 0.9ms NMS per image at shape (1, 3, 640, 640)

val.py inference message:

Speed: 0.2ms pre-process, 7.6ms inference, 1.0ms NMS per image at shape (1, 3, 640, 640)

Have you ever encountered this problem @glenn-jocher

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 19, 2022

@MrRace 👋 hi, thanks for letting us know about this possible problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to start investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

  • Minimal – Use as little code as possible to produce the problem
  • Complete – Provide all parts someone else needs to reproduce the problem
  • Reproducible – Test the code you're about to provide to make sure it reproduces the problem

For Ultralytics to provide assistance your code should also be:

  • Current – Verify that your code is up-to-date with GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been solved in master.
  • Unmodified – Your problem must be reproducible using official YOLOv5 code without changes. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

@armap94
Copy link

armap94 commented Sep 28, 2022

Running the model on Colab with a P100 GPU, I have the following results:

benchmarks: weights=yolov5s.pt, imgsz=640, batch_size=1, data=/content/yolov5/data/coco128.yaml, device=0, half=False, test=False, pt_only=False, hard_fail=False
Checking setup...
YOLOv5 🚀 v6.2-165-g966b0e0 Python-3.7.14 torch-1.12.1+cu113 CUDA:0 (Tesla P100-PCIE-16GB, 16281MiB)
Setup complete ✅ (4 CPUs, 25.5 GB RAM, 44.2/166.8 GB disk)

Benchmarks complete (258.79s)
                   Format  Size (MB)  mAP50-95  Inference time (ms)
0                 PyTorch       14.1    0.4716                 6.36
1             TorchScript       28.1    0.4716                 6.12
2                    ONNX       28.0    0.4716                15.00
3                OpenVINO        NaN       NaN                  NaN
4                TensorRT       33.2    0.4716                 4.58
5                  CoreML        NaN       NaN                  NaN
6   TensorFlow SavedModel       27.8    0.4716                20.79
7     TensorFlow GraphDef       27.8    0.4716                20.80
8         TensorFlow Lite        NaN       NaN                  NaN
9     TensorFlow Edge TPU        NaN       NaN                  NaN
10          TensorFlow.js        NaN       NaN                  NaN
11           PaddlePaddle       57.0    0.4716               409.37

Which shows that in running benchmarks the ONNX, while slower is still comparable. However when running the detection script, the results of of inference with OpenCV DNN is dramatically slower.

!python detect.py --weights yolov5s.pt --img 640 --conf 0.25 --source data/images  
Speed: 0.5ms pre-process, 15.3ms inference, 1.8ms NMS per image at shape (1, 3, 640, 640)
!python detect.py --weights yolov5s.onnx --img 640 --conf 0.25 --source data/images  
Speed: 2.5ms pre-process, 15.6ms inference, 2.8ms NMS per image at shape (1, 3, 640, 640)
!python detect.py --weights yolov5s.onnx --img 640 --conf 0.25 --source data/images  --dnn
Speed: 2.9ms pre-process, 749.4ms inference, 2.6ms NMS per image at shape (1, 3, 640, 640)

@glenn-jocher Is the speed of OpenCV dnn inference the expected normal speed? Is there any way to improve it?

@glenn-jocher
Copy link
Member

@armap94 --dnn inference is likely using CPU. I'm not very familiar with DNN, but if you'd like to submit a PR for DNN inference that would be useful. The relevant code area is here:

Loading:

yolov5/models/common.py

Lines 355 to 358 in 2373d54

elif dnn: # ONNX OpenCV DNN
LOGGER.info(f'Loading {w} for ONNX OpenCV DNN inference...')
check_requirements('opencv-python>=4.5.4')
net = cv2.dnn.readNetFromONNX(w)

Inference:

yolov5/models/common.py

Lines 507 to 510 in 2373d54

elif self.dnn: # ONNX OpenCV DNN
im = im.cpu().numpy() # torch to numpy
self.net.setInput(im)
y = self.net.forward()

@armap94
Copy link

armap94 commented Sep 29, 2022

@armap94 --dnn inference is likely using CPU. I'm not very familiar with DNN, but if you'd like to submit a PR for DNN inference that would be useful. The relevant code area is here:

Loading:

yolov5/models/common.py

Lines 355 to 358 in 2373d54

elif dnn: # ONNX OpenCV DNN
LOGGER.info(f'Loading {w} for ONNX OpenCV DNN inference...')
check_requirements('opencv-python>=4.5.4')
net = cv2.dnn.readNetFromONNX(w)

Inference:

yolov5/models/common.py

Lines 507 to 510 in 2373d54

elif self.dnn: # ONNX OpenCV DNN
im = im.cpu().numpy() # torch to numpy
self.net.setInput(im)
y = self.net.forward()

But when the .pt file is converted to .onnx using export.py, if the flag --device 0 is used, doesn't that force the ONNX to use GPU during inference? Or are extra steps required to ensure that ONNX is using GPU during inference?

@glenn-jocher
Copy link
Member

@armap94 inference device is independent of export device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants