Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add trtexec TensorRT export #6984

Closed
wants to merge 8 commits into from
Closed

Conversation

triple-Mu
Copy link
Contributor

@triple-Mu triple-Mu commented Mar 15, 2022

I tried to add the trtexec tensorrt export and got very interesting results as follows.
1:Use the original export method
The mAP results:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.374
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.570
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.401
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.216
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.423
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.489
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.311
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.516
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.566
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.377
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.627
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.718
The FPS results:
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 5000/5000 [00:37<00:00, 134.16it/s]
all 5000 36335 0.661 0.524 0.615 0.439
Speed: 0.2ms pre-process, 1.5ms inference, 0.5ms NMS per image at shape (1, 3, 640, 640)

2:Use the trtexec export method
The mAP results:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.374
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.571
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.401
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.216
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.423
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.489
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.311
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.516
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.566
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.378
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.628
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.718
The FPS results:
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 5000/5000 [00:46<00:00, 108.09it/s]
all 5000 36335 0.659 0.525 0.616 0.44
Speed: 0.2ms pre-process, 3.5ms inference, 0.5ms NMS per image at shape (1, 3, 640, 640)

3:Summarize
Maybe the trtexec export method get good AP@0.5 and mAR(small/medium)
But it increases inference time from 1.5ms to 3.5ms.
All result images and logs are shown below.
origin export
trtexec export
original map fps
latter map fps

So it will help us to get more accurate results if we use trtexec.
Thanks!

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

WARNING ⚠️ this PR is very large, summary may not cover all changes.

🌟 Summary

This PR introduced enhancements to TensorRT export functionality in YOLOv5.

📊 Key Changes

  • Added support for onnx-graphsurgeon to optimize ONNX models
  • Added 'dynamic_axes' parameter to allow for dynamic input sizes during ONNX export
  • Reduced model export size and memory consumption
  • Improved model inference times and GPU utilization

🎯 Purpose & Impact

  • Enhanced Performance: Users can expect faster model inference with less memory overhead, making deployment on diverse platforms more efficient.
  • Dynamic Input Handling: The ability to handle dynamic input sizes provides flexibility for various use cases and input data.
  • Optimized Model Size: Reduced model export size facilitates easier deployment, especially in edge computing scenarios where resources are limited.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Hello @triple-Mu, thank you for submitting a YOLOv5 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

  • ✅ Verify your PR is up-to-date with upstream/master. If your PR is behind upstream/master an automatic GitHub Actions merge may be attempted by writing /rebase in a new comment, or by running the following code, replacing 'feature' with the name of your local branch:
git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
# git checkout feature  # <--- replace 'feature' with local branch name
git merge upstream/master
git push -u origin -f
  • ✅ Verify all Continuous Integration (CI) checks are passing.
  • ✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee

@glenn-jocher
Copy link
Member

@triple-Mu thanks for the PR! Was not familiar with trtexec. What's the main difference with the default tensorrt export?

BTW note that the default TRT export will always be in FP16 mode regardless of --half. We use this by default as we did not observe any mAP drops but did observe significant speedup in --half mode. Full benchmarking results are in #6963

Colab++ V100 High-RAM Results

benchmarks: weights=/content/yolov5/yolov5s.pt, imgsz=640, batch_size=1, data=/content/yolov5/data/coco128.yaml, device=0, half=False
Checking setup...
YOLOv5 🚀 v6.1-48-g0c1025f torch 1.10.0+cu111 CUDA:0 (Tesla V100-SXM2-16GB, 16160MiB)
Setup complete ✅ (8 CPUs, 51.0 GB RAM, 46.1/166.8 GB disk)

Benchmarks complete (433.63s)
                   Format  mAP@0.5:0.95  Inference time (ms)
0                 PyTorch      0.462296             9.159939
1             TorchScript      0.462296             6.607546
2                    ONNX      0.462296            12.698026
3                OpenVINO           NaN                  NaN
4                TensorRT      0.462280             1.725197
5                  CoreML           NaN                  NaN
6   TensorFlow SavedModel      0.462296            20.273019
7     TensorFlow GraphDef      0.462296            20.212173
8         TensorFlow Lite           NaN                  NaN
9     TensorFlow Edge TPU           NaN                  NaN
10          TensorFlow.js           NaN                  NaN

@glenn-jocher glenn-jocher changed the title Add trtexec tensorrt export Add trtexec TensorRT export Mar 15, 2022
@triple-Mu
Copy link
Contributor Author

triple-Mu commented Mar 15, 2022

@triple-Mu thanks for the PR! Was not familiar with trtexec. What's the main difference with the default tensorrt export?

BTW note that the default TRT export will always be in FP16 mode regardless of --half. We use this by default as we did not observe any mAP drops but did observe significant speedup in --half mode. Full benchmarking results are in #6963

Colab++ V100 High-RAM Results

benchmarks: weights=/content/yolov5/yolov5s.pt, imgsz=640, batch_size=1, data=/content/yolov5/data/coco128.yaml, device=0, half=False
Checking setup...
YOLOv5 🚀 v6.1-48-g0c1025f torch 1.10.0+cu111 CUDA:0 (Tesla V100-SXM2-16GB, 16160MiB)
Setup complete ✅ (8 CPUs, 51.0 GB RAM, 46.1/166.8 GB disk)

Benchmarks complete (433.63s)
                   Format  mAP@0.5:0.95  Inference time (ms)
0                 PyTorch      0.462296             9.159939
1             TorchScript      0.462296             6.607546
2                    ONNX      0.462296            12.698026
3                OpenVINO           NaN                  NaN
4                TensorRT      0.462280             1.725197
5                  CoreML           NaN                  NaN
6   TensorFlow SavedModel      0.462296            20.273019
7     TensorFlow GraphDef      0.462296            20.212173
8         TensorFlow Lite           NaN                  NaN
9     TensorFlow Edge TPU           NaN                  NaN
10          TensorFlow.js           NaN                  NaN

@glenn-jocher Thank you for your reply. trtexec has some optimizations for the machine gpu to export the engine, so the export time may be longer. At the same time, we can view the detailed information in the export process, such as the inference time of random inputs and the time consumption between various layers of the network, which is very convenient.
Besides, trtexec is installed with tensorrt by default, which can avoid installing the python version of tensorrt whl. and more convenient to use in nvidia jetson nano/TX2/AGX

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 15, 2022

@triple-Mu got it, thanks! TRTexec export actually seems faster in your results, i.e. 180 seconds instead of 380 seconds. I tried to run PR but get this error:

/bin/sh: 1: /usr/src/tensorrt/bin/trtexec: not found

Existing pip install does not appear to install trtexec:

pip install -U nvidia-tensorrt --index-url https://pypi.ngc.nvidia.com

@triple-Mu
Copy link
Contributor Author

triple-Mu commented Mar 15, 2022

@glenn-jocher
Where is your tensorrt installation path?
If it is installed using deb, it will be in this path by default, otherwise you need to modify ‘/usr/src/tensorrt/bin/trtexec’ to TensorRT-8.2.3.1/bin/trtexec

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 15, 2022

@triple-Mu this is the full code I'm using to clone the PR, install requirements and run export. I'm running this in Colab:
https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb?hl=en

!git clone https://github.com/triple-Mu/yolov5 -b tripleMu # clone
%cd yolov5
%pip install -qr requirements.txt  # install
%pip install -U nvidia-tensorrt --index-url https://pypi.ngc.nvidia.com  # install
!python export.py --weights yolov5s.pt --include engine --device 0 --trtexec

@triple-Mu
Copy link
Contributor Author

@glenn-jocher
fp16
I tried fp16 export and got a weird result. The AP or AR will be different.But the inference time is the same as original method.

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 15, 2022

@triple-Mu this I'm using this code to clone the PR, install requirements and run export. I'm running this in Colab:
https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb?hl=en

!git clone https://github.com/triple-Mu/yolov5 -b tripleMu # clone
%cd yolov5
%pip install -qr requirements.txt  # install
%pip install -U nvidia-tensorrt --index-url https://pypi.ngc.nvidia.com  # install
!python export.py --weights yolov5s.pt --include engine --device 0 --trtexec

But there's no trtexec file that I can find:

Screenshot 2022-03-15 at 13 03 19

@zhiqwang
Copy link
Contributor

Just FYI @glenn-jocher , trtexec is a command line wrapper tools for

  • It’s useful for benchmarking networks on random or user-provided input data.
  • It’s useful for generating serialized engines from models.
  • It’s useful for generating serialized timing cache from the builder.

And seems that trtexec would be difficult to obtain if we use pip to install TensorRT.

@triple-Mu
Copy link
Contributor Author

@glenn-jocher
I don't know how you installed TensorRT, do you know the path where you store TensorRT? This command line tool should be in the bin folder under the TensorRT path.
Refer to https://github.com/NVIDIA/TensorRT/tree/main/samples/trtexec

@zhiqwang
Copy link
Contributor

Actually the trtexec plays the same role as the Python wrapper. I think it would be better to put the usage of trtexec in the docs and we can add some instructions to inform users that they can also use trtexec to generate serialized engines than to repackage trtexec back into python.

@glenn-jocher
Copy link
Member

@triple-Mu quick questions. Is this PR compatible with your new PR #7736 or does the new PR replace this one?

@triple-Mu
Copy link
Contributor Author

@triple-Mu quick questions. Is this PR compatible with your new PR #7736 or does the new PR replace this one?

New pr has nothing to do with the old one. All right,just as you wish is ok.It does not matter for me.

@triple-Mu triple-Mu closed this May 19, 2022
@glenn-jocher glenn-jocher removed the TODO label May 19, 2022
@triple-Mu triple-Mu deleted the tripleMu branch May 20, 2022 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants