Support dynamic batch for TensorRT and onnxruntime #329

triple-Mu · 2022-07-27T13:00:09Z

As the issue is shown in yolov7 #273 (comment)
We will support dynamic batch for end2end detect!

…mic-batch Merge

triple-Mu · 2022-07-27T14:48:08Z

notebook here !
https://colab.research.google.com/drive/1BHiUKHE4VLTyDsDB1AHpavmOItODBwa4

philipp-schmidt · 2022-07-27T16:16:56Z

Notebook is not set to shared

philipp-schmidt · 2022-07-27T16:45:27Z

Closing #280 in favor of this PR.

AlexeyAB · 2022-07-27T23:11:41Z

@triple-Mu Why we don't use dynamic batch by default?
In which cases static batch could be better than dynamic batch?

triple-Mu · 2022-07-28T08:12:50Z

@AlexeyAB
Now yolov5 is trying to implement dynamic batch export ultralytics/yolov5#8526.
TensorRT8.4 seems to have good support for dynamic batch.
Here is performance summary in batch1/16/32 with static batch and dynamic batch

Static batch=1

[07/28/2022-15:56:12] [I] === Performance summary ===
[07/28/2022-15:56:12] [I] Throughput: 863.856 qps
[07/28/2022-15:56:12] [I] Latency: min = 1.08133 ms, max = 1.18787 ms, mean = 1.14559 ms, median = 1.1438 ms, percentile(99%) = 1.18066 ms
[07/28/2022-15:56:12] [I] Enqueue Time: min = 0.0108643 ms, max = 0.0632324 ms, mean = 0.0199817 ms, median = 0.0189209 ms, percentile(99%) = 0.0423584 ms
[07/28/2022-15:56:12] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:56:12] [I] GPU Compute Time: min = 1.08133 ms, max = 1.18787 ms, mean = 1.14559 ms, median = 1.1438 ms, percentile(99%) = 1.18066 ms
[07/28/2022-15:56:12] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:56:12] [I] Total Host Walltime: 3.00282 s
[07/28/2022-15:56:12] [I] Total GPU Compute Time: 2.97166 s
[07/28/2022-15:56:12] [W] * GPU compute time is unstable, with coefficient of variance = 1.19178%.
[07/28/2022-15:56:12] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-15:56:12] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-15:56:12] [V] 
[07/28/2022-15:56:12] [V] === Explanations of the performance metrics ===
[07/28/2022-15:56:12] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-15:56:12] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-15:56:12] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:56:12] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:56:12] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-15:56:12] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-15:56:12] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-15:56:12] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-15:56:12] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=./yolov7-tiny-batch1.plan --verbose --useCudaGraph --noDataTransfers --shapes=images:1x3x640x640

Static batch=16

[07/28/2022-15:58:53] [I] === Performance summary ===
[07/28/2022-15:58:53] [I] Throughput: 71.0509 qps
[07/28/2022-15:58:53] [I] Latency: min = 13.8823 ms, max = 15.3702 ms, mean = 14.0723 ms, median = 14.0103 ms, percentile(99%) = 15.1163 ms
[07/28/2022-15:58:53] [I] Enqueue Time: min = 0.0202637 ms, max = 0.167694 ms, mean = 0.105239 ms, median = 0.112793 ms, percentile(99%) = 0.154663 ms
[07/28/2022-15:58:53] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:58:53] [I] GPU Compute Time: min = 13.8823 ms, max = 15.3702 ms, mean = 14.0723 ms, median = 14.0103 ms, percentile(99%) = 15.1163 ms
[07/28/2022-15:58:53] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:58:53] [I] Total Host Walltime: 3.026 s
[07/28/2022-15:58:53] [I] Total GPU Compute Time: 3.02554 s
[07/28/2022-15:58:53] [W] * GPU compute time is unstable, with coefficient of variance = 1.61678%.
[07/28/2022-15:58:53] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-15:58:53] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-15:58:53] [V] 
[07/28/2022-15:58:53] [V] === Explanations of the performance metrics ===
[07/28/2022-15:58:53] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-15:58:53] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-15:58:53] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:58:53] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:58:53] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-15:58:53] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-15:58:53] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-15:58:53] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-15:58:53] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=./yolov7-tiny-batch16.plan --verbose --useCudaGraph --noDataTransfers --shapes=images:16x3x640x640

Static batch=32

[07/28/2022-15:59:33] [I] === Performance summary ===
[07/28/2022-15:59:33] [I] Throughput: 36.1283 qps
[07/28/2022-15:59:33] [I] Latency: min = 27.0797 ms, max = 31.2945 ms, mean = 27.6771 ms, median = 27.4131 ms, percentile(99%) = 30.8654 ms
[07/28/2022-15:59:33] [I] Enqueue Time: min = 0.0166016 ms, max = 0.152283 ms, mean = 0.096474 ms, median = 0.104919 ms, percentile(99%) = 0.144287 ms
[07/28/2022-15:59:33] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:59:33] [I] GPU Compute Time: min = 27.0797 ms, max = 31.2945 ms, mean = 27.6771 ms, median = 27.4131 ms, percentile(99%) = 30.8654 ms
[07/28/2022-15:59:33] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:59:33] [I] Total Host Walltime: 3.0447 s
[07/28/2022-15:59:33] [I] Total GPU Compute Time: 3.04448 s
[07/28/2022-15:59:33] [W] * GPU compute time is unstable, with coefficient of variance = 2.442%.
[07/28/2022-15:59:33] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-15:59:33] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-15:59:33] [V] 
[07/28/2022-15:59:33] [V] === Explanations of the performance metrics ===
[07/28/2022-15:59:33] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-15:59:33] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-15:59:33] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:59:33] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:59:33] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-15:59:33] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-15:59:33] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-15:59:33] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-15:59:33] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=./yolov7-tiny-batch32.plan --verbose --useCudaGraph --noDataTransfers --shapes=images:32x3x640x640

Dynamic batch=1

[07/28/2022-16:00:55] [I] === Performance summary ===
[07/28/2022-16:00:55] [I] Throughput: 716.632 qps
[07/28/2022-16:00:55] [I] Latency: min = 1.3476 ms, max = 1.5657 ms, mean = 1.38629 ms, median = 1.38037 ms, percentile(99%) = 1.55853 ms
[07/28/2022-16:00:55] [I] Enqueue Time: min = 0.0109863 ms, max = 0.0975342 ms, mean = 0.022185 ms, median = 0.0200195 ms, percentile(99%) = 0.071167 ms
[07/28/2022-16:00:55] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:00:55] [I] GPU Compute Time: min = 1.3476 ms, max = 1.5657 ms, mean = 1.38629 ms, median = 1.38037 ms, percentile(99%) = 1.55853 ms
[07/28/2022-16:00:55] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:00:55] [I] Total Host Walltime: 3.00294 s
[07/28/2022-16:00:55] [I] Total GPU Compute Time: 2.9833 s
[07/28/2022-16:00:55] [W] * GPU compute time is unstable, with coefficient of variance = 2.10486%.
[07/28/2022-16:00:55] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-16:00:55] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-16:00:55] [V] 
[07/28/2022-16:00:55] [V] === Explanations of the performance metrics ===
[07/28/2022-16:00:55] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-16:00:55] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-16:00:55] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:00:55] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:00:55] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-16:00:55] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-16:00:55] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-16:00:55] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-16:00:55] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=yolov7-tiny-nms.trt --verbose --useCudaGraph --noDataTransfers --shapes=images:1x3x640x640

Dynamic batch=16

[07/28/2022-16:01:33] [I] === Performance summary ===
[07/28/2022-16:01:33] [I] Throughput: 70.9378 qps
[07/28/2022-16:01:33] [I] Latency: min = 13.8496 ms, max = 15.4686 ms, mean = 14.0947 ms, median = 14.037 ms, percentile(99%) = 15.0651 ms
[07/28/2022-16:01:33] [I] Enqueue Time: min = 0.0184326 ms, max = 0.318115 ms, mean = 0.1048 ms, median = 0.112305 ms, percentile(99%) = 0.220886 ms
[07/28/2022-16:01:33] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:01:33] [I] GPU Compute Time: min = 13.8496 ms, max = 15.4686 ms, mean = 14.0947 ms, median = 14.037 ms, percentile(99%) = 15.0651 ms
[07/28/2022-16:01:33] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:01:33] [I] Total Host Walltime: 3.03083 s
[07/28/2022-16:01:33] [I] Total GPU Compute Time: 3.03036 s
[07/28/2022-16:01:33] [W] * GPU compute time is unstable, with coefficient of variance = 1.65175%.
[07/28/2022-16:01:33] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-16:01:33] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-16:01:33] [V] 
[07/28/2022-16:01:33] [V] === Explanations of the performance metrics ===
[07/28/2022-16:01:33] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-16:01:33] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-16:01:33] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:01:33] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:01:33] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-16:01:33] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-16:01:33] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-16:01:33] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-16:01:33] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=yolov7-tiny-nms.trt --verbose --useCudaGraph --noDataTransfers --shapes=images:16x3x640x640

Dynamic batch=32

[07/28/2022-16:01:51] [I] === Performance summary ===
[07/28/2022-16:01:51] [I] Throughput: 35.8421 qps
[07/28/2022-16:01:51] [I] Latency: min = 27.3459 ms, max = 29.6284 ms, mean = 27.8981 ms, median = 27.7662 ms, percentile(99%) = 29.568 ms
[07/28/2022-16:01:51] [I] Enqueue Time: min = 0.0256348 ms, max = 0.164413 ms, mean = 0.106011 ms, median = 0.110474 ms, percentile(99%) = 0.153809 ms
[07/28/2022-16:01:51] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:01:51] [I] GPU Compute Time: min = 27.3459 ms, max = 29.6284 ms, mean = 27.8981 ms, median = 27.7662 ms, percentile(99%) = 29.568 ms
[07/28/2022-16:01:51] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:01:51] [I] Total Host Walltime: 3.06901 s
[07/28/2022-16:01:51] [I] Total GPU Compute Time: 3.06879 s
[07/28/2022-16:01:51] [W] * GPU compute time is unstable, with coefficient of variance = 1.55253%.
[07/28/2022-16:01:51] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-16:01:51] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-16:01:51] [V] 
[07/28/2022-16:01:51] [V] === Explanations of the performance metrics ===
[07/28/2022-16:01:51] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-16:01:51] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-16:01:51] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:01:51] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:01:51] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-16:01:51] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-16:01:51] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-16:01:51] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-16:01:51] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=yolov7-tiny-nms.trt --verbose --useCudaGraph --noDataTransfers --shapes=images:32x3x640x640

Performance Mean Latency Summary

	batch 1	batch 16	batch 32
static	1.14559 ms	14.0723 ms	27.6771 ms
dynamic	1.38629 ms	14.0947 ms	27.8981 ms

As shown in the table above,

Under the same batch, dynamic batch achieves almost the same inference performance as static batch, but we don't need to export three batch models !!!!

The model inference scripts for dynamic batches are different.
And different gpu and TensorRT versions support different.
In contrast, the static batch model is more stable.

…mic-batch Merge main

philipp-schmidt · 2022-07-28T09:15:02Z

Hello @triple-Mu , hello @AlexeyAB ,

Dynamic batching needs to set an "optimal batch size" to optimize the engine for. So if you test the dynamic batch engine with batch size 4 it only really is a representative benchmark if you select the correct optimization profile with "--optShapes=images:4x3x640x640" and tell TRT that batch size 4 will be the most common input size to optimize for (alongside min and max batch size).
Your current performance benchmarks miss an important point about dynamic batching. It's not about having somewhat equal latency as the respective static engine. That's just bonus. It opens up the option to combine multiple batch size 1 requests into request of bigger batch size serverside and therefore better throughput (for a potentially "okay" tradeoff in latency even for many realtime applications like 15FPS video -> 66ms).
Triton Inference Server makes use of that with his Dynamic Batching Strategy and I can tell you this will give some real world deployments up to an additional 50% throughput compared to a bunch of independent realtime applications sending single image requests at the server without this strategy. It basically gives you throughput of batch size X for the network, but completely transparent for a bunch of batch size 1 realtime apps with lots of room for latency tradeoffs.
My initial PR for this would have showcased this with a short tutorial but it got replaced here. If a Triton deployment tutorial (similar to our repo for yolov4) with dynamic batching and python client is wanted I'm happy to make a separate PR, but I would need a confirmation of interest from authors here.

triple-Mu · 2022-07-28T09:18:38Z

Hello @triple-Mu , hello @AlexeyAB ,

Dynamic batching needs to set an "optimal batch size" to optimize the engine for. So if you test the dynamic batch engine with batch size 4 it only really is a representative benchmark if you select the correct optimization profile with "--optShapes=images:4x3x640x640" and tell TRT that batch size 4 will be the most common input size to optimize for (alongside min and max batch size).

Your current performance benchmarks miss an important point about dynamic batching. It's not about having somewhat equal latency as the respective static engine. That's just bonus. It opens up the option to combine multiple batch size 1 requests into request of bigger batch size serverside and therefore better throughput (for a potentially "okay" tradeoff in latency even for many realtime applications like 15FPS video -> 66ms).
Triton Inference Server makes use of that with his Dynamic Batching Strategy and I can tell you this will give some real world deployments up to an additional 50% throughput compared to a bunch of independent realtime applications sending single image requests at the server without this strategy. It basically gives you throughput of batch size X for the network, but completely transparent for a bunch of batch size 1 realtime apps with lots of room for latency tradeoffs.
My initial PR for this would have showcased this with a short tutorial but it got replaced here. If a Triton deployment tutorial (similar to our repo for yolov4) with dynamic batching and python client is wanted I'm happy to make a separate PR, but I would need a confirmation of interest from authors here.

Great work!
Look forward for your Triton usage!

AlexeyAB · 2022-07-28T16:07:12Z

Hi @philipp-schmidt, Hi @triple-Mu

Now I see, so there are cases when Static-batch is better, it is more stable in general, and faster for batch=1.
While for other cases Dynamic-batch is better.

I would need a confirmation of interest from authors here.

Yes, we are interested in this!

philipp-schmidt · 2022-07-28T21:34:34Z

#346

Also here you can see the difference of dynamic batch size: yolov7 performance dynamic batch size

With dynamic batch size (min 1, opt 8, max 8):
Concurrent clients: 16, throughput: 590.119 infer/sec, latency 27080 usec
Without dynamic batch size:
Concurrent clients: 16, throughput: 335.587 infer/sec, latency 47616 usec

So 76% more throughput with only 57% of latency

AlexeyAB · 2022-07-28T22:41:24Z

@triple-Mu Thanks!

What does this value 7680 mean in the export command? https://github.com/WongKinYiu/yolov7/blob/main/tools/YOLOv7-Dynamic-Batch-ONNXRUNTIME.ipynb

python export.py --weights ./yolov7-tiny.pt --grid --end2end --simplify \
    --topk-all 100 --iou-thres 0.65 --conf-thres 0.35 \
    --img-size 640 640 \
    --dynamic-batch \
    --max-wh 7680

triple-Mu · 2022-07-29T00:33:15Z

@triple-Mu Thanks!

What does this value 7680 mean in the export command? https://github.com/WongKinYiu/yolov7/blob/main/tools/YOLOv7-Dynamic-Batch-ONNXRUNTIME.ipynb
python export.py --weights ./yolov7-tiny.pt --grid --end2end --simplify \
    --topk-all 100 --iou-thres 0.65 --conf-thres 0.35 \
    --img-size 640 640 \
    --dynamic-batch \
    --max-wh 7680

Here is my introduction #273 (comment)

It is the same as https://github.com/WongKinYiu/yolov7/blob/main/utils/general.py#L681-L681
and https://github.com/WongKinYiu/yolov7/blob/main/utils/general.py#L619-L619
We support 3 nms:

For onnxruntime agnostic nms, it may set 0
For onnxruntime non-agnostic nms, it may set big int for example 7680 or 4096 or 640
3.For tensorrt efficient nms, it is non-agnostic default, so we use default none.
Therefore, I use max-wh to distinguish between the currently registered NMS that is applicable to the above NMS

akashAD98 · 2022-08-01T06:36:11Z

@triple-Mu @philipp-schmidt can you add code for inference on video? after the model is converted into .trt we want to do inference on video.

philipp-schmidt · 2022-08-01T06:39:10Z

Already there, see "python3 client.py video {input}"

akashAD98 · 2022-08-01T07:08:04Z

@philipp-schmidt it's not in google collab which you have provided.
https://colab.research.google.com/github/WongKinYiu/yolov7/blob/main/tools/YOLOv7trt.ipynb

* Support dynamic batch for TensorRT and onnxruntime * Fix output name * Add some images * Add dynamic-batch usage notebook * Add example notebook for onnxruntime and tensorrt

triple-Mu added 5 commits July 27, 2022 20:59

Support dynamic batch for TensorRT and onnxruntime

0503aa6

Fix output name

3b78204

Merge branch 'WongKinYiu:main' into dynamic-batch

0128532

Add some images

f5f704d

Merge branch 'dynamic-batch' of github.com:triple-Mu/yolov7 into dyna…

4c6a78f

…mic-batch Merge

Merge branch 'WongKinYiu:main' into dynamic-batch

a0f4626

philipp-schmidt mentioned this pull request Jul 27, 2022

Dynamic batchsize export support for ONNX and TensorRT #280

Closed

Add dynamic-batch usage notebook

d77092b

triple-Mu added 3 commits July 28, 2022 16:13

Merge branch 'WongKinYiu:main' into dynamic-batch

95e683a

Add example notebook for onnxruntime and tensorrt

e84613f

Merge branch 'dynamic-batch' of github.com:triple-Mu/yolov7 into dyna…

1c8c7a5

…mic-batch Merge main

triple-Mu closed this Jul 28, 2022

triple-Mu reopened this Jul 28, 2022

philipp-schmidt mentioned this pull request Jul 28, 2022

Add Triton Inference Server deployment #346

Merged

AlexeyAB merged commit a7c0029 into WongKinYiu:main Jul 28, 2022

triple-Mu deleted the dynamic-batch branch July 29, 2022 05:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dynamic batch for TensorRT and onnxruntime #329

Support dynamic batch for TensorRT and onnxruntime #329

triple-Mu commented Jul 27, 2022

triple-Mu commented Jul 27, 2022

philipp-schmidt commented Jul 27, 2022

philipp-schmidt commented Jul 27, 2022

AlexeyAB commented Jul 27, 2022

triple-Mu commented Jul 28, 2022

philipp-schmidt commented Jul 28, 2022 •

edited

Loading

triple-Mu commented Jul 28, 2022

AlexeyAB commented Jul 28, 2022

philipp-schmidt commented Jul 28, 2022

AlexeyAB commented Jul 28, 2022 •

edited

Loading

triple-Mu commented Jul 29, 2022 •

edited

Loading

akashAD98 commented Aug 1, 2022

philipp-schmidt commented Aug 1, 2022

akashAD98 commented Aug 1, 2022

Support dynamic batch for TensorRT and onnxruntime #329

Support dynamic batch for TensorRT and onnxruntime #329

Conversation

triple-Mu commented Jul 27, 2022

triple-Mu commented Jul 27, 2022

philipp-schmidt commented Jul 27, 2022

philipp-schmidt commented Jul 27, 2022

AlexeyAB commented Jul 27, 2022

triple-Mu commented Jul 28, 2022

Static batch=1

Static batch=16

Static batch=32

Dynamic batch=1

Dynamic batch=16

Dynamic batch=32

Performance Mean Latency Summary

philipp-schmidt commented Jul 28, 2022 • edited Loading

triple-Mu commented Jul 28, 2022

AlexeyAB commented Jul 28, 2022

philipp-schmidt commented Jul 28, 2022

AlexeyAB commented Jul 28, 2022 • edited Loading

triple-Mu commented Jul 29, 2022 • edited Loading

akashAD98 commented Aug 1, 2022

philipp-schmidt commented Aug 1, 2022

akashAD98 commented Aug 1, 2022

philipp-schmidt commented Jul 28, 2022 •

edited

Loading

AlexeyAB commented Jul 28, 2022 •

edited

Loading

triple-Mu commented Jul 29, 2022 •

edited

Loading