Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dynamic batch for TensorRT and onnxruntime #329

Merged
merged 10 commits into from
Jul 28, 2022

Conversation

triple-Mu
Copy link
Contributor

As the issue is shown in yolov7 #273 (comment)
We will support dynamic batch for end2end detect!

@triple-Mu
Copy link
Contributor Author

@philipp-schmidt
Copy link
Contributor

Notebook is not set to shared

@philipp-schmidt
Copy link
Contributor

Closing #280 in favor of this PR.

@AlexeyAB
Copy link
Collaborator

@triple-Mu Why we don't use dynamic batch by default?
In which cases static batch could be better than dynamic batch?

@triple-Mu
Copy link
Contributor Author

@AlexeyAB
Now yolov5 is trying to implement dynamic batch export ultralytics/yolov5#8526.
TensorRT8.4 seems to have good support for dynamic batch.
Here is performance summary in batch1/16/32 with static batch and dynamic batch

Static batch=1

[07/28/2022-15:56:12] [I] === Performance summary ===
[07/28/2022-15:56:12] [I] Throughput: 863.856 qps
[07/28/2022-15:56:12] [I] Latency: min = 1.08133 ms, max = 1.18787 ms, mean = 1.14559 ms, median = 1.1438 ms, percentile(99%) = 1.18066 ms
[07/28/2022-15:56:12] [I] Enqueue Time: min = 0.0108643 ms, max = 0.0632324 ms, mean = 0.0199817 ms, median = 0.0189209 ms, percentile(99%) = 0.0423584 ms
[07/28/2022-15:56:12] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:56:12] [I] GPU Compute Time: min = 1.08133 ms, max = 1.18787 ms, mean = 1.14559 ms, median = 1.1438 ms, percentile(99%) = 1.18066 ms
[07/28/2022-15:56:12] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:56:12] [I] Total Host Walltime: 3.00282 s
[07/28/2022-15:56:12] [I] Total GPU Compute Time: 2.97166 s
[07/28/2022-15:56:12] [W] * GPU compute time is unstable, with coefficient of variance = 1.19178%.
[07/28/2022-15:56:12] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-15:56:12] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-15:56:12] [V] 
[07/28/2022-15:56:12] [V] === Explanations of the performance metrics ===
[07/28/2022-15:56:12] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-15:56:12] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-15:56:12] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:56:12] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:56:12] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-15:56:12] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-15:56:12] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-15:56:12] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-15:56:12] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=./yolov7-tiny-batch1.plan --verbose --useCudaGraph --noDataTransfers --shapes=images:1x3x640x640

Static batch=16

[07/28/2022-15:58:53] [I] === Performance summary ===
[07/28/2022-15:58:53] [I] Throughput: 71.0509 qps
[07/28/2022-15:58:53] [I] Latency: min = 13.8823 ms, max = 15.3702 ms, mean = 14.0723 ms, median = 14.0103 ms, percentile(99%) = 15.1163 ms
[07/28/2022-15:58:53] [I] Enqueue Time: min = 0.0202637 ms, max = 0.167694 ms, mean = 0.105239 ms, median = 0.112793 ms, percentile(99%) = 0.154663 ms
[07/28/2022-15:58:53] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:58:53] [I] GPU Compute Time: min = 13.8823 ms, max = 15.3702 ms, mean = 14.0723 ms, median = 14.0103 ms, percentile(99%) = 15.1163 ms
[07/28/2022-15:58:53] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:58:53] [I] Total Host Walltime: 3.026 s
[07/28/2022-15:58:53] [I] Total GPU Compute Time: 3.02554 s
[07/28/2022-15:58:53] [W] * GPU compute time is unstable, with coefficient of variance = 1.61678%.
[07/28/2022-15:58:53] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-15:58:53] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-15:58:53] [V] 
[07/28/2022-15:58:53] [V] === Explanations of the performance metrics ===
[07/28/2022-15:58:53] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-15:58:53] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-15:58:53] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:58:53] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:58:53] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-15:58:53] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-15:58:53] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-15:58:53] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-15:58:53] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=./yolov7-tiny-batch16.plan --verbose --useCudaGraph --noDataTransfers --shapes=images:16x3x640x640

Static batch=32

[07/28/2022-15:59:33] [I] === Performance summary ===
[07/28/2022-15:59:33] [I] Throughput: 36.1283 qps
[07/28/2022-15:59:33] [I] Latency: min = 27.0797 ms, max = 31.2945 ms, mean = 27.6771 ms, median = 27.4131 ms, percentile(99%) = 30.8654 ms
[07/28/2022-15:59:33] [I] Enqueue Time: min = 0.0166016 ms, max = 0.152283 ms, mean = 0.096474 ms, median = 0.104919 ms, percentile(99%) = 0.144287 ms
[07/28/2022-15:59:33] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:59:33] [I] GPU Compute Time: min = 27.0797 ms, max = 31.2945 ms, mean = 27.6771 ms, median = 27.4131 ms, percentile(99%) = 30.8654 ms
[07/28/2022-15:59:33] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:59:33] [I] Total Host Walltime: 3.0447 s
[07/28/2022-15:59:33] [I] Total GPU Compute Time: 3.04448 s
[07/28/2022-15:59:33] [W] * GPU compute time is unstable, with coefficient of variance = 2.442%.
[07/28/2022-15:59:33] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-15:59:33] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-15:59:33] [V] 
[07/28/2022-15:59:33] [V] === Explanations of the performance metrics ===
[07/28/2022-15:59:33] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-15:59:33] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-15:59:33] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:59:33] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:59:33] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-15:59:33] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-15:59:33] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-15:59:33] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-15:59:33] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=./yolov7-tiny-batch32.plan --verbose --useCudaGraph --noDataTransfers --shapes=images:32x3x640x640

Dynamic batch=1

[07/28/2022-16:00:55] [I] === Performance summary ===
[07/28/2022-16:00:55] [I] Throughput: 716.632 qps
[07/28/2022-16:00:55] [I] Latency: min = 1.3476 ms, max = 1.5657 ms, mean = 1.38629 ms, median = 1.38037 ms, percentile(99%) = 1.55853 ms
[07/28/2022-16:00:55] [I] Enqueue Time: min = 0.0109863 ms, max = 0.0975342 ms, mean = 0.022185 ms, median = 0.0200195 ms, percentile(99%) = 0.071167 ms
[07/28/2022-16:00:55] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:00:55] [I] GPU Compute Time: min = 1.3476 ms, max = 1.5657 ms, mean = 1.38629 ms, median = 1.38037 ms, percentile(99%) = 1.55853 ms
[07/28/2022-16:00:55] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:00:55] [I] Total Host Walltime: 3.00294 s
[07/28/2022-16:00:55] [I] Total GPU Compute Time: 2.9833 s
[07/28/2022-16:00:55] [W] * GPU compute time is unstable, with coefficient of variance = 2.10486%.
[07/28/2022-16:00:55] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-16:00:55] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-16:00:55] [V] 
[07/28/2022-16:00:55] [V] === Explanations of the performance metrics ===
[07/28/2022-16:00:55] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-16:00:55] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-16:00:55] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:00:55] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:00:55] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-16:00:55] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-16:00:55] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-16:00:55] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-16:00:55] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=yolov7-tiny-nms.trt --verbose --useCudaGraph --noDataTransfers --shapes=images:1x3x640x640

Dynamic batch=16

[07/28/2022-16:01:33] [I] === Performance summary ===
[07/28/2022-16:01:33] [I] Throughput: 70.9378 qps
[07/28/2022-16:01:33] [I] Latency: min = 13.8496 ms, max = 15.4686 ms, mean = 14.0947 ms, median = 14.037 ms, percentile(99%) = 15.0651 ms
[07/28/2022-16:01:33] [I] Enqueue Time: min = 0.0184326 ms, max = 0.318115 ms, mean = 0.1048 ms, median = 0.112305 ms, percentile(99%) = 0.220886 ms
[07/28/2022-16:01:33] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:01:33] [I] GPU Compute Time: min = 13.8496 ms, max = 15.4686 ms, mean = 14.0947 ms, median = 14.037 ms, percentile(99%) = 15.0651 ms
[07/28/2022-16:01:33] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:01:33] [I] Total Host Walltime: 3.03083 s
[07/28/2022-16:01:33] [I] Total GPU Compute Time: 3.03036 s
[07/28/2022-16:01:33] [W] * GPU compute time is unstable, with coefficient of variance = 1.65175%.
[07/28/2022-16:01:33] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-16:01:33] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-16:01:33] [V] 
[07/28/2022-16:01:33] [V] === Explanations of the performance metrics ===
[07/28/2022-16:01:33] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-16:01:33] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-16:01:33] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:01:33] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:01:33] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-16:01:33] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-16:01:33] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-16:01:33] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-16:01:33] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=yolov7-tiny-nms.trt --verbose --useCudaGraph --noDataTransfers --shapes=images:16x3x640x640

Dynamic batch=32

[07/28/2022-16:01:51] [I] === Performance summary ===
[07/28/2022-16:01:51] [I] Throughput: 35.8421 qps
[07/28/2022-16:01:51] [I] Latency: min = 27.3459 ms, max = 29.6284 ms, mean = 27.8981 ms, median = 27.7662 ms, percentile(99%) = 29.568 ms
[07/28/2022-16:01:51] [I] Enqueue Time: min = 0.0256348 ms, max = 0.164413 ms, mean = 0.106011 ms, median = 0.110474 ms, percentile(99%) = 0.153809 ms
[07/28/2022-16:01:51] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:01:51] [I] GPU Compute Time: min = 27.3459 ms, max = 29.6284 ms, mean = 27.8981 ms, median = 27.7662 ms, percentile(99%) = 29.568 ms
[07/28/2022-16:01:51] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:01:51] [I] Total Host Walltime: 3.06901 s
[07/28/2022-16:01:51] [I] Total GPU Compute Time: 3.06879 s
[07/28/2022-16:01:51] [W] * GPU compute time is unstable, with coefficient of variance = 1.55253%.
[07/28/2022-16:01:51] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-16:01:51] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-16:01:51] [V] 
[07/28/2022-16:01:51] [V] === Explanations of the performance metrics ===
[07/28/2022-16:01:51] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-16:01:51] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-16:01:51] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:01:51] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:01:51] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-16:01:51] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-16:01:51] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-16:01:51] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-16:01:51] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=yolov7-tiny-nms.trt --verbose --useCudaGraph --noDataTransfers --shapes=images:32x3x640x640

Performance Mean Latency Summary

batch 1 batch 16 batch 32
static 1.14559 ms 14.0723 ms 27.6771 ms
dynamic 1.38629 ms 14.0947 ms 27.8981 ms

As shown in the table above,

Under the same batch, dynamic batch achieves almost the same inference performance as static batch, but we don't need to export three batch models !!!!

The model inference scripts for dynamic batches are different.
And different gpu and TensorRT versions support different.
In contrast, the static batch model is more stable.

@philipp-schmidt
Copy link
Contributor

philipp-schmidt commented Jul 28, 2022

Hello @triple-Mu , hello @AlexeyAB ,

  1. Dynamic batching needs to set an "optimal batch size" to optimize the engine for. So if you test the dynamic batch engine with batch size 4 it only really is a representative benchmark if you select the correct optimization profile with "--optShapes=images:4x3x640x640" and tell TRT that batch size 4 will be the most common input size to optimize for (alongside min and max batch size).

  2. Your current performance benchmarks miss an important point about dynamic batching. It's not about having somewhat equal latency as the respective static engine. That's just bonus. It opens up the option to combine multiple batch size 1 requests into request of bigger batch size serverside and therefore better throughput (for a potentially "okay" tradeoff in latency even for many realtime applications like 15FPS video -> 66ms).
    Triton Inference Server makes use of that with his Dynamic Batching Strategy and I can tell you this will give some real world deployments up to an additional 50% throughput compared to a bunch of independent realtime applications sending single image requests at the server without this strategy. It basically gives you throughput of batch size X for the network, but completely transparent for a bunch of batch size 1 realtime apps with lots of room for latency tradeoffs.
    My initial PR for this would have showcased this with a short tutorial but it got replaced here. If a Triton deployment tutorial (similar to our repo for yolov4) with dynamic batching and python client is wanted I'm happy to make a separate PR, but I would need a confirmation of interest from authors here.

@triple-Mu triple-Mu closed this Jul 28, 2022
@triple-Mu triple-Mu reopened this Jul 28, 2022
@triple-Mu
Copy link
Contributor Author

Hello @triple-Mu , hello @AlexeyAB ,

  1. Dynamic batching needs to set an "optimal batch size" to optimize the engine for. So if you test the dynamic batch engine with batch size 4 it only really is a representative benchmark if you select the correct optimization profile with "--optShapes=images:4x3x640x640" and tell TRT that batch size 4 will be the most common input size to optimize for (alongside min and max batch size).
  2. Your current performance benchmarks miss an important point about dynamic batching. It's not about having somewhat equal latency as the respective static engine. That's just bonus. It opens up the option to combine multiple batch size 1 requests into request of bigger batch size serverside and therefore better throughput (for a potentially "okay" tradeoff in latency even for many realtime applications like 15FPS video -> 66ms).
    Triton Inference Server makes use of that with his Dynamic Batching Strategy and I can tell you this will give some real world deployments up to an additional 50% throughput compared to a bunch of independent realtime applications sending single image requests at the server without this strategy. It basically gives you throughput of batch size X for the network, but completely transparent for a bunch of batch size 1 realtime apps with lots of room for latency tradeoffs.
    My initial PR for this would have showcased this with a short tutorial but it got replaced here. If a Triton deployment tutorial (similar to our repo for yolov4) with dynamic batching and python client is wanted I'm happy to make a separate PR, but I would need a confirmation of interest from authors here.

Great work!
Look forward for your Triton usage!

@AlexeyAB
Copy link
Collaborator

Hi @philipp-schmidt, Hi @triple-Mu

Now I see, so there are cases when Static-batch is better, it is more stable in general, and faster for batch=1.
While for other cases Dynamic-batch is better.

I would need a confirmation of interest from authors here.

Yes, we are interested in this!

@philipp-schmidt
Copy link
Contributor

#346

Also here you can see the difference of dynamic batch size: yolov7 performance dynamic batch size

With dynamic batch size (min 1, opt 8, max 8):
Concurrent clients: 16, throughput: 590.119 infer/sec, latency 27080 usec
Without dynamic batch size:
Concurrent clients: 16, throughput: 335.587 infer/sec, latency 47616 usec

So 76% more throughput with only 57% of latency

@AlexeyAB AlexeyAB merged commit a7c0029 into WongKinYiu:main Jul 28, 2022
@AlexeyAB
Copy link
Collaborator

AlexeyAB commented Jul 28, 2022

@triple-Mu Thanks!

What does this value 7680 mean in the export command? https://github.com/WongKinYiu/yolov7/blob/main/tools/YOLOv7-Dynamic-Batch-ONNXRUNTIME.ipynb

python export.py --weights ./yolov7-tiny.pt --grid --end2end --simplify \
    --topk-all 100 --iou-thres 0.65 --conf-thres 0.35 \
    --img-size 640 640 \
    --dynamic-batch \
    --max-wh 7680

@triple-Mu
Copy link
Contributor Author

triple-Mu commented Jul 29, 2022

@triple-Mu Thanks!

What does this value 7680 mean in the export command? https://github.com/WongKinYiu/yolov7/blob/main/tools/YOLOv7-Dynamic-Batch-ONNXRUNTIME.ipynb

python export.py --weights ./yolov7-tiny.pt --grid --end2end --simplify \
    --topk-all 100 --iou-thres 0.65 --conf-thres 0.35 \
    --img-size 640 640 \
    --dynamic-batch \
    --max-wh 7680

Here is my introduction #273 (comment)

It is the same as https://github.com/WongKinYiu/yolov7/blob/main/utils/general.py#L681-L681
and https://github.com/WongKinYiu/yolov7/blob/main/utils/general.py#L619-L619
We support 3 nms:

  1. For onnxruntime agnostic nms, it may set 0
  2. For onnxruntime non-agnostic nms, it may set big int for example 7680 or 4096 or 640
    3.For tensorrt efficient nms, it is non-agnostic default, so we use default none.
    Therefore, I use max-wh to distinguish between the currently registered NMS that is applicable to the above NMS

@triple-Mu triple-Mu deleted the dynamic-batch branch July 29, 2022 05:24
@akashAD98
Copy link
Contributor

@triple-Mu @philipp-schmidt can you add code for inference on video? after the model is converted into .trt we want to do inference on video.

@philipp-schmidt
Copy link
Contributor

Already there, see "python3 client.py video {input}"

@akashAD98
Copy link
Contributor

wheemyungshin-nota pushed a commit to wheemyungshin/yolov7 that referenced this pull request Dec 1, 2023
* Support dynamic batch for TensorRT and onnxruntime

* Fix output name

* Add some images

* Add dynamic-batch usage notebook

* Add example notebook for onnxruntime and tensorrt
nelioasousa pushed a commit to nelioasousa/yolov7 that referenced this pull request Apr 26, 2024
* Support dynamic batch for TensorRT and onnxruntime

* Fix output name

* Add some images

* Add dynamic-batch usage notebook

* Add example notebook for onnxruntime and tensorrt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants