-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dynamic batch for TensorRT and onnxruntime #329
Conversation
Notebook is not set to shared |
Closing #280 in favor of this PR. |
@triple-Mu Why we don't use dynamic batch by default? |
@AlexeyAB Static batch=1[07/28/2022-15:56:12] [I] === Performance summary ===
[07/28/2022-15:56:12] [I] Throughput: 863.856 qps
[07/28/2022-15:56:12] [I] Latency: min = 1.08133 ms, max = 1.18787 ms, mean = 1.14559 ms, median = 1.1438 ms, percentile(99%) = 1.18066 ms
[07/28/2022-15:56:12] [I] Enqueue Time: min = 0.0108643 ms, max = 0.0632324 ms, mean = 0.0199817 ms, median = 0.0189209 ms, percentile(99%) = 0.0423584 ms
[07/28/2022-15:56:12] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:56:12] [I] GPU Compute Time: min = 1.08133 ms, max = 1.18787 ms, mean = 1.14559 ms, median = 1.1438 ms, percentile(99%) = 1.18066 ms
[07/28/2022-15:56:12] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:56:12] [I] Total Host Walltime: 3.00282 s
[07/28/2022-15:56:12] [I] Total GPU Compute Time: 2.97166 s
[07/28/2022-15:56:12] [W] * GPU compute time is unstable, with coefficient of variance = 1.19178%.
[07/28/2022-15:56:12] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-15:56:12] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-15:56:12] [V]
[07/28/2022-15:56:12] [V] === Explanations of the performance metrics ===
[07/28/2022-15:56:12] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-15:56:12] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-15:56:12] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:56:12] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:56:12] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-15:56:12] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-15:56:12] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-15:56:12] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-15:56:12] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=./yolov7-tiny-batch1.plan --verbose --useCudaGraph --noDataTransfers --shapes=images:1x3x640x640 Static batch=16[07/28/2022-15:58:53] [I] === Performance summary ===
[07/28/2022-15:58:53] [I] Throughput: 71.0509 qps
[07/28/2022-15:58:53] [I] Latency: min = 13.8823 ms, max = 15.3702 ms, mean = 14.0723 ms, median = 14.0103 ms, percentile(99%) = 15.1163 ms
[07/28/2022-15:58:53] [I] Enqueue Time: min = 0.0202637 ms, max = 0.167694 ms, mean = 0.105239 ms, median = 0.112793 ms, percentile(99%) = 0.154663 ms
[07/28/2022-15:58:53] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:58:53] [I] GPU Compute Time: min = 13.8823 ms, max = 15.3702 ms, mean = 14.0723 ms, median = 14.0103 ms, percentile(99%) = 15.1163 ms
[07/28/2022-15:58:53] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:58:53] [I] Total Host Walltime: 3.026 s
[07/28/2022-15:58:53] [I] Total GPU Compute Time: 3.02554 s
[07/28/2022-15:58:53] [W] * GPU compute time is unstable, with coefficient of variance = 1.61678%.
[07/28/2022-15:58:53] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-15:58:53] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-15:58:53] [V]
[07/28/2022-15:58:53] [V] === Explanations of the performance metrics ===
[07/28/2022-15:58:53] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-15:58:53] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-15:58:53] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:58:53] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:58:53] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-15:58:53] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-15:58:53] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-15:58:53] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-15:58:53] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=./yolov7-tiny-batch16.plan --verbose --useCudaGraph --noDataTransfers --shapes=images:16x3x640x640 Static batch=32[07/28/2022-15:59:33] [I] === Performance summary ===
[07/28/2022-15:59:33] [I] Throughput: 36.1283 qps
[07/28/2022-15:59:33] [I] Latency: min = 27.0797 ms, max = 31.2945 ms, mean = 27.6771 ms, median = 27.4131 ms, percentile(99%) = 30.8654 ms
[07/28/2022-15:59:33] [I] Enqueue Time: min = 0.0166016 ms, max = 0.152283 ms, mean = 0.096474 ms, median = 0.104919 ms, percentile(99%) = 0.144287 ms
[07/28/2022-15:59:33] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:59:33] [I] GPU Compute Time: min = 27.0797 ms, max = 31.2945 ms, mean = 27.6771 ms, median = 27.4131 ms, percentile(99%) = 30.8654 ms
[07/28/2022-15:59:33] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-15:59:33] [I] Total Host Walltime: 3.0447 s
[07/28/2022-15:59:33] [I] Total GPU Compute Time: 3.04448 s
[07/28/2022-15:59:33] [W] * GPU compute time is unstable, with coefficient of variance = 2.442%.
[07/28/2022-15:59:33] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-15:59:33] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-15:59:33] [V]
[07/28/2022-15:59:33] [V] === Explanations of the performance metrics ===
[07/28/2022-15:59:33] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-15:59:33] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-15:59:33] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:59:33] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-15:59:33] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-15:59:33] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-15:59:33] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-15:59:33] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-15:59:33] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=./yolov7-tiny-batch32.plan --verbose --useCudaGraph --noDataTransfers --shapes=images:32x3x640x640 Dynamic batch=1[07/28/2022-16:00:55] [I] === Performance summary ===
[07/28/2022-16:00:55] [I] Throughput: 716.632 qps
[07/28/2022-16:00:55] [I] Latency: min = 1.3476 ms, max = 1.5657 ms, mean = 1.38629 ms, median = 1.38037 ms, percentile(99%) = 1.55853 ms
[07/28/2022-16:00:55] [I] Enqueue Time: min = 0.0109863 ms, max = 0.0975342 ms, mean = 0.022185 ms, median = 0.0200195 ms, percentile(99%) = 0.071167 ms
[07/28/2022-16:00:55] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:00:55] [I] GPU Compute Time: min = 1.3476 ms, max = 1.5657 ms, mean = 1.38629 ms, median = 1.38037 ms, percentile(99%) = 1.55853 ms
[07/28/2022-16:00:55] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:00:55] [I] Total Host Walltime: 3.00294 s
[07/28/2022-16:00:55] [I] Total GPU Compute Time: 2.9833 s
[07/28/2022-16:00:55] [W] * GPU compute time is unstable, with coefficient of variance = 2.10486%.
[07/28/2022-16:00:55] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-16:00:55] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-16:00:55] [V]
[07/28/2022-16:00:55] [V] === Explanations of the performance metrics ===
[07/28/2022-16:00:55] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-16:00:55] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-16:00:55] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:00:55] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:00:55] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-16:00:55] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-16:00:55] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-16:00:55] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-16:00:55] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=yolov7-tiny-nms.trt --verbose --useCudaGraph --noDataTransfers --shapes=images:1x3x640x640 Dynamic batch=16[07/28/2022-16:01:33] [I] === Performance summary ===
[07/28/2022-16:01:33] [I] Throughput: 70.9378 qps
[07/28/2022-16:01:33] [I] Latency: min = 13.8496 ms, max = 15.4686 ms, mean = 14.0947 ms, median = 14.037 ms, percentile(99%) = 15.0651 ms
[07/28/2022-16:01:33] [I] Enqueue Time: min = 0.0184326 ms, max = 0.318115 ms, mean = 0.1048 ms, median = 0.112305 ms, percentile(99%) = 0.220886 ms
[07/28/2022-16:01:33] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:01:33] [I] GPU Compute Time: min = 13.8496 ms, max = 15.4686 ms, mean = 14.0947 ms, median = 14.037 ms, percentile(99%) = 15.0651 ms
[07/28/2022-16:01:33] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:01:33] [I] Total Host Walltime: 3.03083 s
[07/28/2022-16:01:33] [I] Total GPU Compute Time: 3.03036 s
[07/28/2022-16:01:33] [W] * GPU compute time is unstable, with coefficient of variance = 1.65175%.
[07/28/2022-16:01:33] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-16:01:33] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-16:01:33] [V]
[07/28/2022-16:01:33] [V] === Explanations of the performance metrics ===
[07/28/2022-16:01:33] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-16:01:33] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-16:01:33] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:01:33] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:01:33] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-16:01:33] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-16:01:33] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-16:01:33] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-16:01:33] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=yolov7-tiny-nms.trt --verbose --useCudaGraph --noDataTransfers --shapes=images:16x3x640x640 Dynamic batch=32[07/28/2022-16:01:51] [I] === Performance summary ===
[07/28/2022-16:01:51] [I] Throughput: 35.8421 qps
[07/28/2022-16:01:51] [I] Latency: min = 27.3459 ms, max = 29.6284 ms, mean = 27.8981 ms, median = 27.7662 ms, percentile(99%) = 29.568 ms
[07/28/2022-16:01:51] [I] Enqueue Time: min = 0.0256348 ms, max = 0.164413 ms, mean = 0.106011 ms, median = 0.110474 ms, percentile(99%) = 0.153809 ms
[07/28/2022-16:01:51] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:01:51] [I] GPU Compute Time: min = 27.3459 ms, max = 29.6284 ms, mean = 27.8981 ms, median = 27.7662 ms, percentile(99%) = 29.568 ms
[07/28/2022-16:01:51] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(99%) = 0 ms
[07/28/2022-16:01:51] [I] Total Host Walltime: 3.06901 s
[07/28/2022-16:01:51] [I] Total GPU Compute Time: 3.06879 s
[07/28/2022-16:01:51] [W] * GPU compute time is unstable, with coefficient of variance = 1.55253%.
[07/28/2022-16:01:51] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/28/2022-16:01:51] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/28/2022-16:01:51] [V]
[07/28/2022-16:01:51] [V] === Explanations of the performance metrics ===
[07/28/2022-16:01:51] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/28/2022-16:01:51] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/28/2022-16:01:51] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:01:51] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/28/2022-16:01:51] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/28/2022-16:01:51] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/28/2022-16:01:51] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/28/2022-16:01:51] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/28/2022-16:01:51] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # trtexec --loadEngine=yolov7-tiny-nms.trt --verbose --useCudaGraph --noDataTransfers --shapes=images:32x3x640x640 Performance Mean Latency Summary
As shown in the table above, Under the same batch, dynamic batch achieves almost the same inference performance as static batch, but we don't need to export three batch models !!!! The model inference scripts for dynamic batches are different. |
Hello @triple-Mu , hello @AlexeyAB ,
|
Great work! |
Hi @philipp-schmidt, Hi @triple-Mu Now I see, so there are cases when Static-batch is better, it is more stable in general, and faster for batch=1.
Yes, we are interested in this! |
Also here you can see the difference of dynamic batch size: yolov7 performance dynamic batch size With dynamic batch size (min 1, opt 8, max 8): So 76% more throughput with only 57% of latency |
@triple-Mu Thanks! What does this value python export.py --weights ./yolov7-tiny.pt --grid --end2end --simplify \
--topk-all 100 --iou-thres 0.65 --conf-thres 0.35 \
--img-size 640 640 \
--dynamic-batch \
--max-wh 7680 |
Here is my introduction #273 (comment) It is the same as https://github.com/WongKinYiu/yolov7/blob/main/utils/general.py#L681-L681
|
@triple-Mu @philipp-schmidt can you add code for inference on video? after the model is converted into .trt we want to do inference on video. |
Already there, see "python3 client.py video {input}" |
@philipp-schmidt it's not in google collab which you have provided. |
* Support dynamic batch for TensorRT and onnxruntime * Fix output name * Add some images * Add dynamic-batch usage notebook * Add example notebook for onnxruntime and tensorrt
* Support dynamic batch for TensorRT and onnxruntime * Fix output name * Add some images * Add dynamic-batch usage notebook * Add example notebook for onnxruntime and tensorrt
As the issue is shown in yolov7 #273 (comment)
We will support dynamic batch for end2end detect!