Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple TF export improvements #4824

Merged
merged 4 commits into from
Sep 16, 2021

Conversation

zldrobit
Copy link
Contributor

@zldrobit zldrobit commented Sep 16, 2021

  • Fusing YOLOv5 models before TF export.
  • Set params to non trainable during the export process.
  • Fix TFLite fp16 model export.

πŸ› οΈ PR Summary

Made with ❀️ by Ultralytics Actions

🌟 Summary

Improved support for TensorFlow Lite (TFLite) export and Keras model conversion πŸš€

πŸ“Š Key Changes

  • Set keras_model.trainable to False to freeze the model weights during export.
  • Changed the TFLite file naming convention to include the floating-point precision (-fp16) in the filename.
  • Added support for exporting to TensorFlow Lite with float16 quantization.
  • Updated Conv2D initialization in TensorFlow models to consider if layer has batch normalization or not, and add bias accordingly.
  • Enforced fusing of model layers during loading for all TensorFlow export formats.

🎯 Purpose & Impact

  • Freezing the Keras model weights during export ensures the model remains unchanged for inference, improving reliability.
  • New naming convention for TFLite files clarifies the precision level of the exported model, aiding in user understanding.
  • Float16 support in TFLite can reduce model size, potentially leading to faster inference on compatible hardware while maintaining good accuracy.
  • Correct handling of biases in Conv2D layers during TensorFlow model conversion improves model equivalence between PyTorch and TensorFlow implementations.
  • Enforcing layer fusing aligns PyTorch model loading with TensorFlow's optimization practices, leading to potentially faster and more efficient inference operations.

Overall, these changes aim to enhance the user experience by providing more efficient and understandable model export options, ensuring more consistent cross-platform model performance. πŸ“ˆπŸ”’

@glenn-jocher
Copy link
Member

@zldrobit can confirm super speedup on TFLite detect.py inference:

Before

detect: weights=['yolov5s.tflite'], source=data/images, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False
YOLOv5 πŸš€ v5.0-436-g6b44ecd torch 1.9.0+cu102 CUDA:0 (Tesla T4, 15109.75MB)

image 1/2 /content/yolov5/data/images/bus.jpg: 640x640 4 class0s, 1 class5, Done. (23.460s)
image 2/2 /content/yolov5/data/images/zidane.jpg: 640x640 2 class0s, 2 class27s, Done. (23.573s)
Speed: 5.0ms pre-process, 23516.5ms inference, 8.4ms NMS per image at shape (1, 3, 640, 640)
Results saved to runs/detect/exp

This PR

detect: weights=['yolov5s-fp16.tflite'], source=data/images, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False
YOLOv5 πŸš€ v3.0-901-gfe2b1ec torch 1.9.0+cu102 CUDA:0 (Tesla T4, 15109.75MB)

image 1/2 /content/yolov5/data/images/bus.jpg: 640x640 4 class0s, 1 class5, Done. (0.403s)
image 2/2 /content/yolov5/data/images/zidane.jpg: 640x640 2 class0s, 2 class27s, Done. (0.320s)
Speed: 4.8ms pre-process, 361.5ms inference, 7.7ms NMS per image at shape (1, 3, 640, 640)
Results saved to runs/detect/exp

@glenn-jocher
Copy link
Member

@zldrobit also confirm trainable params are now 0:

Total params: 7,266,973
Trainable params: 0
Non-trainable params: 7,266,973

@glenn-jocher glenn-jocher changed the title Fix issues discuessed in https://github.com/ultralytics/yolov5/pull/4479 Multiple TF export improvements Sep 16, 2021
@glenn-jocher glenn-jocher merged commit 3beb871 into ultralytics:master Sep 16, 2021
@glenn-jocher
Copy link
Member

@zldrobit PR is merged. Thank you for your contributions to YOLOv5 πŸš€ and Vision AI ⭐

@glenn-jocher glenn-jocher added the enhancement New feature or request label Sep 16, 2021
@alexdwu13
Copy link

@zldrobit now that the Tflite export defaults to FP16 post-quantization, how does the --half argument fit in? Is there a difference to enabling the flag vs. leaving it off?

@glenn-jocher
Copy link
Member

@alexdwu13 that's a good question. --half is really meant for exporting other formats like ONNX in FP16. I'm not sure what effect, if any, --half has on TF/TFLite/TFjs models. Have you exported with and without --half to compare?

@alexdwu13
Copy link

@glenn-jocher so I've actually been unable to successfully use the --half option as it requires --device 0, which breaks for me both on Collab and in my local environment. Seems like this bug has been addressed – though the fix does not seem applicable to the current code #2106

But I did compare tf.float16 vs. default (tf.float32) by commenting line 191 out in yolov5/export.py:

# converter.target_spec.supported_types = [tf.float16]

If you drop the 2 models into https://netron.app/ you can see that yolov5s6-fp16.tflite has a bunch of Dequantize layers compared to yolov5s6-fp32.tflite. Additionally, the FP16 model runs ~20% slower on TFLite GPU Delegate due to the additional Dequantize layers running on CPU:

$ ./android_aarch64_benchmark_model --use_gpu=true --graph=yolov5s6-fp32.tflite                                                                                    
STARTING!
Log parameter values verbosely: [0]
Graph: [yolov5s6-fp32.tflite]
Use gpu: [1]
Loaded model yolov5s6-fp32.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for GPU.
INFO: Initialized OpenCL-based API.
INFO: Created 1 GPU delegate kernels.
Explicitly applied GPU delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 13.2676
Initialized session in 2974.27ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=3 first=219376 curr=202144 min=197099 max=219376 avg=206206 std=9537

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=202314 curr=198963 min=197728 max=204225 avg=201360 std=1733
$ ./android_aarch64_benchmark_model --use_gpu=true --graph=yolov5s6-fp16.tflite                                                                                    
STARTING!
Log parameter values verbosely: [0]
Graph: [yolov5s6-fp16.tflite]
Use gpu: [1]
Loaded model yolov5s6-fp16.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for GPU.
ERROR: Following operations are not supported by GPU delegate:
DEQUANTIZE: 
521 operations will run on the GPU, and the remaining 5 operations will run on the CPU.
INFO: Initialized OpenCL-based API.
INFO: Created 1 GPU delegate kernels.
Explicitly applied GPU delegate, and the model graph will be partially executed by the delegate w/ 1 delegate kernels.
The input model file size (MB): 25.5305
Initialized session in 1865.52ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=3 first=228432 curr=253663 min=228432 max=253663 avg=239246 std=10610

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=252072 curr=248613 min=235499 max=256871 avg=251860 std=4912

Since the GPU delegate should be able to natively run in FP16 it seems strange these dequantize ops are added. I wonder if they'd go away if the model was exported with --half --device 0...

yolov5s6-fp32.tflite.zip
yolov5s6-fp16.tflite.zip

@glenn-jocher
Copy link
Member

@alexdwu13 yes it's true that there is an assert to avoid using --half with cpu. This is because pytorch is unable to run CPU inference with FP16 models, and we do dry runs with the pytorch model here to build grids for example:

yolov5/export.py

Lines 296 to 298 in 2c2ef25

for _ in range(2):
y = model(im) # dry runs
print(f"\n{colorstr('PyTorch:')} starting from {file} ({file_size(file):.1f} MB)")

Maybe we could cast the model and image to half after this pytorch inference?

@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 25, 2021

@alexdwu13 ok I ran an experiment. If I cast to .half() after pytorch inference I get errors on TorchScript and ONNX export:

PyTorch: starting from /Users/glennjocher/PycharmProjects/yolov5/yolov5s.pt (14.8 MB)

TorchScript: starting export with torch 1.9.1...
TorchScript: export failure: "unfolded2d_copy" not implemented for 'Half'

ONNX: starting export with onnx 1.10.1...
ONNX: export failure: "unfolded2d_copy" not implemented for 'Half'

But TFLite export works fine, though the export -fp16.tflite model still has dequantize blocks in it. All of the actual Conv layers are defined in FP32 interestingly (in the -fp16.tflite model), so yes it seems like providing the model directly in FP32 is most efficient unless there is a way to have the TFLite Conv layers exist natively in FP16. @zldrobit what do you think?

Screen Shot 2021-09-24 at 7 22 41 PM

@zldrobit
Copy link
Contributor Author

@alexdwu13 @glenn-jocher I inspect the fp32 model @alexdwu13 provided. I found that it is actually an int8 model:
image
Also, the size of *-fp32.tflite is half of that of *-fp16.tflite:
image

TFLite now supports int8 model acceleration by GPU delegate tensorflow/tensorflow#41485 (comment). This explains why the *-fp32.tflite model (int8) runs faster than the *-fp16.tflite model.
In order to export an fp32 tflite model,

yolov5/export.py

Lines 191 to 192 in 39c17ce

converter.target_spec.supported_types = [tf.float16]
converter.optimizations = [tf.lite.Optimize.DEFAULT]

has to be commented.

I tested *-fp32.tflite, *-fp16.tflite and *-int8.tflite, and the results are as follows:

gemini:/data/local/tmp $ ./android_aarch64_benchmark_model --use_gpu=true --graph=yolov5s-fp32.tflite      
STARTING!                                                                                                    
Log parameter values verbosely: [0]                                                                          
Graph: [yolov5s-fp32.tflite]                                                                                 
Use gpu: [1]                                                                                                 
Loaded model yolov5s-fp32.tflite                                                                             
INFO: Initialized TensorFlow Lite runtime.                                                                   
INFO: Created TensorFlow Lite delegate for GPU.                                                              
GPU delegate created.                                                                                        
INFO: Replacing 268 node(s) with delegate (TfLiteGpuDelegateV2) node, yielding 1 partitions.                 
INFO: Initialized OpenCL-based API.                                                                          
INFO: Created 1 GPU delegate kernels.                                                                        
Explicitly applied GPU delegate, and the model graph will be completely executed by the delegate.            
The input model file size (MB): 29.2149                                                                      
Initialized session in 6345.86ms.                                                                            
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds. 
count=3 first=211973 curr=186730 min=179841 max=211973 avg=192848 std=13812                                  
                                                                                                             
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.  
count=50 first=188118 curr=186836 min=184382 max=190163 avg=186622 std=1324                                  
gemini:/data/local/tmp $ ./android_aarch64_benchmark_model --use_gpu=true --graph=yolov5s-fp16.tflite       
STARTING!                                                                                                   
Log parameter values verbosely: [0]                                                                         
Graph: [yolov5s-fp16.tflite]                                                                                
Use gpu: [1]                                                                                                
Loaded model yolov5s-fp16.tflite                                                                            
INFO: Initialized TensorFlow Lite runtime.                                                                  
INFO: Created TensorFlow Lite delegate for GPU.                                                             
GPU delegate created.                                                                                       
INFO: Replacing 404 node(s) with delegate (TfLiteGpuDelegateV2) node, yielding 1 partitions.                
INFO: Initialized OpenCL-based API.                                                                         
INFO: Created 1 GPU delegate kernels.                                                                       
Explicitly applied GPU delegate, and the model graph will be completely executed by the delegate.           
The input model file size (MB): 14.6815                                                                     
Initialized session in 6280.89ms.                                                                           
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=3 first=223643 curr=191280 min=188645 max=223643 avg=201189 std=15913                                 
                                                                                                            
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. 
count=50 first=190210 curr=192567 min=185459 max=193569 avg=189499 std=1719                                 
gemini:/data/local/tmp $ ./android_aarch64_benchmark_model --use_gpu=true --graph=yolov5s-int8.tflite       
STARTING!                                                                                                   
Log parameter values verbosely: [0]                                                                         
Graph: [yolov5s-int8.tflite]                                                                                
Use gpu: [1]                                                                                                
Loaded model yolov5s-int8.tflite                                                                            
INFO: Initialized TensorFlow Lite runtime.                                                                  
INFO: Created TensorFlow Lite delegate for GPU.                                                             
GPU delegate created.                                                                                       
INFO: Replacing 293 node(s) with delegate (TfLiteGpuDelegateV2) node, yielding 1 partitions.                
INFO: Initialized OpenCL-based API.                                                                         
INFO: Created 1 GPU delegate kernels.                                                                       
Explicitly applied GPU delegate, and the model graph will be completely executed by the delegate.           
The input model file size (MB): 7.75111                                                                     
Initialized session in 8884.92ms.                                                                           
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=3 first=242261 curr=205828 min=199662 max=242261 avg=215917 std=18797                                 
                                                                                                            
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. 
count=50 first=206814 curr=200704 min=196685 max=209705 avg=204450 std=2308                                 

The fp32 and fp16 models consume almost the same time. This is because --gpu_precision_loss_allowed (Allow to process computation in lower precision than FP32 in GPU) is enabled by default and it turns on fp16 precision inference when using fp32 models. I turned off this option and ran the fp32 model again:

gemini:/data/local/tmp $ ./android_aarch64_benchmark_model --use_gpu=true --gpu_precision_loss_allowed=false --graph=yolov5s-fp32.tflite
STARTING!                                                                                                                               
Log parameter values verbosely: [0]                                                                                                     
Graph: [yolov5s-fp32.tflite]                                                                                                            
Use gpu: [1]                                                                                                                            
Allow lower precision in gpu: [0]                                                                                                       
Loaded model yolov5s-fp32.tflite                                                                                                        
INFO: Initialized TensorFlow Lite runtime.                                                                                              
INFO: Created TensorFlow Lite delegate for GPU.                                                                                         
GPU delegate created.                                                                                                                   
INFO: Replacing 268 node(s) with delegate (TfLiteGpuDelegateV2) node, yielding 1 partitions.                                            
INFO: Initialized OpenCL-based API.                                                                                                     
INFO: Created 1 GPU delegate kernels.                                                                                                   
Explicitly applied GPU delegate, and the model graph will be completely executed by the delegate.                                       
The input model file size (MB): 29.2149                                                                                                 
Initialized session in 5587.49ms.                                                                                                       
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.                            
count=2 first=469445 curr=438071 min=438071 max=469445 avg=453758 std=15687                                                             
                                                                                                                                        
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.                             
count=50 first=438022 curr=441749 min=431991 max=442449 avg=438025 std=1666                                                             

The elapsed time is more than doubled.

The dequantize op is not performed on GPU delegate, and it is only performed on CPU (https://www.tensorflow.org/lite/performance/post_training_quantization#float16_quantization).
image

Therefore, an fp16 model is as efficient as an fp32 model on GPU delegate and an fp16 model is 50% smaller.

yolov5s-fp32.tflite.zip
yolov5s-fp16.tflite.zip
yolov5s-int8.tflite.zip

@alexdwu13
Copy link

@zldrobit thank you for catching that! Yeah it looks like converter.optimizations = [tf.lite.Optimize.DEFAULT] defaults to int8 – I guess I would've caught that if I noticed the "fp32" model size was half that of the fp16... that's on me.

It looks like the reason I was getting slower execution times with the fp16 model was because I was using an older version of android_aarch64_benchmark_model that was requiring the dequantize nodes to run:

...
ERROR: Following operations are not supported by GPU delegate:
DEQUANTIZE: 
521 operations will run on the GPU, and the remaining 5 operations will run on the CPU
...

With the newest version of the benchmark tool – which uses TfLiteGpuDelegateV2 – I'm able to confirm that fp16 gives roughly the same latency as fp32.

So my question is: assuming --gpu_precision_loss_allowed=true, what's the difference under the hood between a fp16 post-quantized model and a fp32 model? Besides memory footprint, is there any advantage to quantizing to fp16 over using default fp32 wtih --gpu_precision_loss_allowed=true?

@JNaranjo-Alcazar
Copy link

JNaranjo-Alcazar commented Sep 28, 2021

Very interesting thread!
I would like to know what would be the fastest possible configuration on the Raspberry Pi.
As far as I know, first of all the model should be exported in tflite-int8.

Then, in the prediction, detect.py I don't see any difference between setting the --half flag to True or False. I see that this flag only works with GPU.

The inference time per image is about 400ms.

Do I have any options left that might allow me to make a faster inference?

@zldrobit
Copy link
Contributor Author

@alexdwu13

So my question is: assuming --gpu_precision_loss_allowed=true, what's the difference under the hood between a fp16 post-quantized model and a fp32 model?

I cannot find any documents elaborate the mechnism of fp16 TFLite inference, but you could refer to some more general materials, like ARM GPUs, and Nvidia GPUs.

Besides memory footprint, is there any advantage to quantizing to fp16 over using default fp32 wtih --gpu_precision_loss_allowed=true?

Considering that a cell phone has a lesser storage size than a PC, mobile app developers prefer small models to large models, so they would choose fp16 models even int8 models running on GPU. Note that the fp16 precision is set to true by default when using GPU with TFLite Java API on Android https://stackoverflow.com/a/62088843/3036450.

CesarBazanAV pushed a commit to CesarBazanAV/yolov5 that referenced this pull request Sep 29, 2021
* Add fused conv support

* Set all saved_model values to non trainable

* Fix TFLite fp16 model export

* Fix int8 TFLite conversion
@zldrobit
Copy link
Contributor Author

@JNaranjo-Alcazar The --half option does not take effect in TFLite export. Running python export.py --include tflite now only exports fp16 TFLite models no matter whether there is a --half option or not. If your RPI has a GPU, you could try running with GPU delegate for acceleration. And that may require building TFLite GPU delegate for Linux (tensorflow/tensorflow#38336 (comment), zldrobit/onnx_tflite_yolov3#15).

PS:
I don't have an RPI right now.

BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022
* Add fused conv support

* Set all saved_model values to non trainable

* Fix TFLite fp16 model export

* Fix int8 TFLite conversion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants