Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No performance improvement with CUDNN_HALF=1 on Jetson Xavier AGX #5234

Closed
pullmyleg opened this issue Apr 15, 2020 · 13 comments
Closed

No performance improvement with CUDNN_HALF=1 on Jetson Xavier AGX #5234

pullmyleg opened this issue Apr 15, 2020 · 13 comments
Labels
Bug fixed The problem is solved by fixing the source code

Comments

@pullmyleg
Copy link

pullmyleg commented Apr 15, 2020

Hi @AlexeyAB I am having the same issue on Jetson Xavier AGX seen here in issue #4691.

Jetson Xavier AGX

  • Jetpack 4.3 (latest)
  • CUDA 10.0 & CUDNN
  • OpenCV 4.2 (Also tried 4.1)
  • Built using the latest darknet repo cloned 14.04.2020
  • Untouched from source yolov3.cfg
  • Pre-trained yolov3.weights
  • Testvideo
  1. Building with CUDNN_HALF=0 or =1 using makefile gives the same AVG_FPS 14.8 when using the demo - benchmark. See the details below.

  2. If I build with CMAKE it compiles with CUDNN_HALF=0. Not sure if this is expected behaviour or is a clue into issue / if is environment issue.

So I have deleted repo and tried recompiling multiple times with make / make clean and by adjusting the makefile as below. I have also reflashed the device and installed OPENCV4.2 with CUDA & CUDNN.

Any ideas to fix would be greatly appreciated. I see the FPS performance is exactly the same as @vitotsai HALF=0 performance.

Make with:

GPU=1
CUDNN=1
CUDNN_HALF=0 or 1
OPENCV=1
AVX=0
OPENMP=1
LIBSO=0
ZED_CAMERA=0 # ZED SDK 3.0 and above
ZED_CAMERA_v2_8=0 # ZED SDK 2.X

ARCH= -gencode arch=compute_72,code=[sm_72,compute_72]

-benchmark

CUDNN_HALF=0
FPS:14.8 AVG_FPS:14.8

./darknet detector demo cfg/coco.data cfg/yolov3.cfg yolov3.weights cartest.mp4 -benchmark
CUDA-version: 10000 (10000), cuDNN: 7.6.3, GPU count: 1
OpenCV version: 4.1.1
Demo
compute_capability = 720, cudnn_half = 0
net.optimized_memory = 0
mini_batch = 1, batch = 1, time_steps = 1, train = 0
.....
Total BFLOPS 65.879
avg_outputs = 532444
Allocate additional workspace_size = 52.43 MB
Loading weights from yolov3.weights...
seen 64, trained: 32013 K-images (500 Kilo-batches_64)
Done! Loaded 107 layers from weights-file
video file: cartest.mp4
Video stream: 1280 x 720

**-benchmark

CUDNN_HALF=1:
FPS:14.8 AVG_FPS:14.7**

./darknet detector demo cfg/coco.data cfg/yolov3.cfg yolov3.weights cartest.mp4 -benchmark
CUDA-version: 10000 (10000), cuDNN: 7.6.3, CUDNN_HALF=1, GPU count: 1
CUDNN_HALF=1
OpenCV version: 4.1.1
Demo
compute_capability = 720, cudnn_half = 1
net.optimized_memory = 0
mini_batch = 1, batch = 1, time_steps = 1, train = 0
....
[yolo] params: iou loss: mse (2), iou_norm: 0.75, cls_norm: 1.00, scale_x_y: 1.00
Total BFLOPS 65.879
avg_outputs = 532444
Allocate additional workspace_size = 52.43 MB
Loading weights from yolov3.weights...
seen 64, trained: 32013 K-images (500 Kilo-batches_64)
Done! Loaded 107 layers from weights-file
video file: cartest.mp4
Video stream: 1280 x 720

@pullmyleg
Copy link
Author

Thanks @AlexeyAB fixed with latest check in.

@AlexeyAB
Copy link
Owner

@pullmyleg What FPS ca you get now?

@AlexeyAB AlexeyAB added the Bug fixed The problem is solved by fixing the source code label Apr 18, 2020
@pullmyleg
Copy link
Author

@AlexeyAB from 14 FPS to 20.9FPS using the standard Yolov3 config at 416x416.

This is a huge improvement and allows me to run 1080p detection from the UAV for very small objects (dolphins) using YoloTiny at 19FPS.

Thanks!

@AlexeyAB
Copy link
Owner

@pullmyleg Do you use yolov3-tiny.cfg with width=1088 height=1088 in cfg-file and get 19 FPS?

@AlexeyAB
Copy link
Owner

@pullmyleg Download the latest Darknet code, the new code is +10% faster.

@pullmyleg
Copy link
Author

@AlexeyAB thanks. On standard yolov3 config, the latest benchmark is 22.4FPS for the Jetson Xavier AGX. ~10% improvement. This is great, thanks!

The Yolo-tiny runs detection at 19FPS at 1920 x 1088 (21FPS now with the latest change). Please let me know if this doesn't make sense.

We are a non-for-profit (MAUI63) and what we are doing is looking for the world's rarest dolphin (Maui) using object detection and a large UAV that flies at 120km/h. If you are interested see fundraising video here :). The higher we can fly and the smaller the objects (dolphins) we can detect the more area we can cover per flight. Once we spot a dolphin the UAV will circle and follow the pod of dolphins until the pilot tells it to continue surveying.

The goal is to find the model that performs the most accurately with the smallest objects possible from 1080p 30fps footage using a Jetson Xavier AGX on board. We need a minimum of 12FPS to be able to spot dolphins @120kmh, but I think 20FPS+ is preferable and will work better.

I am currently training and benchmarking a range of different configurations for this project at various configurations and models I have compiled by reading through issues and suggestions on this project. You can see the list below, it is not complete and is a work in progress. I am still gathering results and training the different models for comparison.

If you or anyone else have any suggestions on other models/configurations I should use please let me know.

Tomorrow I will have access to some new hardware (1 x Tesla V100 32gb, 48gb ram,12CPU) and soon will have access to Azure NCv3 VM's where I will do some large batch training trialing GPU & CPU memory. The models so far have been trained on a 1080ti (beast) / 2070 (gs65).

Training Results                                
Name Source Model Machine Dataset CPU Memory Batch Subdivisions Random Width Height Calc Anchors Iterations Map Xavier FPS 1920x1080 Video Result (Manual) Notes
yolov3-tiny-maui-1536.txt Darknet Yolov3-Tiny Beast Complete N 64 16 Y 1536 1536 N 7200 89.7% 21 Good @1920x1088 Detection  
yolov3-tiny-maui-20161152.txt Darknet Yolov3-Tiny Beast Complete N 64 16 Y 2016 1152 N 6900 84.0%     Accuracy dropped with width / height changed not square no anchor re calc
yolov3-tiny-maui-544544.txt Darknet Yolov3-Tiny GS65 Complete N 64 4 Y 544 544 N 19000 83.8%     Large batch = better result. Even at low resolutions. Also higher iterations help
yolov3-tiny-maui20161152-anc Darknet Yolov3-Tiny Beast Complete N 64 16 Y 2016 1152 Y          
yolov3-maui-576.txt Darknet Yolov3 Beast Complete N 64 16 Y 576 576 N 8800 88.4%      
yolov3-tiny-maui-1536-anc.txt Darknet Yolov3-Tiny Beast Complete N 64 16 Y 1536 1536 Y 6900 83.9%     Worst Map performance w calculated anchors.
yolov3-tiny-maui-small-1536 Darknet Yolov3-Tiny Beast Small obj only N 64 16 Y 1504 1504 N          
yolov3-tiny-maui-small-1536-anc Darknet Yolov3-Tiny Beast Small obj only N 64 16 Y 1504 1504 Y          
yolov3-tiny-maui-small-20161152 Darknet Yolov3-Tiny Beast Small obj only N 64 16 Y 2016 1152 N          
yolov3-tiny-maui-small-20161152-anc Darknet Yolov3-Tiny Beast Small obj only N 64 16 Y 2016 1152 Y          
yolov3-tiny-maui-bb Darknet Yolov3-Tiny UoA Complete N 64 4 Y 1536 1536 N          
yolov3-tiny-maui-bb-memory Darknet Yolov3-Tiny Azure Complete Y 64 2 N 1920 1920 N          
yolov3-19201080 Darknet yolov3 UoA           1920 1080 ?          
Yolov3-SPP Darknet yolov3-spp                            
Yolov3-SPP-tiny Darknet yolov3-spp-tiny                            
Yolov3-LSTM Darknet yolov3-spp   Complete - ordered frames                        
Yolov3-LSTM-tiny Darknet yolov3-spp-tiny   Complete - ordered frames                        
Yolo-v3-tp3 Darknet yolo_v3_tiny_pan3.cfg                            
Yolo_tiny-prn Darknet yolo_tiny - prn                            
yolov3-tiny-maui-1536-ten Tensorflow Yolov3-Tiny                            
yolov3-light-maui-1536 Tensorflow Yolov3-Tiny-Light                            
yolov3-Nano-maui-1536 Tensorflow Yolov3-nano                            
yolov3-light-maui-1536-spp Tensorflow Yolov3-Tiny-Light-SPP                            
yolov3-tiny-maui-1536 Pytorch Yolov3-Tiny                            
yolov3-maui-1536 Pytorch Yolov3                            
small network model Darknet Network-Model                            

@pullmyleg pullmyleg reopened this Apr 18, 2020
@AlexeyAB
Copy link
Owner

Training Results                                
Name Source Model Machine Dataset CPU Memory Batch Subdivisions Random Width Height Calc Anchors Iterations Map Xavier FPS 1920x1080 Video Result (Manual) Notes
yolov3-tiny-maui-1536.txt Darknet Yolov3-Tiny Beast Complete N 64 16 Y 1536 1536 N 7200 89.7% 21 Good @1920x1088 Detection  
  • Does it mean that you trained model with width=1536 height=1536 in cfg, and after training changed width=1920 height=1088 in cfg for detection? Don't do it, if you train on images from the same camera.

  • Do you use separate validation dataset for mAP calculation?


Try to train these 3 yolov3-tiny models - these models are implemented for aerial detection: #4495 (comment)

Train with width=1920 height=1088 in cfg and use the same width=1920 height=1088 for detection, and train by using pre-trained weights file yolov3-tiny.conv.15 https://github.com/AlexeyAB/darknet#how-to-train-tiny-yolo-to-detect-your-custom-objects

  1. Tiny_3l_roatate_whole_maxout - https://github.com/AlexeyAB/darknet/files/3995740/yolov3-tiny_3l_rotate_whole_maxout.cfg.txt

  2. Tiny_3l_stretch_sway_whole_concat_maxout - https://github.com/AlexeyAB/darknet/files/4003688/yolov3-tiny_3l_stretch_sway_whole_concat_maxout.cfg.txt

  3. Tiny_3l_resize - https://github.com/AlexeyAB/darknet/files/3995772/yolov3-tiny_3l_resize.cfg.txt

@pullmyleg
Copy link
Author

Thanks @AlexeyAB

Does it mean that you trained model with width=1536 height=1536 in cfg, and after training changed width=1920 height=1088 in cfg for detection?

Yes in that example I trained at 1536x1536 and detect at 1920x1088. Why should I not change? Assuming it's because I should train in the same aspect ratio that I would like to detect?

Don't do it, if you train on images from the same camera.

I am training from images from a different camera then will be in the final UAV. The images are frames (6 per second) from 4k Video. 3840 x 2160px. I do not have footage of the dolphins from the final UAV camera yet. It is still being built.

Do you use a separate validation dataset for mAP calculation?

Yes, the training set is different from the validation set. 1 of the 8 videos is used in the testing set. The final video for manual testing is one of the videos in the validation set.

Complete data set

  • Training set ~9000 images.
  • Validation set ~ 300 images

Small images only data set (from high heights, very small dolphins only)

  • Training set ~3200 images.
  • Validation set ~ 700 images

Try to train these 3 yolov3-tiny models - these models are implemented for aerial detection: #4495

Ok, thank you. I will train these next and post results when finished.

@AlexeyAB
Copy link
Owner

Yes in that example I trained at 1536x1536 and detect at 1920x1088. Why should I not change? Assuming it's because I should train in the same aspect ratio that I would like to detect?

Yes, aspect ratio should be the same. So use equal network resolution for training and detection.

Also try to train 4-th yolov3-tiny model with width=1920 height=1088 in cfg: yolo_v3_tiny_pan3_scale_giou.cfg.txt

@pullmyleg
Copy link
Author

Hi @AlexeyAB , I know this question has been answered many times. But I just want to confirm what I am doing is correct re calculated Anchors. I understand that the anchors are the width and height of the closest object height in that layer, but what I don't understand is why they are required at each size between each layer? E.g. Why Anchors greater than 60x60 go in the first layer.

My understanding from the readme is:

  • Anchors greater than 60x60 layer 1.
  • Anchors greater than 30x30 but smaller than 60x60 layer 2
  • Anchors greater smaller than 30x30 layer 3.

Note I have 2 x datasets (complete and small). Small is from 40m+ high only footage (very small objects) and complete: from 10m - 40m (very small and medium-small sized objects).

This is for the small dataset, I am using the small object dataset because the smaller the object can be the higher we can fly and more area can be covered in one flight.

num_of_clusters = 9, width = 1920, height = 1080 
 read labels from 3232 images 
 loaded 	 image: 3232 	 box: 3019
 all loaded. 

 calculating k-means++ ...

 iterations = 16 

counters_per_class = 3019

 avg IoU = 79.59 % 

Saving anchors to the file: anchors.txt 
anchors =  29, 23,  21, 39,  50, 26,  38, 36,  30, 50,  49, 49,  70, 38,  42, 73,  70, 69

Option 1 - based on one number from each anchor fitting the size:

Layer 1 mask = 6,7,8
Layer 2 mask = 3,4,5
Layer 3 mask = 0,1,2
anchors = 29,23, 21,39, 50,26, 38,36, 30,50, 49,49, 70,38, 42,73, 70,69.

Mask 6 is actually smaller than 60 x 60 and Mask 2 greater than 30x30 but I noticed in the original weights a similar approach was being used if it was close or one of the values we're => 60. e.g. Mask 5 in layer 2 in original config is: 59,119 which is > 60x60.

Option 2 - based on total object size e.g. 60*60

Layer 1 mask = 8
Layer 2 mask = 2,3,4,5,6,7
Layer 3 mask = 0,1
anchors = 29,23, 21,39, 50,26, 38,36, 30,50, 49,49, 70,38, 42,73, 70,69

I will adjust the filters accordingly to the masks used in each layer.

Thanks again for your help!

@AlexeyAB
Copy link
Owner

There is no strict rule. There is just an empirical recommendation:

  • Anchors greater than 64x64 for layer with 5 subsampling layers (stride=2), because it has receptive filed >= 32 = pow(2,5) (actually it is higher than 32x32 because conv3x3 also increases receptive field, not only layers with stride=2)
  • Anchors greater than 32x32 but smaller than 64x64 for layer with 4 subsampling layers (stride=2)
  • Anchors greater smaller than 32x32 for layer with 3 subsampling layers (stride=2)

You can add [net] show_receptive_field=1 in cfg to show receptive field size in the console for each layer during network initialization.


This is a more complex issue - you should take into account number of objects per image for each size, and number of overlapped object for each size, ...

I would recommend you to use :

  • or use default anchors
  • or use Option 2, but add default anchors: 2 anchors to layer-1 ad 1 anchor to layer-3

@pullmyleg
Copy link
Author

Ok thank you @AlexeyAB . I will try train with both and compare result.

To confirm options 2 with additional default anchors should look like:

All bold are new.

Layer 1 mask = 9,10,11
Layer 2 mask = 3,4,5,6,7,8
Layer 3 mask = 0,1,2
anchors = 10,13, 29,23, 21,39, 50,26, 38,36, 30,50, 49,49, 70,38, 42,73, 70,69, 116,90, 156,198

@AlexeyAB
Copy link
Owner

@pullmyleg Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug fixed The problem is solved by fixing the source code
Projects
None yet
Development

No branches or pull requests

2 participants