No performance improvement with CUDNN_HALF=1 on Jetson Xavier AGX #5234

pullmyleg · 2020-04-15T01:50:58Z

Hi @AlexeyAB I am having the same issue on Jetson Xavier AGX seen here in issue #4691.

Jetson Xavier AGX

Jetpack 4.3 (latest)
CUDA 10.0 & CUDNN
OpenCV 4.2 (Also tried 4.1)
Built using the latest darknet repo cloned 14.04.2020
Untouched from source yolov3.cfg
Pre-trained yolov3.weights
Testvideo

Building with CUDNN_HALF=0 or =1 using makefile gives the same AVG_FPS 14.8 when using the demo - benchmark. See the details below.
If I build with CMAKE it compiles with CUDNN_HALF=0. Not sure if this is expected behaviour or is a clue into issue / if is environment issue.

So I have deleted repo and tried recompiling multiple times with make / make clean and by adjusting the makefile as below. I have also reflashed the device and installed OPENCV4.2 with CUDA & CUDNN.

Any ideas to fix would be greatly appreciated. I see the FPS performance is exactly the same as @vitotsai HALF=0 performance.

Make with:

GPU=1
CUDNN=1
CUDNN_HALF=0 or 1
OPENCV=1
AVX=0
OPENMP=1
LIBSO=0
ZED_CAMERA=0 # ZED SDK 3.0 and above
ZED_CAMERA_v2_8=0 # ZED SDK 2.X

ARCH= -gencode arch=compute_72,code=[sm_72,compute_72]

-benchmark

CUDNN_HALF=0
FPS:14.8 AVG_FPS:14.8

./darknet detector demo cfg/coco.data cfg/yolov3.cfg yolov3.weights cartest.mp4 -benchmark
CUDA-version: 10000 (10000), cuDNN: 7.6.3, GPU count: 1
OpenCV version: 4.1.1
Demo
compute_capability = 720, cudnn_half = 0
net.optimized_memory = 0
mini_batch = 1, batch = 1, time_steps = 1, train = 0
.....
Total BFLOPS 65.879
avg_outputs = 532444
Allocate additional workspace_size = 52.43 MB
Loading weights from yolov3.weights...
seen 64, trained: 32013 K-images (500 Kilo-batches_64)
Done! Loaded 107 layers from weights-file
video file: cartest.mp4
Video stream: 1280 x 720

**-benchmark

CUDNN_HALF=1:
FPS:14.8 AVG_FPS:14.7**

./darknet detector demo cfg/coco.data cfg/yolov3.cfg yolov3.weights cartest.mp4 -benchmark
CUDA-version: 10000 (10000), cuDNN: 7.6.3, CUDNN_HALF=1, GPU count: 1
CUDNN_HALF=1
OpenCV version: 4.1.1
Demo
compute_capability = 720, cudnn_half = 1
net.optimized_memory = 0
mini_batch = 1, batch = 1, time_steps = 1, train = 0
....
[yolo] params: iou loss: mse (2), iou_norm: 0.75, cls_norm: 1.00, scale_x_y: 1.00
Total BFLOPS 65.879
avg_outputs = 532444
Allocate additional workspace_size = 52.43 MB
Loading weights from yolov3.weights...
seen 64, trained: 32013 K-images (500 Kilo-batches_64)
Done! Loaded 107 layers from weights-file
video file: cartest.mp4
Video stream: 1280 x 720

pullmyleg · 2020-04-18T08:22:13Z

Thanks @AlexeyAB fixed with latest check in.

AlexeyAB · 2020-04-18T10:51:12Z

@pullmyleg What FPS ca you get now?

pullmyleg · 2020-04-18T11:07:55Z

@AlexeyAB from 14 FPS to 20.9FPS using the standard Yolov3 config at 416x416.

This is a huge improvement and allows me to run 1080p detection from the UAV for very small objects (dolphins) using YoloTiny at 19FPS.

Thanks!

AlexeyAB · 2020-04-18T11:24:20Z

@pullmyleg Do you use yolov3-tiny.cfg with width=1088 height=1088 in cfg-file and get 19 FPS?

AlexeyAB · 2020-04-18T16:06:46Z

@pullmyleg Download the latest Darknet code, the new code is +10% faster.

pullmyleg · 2020-04-18T23:38:31Z

@AlexeyAB thanks. On standard yolov3 config, the latest benchmark is 22.4FPS for the Jetson Xavier AGX. ~10% improvement. This is great, thanks!

The Yolo-tiny runs detection at 19FPS at 1920 x 1088 (21FPS now with the latest change). Please let me know if this doesn't make sense.

We are a non-for-profit (MAUI63) and what we are doing is looking for the world's rarest dolphin (Maui) using object detection and a large UAV that flies at 120km/h. If you are interested see fundraising video here :). The higher we can fly and the smaller the objects (dolphins) we can detect the more area we can cover per flight. Once we spot a dolphin the UAV will circle and follow the pod of dolphins until the pilot tells it to continue surveying.

The goal is to find the model that performs the most accurately with the smallest objects possible from 1080p 30fps footage using a Jetson Xavier AGX on board. We need a minimum of 12FPS to be able to spot dolphins @120kmh, but I think 20FPS+ is preferable and will work better.

I am currently training and benchmarking a range of different configurations for this project at various configurations and models I have compiled by reading through issues and suggestions on this project. You can see the list below, it is not complete and is a work in progress. I am still gathering results and training the different models for comparison.

If you or anyone else have any suggestions on other models/configurations I should use please let me know.

Tomorrow I will have access to some new hardware (1 x Tesla V100 32gb, 48gb ram,12CPU) and soon will have access to Azure NCv3 VM's where I will do some large batch training trialing GPU & CPU memory. The models so far have been trained on a 1080ti (beast) / 2070 (gs65).

Training Results
Name	Source	Model	Machine	Dataset	CPU Memory	Batch	Subdivisions	Random	Width	Height	Calc Anchors	Iterations	Map	Xavier FPS 1920x1080	Video Result (Manual)	Notes
yolov3-tiny-maui-1536.txt	Darknet	Yolov3-Tiny	Beast	Complete	N	64	16	Y	1536	1536	N	7200	89.7%	21	Good @1920x1088 Detection
yolov3-tiny-maui-20161152.txt	Darknet	Yolov3-Tiny	Beast	Complete	N	64	16	Y	2016	1152	N	6900	84.0%			Accuracy dropped with width / height changed not square no anchor re calc
yolov3-tiny-maui-544544.txt	Darknet	Yolov3-Tiny	GS65	Complete	N	64	4	Y	544	544	N	19000	83.8%			Large batch = better result. Even at low resolutions. Also higher iterations help
yolov3-tiny-maui20161152-anc	Darknet	Yolov3-Tiny	Beast	Complete	N	64	16	Y	2016	1152	Y
yolov3-maui-576.txt	Darknet	Yolov3	Beast	Complete	N	64	16	Y	576	576	N	8800	88.4%
yolov3-tiny-maui-1536-anc.txt	Darknet	Yolov3-Tiny	Beast	Complete	N	64	16	Y	1536	1536	Y	6900	83.9%			Worst Map performance w calculated anchors.
yolov3-tiny-maui-small-1536	Darknet	Yolov3-Tiny	Beast	Small obj only	N	64	16	Y	1504	1504	N
yolov3-tiny-maui-small-1536-anc	Darknet	Yolov3-Tiny	Beast	Small obj only	N	64	16	Y	1504	1504	Y
yolov3-tiny-maui-small-20161152	Darknet	Yolov3-Tiny	Beast	Small obj only	N	64	16	Y	2016	1152	N
yolov3-tiny-maui-small-20161152-anc	Darknet	Yolov3-Tiny	Beast	Small obj only	N	64	16	Y	2016	1152	Y
yolov3-tiny-maui-bb	Darknet	Yolov3-Tiny	UoA	Complete	N	64	4	Y	1536	1536	N
yolov3-tiny-maui-bb-memory	Darknet	Yolov3-Tiny	Azure	Complete	Y	64	2	N	1920	1920	N
yolov3-19201080	Darknet	yolov3	UoA						1920	1080	?
Yolov3-SPP	Darknet	yolov3-spp
Yolov3-SPP-tiny	Darknet	yolov3-spp-tiny
Yolov3-LSTM	Darknet	yolov3-spp		Complete - ordered frames
Yolov3-LSTM-tiny	Darknet	yolov3-spp-tiny		Complete - ordered frames
Yolo-v3-tp3	Darknet	yolo_v3_tiny_pan3.cfg
Yolo_tiny-prn	Darknet	yolo_tiny - prn
yolov3-tiny-maui-1536-ten	Tensorflow	Yolov3-Tiny
yolov3-light-maui-1536	Tensorflow	Yolov3-Tiny-Light
yolov3-Nano-maui-1536	Tensorflow	Yolov3-nano
yolov3-light-maui-1536-spp	Tensorflow	Yolov3-Tiny-Light-SPP
yolov3-tiny-maui-1536	Pytorch	Yolov3-Tiny
yolov3-maui-1536	Pytorch	Yolov3
small network model	Darknet	Network-Model

AlexeyAB · 2020-04-19T00:44:52Z

Training Results
Name	Source	Model	Machine	Dataset	CPU Memory	Batch	Subdivisions	Random	Width	Height	Calc Anchors	Iterations	Map	Xavier FPS 1920x1080	Video Result (Manual)	Notes
yolov3-tiny-maui-1536.txt	Darknet	Yolov3-Tiny	Beast	Complete	N	64	16	Y	1536	1536	N	7200	89.7%	21	Good @1920x1088 Detection

Does it mean that you trained model with width=1536 height=1536 in cfg, and after training changed width=1920 height=1088 in cfg for detection? Don't do it, if you train on images from the same camera.
Do you use separate validation dataset for mAP calculation?

Try to train these 3 yolov3-tiny models - these models are implemented for aerial detection: #4495 (comment)

Train with width=1920 height=1088 in cfg and use the same width=1920 height=1088 for detection, and train by using pre-trained weights file yolov3-tiny.conv.15 https://github.com/AlexeyAB/darknet#how-to-train-tiny-yolo-to-detect-your-custom-objects

Tiny_3l_roatate_whole_maxout - https://github.com/AlexeyAB/darknet/files/3995740/yolov3-tiny_3l_rotate_whole_maxout.cfg.txt
Tiny_3l_stretch_sway_whole_concat_maxout - https://github.com/AlexeyAB/darknet/files/4003688/yolov3-tiny_3l_stretch_sway_whole_concat_maxout.cfg.txt
Tiny_3l_resize - https://github.com/AlexeyAB/darknet/files/3995772/yolov3-tiny_3l_resize.cfg.txt

pullmyleg · 2020-04-19T03:02:15Z

Thanks @AlexeyAB

Does it mean that you trained model with width=1536 height=1536 in cfg, and after training changed width=1920 height=1088 in cfg for detection?

Yes in that example I trained at 1536x1536 and detect at 1920x1088. Why should I not change? Assuming it's because I should train in the same aspect ratio that I would like to detect?

Don't do it, if you train on images from the same camera.

I am training from images from a different camera then will be in the final UAV. The images are frames (6 per second) from 4k Video. 3840 x 2160px. I do not have footage of the dolphins from the final UAV camera yet. It is still being built.

Do you use a separate validation dataset for mAP calculation?

Yes, the training set is different from the validation set. 1 of the 8 videos is used in the testing set. The final video for manual testing is one of the videos in the validation set.

Complete data set

Training set ~9000 images.
Validation set ~ 300 images

Small images only data set (from high heights, very small dolphins only)

Training set ~3200 images.
Validation set ~ 700 images

Try to train these 3 yolov3-tiny models - these models are implemented for aerial detection: #4495

Ok, thank you. I will train these next and post results when finished.

AlexeyAB · 2020-04-19T12:55:32Z

Yes in that example I trained at 1536x1536 and detect at 1920x1088. Why should I not change? Assuming it's because I should train in the same aspect ratio that I would like to detect?

Yes, aspect ratio should be the same. So use equal network resolution for training and detection.

Also try to train 4-th yolov3-tiny model with width=1920 height=1088 in cfg: yolo_v3_tiny_pan3_scale_giou.cfg.txt

pullmyleg · 2020-04-20T21:54:39Z

Hi @AlexeyAB , I know this question has been answered many times. But I just want to confirm what I am doing is correct re calculated Anchors. I understand that the anchors are the width and height of the closest object height in that layer, but what I don't understand is why they are required at each size between each layer? E.g. Why Anchors greater than 60x60 go in the first layer.

My understanding from the readme is:

Anchors greater than 60x60 layer 1.
Anchors greater than 30x30 but smaller than 60x60 layer 2
Anchors greater smaller than 30x30 layer 3.

Note I have 2 x datasets (complete and small). Small is from 40m+ high only footage (very small objects) and complete: from 10m - 40m (very small and medium-small sized objects).

This is for the small dataset, I am using the small object dataset because the smaller the object can be the higher we can fly and more area can be covered in one flight.

num_of_clusters = 9, width = 1920, height = 1080 
 read labels from 3232 images 
 loaded 	 image: 3232 	 box: 3019
 all loaded. 

 calculating k-means++ ...

 iterations = 16 

counters_per_class = 3019

 avg IoU = 79.59 % 

Saving anchors to the file: anchors.txt 
anchors =  29, 23,  21, 39,  50, 26,  38, 36,  30, 50,  49, 49,  70, 38,  42, 73,  70, 69

Option 1 - based on one number from each anchor fitting the size:

Layer 1 mask = 6,7,8
Layer 2 mask = 3,4,5
Layer 3 mask = 0,1,2
anchors = 29,23, 21,39, 50,26, 38,36, 30,50, 49,49, 70,38, 42,73, 70,69.

Mask 6 is actually smaller than 60 x 60 and Mask 2 greater than 30x30 but I noticed in the original weights a similar approach was being used if it was close or one of the values we're => 60. e.g. Mask 5 in layer 2 in original config is: 59,119 which is > 60x60.

Option 2 - based on total object size e.g. 60*60

Layer 1 mask = 8
Layer 2 mask = 2,3,4,5,6,7
Layer 3 mask = 0,1
anchors = 29,23, 21,39, 50,26, 38,36, 30,50, 49,49, 70,38, 42,73, 70,69

I will adjust the filters accordingly to the masks used in each layer.

Thanks again for your help!

AlexeyAB · 2020-04-20T22:27:44Z

There is no strict rule. There is just an empirical recommendation:

Anchors greater than 64x64 for layer with 5 subsampling layers (stride=2), because it has receptive filed >= 32 = pow(2,5) (actually it is higher than 32x32 because conv3x3 also increases receptive field, not only layers with stride=2)
Anchors greater than 32x32 but smaller than 64x64 for layer with 4 subsampling layers (stride=2)
Anchors greater smaller than 32x32 for layer with 3 subsampling layers (stride=2)

You can add [net] show_receptive_field=1 in cfg to show receptive field size in the console for each layer during network initialization.

This is a more complex issue - you should take into account number of objects per image for each size, and number of overlapped object for each size, ...

I would recommend you to use :

or use default anchors
or use Option 2, but add default anchors: 2 anchors to layer-1 ad 1 anchor to layer-3

pullmyleg · 2020-04-20T23:12:38Z

Ok thank you @AlexeyAB . I will try train with both and compare result.

To confirm options 2 with additional default anchors should look like:

All bold are new.

Layer 1 mask = 9,10,11
Layer 2 mask = 3,4,5,6,7,8
Layer 3 mask = 0,1,2
anchors = 10,13, 29,23, 21,39, 50,26, 38,36, 30,50, 49,49, 70,38, 42,73, 70,69, 116,90, 156,198

AlexeyAB · 2020-04-20T23:19:57Z

@pullmyleg Yes.

pullmyleg closed this as completed Apr 18, 2020

AlexeyAB added the Bug fixed The problem is solved by fixing the source code label Apr 18, 2020

pullmyleg reopened this Apr 18, 2020

pullmyleg mentioned this issue May 10, 2020

Small object detection @1080p on Xavier AGX results & request for improvement suggestions? #5552

Closed

pullmyleg closed this as completed Jun 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No performance improvement with CUDNN_HALF=1 on Jetson Xavier AGX #5234

No performance improvement with CUDNN_HALF=1 on Jetson Xavier AGX #5234

pullmyleg commented Apr 15, 2020 •

edited

Loading

pullmyleg commented Apr 18, 2020

AlexeyAB commented Apr 18, 2020

pullmyleg commented Apr 18, 2020

AlexeyAB commented Apr 18, 2020

AlexeyAB commented Apr 18, 2020

pullmyleg commented Apr 18, 2020

AlexeyAB commented Apr 19, 2020

pullmyleg commented Apr 19, 2020

AlexeyAB commented Apr 19, 2020

pullmyleg commented Apr 20, 2020

AlexeyAB commented Apr 20, 2020

pullmyleg commented Apr 20, 2020

AlexeyAB commented Apr 20, 2020

No performance improvement with CUDNN_HALF=1 on Jetson Xavier AGX #5234

No performance improvement with CUDNN_HALF=1 on Jetson Xavier AGX #5234

Comments

pullmyleg commented Apr 15, 2020 • edited Loading

pullmyleg commented Apr 18, 2020

AlexeyAB commented Apr 18, 2020

pullmyleg commented Apr 18, 2020

AlexeyAB commented Apr 18, 2020

AlexeyAB commented Apr 18, 2020

pullmyleg commented Apr 18, 2020

AlexeyAB commented Apr 19, 2020

pullmyleg commented Apr 19, 2020

AlexeyAB commented Apr 19, 2020

pullmyleg commented Apr 20, 2020

AlexeyAB commented Apr 20, 2020

pullmyleg commented Apr 20, 2020

AlexeyAB commented Apr 20, 2020

pullmyleg commented Apr 15, 2020 •

edited

Loading