Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detected object coordinate (x, y) and custom training #2

Closed
MyVanitar opened this issue Dec 9, 2016 · 38 comments
Closed

Detected object coordinate (x, y) and custom training #2

MyVanitar opened this issue Dec 9, 2016 · 38 comments

Comments

@MyVanitar
Copy link

Hello,

How can I get coordinate information (x, y) of detected object(s)?

How can I train the Yolo2 for my own desired objects?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Dec 9, 2016

@VanitarNordic Hi,

You can add printf("%d, %d, %d, %d \n", left, right, top, bot); here: https://github.com/AlexeyAB/darknet/blob/master/src/image.c#L219

or also add:

int x_center = b.x*im.w;
int y_center = b.y*im.h
int width = b.w*im.w;
int height= b.h*im.h;

Training guide is in progress, yet: https://groups.google.com/d/msg/darknet/0ksFU91emmc/QMEO0HnHAgAJ

@MyVanitar
Copy link
Author

Thank you very much.

Do you know ho we can add a live video camera support instead of image as input? You mentioned about a camera which is installed on a network (accessible by IP), but I mean host connected cameras such as internal webcam, USB3 , .... similar.

@AlexeyAB
Copy link
Owner

@VanitarNordic

Yes, for WebCamera number 0 you can use : darknet.exe detector demo data/voc.data yolo-voc.cfg yolo-voc.weights -c 0

@AlexeyAB
Copy link
Owner

@VanitarNordic

How can I train the Yolo2 for my own desired objects?

Now you can train Yolo v2 by using following instructions: https://github.com/AlexeyAB/darknet#how-to-train-pascal-voc-data

Original for Linux: http://pjreddie.com/darknet/yolo/#train-voc

@MyVanitar
Copy link
Author

Thank you gentleman,

I read that briefly, but as I realized it is about re-generating the training data file based on VOC. what about if we have our selected discrete 1000 image files (which contain variation of a desired object within other objects) and decided to train the Yolo-2 with these?

I mean training with our own image files from scratch.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Dec 13, 2016

@VanitarNordic

To training for your 2 objects:

  1. Copy yolo-voc.cfg to yolo-obj.cfg and change line classes=20 to classes=2

  2. Create file obj.names with 2 objects names each in new line

  3. Create file train.txt with filenames of your images each in new line

  4. Create file obj.data containing:

classes= 2
train  = train.txt
valid  = test.txt
names = obj.names
backup = backup/
  1. Create .txt-file for each image-file - with the same name, but .txt-extension, and put to it for each object on this image in new line: <object-class> <x> <y> <width> <height> - float values relative to width and height of image.

For example (atention: x, y - centers of rectangle) for img1.jpg you create img1.txt containing:

1 0.716797 0.395833 0.216406 0.147222
0 0.687109 0.379167 0.255469 0.158333
1 0.420312 0.395833 0.140625 0.166667

  1. Download pre-trained weights for the convolutional layers (76 MB): http://pjreddie.com/media/files/darknet19_448.conv.23 and put to the directory build\darknet\x64

  2. Run training: darknet.exe detector train obj.data yolo-obj.cfg darknet19_448.conv.23

@MyVanitar
Copy link
Author

Thank you again Alexey.

I have some more questions:

  1. in step-1 you mentioned: "Copy yolo-voc.cfg to yolo-obj.cfg and ..." . Do you man replacing the "yolo-voc.cfg" file with "yolo-obj.cfg"?

  2. in step-4, Do you mean just about creating a file, which contains those information?

  3. in step-5, do you know any toll which generates such annotation file? OpenCV has such a tool, but it produces annotation files differently (x, y are top left coordinate and they are integer values)

@AlexeyAB
Copy link
Owner

@VanitarNordic

  1. I mean you should create new file "yolo-obj.cfg" with the same content as "yolo-voc.cfg", but with only one change classes=2

  2. Yes.

  3. No, I don't know such soft. About what tool in OpenCV do you talk, can you give link?

Also you can ask about it here: https://groups.google.com/forum/#!forum/darknet

@AlexeyAB
Copy link
Owner

@VanitarNordic

Also you should change filters=(classes + 5)*5 in your yolo-voc.cfg

I added How to train (to detect your custom objects): https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

@MyVanitar
Copy link
Author

MyVanitar commented Dec 17, 2016

Thank you Alexey

Very good explanation.

I have a few question either.

  1. if I wanted to detect one object type, such as just cars and nothing else, then the number of classes would be equal to 1?

  2. The first name in the first line of the "obj.names" file, relates to class 1? and similarly line 2 correspond the class 2?

Finally still I don't understand why <x> <y> <width> <height> values for each image are float numbers. if I understand why is like that, I could maybe be able to create a software to make these files and values, if we couldn't find what tool the authors have used to make these.

@AlexeyAB
Copy link
Owner

@VanitarNordic

  1. Yes, for 1 object, classes=1 in obj.data and in yolo-obj.cfg

(and filters=30 in yolo-obj.cfg)

  1. Numbering starts at zero. If you have only one class of object, then <object-class> always be 0

There are used float values for <x> <y> <width> <height> because are relative to the absolute Width x Height of image, and can be equal from 0.0 to 1.0. The advantage of the relative values that are valid for any resizing images.

Input images can be any size (any width and height) both for training and prediction, and here any image resized to the neural-network size (416x416 or 448x448), but relative values <x> <y> <width> <height> still valid without changes: https://github.com/AlexeyAB/darknet/blob/master/src/demo.c#L49

@MyVanitar
Copy link
Author

Thanks,

Please correct me if the below calculation is not correct:


(x, y: center of the rectangle)

relative x = absolute x / width 
relative y = absolute y / height

relative height = absolute height / height
relative width = absolute width / width

@AlexeyAB
Copy link
Owner

@VanitarNordic

Yes.

I created a new repository with GUI-software for generating annotation file for Yolo v2, which I wrote myself before: https://github.com/AlexeyAB/Yolo_mark

@MyVanitar
Copy link
Author

MyVanitar commented Dec 20, 2016

Thank you,

may I ask you what speed (fps) have you achieved in testing the Yolo-2 on CPU? mine is very slow (few seconds for an image), other DNN based algorithms are slow in training but okay in test and run-time. am I doing something wrong?

@MyVanitar
Copy link
Author

no idea?

@AlexeyAB
Copy link
Owner

@VanitarNordic

  • CPU Intel Core i7-6700K - 4 GHz 4(8) Cores: 0.3 FPS
  • GPU GeForce GTX 970 - 1 GHz 1664 Cores: 32 FPS

Darknet Yolo v2 is not optimized for CPU and use only 1 - 2 Cores.

@MyVanitar
Copy link
Author

You have a sophisticated graphic card but 32FPS. it should be at last 60FPS for not blinking and real-time. Why the YOLO1 and 2 authors always claim it is fast algorithm?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Dec 29, 2016

I got 32 FPS for full Yolo v2 480x480 on GTX 970 without cuDNN. It is not fast GPU, top GPU Nvidia Titan X GP102 is 3 x faster.

  • GeForce GTX 970 - 3.5 TFlops-SP (without cuDNN)
  • GeForce GTX Titan X GM200 - 6.1 TFlops-SP x 1.74

Resluts:

  • YOLOv2 480 × 480 VOC - 32 FPS on GTX 970 (without cuDNN)
  • YOLOv2 480 × 480 VOC - 59 FPS on Titan X GM200 x 1.84

Did you try any else object detectors: Faster-RCNN ResNet-152, SSD 300/500 old & new*?
map_fps

@MyVanitar
Copy link
Author

480*480 is the input resolution (image or video)?
from the curve I can assume that Yolo-2 is somewhere between speed and accuracy, isn't t?

I have tried the Dlib and it seems it is faster and more accurate

@AlexeyAB
Copy link
Owner

480x480 is input resolution of neural network. All YoloV2 points lies on optimal Pareto frontier, i.e. it is state-of-art. If you want more than 30 FPS on TitanX, those there is nothing better at the moment for accuracy/speed.

All objects-detectors of dlib are much less accurate. Which one object-detector do you use from dlib?

@MyVanitar
Copy link
Author

MyVanitar commented Dec 29, 2016

Actually you have got 59FPS on Titan X as I see, which is good.

I am not deeply familiar with the algorithm itself, so if the input to the neural network is different with the main input, then what is the resolution of the main input images (or video from the camera) and what about if we decided to use HD resolution as camera or input? (Such as HDMI camera)

I used face pose detection on CPU and it was good. but because I do not have a professional GPU, I have not tested his last post here: http://blog.dlib.net/ What he claims about speed and accuracy is very good if he is right. it seems the accuracy is better than RCNN.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Dec 30, 2016

If you use 480x480 Yolo v2 and capture FullHD video 1920x1080, then each frame will be resized to 480x480, then will be processed by the neural network, with the best accuracy/speed among all realtime (>30 FPS) object-detectors.

If you want to detect very small objects (15x15 pixels) then you can divide the input image (1920x1080) into overlapping (10%) small images (480x480) and process each of them. You have to write this code yourself.

@MyVanitar
Copy link
Author

What about Dlib's last blog post?

Also I have heard about Caffe. What is your opinion about them?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jan 10, 2017

@VanitarNordic

It is necessary to distinguish: frameworks, apporoaches of region proposals, neural nets.

Frameworks:

Approaches of region proposals - using Caffe:

Neural Networks:

For example, commonly used together:

  • framework(Caffe) + approach(SSD) + network(VGG16)
  • framework(Caffe) + approach(Faster-RCNN) + network(VGG16)
  • framework(Caffe) + approach(RFCN) + network(ResNet-101)
  • framework(Darknet) + approach(Yolo) + network(Yolo v2)
  • framework(Caffe) + approach(DetectNet based on Yolo v1) + network(DetectNet based on GoogLeNet)

@MyVanitar
Copy link
Author

Thanks,

I mean DetectNet (object detection) which is trained based on NVCaffe. GoogleNet does the classification.

@AlexeyAB
Copy link
Owner

@VanitarNordic
DetectNet worse than Yolo v2.

Results of DetectNet is absent in any tests for Detection:

DetectNet uses: framework(Caffe) + approach(DetectNet based on old Yolo v1) + network(DetectNet based on GoogLeNet)

@MyVanitar
Copy link
Author

MyVanitar commented Jan 10, 2017

  1. What about Dlib 19.2?

  2. I am so curious if I could be able to train the Yolo-2 with DIGITS. probably it must have a caffemodel and a prototxt file.

  3. What is your opinion about GTX 1080 GPU, can you predict how fast Yolo-2 would be (FPS) on this graphic card (for detection)?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jan 16, 2017

@VanitarNordic

  1. As said here, dlib-cnn + MMOD compared only with Caffe-FasterRCNN-VGG16, and only for faces. And in this case, perhaps it gives good results, better than FasterRCNN-VGG16.

But for other objects than faces it may have a bad result, dlib is absent in any public tests for Detection:

Also, current the best approach Caffe + RFCN + ResNet-101 (https://github.com/daijifeng001/r-fcn) has much better result, with x2 less errors, than FasterRCNN-VGG16.

I.e. dlib is not the best, but good.

  1. No, you can't train Yolo-model in Caffe or Caffe-DIGITS. There is soft to convert Yolo v1 cfg-file and weights-file to prototxt and caffemodel, but it works only for old Yolo v1: https://github.com/xingwangsfu/caffe-yolo

  2. You can simply compare this results from the picture for nVidia Titan X GM200 with 6144 GFlops With any nVidia GPU from this list: https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_10_series

88f81b8a-ce09-11e6-9516-8c3dd35dfaa7

  • nVidia Titan X GM200 with 6144 GFlops
  • nVidia GeForce GTX 1080 with 8228 GFlops - i.e. x1.34 faster than shown in picture

@MyVanitar
Copy link
Author

MyVanitar commented Jan 16, 2017

Thank you. again very professional and comprehensive explanation. Really I have nothing to tell anymore. fantastic :-)
Also you gave me a parameter to compare GPUs for DNN if I decided to purchase one wisely. Gflop

So by the way Yolo-2 should be the best both in terms of precision and speed, yes?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jan 16, 2017

@VanitarNordic In different tests may be different winners.
But there are three of the best methods for real-time:

For not real-time the best Caffe-RFCN+ResNet101: https://github.com/daijifeng001/r-fcn

@MyVanitar
Copy link
Author

The Caffe-PVANet refers to which model in the picture (Voc 2007 test I mean)?

SSD512 is accurate but is slow even on Titan X.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jan 16, 2017

It is not on VOC2007, but is on VOC2012 (comparison for DNNs trained on very large data-set): http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?cls=mean&challengeid=11&compid=4&submid=9804

@MyVanitar
Copy link
Author

MyVanitar commented Jan 16, 2017

well, according to the github description, it has achieved mAP=84.9 on VOC2007, but it has not mentioned the speed (FPS)

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jan 16, 2017

all on Titan X (GM200)
PVANet+ mAP=84,2 FPS=22
PVANet+ (compressed) mAP=83,7 FPS=31
https://arxiv.org/pdf/1611.08588v2.pdf

@MyVanitar
Copy link
Author

MyVanitar commented Jan 16, 2017

  1. When the FPS is low and the model is accurate, is there anyway to achieve a higher speed? is there any other hardware to perform faster than GPU?

  2. Where you got the Pascal Voc 2012 result?

  3. Does the memory of the GPU influence the model accuracy on training (typically we have to adjust the batch sizes to applicable with GPUs with lower memory sizes)

@MyVanitar
Copy link
Author

Also, have you heard about YOLO9000?

@MyVanitar
Copy link
Author

There was a chart on your previous posts about the competition results but I can not see that Image now. can you upload it again or mention the source?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Feb 5, 2017

@VanitarNordic All on nVidia Titan X (GM200)

hj3yoo referenced this issue in hj3yoo/mtg_card_detector Sep 16, 2018
sambo55 added a commit to sambo55/darknet that referenced this issue Feb 6, 2020
attempt AlexeyAB#2 to correct frame offset, when I run predictor it has first prediction at frame 306 but it should be at frame 308. Could be an ffmpeg thing.
@tuteming tuteming mentioned this issue Jun 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants